Written by Satu Korhonen on the 25th of July 2025
When beginning any endeavor, one plans the approach, the goals, the players, and the strategy. This is especially true for any business as the amount of design, planning, strategizing, and basically answers required by funders, investors, banks, and so forth is not insignificant. AI businesses are no different, or at least they should not be. Each company plans what they want to do and find others who share their view of how to do things. If they are successful enough in getting people and money on their side, the company gets started and developed.
This post is about thinking through potential strategies behind some of the big players in AI. The goal of this is to raise discussion about what we want to allow and under what rules. It stems from the misconception that I’ve frequently run against that AI is not regulated because it’s new technology. Therein lies one main misconception. Just because it is new, does not mean it is not regulated. It does not exist outside regulation. For instance, you cannot discriminate with AI even if it is not specifically forbidden as is regulation stating you cannot discriminate. Ai is also not new, but that is a topic, or potentially a rant, for another post.
Starting as research
The roots of AI lie heavily in academia. Many of the technological underpinnings enabling ChatGPT were developed decades ago. They were often developed in Universities and with research groups. During this time, training data was gathered where ever it could be found. As the research was often for the betterment of science and humanity and shared openly, I see this as permissible. While I am not saying that the process of science allows trampling human rights, I am saying that the situation was different.
When I started getting interested in AI, back in 2016, it was the time of several very public ethical problems being publicized about AI. The COMPAS algorithm predicting the risk of recidivism was shown to be racially biased. Apple credit card and Amazon recruitment algorithm was biased against women. Microsofts Tay bot was taken offline after only 16 hours of use due to, lets say, questionable behavior.
AI development and deployment was already heavily influenced by Silicon valley mentality of move fast and break things. In the 9 years I’ve been more and more in the industry, the one thing I have hoped to facilitate is the discussion of if this really is how we want to proceed.A quote from the first Jurassic Park summed up my feelings well for a long time. And on some days still.

Scraping the net for profit
All AI systems are data hungry. They need original and human created data. While human created content can be appended by AI variations of the same, the original content is important. The internet currently has pretty much been scraped with everything that technically can be used to create the systems we use today. Some of this scraping was legal. Some of it definitely was not.
This practice of ignoring intellectual property and creators rights is one of the major objections I hear from people concerning the industry (for profit AI). And I agree with them. It is profiting from the work of others that it depends on but does nothing to compensate for it. And while compensating fairly for the work could potentially have made this technology impossible to create at least as it is now, the rules of engagement that have been adopted are problematic to say the least. Essentially it is AI companies saying to content creators that while we depend on your creations for on our existence and services and while we will need you to continue to provide this for us, we will not share the profits we make from it or acknowledge it in any way. We will just profit while you just work. There is a word for this. Does it come to mind?
This is not a way to build a sustainable ecosystem of commerce. It is, however, a certain way to raise resistance and fighting against it. And soon after we saw libraries and tools helping creators poison their creations online to prevent AI companies from using it, or if they did, it would actively break the model. One of these is f.ex. Nightshade but there are others as well. It also leads to resistance to use the technology as many feel that doing so is giving a dangerous signal to the companies engaging in this practice a signal that it is accepteptable.
Decided in the courts
As with many rules of engagement, many are decided in the court system. This topic is no different. Ever since the advent of ChatGPT, and other similar models and services, there have been court proceedings to determine if the rules of engagement chosen by the company are defensible. I have been waiting for the rulings as they are one of the main ways for individuals and organisations to state their opinion on companies operating practices.
On the 24th of June, a judge in the US made a ruling stating that AI training does not need special permissions. So copyrighted books can be acquired and used in AI model training. However, they need to be lawfully acquired. This means that AI companies are allowed to buy rights to texts and use them also for AI training. This ability is vital for the industry as books are one main sources of high quality human created text. Just as important, however, is the decision that acquiring these through piracy is not. Another trail is to be had about this later this year.
“While training AI models with copyrighted data may be considered fair use, Anthropic’s separate action of building and storing a searchable repository of pirated books is not, Alsup ruled. Alsup noted that the fact that Anthropic later bought a copy of a book it earlier stole off the internet “will not absolve it of liability for the theft, but it may affect the extent of statutory damages.” – source
Fair use policy in AI training
Copyright by idea is designed to enable the creator of the content to profit from their work. The very core of this is to promote human creativity and ingenuity that is recognized as one of the corner stones of human progress. Fair use of it is basically what I would do when I get inspired by a novel and write my own that is in some ways similar but still original. It is creating more original content, not to copy the content created by another.
The fair use referred to here is, then, about the intent of use. Anthropic, the company behind Claude, and participant in the court case, stated that they use the books to study the way of writing. And indeed, this is what AI does. It looks for the statistical patterns of how words follow one another so it can generate a plausible text when prompted. The idea of the training of a model is not to copy the style of the author as such, but have that content affect the probability of what word follows after in a text that is hopefully original and not pirated.
Actually, for AI companies developing foundational models, it is quite problematic when the original data used in training can be prompted out of the model. This leads them into trouble with copyright as well. But it goes deeper than that. The data going into the models is no longer openly shared. It is proprietary knowledge. It’s also kind of where the some of the bodies of the industry are buried, so AI companies would very much, I assume, want the training data not to keep leaking.
Further, the risk of data fed to the model leaking out, has been one of the main concern of companies in using these systems. Samsung in 2023 forbade their personnel using them after trade secrets were leaked out of the model. This concern slows the process of adopting these tools significantly, which is not in the interest of the companies providing the services. So they do not want the original data to leak. Why it still does, well that is a topic for another blog post.
Fair rules enable industries to develop
So we have an industry reliant on material created by individuals. It started out by stealing it, using it for their own benefit without giving anything back. But it went further than this. It also openly invites narratives on how they can now do the work of these people, so others don’t need to buy the artists and journalists content either. For me, this is a facepalm kind of idea when novel content is so important to the development of the industry.
At the core of this is the profound question of “why in the world would you anger and injure the very people of whom you depend on?” And at its core goes back to the quote from Jurassic Park. They were focused on advancing a technology, a science. They were so busy in being the first ones to succeed and proceed, they didn’t stop to think, or if they did, they need to take a moment to do that again.
I am a firm proponent of science and a research based way to see the world and develop it further. But the end result does not excuse the means of going about it. We have ethical rules in research nowadays. Many of the experiments done earlier in science would be utterly impossible to get past ethics boards today. I want to see the same development in the field I am in. I want companies changing our ways of communication to stop and think about unintended consequences and how their actions and choices affect the world we all live in.
If AI companies buy content from authors and publishers and train their models on that, it gives an incentive of people to keep creating the content. And that allows AI companies to keep developing their models on new original human created content. This is vital because language and culture is constantly changing. If all the data that is available for training comes pre-2023 and chatgpt, the models will soon stop reflecting the language used today and be increasingly outdated and less useful.
So instead of choosing rules of engagement based on theft, I suggest the industry creating foundation models engages the content creators and publishers in a dialogue on the rules of engagement that everyone can live with. And until that happens, these lawsuits will keep happening. This is not the last ruling on the topic. And it is not the only legislature that matters as this is discussed across the globe.
What happens next is up to all of us
The companies developing foundational models have so far been the one making the rules. The court cases are people stating their opinion. People choosing to use these systems or refusing to do so are also making a point. It can be easy to feel dis-empowered and see us as unable to have an effect. But the court case above was brought about by just 3 authors. That is a very small group of people making a difference in the rules of engagement of an entire industry with very deep pockets.
These models are not cheap to train or run. A risk I see is that being forced to also pay for the data makes the industry less able to be profitable and once the hype dies down, and the investor money alongside it, we will see how the rules of engagement change.
Maybe I am being naive, but I prefer to see systems that benefit all involved and bring prosperity and well-being to as many as possible on this small globe floating around in space. I still use AI and sometimes feel, what the Germans call “weltschmerz” meaning world-hurt, or how google AI summary states “describes a feeling of melancholy and weariness with the world, often stemming from the contrast between one’s ideal vision of the world and its actual state”. I see these tool as useful and would very much like the industry developing it to evolve in the way where I could feel good about using them as I would know that I could still also enjoy human created art and literature in the future as well.
