Overly strict? Copyright obligations for training AI models under the AI Act will be difficult to meet

Michal Kandráč, Nikita Fesyukov, Dalibor Kovář

12. 3. 2024

The process of adopting the Artificial Intelligence Act is reaching its finale. The legislation, which is resonating strongly in the tech and legal world, will be adopted by the European Parliament's plenary session tomorrow and will take effect in 2026. There is a common misconception that it is only about regulating the technical nature of AI models and systems. In fact, a major copyright aspect has been added to the Act following the autumn trilogue of the European Union (EU),[1] which could have far-reaching negative implications for the development and distribution of AI systems in the EU.

The boom around generative AI over the past year has resulted in the addition of several provisions to the drafting of the Artificial Intelligence Act (the "Act") that have shuffled the deck and introduced more issues into the document than it has resolved. Specifically, a group of popular generative AI tools such as ChatGPT, Midjourney and others have been terminologically defined as so-called general-purpose AI models (GPAI).[2]

Among the many obligations that the Act shall impose on GPAI providers is the obligation to implement a policy that respects EU copyright law, in particular to identify and take into account, also using state-of-the-art technology, the reservations of rights expressed under Article 4(3) (data mining provision) of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market (the “Directive”).

Data mining conditions and cross-border applicability

The Act currently expressly states that compiling datasets and training AI models on those datasets constitute data mining within the meaning of the Directive, subject to the conditions and restrictions expressed therein. In general, data mining is prohibited without the consent of the authors of the works being mined, except for scientific or commercial purposes under the terms of Article 4 of the Directive. This is the case if the mining entity has “lawful access” to the mined data (for example, if the data in question is publicly available) and the holders of the relevant rights have not prohibited the use of their works (data) for data mining in a machine-readable format (e.g. via robots.txt or relevant terms of use).

In other words, if rights holders carry out a relevant data mining “opt-out”, the data in question may not be mined or used by anyone to train AI, as this would infringe the exclusive rights of third parties.

According to the current draft of the Act, the EU seems to be taking the path of relatively strict regulation. For example, the recital to the Act states that any provider placing an AI system on the single market should comply with EU copyright law, regardless of the jurisdiction in which AI models are trained.

The aim is to ensure a “level playing field” among providers, preventing anyone from gaining a competitive advantage by training an AI model in a jurisdiction outside the EU where data mining laws are less restrictive. However, such a concept may collide with the principle of territoriality of copyright (i.e. the legal system of each country is autonomous in regulating the terms of creation and content of copyright, as well as interference therewith), and therefore it remains very unclear whether and how the concept will be practically applied and enforced.

Approaches abroad

While the EU has chosen an approach that protects content rights holders, Japanese law, for example, appears to be much more favourable to developers of AI solutions. In fact, under the current copyright laws, authors are unlikely to be able to file copyright infringement lawsuits against developers who use their copyrighted works to compile AI datasets,[3] a conclusion inter alia recently confirmed at government level.[4]

In the U.S., data mining could fall under the “fair use” doctrine, justifying the use of a work without the author’s consent in some cases (assessed on a case-by-case basis – most often in the context of parody[5]).[6] In this regard, clarification of the respective rules could come in the wake of a recent lawsuit filed by The New York Times against OpenAI and Microsoft for copyright infringement, in which the New York Times claims that “millions of articles from The New York Times were used to train chatbots that now compete with it.”[7] In this respect, OpenAI argues that training AI models using publicly available Internet resources constitutes “fair use”, which is “supported by long-standing and widely accepted precedents” and which the company considers “fair to creators, necessary for innovators, and critical for US competitiveness.”[8] Given the potentially precedential nature of this lawsuit, it will be interesting to see which of these views prevails in court.

Possible implications

In contrast to the above rules in other jurisdictions, the EU’s planned approach is quite restrictive for developers and may represent another reason[9] why major tech players could choose to develop their AI solutions outside the EU or distribute them in the EU in a very limited form. Microsoft’s investment in Mistral AI,[10] a French AI startup, could be the first glimmer of this trend, signalling the potential wider implications of an over-regulated AI market in the EU.

This is because the strict European regulatory conditions are unlikely to be achievable for most entities, given the vast amount of training data and the way it is collected (i.e. by automated scraping across the internet).

Moreover, the current setup of the Act entails many other related issues. For example, how to proceed in a situation where AI models have already been trained before the Act is scheduled to take effect?[11] Does this mean that AI models will need to be “de-trained” or substantially filtered out (which will be either technically impossible or at least very costly)? Will data mining in violation of the Directive be considered copyright infringement per se, and even a possible filtering mechanism in the AI model to prevent the generation of “infringing” content will not retroactively remedy such infringement?

As is evident from the above, the link between the Act and the EU copyright perspective raises much ambiguity as well as the question of whether, rather than navigating the complexities or providing only limited services, technology companies might simply go for a regional “opt-out”, focusing their services on developing regions with relaxed regulation or developed countries with a more permissive approach towards developers.

Conclusion

In terms of copyright law, the current wording of the Act is very ambiguous. This is one of the reasons why we believe that in order to address the above issues, it might be a more balanced and innovative approach to introduce a safe harbour mechanism similar to what companies can use in relation to liability for third-party online content.[12]

A safe harbour mechanism for AI developers would, in our view, provide a clear and understandable legal framework that takes into account the challenges and realities of developing AI solutions, particularly in terms of data mining and the use of copyright works for training purposes. With a structured and flexible legal framework, AI developers would have greater legal certainty about their obligations, the breach of which would give rise to liability towards the relevant rights holders.

It remains to be seen how the Act and the discussion about its wording will evolve. If the EU continues to crack down and sanction AI developers with strict liability, even if they make all reasonable efforts to prevent infringements of third-party rights, it will very quickly derail the figurative AI train to Japan or like-minded countries.

[1] – The official consolidated text of the Act has not yet been published, but this version has been leaked through a number of well-informed sources on the matter, e.g. here: https://drive.google.com/file/d/1xfN5T8VChK8fSh3wUiYtRVOKIi9oIcAF/view.
[2] – The AI Act defines a GPAI system as: “an AI system that – irrespective of how it is placed on the market or put into service, including as open source software – is intended by the provider to perform generally applicable functions such as image and speech recognition, audio and video generation, pattern detection, question answering, translation and others; a general purpose AI system may be used in a plurality of contexts and be integrated in a plurality of other AI systems;” i.e. it is clear that the primary focus is on generative AI tools, although a broader conception of this definition could lead rather unsystematically to a much more extensive interpretation.
[3] – However, each case will have to be assessed on a case-to-case basis and it cannot be said that this will always be the case. More information can be found here: https://www.lexology.com/library/detail.aspx?g=d6358d5b-b020-430e-84b9-dd3497822c9b.
[4] – More information can be found here: https://finance.yahoo.com/news/ai-art-wars-japan-says-185350499.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAFkUn6iqpeHvUzkO2ZbRV4qwubB07C5QwpZ7su0nTErrUAPu56B0e_8NUqKlI9d1JkmMc-fJ8WeCZUJXjJGam2EHcNiMo4soUg5oFyg_lz4zaSPBKMqK7H0Hl5eJve6b7pGDLozqxQ5WjkrsaGfV9p4uHd3--5IwasQkGughEPGn.
[5] – This doctrine is often referred to by the creators of the South Park series that very often uses borderline parodic elements. More information can be found here: https://reason.com/2016/03/20/how-south-park-saved-fair-use/.
[6] – Many U.S. fair use and copyright experts believe that a strong precedent for applying the fair use doctrine in favour of developers of AI solutions could be the case of Authors Guild, Inc. v. Google, Inc. In that case, the courts ruled that the Google Books digitization project, in which Google scanned a large number of books into its own digital library, fell within the bounds of fair use because it was transformative, emphasizing the project's social benefits without directly competing with the market for original works. For more information see here: https://www.authorsalliance.org/2023/02/24/fair-use-week-2023-looking-back-at-google-books-eight-years-later/.
[7] – More information can be found: https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
[8] – More information can be found: https://openai.com/blog/openai-and-journalism.
[9] – In addition to the generally quite strict regulation under the Act and planned regulation of liability for damage caused by AI systems.
[10] – More information can be found: https://techcrunch.com/2024/02/27/microsoft-made-a-16-million-investment-in-mistral-ai/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAIZIp3QavA4linT3XnuxoCmg_7reA-_TDYd5i4IMYOWkHMPYSMhOwGD3DotkWgDEjAQpi2H67NQHLuvanjTCfaOWPHy1JuAPXaRazQf59vbG2N6NeMOSAVFA1tMi-ARCnjLD1MocjyIhC8ZYHZNrddf0qjBIwV-01U-dIMdk3UX3.
[11] – Currently, the Act provides that GPAI providers whose AI models were placed on the market before the Act comes into force will take measures necessary to comply with the obligations set forth therein within two years of the effective date of the Act. However, it has not been clarified how some of these obligations are to be complied with technically.
[12] – Japanese copyright law even allows AI developers to use copyrighted works, as long as the use is not intended to exploit or imitate the original work (for example, to create a derivative work), but for the purpose of developing a complex tool that ultimately results in a “new” work, which is not identical or similar to the elements of the copyrighted work used. See footnote 4 for more information.