The adoption of AI for commercial purposes shows no signs of slowing down. Whether it is used for decision-making, data modelling, or generating content, AI is fast becoming an indispensable tool for many businesses. For all its convenience, however, the use of AI is not without risks and as the law struggles to catch up, such risks could well end up lying in wait and presenting significant costs down the line.
In this post, we take a look at how AI is trained and the issues that presents, particularly where intellectual property rights are concerned.
AI takes a range of forms and has a range of applications. In this post, we are focusing on AI models that are trained using vast amounts of pre-existing content, including data, text, images, and video. Where does this content come from? Many AI models are trained on publicly available content which is “scraped” from the internet. Text and data mining (“TDM”) is a technique that is often used in such training, although it should be noted that TDM and AI model training are separate activities with the former often necessarily preceding the latter. Indeed, as Recital 105 to the EU’s Artificial Intelligence Act explains:
“General-purpose AI models, in particular large generative AI models, capable of generating text, images, and other content, present unique innovation opportunities but also challenges to artists, authors, and other creators and the way their creative content is created, distributed, used and consumed. The development and training of such models require access to vast amounts of text, images, videos, and other data. Text and data mining techniques may be used extensively in this context for the retrieval and analysis of such content, which may be protected by copyright and related rights.”
The intellectual property issues involved are seemingly obvious and yet seem to elude those developing and training AI models, whether by deliberate choice or by a lack of understanding of IP law. While other IP rights such as database rights may be relevant in some cases, copyright is the chief concern in most cases given that text, images, videos, music, and other web content scraped from the internet for training AI models will generally be copyright works. A further issue, as noted below, concerns the ownership of content created using AI.
The question is, how can the issue be resolved? Content owners are having their valuable IP used without their consent and without remuneration, and AI developers are either knowingly or blindly risking IP infringement actions being brought against them. Here we consider three possible solutions: licence agreements, legal exceptions, and technical measures.
Intellectual Property
Where content is protected by IP rights, using it for AI training purposes potentially amounts to infringement. As noted above, this harms creators and content owners who are effectively denied a source of revenue by having their material unlawfully used for free and exposes AI developers to significant legal risks.
To cite but one example, Getty Images brought an action against Stability AI for copyright infringement, database right infringement, trade mark infringement, and passing off. Getty alleged that Stability AI (developer of Stable Diffusion) had scraped millions of images from Getty’s websites. Furthermore, Getty argued that the outputs from Stable Diffusion also reproduced its IP.
Licence Agreements
One solution to the scraping and TDM problem is licence agreements. Without a legal exception to rely on (see below), some AI developers are already choosing this option in order to train their AI models without infringing the IP rights of others.
A number of large organisations have entered into licensing arrangements including AP, the Financial Times, News Corp, and Condé Nast.
Given that such high-profile organisations may be involved in such deals, this has the additional benefit of raising the profile of the AI developer by association.
Licensing agreements are not an all-encompassing solution, however. For one thing, small businesses may not always benefit. Small AI developers may not be in a position to afford the licensing fees likely demanded by large organisations such as those noted above, and small content creators and owners are arguably less likely to be approached by AI developers. This problem becomes all the more acute when considering the bargaining power imbalance between SMEs and “big tech”. Moreover, smaller creators are more likely to host their content on larger services, raising the question of their compensation if the larger service enters into a license agreement with an AI developer.
Finally, not every content creator or owner wants to licence their content for AI training purposes, or have it scraped. This either limits the availability of content for AI developers to use for AI training or deprives the rightsholders of revenue and control over their content if it is scraped irrespective of their wishes.
Exceptions in the Law
Another solution, which is arguably very one sided, is to create an exception in the law to allow scraping and TDM for AI training. Here in the UK, there is currently no such exception for commercial use. Only Section 29A of the Copyright Designs and Patents Act 1988 creates an exception for TDM, and only for non-commercial research purposes.
TDM exceptions themselves are becoming increasingly widespread and indeed have potential benefits that extend far beyond the scope of retrieving and analysing content for training AI models. The EU, Japan, and Singapore, for example, all have TDM exceptions that cover both commercial and non-commercial purposes. An important aspect of the EU approach is the ability for content owners to opt-out – a contentious issue which we will discuss in more depth shortly. The US, on the other hand, does not have a specific TDM exception, but this may be counterbalanced by the broader approach to “fair use” in US copyright law.
Meanwhile, back in the UK, in 2022 the previous government proposed an extension to the S29A TDM exception that would have allowed a broader range of purposes. In 2023, however, the proposal was dropped following objections from the creative industries. The Culture, Media and Sport Committee in its report Connected Tech: AI and Creative Technology (external link) concluded that the government should not pursue an expansion of the exception and focus on licensing instead.
Fast forward to 2024 and the new Labour government may be reconsidering the issue. As reported in The Guardian (external link) on 26 October, the government is facing opposition from content owners including the BBC to plans to allow AI developers to train AI models using online content by default unless content owners have opted out (in a similar vein to the EU approach). It is clear that the government will face pressure from both sides of the divide. The same article notes that Google has warned that the UK risks being left behind unless AI developers are allowed to use IP protected content for AI training.
In addition to strong and numerous objections from rightsholders, Dame Caroline Dinenage MP, chair of the Culture, Media and Sport Committee, wrote to Lisa Nandy MP, the Secretary of State for Science, Innovation and Technology voicing her concerns about “recent comments that ‘New law may be needed to end AI copyright dispute’”, further commenting on X (external link) that she is “increasingly worried that this Government is sleepwalking into a policy with disastrous implications for our cultural & creative industries”.
So, is a model similar to the EU’s approach feasible? On the face of it, it may appear to be a neat solution. AI developers get to use the content they need to train their systems while content owners are free to object. In practice, however, there are problems with the opt-out model and a number of publishers, news organisations, and celebrities have voiced their opposition to it, labelling the unlicensed use of their work to train AI a “major, unjust threat” to artists’ livelihoods (as reported in The Times (external link, paywall) on 22 October. As reported in The Guardian article (above), critics who instead favour an opt-in approach (i.e., licence agreements) argue that the opt-out model is impractical as they may not know when their content is being scraped – the technical equivalent of the horse being a mere dot on the horizon before anyone realises that the barn door needs closing.
If the onus is to fall on AI developers to seek consent in the form of licence agreements rather than relying on a (arguably unfeasible) opt-out mechanism, how effective is this likely to be? Smaller developers may not have sufficient awareness or understanding of the IP issues involved and the sheer power (and financial resources) of big tech moves one to question how trustworthy they may be in this context.
While the government is yet to address the topic in detail, The Times (external link, paywall) quoted Prime Minister, Sir Keir Starmer on 28 October as saying that news organisations should be paid for allowing the use of their work for AI training and promised to protect journalism from the threat posed by big tech. The same article quotes the Prime Minister as saying that “both artificial intelligence and the creative industries – which include news media – are central to this government’s driving mission on economic growth”. While such statements are indeed encouraging, it is questionable how much protection smaller creators and content owners may receive if policy focuses on the bigger players.
Technical Measures
It is clear that law and policy will take time to catch up with technology, as is so often the case, and that larger organisations on both sides of the equation will have the greatest influence over the shape of that law and policy.
In the meantime, what can smaller rightsholders do to protect their content from scraping and use in AI training? Certain technical and UX / UI methods can be effective, to a greater or lesser degree. These include paywalls, specific prohibitions in terms and conditions (coming soon to our own range of Website Terms & Conditions templates), and using robots.txt files to block AI scraping bots.
Paywalls represent something of a double-edged sword, particularly given how accustomed users have become to accessing online content for free, even when bombarded with ads. Many users prefer an arguably inferior, ad-supported experience to a paid one, even if it is objectively better.
Terms and conditions are an important component of any website and should state clearly how the website may be used, effectively setting out users’ rights and obligations, any applicable restrictions on use, asserting intellectual property rights, and providing other key information. By including a specific clause prohibiting scraping and TDM, a website’s terms and conditions are effectively opting out, as discussed above. It is, therefore, important to bring your terms and conditions to users’ attention and, ideally, require the reading and acceptance of them before your site and content can be accessed. Ultimately, however, the effectiveness of this approach relies on the diligence, honesty, and integrity of the would-be AI developer looking to acquire training data.
What about more technical measures to prevent scraping? A website’s robots.txt file can be used to block scraping bots. When using this method, it is important to be aware of the difference between web scrapers and web crawlers. Web crawling (also known as indexing) is used to index information on web pages. Web crawlers are an important tool used by search engines and can thus be valuable to content owners who want their websites to be effectively found online. Web crawling captures generic information. Web scraping, on the other hand, is used to extract specific information and content from sites. It is therefore important to add only the scraping bots that you wish to block to your robots.txt file.
This method, however, is also far from perfect. Not only are we once again back to an opt-out situation, but as noted above, content owners need to know about the scraping bots before the scraping takes place in order to exclude them. Moreover, robots.txt files are not inviolable – bots can be coded in such a way that they do not respect the contents of robots.txt files.
It remains clear, then, that better legal protections are required.
Clarification Required
Thus far, there has been surprisingly little discussion about the infringement risks for end users using AI that has been trained on scraped data.
The author of a computer-generated work is defined in Section 9(2) of the Copyright Designs and Patents Act 1988 as “the person by whom the arrangements necessary for the creation of the work are undertaken”. Whether or not the threshold of originality required for copyright (requiring sufficient skill, labour, and judgement) is met by a user entering prompts into an AI model is also at the very least open to question, but beyond the scope of this post.
Whether they qualify for copyright protection themselves or not, AI generated works can infringe IP rights and, since the AI itself cannot be held liable, it is the user who would be liable if it were proven by a rightsholder that the AI’s output infringed their IP.
Moreover, while the focus on legal reform appears to be on broadening the TDM exception in the Copyright Designs and Patents Act, a significant concern is that it would only benefit those using content to train AI models in this context as, one assumes, would a standard licence agreement of the kind discussed above. This, then, leaves unresolved the position of end users and their potential liability for IP infringement.
Conclusion
As has happened so many times before, we are once again faced with a classic example of technological development outpacing the law. Many jurisdictions are making progress in various forms, but it is clear that the solutions are not yet perfect.
Those developing AI and using training data would be wise to exercise caution and consider the rights subsisting in the content that they wish to use, taking voluntary steps such as licence agreements with rightsholders before scraping content and using it for AI training.
Content owners should also consider what steps they can take to avoid their content being scraped and used in this way. Updating terms and conditions is one helpful step to clarify the position, at least in the case of more IP-aware developers. Pay-walling content may also help, as noted above, but this can be a difficult decision depending on your business model given that it has the potential to exclude large numbers of users who are not willing to pay. Technical steps such as the aforementioned robots.txt method are also advisable, but should be done with care and require a proactive approach.
End users of AI should be similarly cautious. As noted above and in our previous AI blog post, the outputs produced by generative AI models could potentially infringe another party’s IP rights, leaving the user open to the possibility of infringement proceedings.
It is clear that there are many more questions than answers in this area at present, and that is unlikely to change in the immediate future. With the budget dominating the government’s business until now, it is not altogether surprising that we have not yet seen a great deal of attention paid to this issue. Will that change in the coming weeks and months? Only time will tell.