The digital library Internet Archive supports a broader text and data mining exception under the UK’s copyright law, it says in its submission to the government’s consultation on artificial intelligence (AI) and copyright. This consultation has asked people for responses on different ways the government should consider updating the copyright law, with one of the ways being a broad data mining exception that would allow companies to mine copyrighted works without the right holders’ permission.
Other approaches include no changes in the current regulations and allowing data mining unless copyright holders opt out of it. Of all the approaches, the government recommends allowing copyright holders the option to opt out. This approach would also require companies to include a machine-readable version of their reservation of rights into their works.
Internet Archive mentions that all the other approaches that the government suggested in its consultation would “effectively allow publishing industries to act as gatekeepers to AI innovation; this will hold the UK back in ways that will make it impossible to compete with jurisdictions that have no such restrictions.” It also says that it is because of text and data mining (TDM) techniques that US researchers have published more impactful papers than researchers in other countries.
The need for broader data mining exceptions:
UK government’s approach is not ambitious enough:
Internet Archive claims that the opt-out approach will “likely make the UK’s standing in AI markets worse”. It adds that this approach does not meet the objectives set out under the UK’s AI Opportunities Action Plan. This plan, released in January this year, seeks to lay the foundations to enable AI, including through securing sufficient compute for AI operations and “unlocking datasets in the public and private sectors” to enable innovation by UK startups.
Charging companies for mining facts:
The digital library claimed that besides the broader data mining exemption, the other approaches expand the scope of rights under the copyright law. Thus allowing copyright holders to “extract payment for underlying facts and information that have nothing to do with creative expression”.
Cultural heritage content in datasets:
“The Internet Archive, Google, and others in the United States, have digital cultural heritage materials widely pursued as datasets for US-based AI organizations,” Internet Archive says. A broader data mining exception would allow AI companies in the UK to do the same.
Collective licensing hinders innovation:
Collective licensing is where creators assign the rights of their copyrighted works to collective management organisations (CMOs) who then license the creators’ works. As a part of the opt-out approach, the UK government had suggested that copyright holders could use collective licensing to seek remuneration from AI companies.
Internet Archive argues against collective licensing, stating that AI companies seek massive quantities of data and in such a case, remumeration for creators is unlikely to be substantial. Further, within these collective licensing schemes, larger copyright holders will end up becoming gatekeepers on what data can and cannot be used for AI. This would mean that only larger companies that can afford to pay for data will be able to access training datasets.
Why it matters:
The UK’s proposed changes to copyright law saw a flood of dissent from the creative community, who fear that AI companies will use their copyrighted works without any remuneration. They argue that the government’s suggested opt-out approach has a major flaw in that people do not even know which companies are trying to mine their data. If this is the perspective on the approach that the Internet Archive calls “not ambitious enough,” naturally, creators would also be opposed to data mining exceptions.
The intersections between AI and copyright law are a major topic of debate across jurisdictions. In India, OpenAI is embroiled in a legal battle against news agency ANI, with the Federation of Indian Publishers and Digital News Publishers Association (DNPA) also looking to join the case. Given the ongoing debate around the topic, perspectives in the UK consultation are likely to shape arguments around the scope of control copyright holders have over their works.
Advantages of data mining exceptions:
Internet Archive said that a clear text and data mining exception would allow libraries (like itself) to “partner with the UK government, universities, and other mission-aligned organisations to work on public-interest AI tools that could benefit society as a whole”. So far, only for-profit businesses have access to the data and computational resources needed for ultilsing modern AI tools. The digital library argues that in case the UK government enforces a copyright regime that requires licenses for training AI models on copyrighted works, it will further centralise control in the hands of a few companies.
Further, the Internet Archive says that the exemption for data mining does not have to be at the expense of copyright holders. Copyright holders will continue to have protections in the form of the enforcement of their copyright “against the particular outputs of AI systems that are themselves infringing.”
Arguments against web-crawler regulations:
Regulating web crawlers can adversely impact public interest by preventing legitimate uses of publicly available data. It also mentions that its own work (in terms of preserving historical records) heavily relies on web crawling. Beyond the Internet Archive, web crawling also supports search engines, which make the web navigable.
Should AI companies share their data sources?
While the government should encourage AI companies to disclose where they get training data from, there is no need for a mandate for the same, the Internet Archive explains.
“As a library, the Internet Archive seeks to support information integrity and authenticity. In the digital realm, this is often realised through metadata practices. Ideally, these provide enough context and history for individuals to understand, in a verifiable manner, who produced or created a particular item, where it came from, and whether that item has been changed over time,” it says.
The digital library also argues that given the scale of training datasets, if the government asks AI companies to add sources to the responses their chatbots generate, it would end up as a list of millions/billions of URLs. Researchers would not be able to look at these lists and identify where a specific idea came from.
Some questions:
- What would be the impact of the other approaches that the UK government proposed on the Internet Archive?
- The Internet Archive argues that web-crawler regulations could negatively impact public interest initiatives. Do you see a need for any safeguards or best practices to ensure responsible web crawling without infringing on content creators’ rights?
- Given concerns from the creative community about unauthorized use of copyrighted works, how would you address fears that a broader exemption could undermine creators’ ability to monetize their content?
We have reached out to the Internet Archive with these questions and will update the story based on their responses.
Also read:
Support our journalism: