Some of the largest generative artificial intelligence platforms, such as the Microsoft-backed OpenAI, provide information for developers on their own website on specific tags that its bots use when scraping for data. (iStock.com/Wanan Yossingkum)
But how copyright holders flag their digital works to be excluded from generative AI scraping is not set in stone, and will likely need to be clarified by EU courts, experts say.
"Currently, there is no generally recognized market standard of what qualifies as such 'machine-readable means.' This arguably leads to legal uncertainty," Clemens Molle, an IP associate at Bird & Bird LLP, said.
The AI Act does not introduce any new exceptions for copyrighted works concerning generative AI. But it states that the use of copyrighted material to train AI models "requires the authorization of the rights holder" unless relevant exceptions and limitations apply.
Such exceptions include those outlined in Directive (EU) 2019/790, which introduced the option for rights holders to "reserve their rights over their works," to prevent text and data mining.
"Where the right to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorization from rights holders if they want to carry out text- and data-mining over such works," the act said.
But while protocols for excluding digital works from data scraping exist, IP experts say there is so far no standardized system for such machine-readable exemptions for AI training.
"Rights holders are currently using a range of different methods to attempt to reserve their rights, and consider they ought to be able to do so in the manner most convenient to them," Xuyang Zhu, a partner at Taylor Wessing LLP, said.
"Whereas AI developers would favor means that would allow them to more easily track and implement these 'opt-outs' in light of the technology they are using," she added
The Robots Exclusion Protocol, or robots.txt, is a standard used by web developers to flag what parts of a given website can be visited by internet bots. While the protocol was initially developed in 1994, websites have also used it to deny bots scraping the web for training data for AI.
Some of the largest generative AI platforms, such as the Microsoft-backed OpenAI, provide information for developers on their own website on specific tags that its bots use when scraping for data.
"A number of AI developers have been proactive on this topic by releasing information confirming exactly how to stop their services carrying out training on other content," Tanguy Van Overstraeten, the global head of data protection and a partner at Linklaters LLP, said.
The TDM Reservation Protocol is a more targeted alternative for text- and data-mining that explicitly abides by the constraints of the Directive (EU) 2019/790.
Outside these protocols, the German courts are also considering the scope of copyright exceptions for text- and data-mining in Robert Kneschke v. LAION eV .
The case, launched by Kneschke, a photographer, alleges that the LAION 5B dataset included his copyrighted work found on his blog. Kneschke claimed that LAION scraped his website despite the inclusion of an opt-out there.
Judges heard arguments at trial on, among other points, whether Kneschke's specific opt-out was "machine-readable."
The decision in this case, according to Zhu, could offer further guidance on "where the TDM exception is engaged."
--Editing by Joe Millis.
For a reprint of this article, please contact reprints@law360.com.