OpenAI wants to bend copyright rules. Study suggests it isn’t waiting for permission

GPT-4o likely trained on O’Reilly books without permission, figures appear to show Tech textbook tycoon Tim O'Reilly claims OpenAI mined his publishing house's copyright-protected tomes for training data and fed it all into its top-tier GPT-4o model without permission....

featured-image

Tech textbook tycoon Tim O'Reilly claims OpenAI mined his publishing house's copyright-protected tomes for training data and fed it all into its top-tier GPT-4o model without permission. This comes as the generative AI upstart faces lawsuits over its use of copyrighted material, allegedly without due consent or compensation, to train its GPT-family of neural networks. OpenAI denies any wrongdoing.

O'Reilly (the man) is one of three authors of a study [PDF] titled, “Beyond Public Access in LLM Pre-Training Data: Non-public book content in OpenAI’s Models," issued by the AI Disclosures Project. By non-public, the authors mean books that are available for humans from behind a paywall, and aren't publicly available to read for free unless you count sites that illegally pirate this kind of material. The trio set out to determine whether GPT-4o had, without the publisher's permission, ingested 34 copyrighted O'Reilly Media books.



To probe the model, which powers the world-famous ChatGPT, they performed so-called DE-COP inference attacks described in this 2024 pre-press paper . Here's how that worked: The team posed OpenAI's model a string of multiple choice questions. Each question asked the software to select from a group of paragraphs, labeled A to D, the one that is a verbatim passage of text from a given O'Reilly (the publisher) book.

One of the options was lifted straight from the book, the others machine-generated paraphrases of the original. If the OpenAI model tended to answer correctly, and identify the verbatim paragraphs, that suggested it was probably trained on that copyrighted text. More specifically, the model's choices were used to calculate what's dubbed an Area Under the Receiver Operating Characteristic (AUROC) score, with higher figures indicating a greater likelihood the neural network was trained on passages from the 34 O'Reilly books.

Scores closer to 50 percent, meanwhile, were considered an indication that the model hadn't been trained on the data. Testing of OpenAI models GPT-3.5 Turbo and GPT-4o Mini, as well as GPT-4o, across 13,962 paragraphs uncovered mixed results.

GPT-4o, which was released in May 2024, scored 82 percent, a strong signal it was likely trained on the publisher's material. The researchers speculated OpenAI may have trained the model using the LibGen database, which contains all 34 of the books tested. You may recall Meta has also been accused of training its Llama models using this notorious dataset.

The role of non-public data in OpenAI's model pre-training data has increased significantly over time The AUROC score for 2022’s GPT-3.5 model came in at just above 50 percent. The researchers asserted that the higher score for GPT-4o is evidence that "the role of non-public data in OpenAI's model pre-training data has increased significantly over time.

" However the trio also found that the smaller GPT-4o Mini model, also released in 2024 after a training process that ended at the same time as the full GPT-4o model, wasn’t seemingly trained on O’Reilly books. They think that’s not an indicator their tests are flawed, but that the smaller parameter count in the mini-model may impact its ability to "remember" text. "These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training," the authors wrote.

"Although the evidence present here on model access violations is specific to OpenAI and O'Reilly Media books, this is likely a systematic issue," they added. The trio – which included Sruly Rosenblat and Ilan Strauss – also warned that a failure to adequately compensate creators for their works could result in – and if you can pardon the jargon – the enshittification of the entire internet. "If AI companies extract value from a content creator's produced materials without fairly compensating the creator, they risk depleting the very resources upon which their AI systems depend," they argued.

"If left unaddressed, uncompensated training data could lead to a downward spiral in the internet's content quality and diversity." Uncompensated training data could lead to a downward spiral in the internet's content quality and diversity AI giants seem to know they can’t rely on internet scraping to find the material they need to train models, as they have started signing content licensing agreements with publishers and social networks. Last year, OpenAI inked deals with Reddit and Time Magazine to access their archives for training purposes.

Google also did a deal with Reddit. Recently, however, OpenAI has urged the US government to relax copyright restrictions in ways that would make training AI models easier. Last month, the super-lab submitted an open letter to the White House Office of Science and Technology in which it argued that "rigid copyright rules are repressing innovation and investment," and that if action isn't taken to change this, Chinese model builders could surpass American companies.

While model-makers apparently struggle, lawyers are doing well. As we recently reported, Thomson Reuters won a partial summary judgment against Ross Intelligence after a US court found the startup had infringed copyright by using the newswire's Westlaw's headnotes to train its AI system. While neural network trainers push for unfettered access, others in the tech world are introducing roadblocks to protect copyrighted material.

Last month Cloudflare rolled out a bot-busting AI designed to make life miserable for scrapers that ignore robots.txt directives. Cloudflare's “AI Labyrinth” works by luring rogue crawler bots into a maze of decoy pages, wasting their time and compute resources while shielding real content.

OpenAI didn't immediately respond to a request for comment; we'll let you know if we hear anything back. ®.