New York Times accuses OpenAI of deleting crucial evidence in copyright lawsuit

Tensions between The New York Times and OpenAI have intensified in their ongoing copyright lawsuit. The Times has accused OpenAI of accidentally erasing vital evidence that its legal team had spent over 150 hours extracting, an action that could have significant implications for the case. According to the newspaper’s legal team, OpenAI’s engineers unintentionally deleted data that was essential for determining whether its articles were used in training OpenAI’s AI models, including the widely-used ChatGPT.

Although OpenAI managed to recover some of the lost data, the Times claims that the absence of original file names and folder structures makes it impossible to trace where its articles were incorporated into OpenAI’s models. In a court filing, the Times’ lawyer , Jennifer B. Maisel, noted that the missing information hindered the identification of potential copyright violations.

The Times has been embroiled in a lawsuit against both OpenAI and Microsoft , accusing them of unlawfully using its articles to train AI tools without permission. This case is one of several ongoing legal battles between publishers and AI companies over the use of copyrighted content in training AI systems. OpenAI has yet to publicly disclose the specifics of the data used to train its models, which makes the Times’ legal pursuit particularly important.

As part of the discovery process, the court required OpenAI to share its training data with the Times, which led to the creation of a “sandbox” environment. In this space, the Times’ legal team could examine the data used to build OpenAI’s AI models. However, the data that was supposed to be organised by the Times’ team was reportedly deleted.

Although OpenAI admitted to the mistake, it has not been able to fully restore the data in its original form, forcing the Times to redo much of its work, leading to significant delays and additional costs. OpenAI has denied any malicious intent behind the deletion, calling it a technical glitch. A spokesperson for the company stated that they would file a formal response to the claims soon.

Despite this, the deletion of data has added fuel to the fire of an already contentious legal dispute. The Times’ legal team stressed that it was crucial for OpenAI to provide a complete and organised set of training data to properly assess any infringement. The lawsuit has also highlighted ongoing disputes over who is responsible for sorting through the data.

The Times has argued that OpenAI is in the best position to handle this task, as it holds the most information about how the models were trained. In addition, the Times has demanded further documentation, including Slack messages and social media conversations between key OpenAI figures, in an effort to strengthen its case. As the legal proceedings unfold, the Times and OpenAI continue to clash over the scope of the case.

Microsoft, which is also named in the lawsuit, has requested that the Times turn over documents related to its own use of generative AI, including materials concerning how its tech columnists engage with such tools. Beyond the courtroom, OpenAI is pursuing licensing deals with other major publishers like The Atlantic, Axel Springer, and Conde Nast. These cases will have far-reaching consequences for how AI companies operate in the US and may set important precedents for content licensing and the use of copyrighted material in training artificial intelligence.

The outcome of the lawsuits could reshape the future of AI regulation and its relationship with the media industry..