The Library Of Congress Is A Training Data Playground For AI Companies

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued.

featured-image

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued. B lack and white portraits of Rosa Parks, letters penned by Thomas Jefferson and The Giant Bible of Mainz, a 15th century manuscript known to be one of the last handwritten Bibles in Europe. These are among the 180 million items including books, manuscripts, maps and audio recordings housed within the Library of Congress.

Every year hundreds of thousands of visitors walk through the library’s high-ceiling pillared halls, passing beneath Renaissance-style domes, embellished with murals and mosaics. But of late, the more than 200-year-old library has attracted a new type of patron: AI companies that are eager to access the library’s digital archives — and the 185 petabytes of data stored within it — to develop and train their most advanced AI models. “We know that we have a large amount of digital material that large language model companies are very interested in,” Judith Conklin, chief information officer at the Library of Congress (LOC) told Forbes .



“It's extraordinarily popular.” The upsurge in interest in the library’s data is also reflected in the numbers. The congress.

gov site, which is managed by the LOC and hosts data about bills, statutes and laws, gets anywhere between 20 million to 40 million monthly hits on its API, an interface that allows programmers to download the library’s data in a machine-readable format. Conklin said the traffic to the congress.gov API has consistently grown since it became available in September 2022 .

The library’s API now gets about a million visits every month. The library’s digital archives host an abundance of rare, original and authoritative information. It’s also diverse; the collections feature content in more than 400 languages spanning art, music and most disciplines.

But what makes this data especially appealing to AI developers is that these works are in the public domain and not copyrighted or otherwise restricted. While a growing group of artists and organizations are locking up their data to prevent AI companies from scraping it, the Library of Congress has made its data reserves freely available to anyone who wants it. For AI companies that have already mined the entirety of the internet , scraping everything from YouTube videos to copyrighted books, to train their models, the Library is one of the few remaining “free” resources.

Otherwise, they must strike licensing deals with publishers or use AI-generated “synthetic data” which can be problematic, leading to degraded responses from the model. The only caveat: people who want access to the Library’s data must collect it via the API, a portal through which anyone, from a genealogist to an AI researcher, can download data. But they are prohibited from scraping content directly from the site, a common practice among AI companies and one that Conklin said has become a real “hurdle” for the library because it slows public access to its archives.

“There are others who want our data to train their own models but they want it fast and so they just scrape our websites,” she said. “If they're hurting the performance of our websites, we have to manually slow them down.” The hunt for data is just one part of the story.

Companies like OpenAI, Amazon and Microsoft are also courting the world’s largest library as a customer . They claim AI models can help librarians and subject matter specialists with tasks like navigating catalogs, searching records and summarizing long documents. This is certainly possible, but there are some rough edges that need to be ironed out first.

Natalie Smith, the LOC’s director of digital strategy, told Forbes that AI models, trained on contemporary data, sometimes struggle with historical accuracy — identifying a person holding a book as someone holding a cell phone, for example. “There is an overwhelming bias towards today's times and so they often apply modern concepts to historical documents,” Smith said. Beyond this is the risk of hallucination and propagating inaccurate information based on the works in the world’s largest library.

In March, the Congressional Research Service, a research institute that is part of the LOC, announced that it is developing AI models to write bill summaries in the hopes that the tool could help clear off a backlog of thousands of pending reports. But in tests, the model repeatedly hallucinated. It listed the District of Columbia as a U.

S. state in a bill that outlined the definition of a “state” and incorrectly claimed that students from Taiwan and Hong Kong would be impacted in a bill that prohibited student visas for specific Chinese citizens. While it’s carefully considering how to use AI tools internally, the Library wants to make more of its unrestricted data available to the world.

In the coming years, it plans to digitize more of its special collections, a boon for the public. That AI companies will also make use of it is inevitable. “Libraries and federal agencies have been the backbone of data that has spurred the economy in so many different ways,” Smith said.

“Wie often say you wouldn't really have had an Uber without having geospatial data that came from a federal agency.” Editorial Standards Print Reprints & Permissions.