Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it

featured-image

Recently formed AI Preferences Working Group has August deadline to develop ideas on how to tell crawlers to go away, or come for a feast The Internet Engineering Task Force has chartered a group it hopes will create a standard that lets content creators tell AI developers whether it’s OK to use their work....

The Internet Engineering Task Force has chartered a group it hopes will create a standard that lets content creators tell AI developers whether it’s OK to use their work. Named the AI Preferences Working Group (AIPREF), the group has been asked develop two things: The AIPREF charter suggests “attaching preferences to content either by including preferences in content metadata or by signaling preferences using the protocol that delivers content” as the ways to get this done. AIPREF co-chair Mark Nottingham thinks those items are needed because current systems aren’t working.

He thinks the “non-standard signals” in robots.txt files – an IETF standard that defines syntax on whether crawlers are allowed to access web content – aren’t working. “As a result, authors and publishers lose confidence that their preferences will be adhered to, and resort to measures like blocking their [AI vendors’] IP addresses.



” Content creators resort to IP blocking because major model-makers did not ask for permission or seek licenses before scraping the internet for content needed to train their AIs. OpenAI is now lobbying for copyright reform that would allow it to scrape more content without payment. Copyright-holders are fighting back with lawsuits against those who used copyrighted material to build their models, and signing licensing deals that see AI players pay to access content.

AI crawlers are also costing publishers money. The Wikimedia Foundation recently complained that bandwidth it devotes to serving image retrieval requests has risen by 50 percent over the last year, mostly to AI crawlers downloading material. The IETF doesn’t care about those legal and operational matters: It just wants to build tech that enables people to express their preferences in the hope that scraper operators buy in and ingest content that creators are happy to have fed into AIs.

To get the ball rolling, AIPREF met at the IETF 122 conference in mid-March, and has already developed two draft proposals. One proposes “Short Usage Preference Strings for Automated Processing” and suggests those strings could be used in robots.txt files or HTTP header fields.

The other, from the Common Crawl Foundation, is titled Vocabulary for Expressing Content Preferences for AI Training and also suggests syntax for preferences be stored in robots.txt files or HTTP header fields, but also suggests use of the proposed vocabulary in
The Working Group has given itself a deadline of August 2025 to deliver proposals. Participants seem to know that’s a tight deadline and that the group will therefore need to act with some urgency. ®.