How AI scraper bots are putting Wikipedia under strain

featured-image

For more than a year, the Wikimedia Foundation, which publishes the online encyclopedia Wikipedia, has seen a surge in traffic with the rise of AI web-scraping bots. This increase in network traffic poses major infrastructure and cost management issues. Read full story

For more than a year, the Wikimedia Foundation, which publishes the online encyclopedia Wikipedia, has seen a surge in traffic with the rise of AI web-scraping bots. This increase in network traffic poses major infrastructure and cost management issues. The Wikimedia Foundation is a non-profit organisation that manages Wikipedia and other projects related to free knowledge.

It is highlighting the growing impact of web crawlers on its projects, particularly Wikipedia. These bots are automated programs that mass-retrieve freely licensed articles, images and videos in order to train different generative artificial intelligence models. Since January 2024, Wikimedia has seen a 50% increase in the bandwidth used to download multimedia content from its servers.



This increase is mainly attributed to these scraper bots, and now represents a significant burden on the foundation's operations.​ For example, when former US President Jimmy Carter died last December, his English-language Wikipedia page received more than 2.8 million views in one day, which is high but manageable.

But at the same time, numerous bots also "read” in full a 1.5-hour video of the 1980 presidential debate between him and Ronald Reagan, which resulted in the usual network traffic doubling and saturating access to the servers. For users, this resulted in a significant slowdown in page loading.

This shows that, in certain situations, Wikimedia can be significantly impacted by the activity of these bots. The foundation emphasises the importance of implementing new mechanisms to manage this influx of traffic. The idea is, for example, to regulate bot-generated traffic, starting by limiting the number of requests per second that a bot can send to a site's servers or even imposing a time limit between each request to avoid congestion.

It could also be necessary to develop algorithms capable of differentiating real visitors from bots, or even charging companies that make massive use of its data for access to its services. In any case, it will be necessary to negotiate directly with AI software companies very quickly to ensure that the development of their models does not affect the quality of service of Wikipedia and other websites. – AFP Relaxnews.