Premium news companies may be the biggest victims of AI data scraping

The post Premium news companies may be the biggest victims of AI data scraping appeared first on Android Headlines.

featured-image

At this point, we already know that AI models need to ingest a ton of data from numerous sources to learn. Companies extract data from sources all over the Internet like ebooks, social media sites, video sites, news websites, blogs, and so on. Much of the data is free to the public, but AI companies also take a ton of data from premium sources.

We’re talking about pay-walled copyrighted content. This might not mean much to the average person, but what are the implications of this practice and is it justified? We’re seeing a shift in the industry nowadays. Large news and media companies are signing deals that hand their content over to AI companies like OpenAI and Meta .



This really shocked the masses, as AI technology has had a negative effect on journalism. So, it’s a little surprising that so many news companies are happily serving their content to AI companies to further make journalists obsolete. Among other things, this practice is all about avoiding legal issues with companies.

Not too long after the AI explosion, we found out where AI companies got the data to train their AI models. Several major companies didn’t like that AI companies were scraping their content, and one of the main companies was The New York Times. At the time of writing this article, The New York Times is locked in a huge legal battle with OpenAI.

This company scraped a ton of The New York Times’ copyrighted articles. Not only that, but the New York Times alleges that ChatGPT reproduces sections of its articles verbatim. Other lawsuits like this have popped up over the past year, and we expect more from different companies.

This is especially true because we’re seeing more stories rise shedding light on how much premium content AI companies scraped to train their models. People are looking back at the datasets that some of the biggest AI models used to train, and they’re seeing that much of the content comes from pay-walled websites. The analysis As stated, reports are coming out revealing just how much premium and pay-walled data AI companies are scraping to train their AI models.

News Media Alliance published a report last year letting us know that some of the biggest datasets in the world used a substantial amount of premium content. It found that OpenWebText, the datasets used to train OpenAI’s GPT-2 model, consisted of about 10% premium content . That may not sound like a lot, but that dataset comprises about 23 million web pages.

So, 10% of a 23-million-page pie is a hefty slice. Not only that, but there aren’t too many premium news sites compared to the Internet as a whole, so any percentage over 0.001% is substantial.

What does this mean? It means that firms like OpenAI aren’t just crawling the internet and feeding their models what happens to pop up. AI companies are often targeting data from premium sites for their models. The abovementioned report opened the door to more news to flood in.

A recent analysis from Ziff Davis pointed to something similar; datasets used to train major models consist of a large amount of pay-walled content. The Ziff Davis report, however, takes four datasets into account and it reveals something about AI companies’ intentions. The four datasets it takes into account are Common Crawl, C4, OpenWebText, and OpenWebText2.

Several AI companies use these four datasets among others to train their models. Common Crawl was used to train OpenAI’s GPT-3 and Meta’s LLaMA. C4 was used to train Google’s LaMDA and T5 models along with LLaMA.

OpenWebText was used to train GPT-2 and OpenWebText2 was used to train GPT-3. Other major models most likely used these datasets, but the abovementioned models were featured in the report. So, these datasets trained some rather big models.

Obviously, they’re pretty out-of-date. OpenAI is currently several iterations into its GPT-4 series and Meta is on LLaMA 3, so the models listed above are well past their prime. We shouldn’t sneeze at the sheer amount of data that exists in these datasets, however.

OpenWebText2 contains over 17 million web pages while OpenWebText 2 contains 23 million web pages. C4 towers above them with 365 million web pages, but the reigning champ is Common Crawl with a scale-tipping 3.15 billion web pages.

Going by the numbers, it seems like GPT-3 and LLaMA should be the smartest models on the list. However, the opposite might be true. When you’re in school, your teacher doesn’t just stand in front of you and rattle off arbitrary facts for six hours straight.

The information they tell you has to be curated by the teacher, school, and school board. This is why you have lesson plans and a standard curriculum. What does this have to do with AI models? Well, AI models are more like human beings than you think.

If you’re an AI model, and you’re being fed a dataset, you’d prefer to be fed high-quality and relevant information. As such, companies don’t always stuff their models with a boatload of random data. Datasets are sometimes cleaned and curated.

Dataset cleaning is a process that eliminates duplicate data, errors, inconsistent information, incomplete data, and more. In a way, it trims the fat. Dataset curation organizes the dataset to make the information more accessible.

These are oversimplifications, but you can read more with the hyperlinks. In any case, cleaning and curating the datasets basically processes the data and dolls it up so that it’s easier for the model to ingest. This is similar to how your school curriculum is organized to gradually increase in difficulty as the year goes on.

Time for a small, yet necessary tangent. There’s another angle to this report, and one is domain authority. In a way, the higher domain authority a site has, the more trusted and reputable the site is.

So, you’d expect a site like The New York Times, a major news corporation to have a higher domain authority than a brand-new news site that gets a maximum of 10 views every day. The report took into account 15 of the news companies with the highest domain authority. This list consists of “ Advance (Condé Nast, Advance Local), Alden Global Capital (Tribune Publishing, MediaNews Group), Axel Springer, Bustle Digital Group, Buzzfeed, Inc.

, Future plc, Gannett, Hearst, IAC (Dotdash Meredith and other divisions), News Corp, The New York Times Company, Penske Media Corporation, Vox Media, The Washington Post, and Ziff Davis .” The report puts domain authority on a 1 to 100-point system. 100 means that the site has the most domain authority.

The list above consists of sites with rather high domain authorities. What does that have to do with datasets and AI models? Well, let’s put this all together. In the report, we see a breakdown of the four datasets.

In the graph below, we see an interesting trend. The X-axis of the graph shows the domain authority scores divided into 10-point intervals and the Y-axis shows the percentage of the amount of data in each set. It shows that just over 50% of the websites in Common Crawl have domain authority scores of between 0 and 9.

It drops sharply as the domain authority increases. Less than 10% of the dataset has a score of over 10 points, and that continues for the rest of the graph. Moving over to C4, the results aren’t much better.

About 20% of the sites have a domain score of between 10 and 20 points. It then drops off considerably as well. C4 remains consistently higher than Common Crawl for the majority of the graph.

However, we see a dramatic shift once we look at the two OpenWebText datasets. In fact, we see the exact opposite! Both models start from a similar place on the graph with scores from 0 to 9, but they steadily rise as the domain authority scores increase. More than 30% of OpenWebTexts data came from sites with domain authority scores between 90 and 100.

As for OpenWebText 2, about 40% of this dataset consists of sites with domain authority scores between 90 and 100. Here’s a graph that shows similar data. However, instead of data from all sites scraped, this only shows data from the aforementioned premium websites.

Below, we have a graph showing each of the aforementioned publications and how much they were used in each dataset. We see that the percentage skyrockets for both OpenWebText models, but these two models have substantially less data, so it’s easier for one source to make up a higher percentage. So, we see that more high-quality websites’ data exists in the OpenWebText datasets, but here’s the kicker.

Remember how we talked about cleaning and curating datasets? This process takes the raw and unfiltered data and processes it. Well, in the report, the Common Crawl and C4 have not been cleaned or curated. The two OpenWebText datasets were.

This means that the datasets with the higher volume of premium content just so happen to be the ones that have been touched by human hands. This hints that AI companies specifically target premium data to scrape. Until this point, we assumed that these companies decided to just crawl websites and dump as much data into their models as possible, paying no heed to where it comes from.

However, the reality is that many of these companies may be specifically looking for content they shouldn’t be using. This report shows that so much of the content used to train OpenAI’s models involves paywalled content. So, the question is, how many other datasets are processed to favor premium data? Can AI companies taking premium data be justified? At the surface, the companies seem to be in the wrong, but when you dig in a bit deeper, the line between right and wrong starts to blur.

We know about the legal implications. AI companies overstep their boundaries when they train their models on pay-walled material. Aside from reproducing bits of pay-walled content verbatim in some cases, these companies are stealing data to train models that will put those companies out of business.

That’s pretty messed up. However, there are two sides to this conversation. The fact is that AI models are here, and there’s nothing that anyone can do about it.

They’re delivering answers to our questions, teaching us, etc. Not only that but these AI tools are poised to be used in some rather crucial and understaffed fields like medicine and education. If they’re going to be trained on content from the internet, it’d be best for them to be trained on high-quality content.

While it’s tough to admit that there could be some merit in this practice, more and more of our lives will be touched by AI in some way. Honestly, it’d be better to use models trained on high-quality data than models trained on whatever. Much of the population doesn’t like the march of AI, but no one can stop progress.

AI will take over, so having the models trained on higher-quality content may be the lesser of two evils. Does this justify the use of pay-walled content? One of the worst things in any industry is when a large company can simply act as it pleases. Would you trust your 8-year-old alone in an unguarded candy store? Obviously, with no staff around to stop them from pigging out, your child will come home with a belly ache.

Justifying companies surreptitiously taking pay-walled basically gives them free rein to gorge on as much data as they can, much like the child. It basically grants them a hall pass to freely take data from other paid services. The companies that exist on the Internet have to, unfortunately, live by the internet’s rules; Rule #1 is that all sites get crawled, and there’s very little that anyone can do about it.

The reports from Ziff Davis and the News Media Alliance show that several AI companies knowingly siphoned data from premium publications and didn’t acknowledge it. Companies are filing lawsuits, as they rightfully should because there’s no telling how much of their data lives within the same chatbots that are stealing journalists’ jobs..