Reddit stands firm against AI companies scraping content for training without paying (2024)

A hot potato: Reddit has been making moves as part of a crackdown on companies indiscriminately scraping the website for AI training purposes. Its philosophy is that AI companies stand to make millions or billions on large language models they are developing with resources they do not own. It's analogous to someone taking two-by-fours from a lumberyard to build their house just because the yard doesn't have a locked gate. But the issue goes way beyond Reddit and is central to how the open web has worked so far.

The Robots Exclusion Protocol is a web standard used to control and manage web crawler and bot access to websites. Defined by the robots.txt file, it tells search engines which parts of a site can be crawled or indexed, helping webmasters protect sensitive content and manage traffic efficiently. However, it works on the honor system with few ways to enforce it.

Hoffman said that the biggest thorn in his side is that some companies scraping Reddit data are turning around and selling it to other AI firms via their APIs. He specifically called out Microsoft AI CEO Mustafa Suleyman for recently comparing all public data on the internet to "freeware."

"We've had Microsoft, Anthropic, and Perplexity act as though all of the content on the internet is free for them to use," said Huffman. "That's their real position." While Microsoft Bing has been gracious in respecting Reddit's decision to block its crawlers, the company managed to slip in a denigrating remark.

Microsoft AI CEO Mustafa Suleyman: the social contract for content that is on the open web is that it's "freeware" for training AI models pic.twitter.com/FN1xrqnJC0
– Tsarathustra (@tsarnick) June 26, 2024

"Reddit has blocked Bing from crawling their site for search, favoring another search engine and impacting competition from Bing and Bing-powered engines," Microsoft spokesperson Caitlin Roulston said last week. "We honor the directions provided by websites that do not want content on their pages to be used with our generative AI models."

So far, Google and OpenAI are the only search engines on Reddit's whitelist. If other engines return anything but outdated Reddit content, then they are not abiding by the website's robots.txt document.

Reddit profiting from user-generated content through these licensing deals is still a hot potato. On the one hand, the lucrative fees do not go into the pockets of the community who make up Reddit's forums. On the other hand, these licensing deals are not much different from those of other companies.

OpenAI already pays licensing fees to large publishers like Dotdash Meredith, Axel Springer, the Associate Press, and The Atlantic. It is unconfirmed but doubtful that these publications pass those profits to their writers via raises or bonuses. Does that make it right? No, and the courts are still trying to decide about this unprecedented activity. However, it's par for the course at this point.

And this very issue is not limited to Reddit but all online publishers, big and small. In the race against AI training abuse, Reddit is one of the few with the muscle and influence to call out AI companies. While big media companies try to monetize and reach agreements, the rest of the internet is struggling. In fact, some subreddits have their own bots that copy and paste entire written content from original sources and display it as the first comment in the thread, effectively copying the content and then selling that to AI companies.

Until there are governing regulations, the AI gold rush will be like the California gold rush of 1848. Artificial intelligence firms will continue flocking to shovel AI products down everyone's throats for profit or to gather more data. Meanwhile, companies like Reddit and Vox will keep handing them the shovels.

Image credit: Jernej Furman

Permalink to story:

Reddit stands firm against AI companies scraping content for training without paying

Reddit stands firm against AI companies scraping content for training without paying (2024)

References