Reddit stands firm against AI companies scraping content for training without paying (2024)

A hot potato: Reddit has been making moves as part of a crackdown on companies indiscriminately scraping the website for AI training purposes. Its philosophy is that AI companies stand to make millions or billions on large language models they are developing with resources they do not own. It's analogous to someone taking two-by-fours from a lumberyard to build their house just because the yard doesn't have a locked gate. But the issue goes way beyond Reddit and is central to how the open web has worked so far.

Reddit stands firm against AI companies scraping content for training without paying (1)

The Robots Exclusion Protocol is a web standard used to control and manage web crawler and bot access to websites. Defined by the robots.txt file, it tells search engines which parts of a site can be crawled or indexed, helping webmasters protect sensitive content and manage traffic efficiently. However, it works on the honor system with few ways to enforce it.

Last week, Ars Technica was reporting that Reddit posts were not appearing in any search engines except for Google. It's no big mystery that Reddit already penned a $60 million licensing deal with Alphabet to use its content for training – meanwhile Reddit has been increasingly ranking at the top of Google searches this past year (quid pro quo, or maybe not...).

The company also recently notified users that it changed its robots.txt file to exclude bots and crawlers that didn't have permission to access its data. Reddit CEO Steve Huffman said he believes in an open internet but that companies now use search engine web crawlers to scrape information for profit, a far cry from their historical use. "I think the traditional value exchange from search engines has changed," Huffman told The Verge.

"Search and summarization and training are merging, and the value exchange of crawling in exchange for traffic back is becoming muddied."

Reddit stands firm against AI companies scraping content for training without paying (2)

To this point, Huffman said that blocking companies unwilling to pay for data harvesting has been "a real pain in the ass," prompting the changes to Reddit's robots.txt. For the most part, companies have respected Reddit's wishes, and several, including Microsoft, Anthropic, and Perplexity, have entered negotiations to license its content.

Hoffman said that the biggest thorn in his side is that some companies scraping Reddit data are turning around and selling it to other AI firms via their APIs. He specifically called out Microsoft AI CEO Mustafa Suleyman for recently comparing all public data on the internet to "freeware."

"We've had Microsoft, Anthropic, and Perplexity act as though all of the content on the internet is free for them to use," said Huffman. "That's their real position." While Microsoft Bing has been gracious in respecting Reddit's decision to block its crawlers, the company managed to slip in a denigrating remark.

Microsoft AI CEO Mustafa Suleyman: the social contract for content that is on the open web is that it's "freeware" for training AI models pic.twitter.com/FN1xrqnJC0

– Tsarathustra (@tsarnick) June 26, 2024

"Reddit has blocked Bing from crawling their site for search, favoring another search engine and impacting competition from Bing and Bing-powered engines," Microsoft spokesperson Caitlin Roulston said last week. "We honor the directions provided by websites that do not want content on their pages to be used with our generative AI models."

So far, Google and OpenAI are the only search engines on Reddit's whitelist. If other engines return anything but outdated Reddit content, then they are not abiding by the website's robots.txt document.

Reddit profiting from user-generated content through these licensing deals is still a hot potato. On the one hand, the lucrative fees do not go into the pockets of the community who make up Reddit's forums. On the other hand, these licensing deals are not much different from those of other companies.

OpenAI already pays licensing fees to large publishers like Dotdash Meredith, Axel Springer, the Associate Press, and The Atlantic. It is unconfirmed but doubtful that these publications pass those profits to their writers via raises or bonuses. Does that make it right? No, and the courts are still trying to decide about this unprecedented activity. However, it's par for the course at this point.

And this very issue is not limited to Reddit but all online publishers, big and small. In the race against AI training abuse, Reddit is one of the few with the muscle and influence to call out AI companies. While big media companies try to monetize and reach agreements, the rest of the internet is struggling. In fact, some subreddits have their own bots that copy and paste entire written content from original sources and display it as the first comment in the thread, effectively copying the content and then selling that to AI companies.

Until there are governing regulations, the AI gold rush will be like the California gold rush of 1848. Artificial intelligence firms will continue flocking to shovel AI products down everyone's throats for profit or to gather more data. Meanwhile, companies like Reddit and Vox will keep handing them the shovels.

Image credit: Jernej Furman

Permalink to story:

Reddit stands firm against AI companies scraping content for training without paying

Reddit stands firm against AI companies scraping content for training without paying (2024)

References

Top Articles
Charlie Chaplin Dead at 88; Made the Film an Art Form
Paulette Goddard Photos, News and Videos, Trivia and Quotes - FamousFix
Melson Funeral Services Obituaries
Visitor Information | Medical Center
Kraziithegreat
Noaa Weather Philadelphia
Lycoming County Docket Sheets
The Blind Showtimes Near Showcase Cinemas Springdale
R/Altfeet
Med First James City
Diablo 3 Metascore
RBT Exam: What to Expect
Mills and Main Street Tour
Bcbs Prefix List Phone Numbers
Nutrislice Menus
Slope Tyrones Unblocked Games
Craigslist Free Stuff Greensboro Nc
History of Osceola County
Uky Linkblue Login
Who called you from +19192464227 (9192464227): 5 reviews
Iu Spring Break 2024
Lowe's Garden Fence Roll
Parentvue Clarkston
Milanka Kudel Telegram
Reborn Rich Kissasian
John Chiv Words Worth
Arrest Gif
Milwaukee Nickname Crossword Clue
Unable to receive sms verification codes
Jailfunds Send Message
lol Did he score on me ?
R3Vlimited Forum
How to Use Craigslist (with Pictures) - wikiHow
Jambus - Definition, Beispiele, Merkmale, Wirkung
Tamilrockers Movies 2023 Download
M3Gan Showtimes Near Cinemark North Hills And Xd
Prima Healthcare Columbiana Ohio
The Legacy 3: The Tree of Might – Walkthrough
Barrage Enhancement Lost Ark
8 Ball Pool Unblocked Cool Math Games
Samantha Lyne Wikipedia
Craigslist Farm And Garden Reading Pa
فیلم گارد ساحلی زیرنویس فارسی بدون سانسور تاینی موویز
Craigslist Com St Cloud Mn
Phmc.myloancare.com
Call2Recycle Sites At The Home Depot
Electric Toothbrush Feature Crossword
Congressional hopeful Aisha Mills sees district as an economical model
Overstock Comenity Login
Kobe Express Bayside Lakes Photos
4015 Ballinger Rd Martinsville In 46151
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6664

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.