Latest news with #dataScraping


The National
4 days ago
- Business
- The National
AI audits and 'pay per crawl': How Cloudflare is trying to fix a 'broken' web model
An announcement from network giant Cloudflare is deepening a divide between the worlds of tech and content publishing, which are at odds over the data used to train AI platforms. Cloudflare said publishers using the cloud company's tools for hosting websites will now block AI crawlers by default from accessing and poaching content without permission. 'Upon sign-up with Cloudflare, every new domain will now be asked if they want to allow AI crawlers, giving customers the choice upfront to explicitly allow or deny AI crawlers access,' the company said. 'This significant shift means that every new domain starts with the default of control, and eliminates the need for webpage owners to manually configure their settings to opt out.' Cloudflare first addressed concerns about AI data-scraping last year when it gave websites the option to block AI companies from poaching content. 'Now by default you have control over who crawls your site and what that information is used for,' said Stephanie Cohen, a chief strategy officer for Cloudflare. 'The benefit of that is that it creates the conditions for a new business model of the internet to develop,' she told The National after Cloudflare introduced the new settings and options. Some of the strongest proponents of AI tools, and many of the tools' creators, have justified data scraping, saying it is akin to the early days of search engines when controversy briefly surfaced over whether or not search companies should be able to index sites. Others say that comparison isn't appropriate, because search engines didn't poach the contents of entire websites. Additionally, during the early days of web browsers, search engines and the crawlers they implemented provided a framework that built much of the internet as we know it. It was a win-win situation for the likes of Google and media companies which provided information and sought to attract audiences by delivering web traffic through internet searches. The debut of OpenAI's ChatGPT in 2022 and other AI platforms turned that economic model on its head. Instead of directing traffic to websites, AI summaries have quickly become a destination unto themselves, siphoning traffic from the same websites from which they scrape data. Ms Cohen said publishers and content creators using Cloudflare's services soon noticed a major dip in web traffic. 'Not only was it getting more difficult to get web traffic – to the tune of it being 10 times harder – but it was also getting more difficult at a faster and faster rate,' she said. The web's economics based on search that built up over the last decade, she said, started to erode over a period of six months. In 2024, Ms Cohen said Cloudflare allowed users to see which AI companies were scraping their sites and turn off that ability. This year they are taking things further by introducing 'pay per crawl'. The tool gives publishers and website operators the option of allowing AI scraping for free, charging for it 'at the configured domain-wide price,' or blocking scraping entirely. As AI developments quickens, so too does the bad blood between media organisations and the tech firms driving the AI boom. Several lawsuits have been filed. The New York Times has sued OpenAI and Microsoft for allegedly using its articles to power increasingly popular chatbots.
Yahoo
08-06-2025
- Business
- Yahoo
Reddit Lawsuit Against Anthropic AI Has Stakes for Sports
In a new lawsuit, Reddit accuses AI company Anthropic of illegally scraping its users' data—including posts authored by sports fans who use the popular online discussion platform. Reddit's complaint, drafted by John B. Quinn and other attorneys from Quinn Emanuel Urquhart & Sullivan, was filed on Wednesday in a California court. It contends Anthropic breached the Reddit user agreement by scraping Reddit content through its web crawler, ClaudeBot. The web crawler provides training data for Anthropic's AI tool, Claude, which relies on large language models (LLMs) that distill data and language. More from Prime Video NASCAR Coverage Uses AI to Show Hidden Race Story Indy 500 Fans Use Record Amount of Data During Sellout Race Who Killed the AAF? League's Demise Examined in Latest Rulings Other claims in the complaint include tortious interference and unjust enrichment. Scraping Reddit content is portrayed as undermining Reddit's obligations to its more than 100 million daily active unique users, including to protect their privacy. Reddit also contends Anthropic subverts its assurances to users that they control their expressions, including when deleting posts from public view. Scraping is key to AI. Automated technology makes requests to a website, then copies the results and tries to make sense of them. Anthropic, Reddit claims, finds Reddit data 'to be of the highest quality and well-suited for fine-tuning AI models' and useful for training AI. Anthropic allegedly violates users' privacy, since those users 'have no way of knowing' their data has been taken. Reddit, valued at $6.4 billion in its initial public offering last year, has hundreds of thousands of 'subreddits,' or online communities that cover numerous shared interests. Many subreddits are sports related, including r/sports, which has 22 million fans, r/nba (17 million) and the college football-centered r/CFB (4.4 million). Some pro franchises, including the Miami Dolphins (r/miamidolphins) and Dallas Cowboys (r/cowboys), have official subreddits. Reddit contends its unique features elevate its content and thus make the content more attractive to scraping endeavors. Reddit users submit posts, which can include original commentary, links, polls and videos, and they upvote or downvote content. This voting influences whether a post appears on the subreddit's front page or is more obscurely placed. Subreddit communities also self-police, with prohibitions on personal attacks, harassment, racism and spam. These practices can generate thoughtful and detailed commentary. Reddit estimates that ClaudeBot's scraping of Reddit has 'catapulted Anthropic into its valuation of tens of billions of dollars.' Meanwhile, Reddit says the company and its users lose out, because they 'realize no benefits from the technology that they helped create.' Anthropic allegedly trained ClaudeBot to extract data from Reddit starting in December 2021. Anthropic CEO Dario Amodei is quoted in the complaint as praising Reddit content, especially content found in prominent subreddits. Although Anthropic indicated it had stopped scraping Reddit in July 2024, Reddit says audit logs show Anthropic 'continued to deploy its automated bots to access Reddit content' more than 100,000 times in subsequent months. Reddit also unfavorably compares Anthropic to OpenAI and Google, which are 'giants in the AI space.' Reddit says OpenAI and Google 'entered into formal partnerships with Reddit' that permitted them to use Reddit content but only in ways that 'protect Reddit and its users' interests and privacy.' In contrast, Anthropic is depicted as engaging in unauthorized activities. In a statement shared with media, an Anthropic spokesperson said, 'we disagree with Reddit's claims, and we will defend ourselves vigorously.' In the weeks ahead, attorneys or Anthropic will answer Reddit's complaint and argue the company has not broken any laws. Reddit v. Anthropic has implications beyond the posts of Reddit users. Web crawlers scraping is a constant activity on the Internet, including message boards, blogs and other forums where sports fans and followers express viewpoints. The use of this content to train AI without knowledge or explicit consent by users is a legal topic sure to stir debate in the years ahead. Best of College Athletes as Employees: Answering 25 Key Questions


Asharq Al-Awsat
05-06-2025
- Business
- Asharq Al-Awsat
Reddit Sues AI Giant Anthropic Over Content Use
Social media outlet Reddit filed a lawsuit Wednesday against artificial intelligence company Anthropic, accusing the startup of illegally scraping millions of user comments to train its Claude chatbot without permission or compensation. The lawsuit in a California state court represents the latest front in the growing battle between content providers and AI companies over the use of data to train increasingly sophisticated language models that power the generative AI revolution. Anthropic, valued at $61.5 billion and heavily backed by Amazon, was founded in 2021 by former executives from OpenAI, the creator of ChatGPT. The company, known for its Claude chatbot and AI models, positions itself as focused on AI safety and responsible development. "This case is about the two faces of Anthropic: the public face that attempts to ingratiate itself into the consumer's consciousness with claims of righteousness and respect for boundaries and the law, and the private face that ignores any rules that interfere with its attempts to further line its pockets," the suit said. According to the complaint, Anthropic has been training its models on Reddit content since at least December 2021, with CEO Dario Amodei co-authoring research papers that specifically identified high-quality content for data training. The lawsuit alleges that despite Anthropic's public claims that it had blocked its bots from accessing Reddit, the company's automated systems continued to harvest Reddit's servers more than 100,000 times in subsequent months. Reddit is seeking monetary damages and a court injunction to force Anthropic to comply with its user agreement terms. The company has requested a jury trial. In an email to AFP, Anthropic said "We disagree with Reddit's claims and will defend ourselves vigorously." Reddit has entered into licensing agreements with other AI giants including Google and OpenAI, which allow those companies to use Reddit content under terms that protect user privacy and provide compensation to the platform. Those deals have helped lift Reddit's share price since it went public in 2024. Reddit shares closed up more than six percent on Wednesday following news of the lawsuit. Musicians, book authors, visual artists and news publications have sued the various AI companies that used their data without permission or payment. AI companies generally defend their practices by claiming fair use, arguing that training AI on large datasets fundamentally changes the original content and is necessary for innovation. Though most of these lawsuits are still in early stages, their outcomes could have a profound effect on the shape of the AI industry.