logo
#

Latest news with #AILabyrinth

Cloudflare blocks AI crawlers by default in a major win for web publishers
Cloudflare blocks AI crawlers by default in a major win for web publishers

Hindustan Times

time07-07-2025

  • Business
  • Hindustan Times

Cloudflare blocks AI crawlers by default in a major win for web publishers

Cloudflare has rolled out a major policy update where new sites will now block AI crawlers by default. This is a critical turning point for content creators and publishers, who have long-faced bots scraping content without compensation or traffic returns. Cloudflare blocks AI crawlers; Safeguards publishers Historical context: How it used to work Traditionally, the web operated on a 'content-for-traffic' model: search engines crawled sites and sent visitors back via links. This model supported ad revenue and visibility. With the rise of generative AI, firms have harvested vast amounts of web content like text, images, code, without driving any traffic back to publishers. What Cloudflare is doing now New Cloudflare users are now opted‑out of AI scraper access by default. Owners must explicitly choose to allow bots. An optional Pay Per Crawl program is also in development: publishers can monetize access by charging AI companies per crawl. Major players like Time, Reddit, BuzzFeed, and The Atlantic have already expressed support. How this affects AI companies AI firms relying on large language models such as OpenAI, Google, Meta, and Anthropic will face new hurdles. They must now seek permission or negotiate access fees, reducing free access to diverse data. The rise of 'AI Labyrinth,' a honeypot of decoy pages, further complicates unregulated scraping. Technical and operational roadblocks AI scrapers pose challenges. They ignore existing protections like cause server strain, and increase bandwidth costs. Cloudflare's answer includes advanced bot detection powered by behavioural analysis and ML, plus the labyrinth tactic to confuse scrapers. Industry reactions and implications Content owners, including Universal Music Group and Reddit, have welcomed the change, calling it essential to sustainable web economics. Cloudflare's CEO Matthew Prince summed it up: 'If the Internet is going to survive the age of AI, we need to give publishers the control they deserve'. But critics worry that a permission-based web could limit AI innovation. New web marketplace on the horizon Cloudflare is exploring a content-access marketplace where AI firms pay for usage, transforming free scrapes into transactions. This could value content by data rather than clicks, offering publishers a fresh revenue channel. Cloudflare's move shifts control back to creators, forcing a rethink of how AI companies gather data. Publishers now have the power to block, charge, or allow bots. This shift may define the economic and ethical foundations of the AI-powered web.

Cloudflare CEO says people aren't checking AI chatbots' source links
Cloudflare CEO says people aren't checking AI chatbots' source links

Engadget

time20-06-2025

  • Business
  • Engadget

Cloudflare CEO says people aren't checking AI chatbots' source links

Companies that develop generative AI always make it a point to say that they include links to websites in the answers that their chatbots generate for users. But Cloudflare CEO Matthew Prince has revealed to Axios that search traffic referrals keep plummeting. Publishers are facing an existential threat, he said, because people aren't clicking through those chatbot links and are relying more and more on AI summaries without digging deeper. Prince told Axios that 10 years ago, Google sent a publisher one visitor for every two pages it had crawled. Six months ago, the ratio was one visitor for every six pages, and now it's one for every 18. OpenAI sent one visitor to a publisher for every 250 pages it crawled six months ago, while Anthropic sent one visitor for every 6,000 pages. These days, OpenAI sends one visitor to a publisher for every 1,500 pages, whereas Anthropic sends one visitor for every 60,000 pages. People have come to trust AI chatbots more over the past few months. The problem for publishers is that they don't earn from advertisements if people don't click through links leading to their websites, and that's why Prince is encouraging them to take action to make sure they're fairly compensated. Prince said Cloudflare is currently working on a tool to block bots that scrape content for large language models even if a web page already has a "no crawl" instruction. If you'll recall, several outlets had reported in 2024 that AI companies have been ignoring websites' Robots Exclusion Protocol, or files and taking their content anyway to train their technologies. Cloudflare has been looking for ways to block scrapers since last year. But it was only in March when Cloudflare officially introduced AI Labyrinth, which uses AI-generated content to "slow down, confuse, and waste the resources of AI Crawlers and other bots that don't respect 'no crawl' directives." It works by linking an unauthorized crawler a series of AI-generated pages that are convincing enough but don't actually have the contents of the site the tool it's protecting. That way, the crawler ends up wasting time and resources. "I go to war every single day with the Chinese government, the Russian government, the Iranians, the North Koreans, probably Americans, the Israelis, all of them who are trying to hack into our customer sites," Prince said. "And you're telling me, I can't stop some nerd with a C-corporation in Palo Alto?"

A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish
A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish

Business Insider

time01-05-2025

  • Business
  • Business Insider

A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish

Bots now generate more internet traffic than humans, according to cybersecurity firm Thales. This is being driven by web crawlers from tech giants that harvest data for AI model training. Cloudflare's AI Labyrinth misleads and exhausts bots with fake content. A data point caught my eye recently. Bots generate more internet traffic to websites than humans now, according to cybersecurity company Thales. This is being driven by a swarm of web crawlers unleashed by Big Tech companies and AI labs, including Google, OpenAI, and Anthropic, that slurp up copyrighted content for free. I've warned about these automated scrapers before. They're increasingly sophisticated and persistent in their quest to harvest information to feed the insatiable demand for AI training datasets. Not only do these bots take data without permission or payment, but they're also causing traffic surges in some parts of the internet, increasing costs for website owners and content creators. Thankfully, there's a new way to thwart this bot swarm. If you're struggling to block them entirely, you can send them down new digital rabbit holes where they ingest content garbage. One software developer recently called this "diabolical" — in a good way. Absolutely diabolical Cloudflare feature. love to see it — hibakod (@hibakod) April 25, 2025 It's called AI Labyrinth, and it's a tool from Cloudflare. Described as a "new mitigation approach," AI Labyrinth uses generative AI not to inform, but to mislead. When Cloudflare detects unauthorized activity, typically from bots ignoring "no crawl" directives, it deploys a trap: a maze of convincingly real but irrelevant AI-generated content designed to waste bots' time and chew through AI companies' computing power. Cloudflare pledged in a recent announcement that this is only the first iteration of using generative AI to thwart bots. Digital gibberish Unlike traditional honeypots, AI Labyrinth creates entire networks of linked pages invisible to humans but highly attractive to bots. These decoy pages don't affect search engine optimization and aren't indexed by search engines. They are specifically tailored to bots, which get ensnared in a meaningless loop of digital gibberish. When bots follow the maze deeper, they inadvertently reveal their behavior, allowing Cloudflare to fingerprint and catalog them. These data points feed directly into Cloudflare's evolving machine learning models, strengthening future detection for customers. Will Allen, VP of Product at Cloudflare, told me that more than 800,000 domains have fired up the company's general AI Bot blocking tool. AI Labyrinth is the next weapon to wield when sneaky AI companies get around blockers. Cloudflare hasn't released data on how many customers use AI Labyrinth, which suggests it's too early for major adoption. "It's still very new, so we haven't released that particular data point yet," Allen said. I asked him why AI bots are still so active if most of the internet's data has already been scraped for model training. "New content," Allen replied. "If I search for 'what are the best restaurants in San Francisco,' showing high-quality content from the past week is much better than information from a year or two prior that might be out of date." Turning AI against itself Bots are not just scraping old blog posts, they're hungry for the freshest data to keep AI outputs relevant. Cloudflare's strategy flips this demand on its head. Instead of serving up valuable new content to unauthorized scrapers, it offers them an endless buffet of synthetic articles, each more irrelevant than the last. As AI scrapers become more common, innovative defenses like AI Labyrinth are becoming essential. By turning AI against itself, Cloudflare has introduced a clever layer of defense that doesn't just block bad actors but exhausts them. For web admins, enabling AI Labyrinth is as easy as toggling a switch in the Cloudflare dashboard. It's a small step that could make a big difference in protecting original content from unauthorized exploitation in the age of AI.

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%
AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

Yahoo

time02-04-2025

  • Yahoo

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

The Wikimedia Foundation, the umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects, said on Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has surged by 50% since January 2024. The reason, the outfit wrote in a blog post Tuesday, isn't due to growing demand from knowledge-thirsty humans, but from automated, data-hungry scrapers looking to train AI models. "Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," the post reads. Wikimedia Commons is a freely accessible repository of images, videos and audio files that are available under open licenses or are otherwise in the public domain. Digging down, Wikimedia says that almost two-thirds (65%) of the most "expensive" traffic -- that is, the most resource-intensive in terms of the kind of content consumed -- was from bots. However, just 35% of the overall pageviews comes from these bots. The reason for this disparity, according to Wikimedia, is that frequently-accessed content stays closer to the user in its cache, while other less-frequently accessed content is stored further away in the "core data center," which is more expensive to serve content from. This is the kind of content that bots typically go looking for. "While human readers tend to focus on specific – often similar – topics, crawler bots tend to 'bulk read' larger numbers of pages and visit also the less popular pages," Wikimedia writes. "This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources." The long and short of all this is that the Wikimedia Foundation' site reliability team are having to spend a lot of time and resources blocking crawlers to avert disruption for regular users. And all this before we consider the cloud costs that the Foundation is faced with. In truth, this represents part of a fast-growing trend that is threatening the very existence of the open internet. Last month, software engineer and open source advocate ​Drew DeVault bemoaned the fact that AI crawlers ignore " files that are designed to ward off automated traffic. And 'pragmatic engineer' Gergely Orosz also complained last week that AI scrapers from companies such as Meta have driven up bandwidth demands for his own projects. While open source infrastructure, in particular, is in the firing line, developers are fighting back with "cleverness and vengeance," as TechCrunch wrote last week. Some tech companies are doing their bit to address the issue, too — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow crawlers down. However, it's very much a cat-and-mouse game that could ultimately force many publishers to duck for cover behind logins and paywalls — to the detriment of everyone who uses the web today.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store