AI bots, web scraping only one block away after Cloudflare’s new feature

BY

Published 9 Jul 2024

NSFW AI Why trust Greenbot

We maintain a strict editorial policy dedicated to factual accuracy, relevance, and impartiality. Our content is written and edited by top industry professionals with first-hand experience. The content undergoes thorough review by experienced editors to guarantee and adherence to the highest standards of reporting and publishing.

Disclosure

Free Faceless person working on computer using mouse and keyboard and having laptop on cradle Stock Photo

Cloudflare debuted a “Declare your AIndependence” initiative on July 3, releasing a brand-new one-click option that lets web owners block all artificial intelligence (AI) bots and crawlers from scraping content without a license.

As one of the leading internet service and hosting companies worldwide, operating one-fifth of all web traffic, Cloudflare introduced this latest generative AI combat system available to all, including those on its free plan, as part of its flagship content delivery network.

The new system can be activated by toggling the “AI Scrapers and Crawlers” switch in the Security section of the Cloudflare dashboard, which will prevent AI bots from visiting and accessing websites without permission.

The one-click feature came after the company’s bot categorization program last year, which enabled users to choose specific categories of bots they wanted to allow or block.

Continued Struggle versus Scrapers

The rapid development of generative AI has led to the growing demand for data training using original web content. AI bots, crawlers, or scrapers browse and gather data from websites across the Internet, which is applied to train large language models (LLM) and support AI-powered technologies. The main issue here lies in how some AI bots do not follow established protocols, sparking ethical issues regarding copyright infringement and intellectual property violations.

Based on Cloudflare’s analysis of AI bot traffic across its network, Bytespider, Amazonbot, ClaudeBot, and GPTBot were identified to be the most active AI crawlers. Among these four, Bytespider, a chatbot operated by TikTok’s parent company ByteDance, was the most frequently blocked from websites, followed by OpenAI’s GPTBot.

Moreover, 39% of the top one million websites under it have been accessed by AI bots, but only 2.98% of them have consistently blocked or challenged these crawlers. The study also revealed that chatbots are more attracted to higher-ranked websites, which are more likely to block them as a result.

Although website operators can control the access of AI crawlers using robots.txt, not all bot operators do not honestly identify themselves and ignore the Robots Exclusion Protocol.

Seeing Through the Veil

“Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent,” Cloudflare stated in its blog. “We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot.”

The American company claims that its latest feature can identify web scrapers despite their attempt to avoid detection. It does this by using a global bot score system that assigns every request with a score from 1 to 99, with lower values indicating a higher chance that the request was made by an AI bot.

In the case of Perplexity AI, Cloudflare noted that the bot working for the “answer engine” AI tech consistently received a score below 30, which falls within the company’s recommendation to automatically block all of the chatbot’s traffic provided that it does not take any action. Cloudflare expects the same thing to happen for future AI crawlers using similar techniques to hide themselves.

“When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint,” the engineers in Cloudflare detailed in its blog.

Looking Ahead

The company also proposed two ways for web owners to report misbehaving AI crawlers. Through Bot Analytics, Enterprise Bot Management customers can file false negative feedback reports, while all Cloudflare clients can use a specific reporting tool to flag AI bots accessing their contents unlicensed.

Cloudflare also vows to continuously update its AI Scrapers and Crawlers rules and models to adapt to AI companies that may develop new methods, fingerprints, or chatbots to avoid detection.