Cloudflare’s new feature will block AI scrapers from pilfering internet content

technology July 4, 2024No Comments

Cloudflare will help internet content creators prevent AI developers from using their content to train LLMs. This service will become part of Cloudflare’s existing CDN platform for securely managing web content.

Cloudflare is going to provide additional functionality to combat the scraping of internet content by AI developers using special bots. These are becoming increasingly sophisticated and better at disguising their true identities.

According to the CDN provider, AI developers are trying to mask their scraping bots as legitimate web browsers. This is done using a spoofed user agent and makes it difficult to distinguish genuine visits from those by bots. Cloudflare has developed a method to detect this spoofing method with its proprietary machine learning model, which allows for blocking the bots.

One of the AI developers using this form of scraping is Perplexity AI, as recently discovered by Wired magazine.

Tip: Is Perplexity a preview of online search’s AI-driven future?

Score rating of website visits

In response to these involuntary scraping practices, Cloudflare has developed a no-code solution within its CDN platform that checks whether website traffic may have come from a scraping bot, even if the AI scraping bot is trying to hide itself.

To do this, each website visit processed by the CDN specialist is given a score between 1 and 99. The lower the score, the more certain it is that a particular request came from a bot. For example, Perplexity AI’s scraping bot gets a score lower than 30.

The technology used to assign the scores allows Cloudflare to give certain tools and frameworks that scrapers use to collect data a specific ‘fingerprint’. For this, the company uses its own network that processes about 57 million web requests per second. The company seems to indicate that Cloudflare relies heavily on this technology.

In addition to identifying possible visits by AI scraper bots, Cloudflare users also get a special reporting tool to notify the company of possible active bots on their website. Enterprise Bot Management customers can also send a False Negative Feedback Loop report through Bot Analytics. This can be done by simply clicking the data traffic segment when they identify suspicious behaviour that is not automatically spotted by Cloudflare’s own tech.

The Cloudflare tools against web content scraping by AI developers is available for both the free and paid versions of the Cloudflare CDN platform.

AI content scraping is a contentious topic

Content scraping by AI developers has long been under discussion. OpenAI and Google offer opt-out options, but other LLM developers still use web content without the creator’s consent.

Content providers are, therefore, not only trying to prevent this scraping but also entering into deals with AI developers to use their content. Consider Reddit, for example. This community platform has exclusive paid deals with OpenAI and Google but actively repels AI scraping from other developers.

Also read: Reddit to block AI data scrapers who haven’t signed a lucrative deal

#Cloudflares #feature #block #scrapers #pilfering #internet #content,
#Cloudflares #feature #block #scrapers #pilfering #internet #content