Websites Turn to Charging AI Scrapers They’ve Struggled to Block

Unnerved by AI-powered bots able to extract data from websites at tremendous speeds, companies are considering a new tactic startling to some: allow scraping for a fee.

Online platforms TollBit and Cloudflare are two pioneers of this strategy—known as a pay-per crawl model—which enables companies to either block, allow, or monetize web scrapers’ access to their content.

Companies that can’t stop increasingly sophisticated bots—either through through their own technology or in the courts—face the loss of their proprietary data, an existential threat for small firms. Selling information to scrapers, however, comes with its own perils, from copyright infringement to privacy law violations, ushering online platforms into uncharted legal territory.

Pay-per-crawl models change the nature of the claims users or copyright holders could wield against companies, said Austin T. Chambers, partner at Dorsey & Whitney LLP.

“Given the complex copyright, data protection types of issues, any client considering this has to kind of come at this from multiple angles,” he said, “And make sure that they’re well advised in thinking through sort of the concrete risks here.”

Do Not Scrape

Web crawlers have been scraping online content since the dawn of the internet.

In response, many companies wanting to block them have relied on publicly-facing instructions for bots about what crawlers can or can’t access. But the instructions guide—called robots.txt—lacks legal weight and has largely served as a warning.

Most scraping bots now “ignore it,” Chambers said. “It’s just sort of a passive signaling mechanism that doesn’t actually block it.”

Some platforms have coupled their do-not-scrape signs with more aggressive language in their terms of use, giving them legal grounds to sue crawlers under federal statutes like the Computer Fraud and Abuse Act. Others have put some of their content behind paywalls or deployed traffic-control-like tools to block IP addresses belonging to bots.

Companies with more resources have been able to fend off scrapers despite the CFAA’s limitations. Microsoft Corp.‘s LinkedIn sued Proxycurl in January under the federal law after it said the startup scraped millions of its users’ profiles. Proxycurl shut down in July, pointing to LinkedIn’s “unlimited war chest.”

But small website owners face tougher challenges.

“Maybe a small artist goes and puts up their website with all their art, and they had a thriving business. And then, three weeks later, any of the text-to-image models can generate their style flawlessly,” said Shayne Longpre, who leads the Data Provenance Initiative, a volunteer group of AI researchers. “They don’t have the tech savvy to block it.”

Some businesses now get 95% of their web traffic from AI scraping bots, according to statistics from HUMAN, a cybersecurity company. One e-commerce business was hit by 725 million scraping attempts in April 2024 alone, the company said.

Companies need to start thinking about how they want to control this web traffic, said Bryan Becker, senior director of product management at HUMAN. “What specific things do they want to allow it to or not allow it to do?,” he said.

Legal Questions

Financial deals with AI companies are attractive for platforms that have lost key revenue streams, but they may open them up to more legal risks.

“The monetization part gives me more pause, because we already have a regime for monetizing content: and it’s intellectual property,” said Andy Sellars, partner at Albert Sellars LLP. “And it is much more complicated than sort of a yes, no paywall thing.”

Content publishers including Dow Jones & Co., New York Times Co. and Reddit have sued generative-AI platforms like OpenAI Inc., Perplexity AI Inc., and Anthropic PBC over copyright infringement. The landmark lawsuits, still making their way through US courts, seek to answer whether AI companies’ use of publishers’ works to train their large language models counts as fair use. The risks to the AI companies aren’t insubstantial, either: Anthropic, warning that it faced hundreds of billions of dollars in statutory damages, said Tuesday that it had settled a class-wide settlement with authors in a copyright lawsuit over millions of pirated books.

For website owners, any monetization of personal information would have to abide by privacy laws. While many state privacy statutes include exemptions for publicly-available personal information, laws give consumers rights over how their data gets sold.

Companies operating in California or Colorado, for example, need to notify users and give them a chance to opt out of the sale or sharing of their personal information.

Selling Data

It’s too soon to say if pay-per-scrape models will be pervasive.

Cloudflare’s offering is still in beta, and the company doesn’t yet have a timeline for a broader release, according to its VP of product, Will Allen.

The company is engaging with both sides of the marketplace, he said, to determine how “publishers feel about it? Is it working for them?” They’re also learning what crawlers are asking for, like content safety scores and other classifications.

“It’s not obvious to me that this is going to be scalable across the full ecosystem of publishers on the internet,” said Marc McCollum, chief growth officer at Raptive, which helps sell digital ads for independent creators and supports Cloudflare’s model. “It’s also not obvious to me that AI companies are going to be willing to pay per crawl, so there’s a lot that would have to come together for that to work at scale.”

Financial arrangements with scrapers may also be out of the question for certain industries.

Platforms operating in the retail, airline, financial services, or social media industries, for example, may have stronger incentives to protect their data, said Amy Mushahwar, chair of the data, privacy and cybersecurity practice at Lowenstein Sandler LLP.

Scraped financial data could impact a platform’s ability to set competitive prices, for instance, while allowing bots to scrape user data could hurt consumer trust and trigger lawsuits.

She added, “There is a camp of ‘scraping must be limited.’”

An incorrect AI summary previously at the top of this story was removed.

To contact the reporters on this story: Cassandre Coyer in Washington at ccoyer@bloombergindustry.com; Kaustuv Basu in Washington at kbasu@bloombergindustry.com

To contact the editors responsible for this story: Jeff Harrington at jharrington@bloombergindustry.com; David Jolly at djolly@bloombergindustry.com

Learn more about Bloomberg Law or Log In to keep reading:

Learn About Bloomberg Law

AI-powered legal analytics, workflow tools and premium legal & business news.