AI’s Voracious Appetite for Data Imperils Key Privacy Principles

Oct. 2, 2025, 9:00 AM UTC

The massive demand for data to train AI models is threatening to undermine one of privacy advocates’ key goals: data minimization.

Policy makers have long argued that consumer privacy is best protected if entities collect and process only the data they need for a specific purpose, and retain that data only as long as needed. The principle is enshrined in the EU’s General Data Protection Regulation and similar laws in Canada, Australia, Brazil, and India, as well as California and many other US states.

But the need for AI training data is so great that the entire corpus of written English produced until the present day isn’t enough, said Jennifer King, privacy and data-policy fellow at the Stanford University Institute for Human-Centered Artificial Intelligence.

“Everybody in the field is talking about how we’re out of data, that pretty much everything that can be scraped and processed has basically been done at this point,” she said.

Some privacy advocates are optimistic that AI and data minimization can coexist as developers explore the benefits of using curated and anonymized data to train their AI tools rather than stuffing them with every bit of information they can capture.

But others are less certain that data minimization will survive in a world where remaining competitive in AI development is a key factor in economic growth.

“Those who follow this closely see that our personal data and the kinds of data we produce in chatbots and online conversations is only going to increase in value,” King said. “And that flies in the face of the principles of data minimization that would restrict access to that kind of data.”

“I’m very curious to see which is going to win,” she said.

Data Hunger

The root of the conflict is large language models’ insatiable hunger for data to train on, said D. Reed Freeman Jr., a partner with ArentFox Schiff LLP who focuses on privacy and data security.

“You need an extraordinary amount of data to develop a large language model that’s going to be used in a generative AI capacity,” he said. “And if a state comes along with a law that says you can only collect and use data for the purpose for which the consumer gave it to you, guess what? You can’t train.”

That’s why Google LLC was willing to pay $60 million last year in a deal to gain access to Reddit posts for the training of its AI models, King said.

Other tech companies placing big bets on AI, such as Meta Platforms Inc., are similarly “data hungry,” she said.

The current mantra among AI developers appears to be “the more data, the better,” which accounts for tech industry lobbying efforts against regulations that could slow AI development, said Calli Schroeder, who heads the global privacy project at the Electronic Privacy Information Center.

Need to Curate

But there’s reason to think that mantra is flawed, she said.

“If you’re training your AI on huge, huge volumes of data, but you’re not curating that data at all, you’re going to be building your model on random online posts with wildly improper grammar, conspiracy theories, stolen information, or things that are wildly inaccurate,” Schroeder said.

A recently released Meta AI tool that had been trained on Facebook and Instagram posts was “like talking to a caffeine-addled teenager,” King said. There was “nothing but ‘lol,’ ‘Oh my God,’ and ‘Are you sure?’,” she said.

Data quality concerns could help ease the tensions over the use of data by AI developers, said Eduardo Ustaran, a partner with Hogan Lovells in London who is co-head of the firm’s privacy and cybersecurity practice.

“The way you reconcile the development of AI tools, which need data, with the principle of data minimization, is by ensuring that the type of data and the nature of the data that is used is necessary for the training that is being undertaken,” he said. “That can create the room for maneuver to justify AI development under a data-minimization regime.”

China’s DeepSeek AI model provided an interesting counter-example to the argument that AI tools must be trained on as much data as possible, Ustaran said. “DeepSeek was as good as other open AI models, but was developed with a lot less data,” he said.

Not needing as much data can also provide a competitive advantage, he said.

“If you need to source 10 times the amount of data, it’s going to require more energy, it’s going to take more time, and it’s going to be more expensive,” he said.

Flexible Regulation

Nevertheless, AI developers remain wary of regulations that could slow advances in the technology, Freeman said.

He argues for a flexible approach to regulating AI that focuses on specific harms, such as its improper use in hiring decisions or granting credit, rather than abstract privacy harms related to data collection for training purposes.

But even a flexible approach to regulation may not satisfy a tech industry willing to play regulatory regimes against one another by raising the specter of competitive disadvantage, said Amba Kak, co-executive director of the AI Now Institute and a former AI policy adviser at the Federal Trade Commission.

There are even rumors that the EU is considering a second look at the GDPR and its data-minimization principles under the pressure of the AI arms race, King said.

“I don’t think they’re proposing throwing away the fundamental rights, but I think you are seeing an adjustment to the reality that’s at least partially being driven by AI,” she said.

To contact the reporter on this story: Christopher Brown in St. Louis at ChrisBrown@bloombergindustry.com

To contact the editors responsible for this story: Nicholas Datlowe at ndatlowe@bloombergindustry.com; Laura D. Francis at lfrancis@bloombergindustry.com

Learn more about Bloomberg Law or Log In to keep reading:

Learn About Bloomberg Law

AI-powered legal analytics, workflow tools and premium legal & business news.

Already a subscriber?

Log in to keep reading or access research tools.