Personal Info in AI Models Threatens Split in US, EU Approach

Aug. 28, 2024, 9:03 AM UTC

A German privacy regulator’s controversial finding that AI models don’t memorize personal information like names and birth dates has intensified an ongoing international debate over the fate of a person’s data once it’s gobbled up by large language models.

The Hamburg Commissioner for Data Protection and Freedom of Information determined in a July 15 paper that even though an AI system like a chatbot might be able to spit out personal data, the LLMs that power it don’t store that information. That’s because after an LLM is trained on data, it “lacks the necessary direct, targeted association to individuals” to be considered personal data, according to the paper.

If these findings become pervasive in Europe and parts of the US, individuals could no longer exercise their rights to access, correct, or delete any personal data once ingested by an LLM.

The conclusion, from the German equivalent of a state enforcement authority, roiled not just privacy advocates but also technologists who’ve established that LLMs can memorize personal data and regurgitate it when prompted or attacked. A coalition of researchers from institutions including Cornell University; University of California, Berkeley; and Google DeepMind coaxed OpenAI Inc.‘s ChatGPT to divulge 10,000 examples of identifiable data using a $200 budget, according to their paper published in November.

“It is a really hotly debated policy question, just because for privacy and data protection laws, so much hinges on this initial question of, what is personal data? Where does it live?” said Cobun Zweifel-Keegan, managing director of the International Association of Privacy Professionals in Washington, D.C. “LLMs are just very different systems than a database. They don’t store information in neat rows and columns.”

In the US, California lawmakers are pushing for a more absolutist approach. Proposed amendments to the California Consumer Protection Act aim to clarify that the law’s protections apply to personal information, whatever digital format it may take, including in “artificial intelligence systems that are capable of outputting personal information” such as chatbots. The proposals, though, don’t specify which components of an AI system might contain the personal data.

The split regulatory approaches stem from the black-box nature of most large language models, which has left privacy watchdogs on both sides of the Atlantic grappling with how to determine if AI models contain personal data, and how to apply existing privacy laws to generative AI. While the German regulator’s approach diverges from technologists’ findings, California’s broader scope skirts answering critical questions about the technical feasibility of granting certain data subjects rights.

“Memorization is certainly a thing. Data can sometimes come out, but it’s not necessarily very, very common,” said Florian Tramèr, assistant professor of Computer Science at ETH Zürich working on computer security, privacy, and machine learning. “Whether this matters or not is ultimately a legal or a policy question.”

Models Can Memorize

Most research in machine learning currently points toward some evidence that models memorize fragments of the data they were trained on—like poems, pieces of code, names of countries, or full names.

The German regulator’s position is “a really idealized interpretation of what happens,” said Patrick Hall, teaching assistant professor of decision sciences at the George Washington University School of Business. “For me, it’s pretty clear sometimes neural networks do memorize data.”

AI models’ capacity to retain certain types of personal information has been proven in several ways, including through certain fine-tuning processes and targeted attacks that can coax a chatbot into reproducing training data.

Groups of tech-savvy users have even taken on the goal of “breaking” models, and are often successful in extracting personal data, Hall said.

Research on how and why models retain certain bits of information is still developing. Some of the biggest LLMs on the market have been trained on a huge subset of the internet, Tramèr explained, making it nearly impossible for them to actually memorize it all without “the biggest breakthrough in data compression.”

“It doesn’t mean that the model doesn’t somehow store or memorize individual bits and pieces of data it’s been trained on,” he said.

In many cases, that’s the desired end result—users likely want the model to remember things like that Paris is the capital of France, for instance. But controlling what the model retains and what it doesn’t remains an unsolved problem, Tramèr added.

Probability ‘Isn’t Sufficient’

Because machine-learning research is still developing, some of these findings are too probabilistic for privacy regulators to act on, at least for now, said Markus Wünschelbaum, policy and data strategy advisor to the Hamburg Data Protection Commissioner.

(Wünschelbaum said he was not speaking on behalf of the Hamburg Data Protection Commissioner.)

“The problem with this is that the regulatory logic of the GDPR does not recognize the ‘probable’ piece of personal data,” Wünschelbaum said. “It’s either personal data or it isn’t.”

Under Europe’s General Data Protection Regulation, authorities like the Hamburg Data Protection Commissioner bear the burden to determine whether personal data is being processed by an organization.

“In the end, a conceivable, hypothetical, probable piece of personal data, is not sufficient to justify action by a supervisory authority,” Wünschelbaum said. “We cannot issue fines because we find it possible that there’s personal data. There has to be something and this is a definite discrepancy between the GDPR and AI models.”

The data-protection authority’s interpretation received mixed feedback in the EU, and opened up questions about the future of GDPR enforcement for generative AI products. The European Data Protection Board, which oversees enforcement the GDPR across the EU, hasn’t yet weighed in on the question.

The board convened a ChatGPT task force last year that concluded that LLMs are trained and enhanced using “a huge amount of data, including personal data.” That data must be protected under the guardrails of GDPR and a “technical impossibility cannot be invoked to justify non-compliance,” it noted in a May 2024 paper on the task force’s findings.

California Assemblymember Rebecca Bauer-Khan, the author of one of the bills to amend the CCPA’s treatment of personal information in AI models, voiced dismay at the Hamburg Commissioner’s paper.

“It’s fascinating that anyone would contend that private data is not being stored in these models,” she said.

Regulators’ deliberations over what counts as personal data is likely a precursor to a more technical question of how AI tools and privacy can co-exist, said Cassandra L. Gaedt-Sheckter, co-chair of Gibson, Dunn & Crutcher’s AI practice.

“It can be difficult to execute on obligations, privacy-related obligations, once personal information is used in an algorithm, for example, either for training or other purposes,” she said. That complicates “the question of what you do with a request to delete” under privacy laws.

The German regulator’s business-friendly approach relies on false statements to back-up its enforcement perspective, said Max Schrems, the Austrian privacy advocate and founder of noyb—European Center for Digital Rights, one of the EU’s leading privacy advocacy groups.

While penalties require a higher burden of proof than a probability under the GDPR, “this is not true for a mere order to stop, delete or rectify data, which is not a ‘penalty’ or ‘fine,’” Schrems told Bloomberg Law in an email.

He added, “It really sounds like a circular ‘if you shut your eyes it’s not there’ argument.”

To contact the reporter on this story: Cassandre Coyer in Washington at ccoyer@bloombergindustry.com

To contact the editors responsible for this story: Kartikay Mehrotra at kmehrotra@bloombergindustry.com; Adam M. Taylor at ataylor@bloombergindustry.com

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.

Already a subscriber?

Log in to keep reading or access research tools and resources.