- Bill would require notice of works used to train AI models
- Practicality questioned as legal, ethical issues debated
A bill to require generative AI developers to disclose when they train their models on copyrighted works raises major logistical issues to go along with raging legal questions.
The Generative AI Copyright Disclosure Act (H.R. 7913), introduced by Rep. Adam Schiff (D-Calif.) April 9, would require notice to the US Copyright Office of “copyrighted works used in building generative AI systems.” The bill is being praised by writers, artists, and other creators. But advocates of the technology argue the measure is impractical and unnecessary.
Whether using millions of works to train large language models infringes copyrights, regardless of the AI’s output, is also the subject of ongoing litigation. If it does, the practicality of even identifying, collecting, and cataloging all the works ingested by an AI model—much less tracking down and paying rightsholders—presents daunting logistical challenges that, along with the threat of royalties, could pose an existential threat to the emerging technologies.
“I don’t even really understand what he’s trying to do,” intellectual property and technology law professor Eric Goldman of the University of Santa Clara said of Schiff and the bill. “Trying to disclose your sources when building an AI generative index—it’s not possible.”
The more material AI trains on, the more powerful and useful it becomes, Goldman said. He wondered how it would be possible to find anything on a disclosed list of billions of URLs, and what the possible benefit could be—other than an unimaginably vast “roadmap for litigation.”
ChatGPT maker
“Having talked to creators and rightholders, there’s a perpetual frustration to have this process of trying to go after these people who have continuously used their works without having to get a license,” Simpson said. “If these AI developers are relying on human creativity to design their machines, legalities aside, creators still need to benefit form their works.”
Schiff’s office didn’t respond to a request for comment. It said in a press release that the bill “champions innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets.”
“AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives. We must balance the immense potential of AI with the crucial need for ethical guidelines and protections,” Schiff said in the release. “This is about respecting creativity in the age of AI and marrying technological progress with fairness.”
The bill has yet to gain any cosponsors and faces long odds for passage. But the proposal and its considerable support highlight the broader debate over what generative AI developers can and should be required to do—or, potentially, pay—regarding training materials for their models.
Scale
Schiff introduced his bill at a time when courtrooms across the country have already begun assessing whether training on copyrighted materials infringes copyrights.
The bill would require anyone who “creates” or “alters” a training dataset used by generative AI to provide the Copyright Office a “sufficiently detailed summary of any copyrighted works used.”
The definition of copyrighted works doesn’t require them to be registered with the Copyright Office. Because works are legally protected upon creation, that would rope in any original work—from a painting or novel to a blog post or TikTok video.
Simpson said the AI companies have a better idea what they’re using to train AI than artists, especially given the “black box” posed by inaccessible models.
“The onus should be on the AI developer to talk about what they think is the right solution,” Simpson said.
But “the sheer scale does present practical issues,” Computer & Communications Industry Association senior counsel Joshua Landau said, noting that models can train on more than 500 million works. Beyond identifying creators, that leaves little money per creator for even lucrative AI systems if royalties are eventually to be paid, he said.
Simpson noted that AI companies could get a big head start with a relatively small number of contacts given the magnitude of rights held by big publishers and rights-management organizations. Many of those organizations lined up behind Schiff’s bill in the press release—groups representing the music industry, book publishers, photographers, film and television writers, authors, and other artists.
“As an AI developer, I think you could clear a lot of works but don’t think you could clear everything,” Simpson acknowledged. “Individual creators not affiliated with a broader rights organization, that’s where a legislative solution might be needed.”
Goldman said the training should be covered by the fair use doctrine, which allows unlicensed copying in some circumstances. He said the training sets are legally indistinguishable from Google Books’ scanning of millions of books to provide a reference service that reproduced snippets of text—which the Second Circuit affirmed as fair use in 2015.
Incentivizing ‘Junk’
The bill wouldn’t explicitly require permission or royalty payments for using works as training data.
But Landau and Goldman both noted that a pay-for-input system would erect a large barrier to entry for new generative AI companies, cutting out smaller ones that might provide competition, innovation, or niche offerings.
Simpson pushed back on the idea it’s impractical to pay creators. She worked on the Music Modernization Act during a stint with the House Judiciary Committee, a law that created a clearinghouse that now distributes royalties on millions of songs from streamers like
She said she’d challenge the premise that half a billion works are needed to build a viable model. While it’s unclear what a potential clearinghouse model might look like for AI, the Copyright Office’s registry provides a starting point as a database for linking royalties to creators, she said.
“You’ve got to start somewhere, and registered works are ones that creators thought were worthy of being registered. Worry about those first,” she said.
AI and ‘Human Potential’
Advocates for AI have suggested machine learning is conceptually little different than human learning, with both taking in information about the work of others then putting together their own creative work influenced by it. Only in AI’s case each work makes up an even smaller fraction of the influence over the output, given the multitude of sources.
“Everyone who is an artist today got their inspiration and training from other sources. They put their unique spin on it,” Goldman said. “It’s just really hard to disentangle the generative AI index arguments with the way artists generally build their craft. In all cases you have to learn from somebody.”
Simpson calls that “overly simplistic.”
“That analogy for AI being compared to a human learning is kind of flawed on its face” Simpson said, noting that no humans can ingest the volume of works AI can. “Seems to me that it’s more dependent on copying and repackaging than a human is.”
While creators lament the removal of humanity from their art forms, AI advocates worry damage from hamstringing the technology goes beyond efficiently creating art or text. Goldman noted for example that generative AI is often used to create software code, with boundless potential application for people with better ideas than coding skills—if AI training on robust code were legal.
“If the chatbot can reduce the access barriers to coding,” Goldman said, “that can really expand the range of human potential.”
To contact the reporter on this story:
To contact the editors responsible for this story:
Learn more about Bloomberg Law or Log In to keep reading:
See Breaking News in Context
Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.
Already a subscriber?
Log in to keep reading or access research tools and resources.