The Intercept Bolsters OpenAI Copyright Suit with More Evidence

The Intercept Media Inc. amended its complaint against OpenAI Inc. and Microsoft Corp. to add examples of text “regurgitations” demonstrating the firms removed author information from articles in data sets used to train its generative AI tools.

Over 6,000 URLs from The Intercept are included in one training set used by OpenAI and over 2,000 are in another, according to the amended complaint filed June 21 in the US District Court for the Southern District of New York. The complaint detailed how OpenAI used two sets of algorithms, Dragnet and Newspaper, to strip copyright information from news articles then used to train its artificial intelligence products.

The Intercept Media Inc., Raw Story Media Inc., and Alternet Media Inc. filed a pair of lawsuits against OpenAI and Microsoft in February, accusing the organizations of violating the 1998 Digital Millennium Copyright Act by intentionally removing author names, titles, copyright notices, and terms of use from their content. The ChatGPT AI chatbot has since produced verbatim or near-verbatim versions of copyright-protected articles without the corresponding credits and other identifying data, the complaints said.

After Judge Jed S. Rakoff on June 3 heard oral argument regarding OpenAI and Microsoft’s motions to dismiss the lawsuit, he ordered The Intercept to amend its complaint to “attempt to rectify some of the seeming lack of specificity in its current complaint.”

The Intercept amended its complaint to include a 374-page exhibit listing articles it owns the copyrights to, a 225-page exhibit listing its URLS included in a data set used by OpenAI to train its models, and examples of how the Dragnet and Newspaper algorithms produced text “substantively identical” to its content without copyright information. The output was at times identical except for “the seemingly random addition of an extra space between two words, or the exclusion of a description associated with an embedded photo,” the updated complaint said.

OpenAI has admitted to intentionally using the Dragnet and Newspaper codes when assembling a data set, The Intercept said, adding it continues to use the same or similar text extraction methods when creating training sets for its chatbot.

Because the codes were publicly known to remove copyright information and OpenAI employs “highly skilled data scientists” familiar with how they work, The Intercept said OpenAI “intentionally and knowingly removed this copyright management information.”

The media outlet’s new filingalso said OpenAI possesses a repository of every regurgitation of its work since it retains users’ chat histories unless they disable that feature. And the firm has effectively admitted regurgitation of copyrighted works is infringement because it updated ChatGPT to reduce regurgitations for copyright reasons.

The amended complaint contains additional details about Microsoft’s involvement as well, including CEO Satya Nadella’s quote from a November 2023 New York Magazine’s Intelligencer interview saying the tech giant could continue innovation even if OpenAI disappeared because “we have the data, we have everything.”

OpenAI and Microsoft have until July 8 to file a supplemental brief in support of their motions to dismiss.

Loevy & Loevy represents Intercept. Latham & Watkins LLP, Morrison & Foerster LLP, and Keker, Van Nest & Peters LLP represent OpenAI. Orrick, Herrington & Sutcliffe LLP and Keker, Van Nest & Peters LLP represent Microsoft.

The case is The Intercept Media Inc. v. OpenAI Inc., S.D.N.Y., No. 24-cv-01515, 6/21/24.

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Already a subscriber?