OpenAI is challenging a lawsuit from The New York Times in the court of public opinion. In a blog post, OpenAI said the lawsuit lacks merit, also suggesting that the Times isn’t providing a complete picture of the situation.
The Times claims copyright infringement, asserting that OpenAI and Microsoft used the company’s articles to train its chatbot, ChatGPT.
The lawsuit argues that the Times stands to lose customers and revenue if it’s forced to compete with ChatGPT as a news source.
This raises questions that debate if OpenAI is working within the confines of fair use, or if this litigation opens the door to crafting a new framework for such assessments.
“As a copyright lawyer and an academic, this is the first thing that I wanted to know,” Matthew Sag, a professor of law at Emory University specializing in the intersection of intellectual property and generative AI.
Courts have indicated that large language models processing huge amounts of data to generate abstract information on copyrighted material may qualify for fair use under U.S. law.
The way machine learning works is that rather than starting with a theory and then, you know, testing that like a normal statistician, you basically throw an incredible amount of data at a model and the model keeps tweaking itself in successive rounds of training, trying to get better.
Matthew Sag, Emory University
“Generative AI is kind of a slippery term,” Sag said. “I mean, what we’re really talking about is sort of a subset of machine learning programs. And the way machine learning works is that rather than starting with a theory and then, you know, testing that like a normal statistician, you basically throw an incredible amount of data at a model and the model keeps tweaking itself in successive rounds of training, trying to get better.”
This issue is highly debatable, especially concerning potential infringement concerns.
“One of the things that’s really impressive about The New York Times complaint is that they show, like a lot of examples of, ‘Hey, you didn’t just learn abstract things, you kind of seem to have learned how to copy our works exactly.’ And quite frankly, I was shocked at how impressive the evidence was, but that evidence has not been tested,” Sag said.
For example, someone can ask ChatGPT to summarize a specific historical event, and within seconds the generative AI will produce whatever length summary the user requested. The AI can also do things like write songs in the style of a particular artist, raising the issue of the information’s origin and the need for assurance that it is not a direct copy from official sources.
The chatbot produces regurgitations or memorizations, meaning the model might generate text that is similar or identical to phrases, sentences or passages from the data it was trained on. It’s a phenomenon where the model seems to reproduce or memorize specific patterns from its training set rather than generating a novel response.
“The regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites,” OpenAI said in response to the Times’ lawsuit. “It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate.”
OpenAI and Microsoft have not submitted formal counterarguments for the New York cases. The companies are required to answer the summons by Jan. 18.
“You know, the NYT complaint, it’s impressive,” Sag said. “And if what they’re showing really is representative of what the GPT-4 is doing, then you know you can, you know they’re hard put to argue that it’s a non expressive use. I’m still skeptical but I think it’s one we have to wait and see how it plays out.”