- Duke Law’s Keith Porcaro reviews research showing LLMs’ limits
- Lawyers should closely test AI on tasks like issue spotting
A team of Apple researchers’ recent paper reveals cracks in a crucial pillar of large language model marketing. Every lawyer experimenting with artificial intelligence should read it and absorb its warning.
The paper’s main finding: LLMs, the technology powering popular AI chatbots such as ChatGPT, aren’t capable of formal reasoning. To show this, the researchers started with a popular benchmark that tests state-of-the-art LLMs with grade-school mathematical word problems. Then they made two modifications to the problem set.
First, the researchers generated numerous variations of each problem, each with changed names and numerical values. For example, “Emily picked 4 apples” might become “John picked 7 apples,” “Lee picked 19 apples,” and so on. If LLMs were capable of formal reasoning, they should perform identically well on each problem variant. Trouble was, they didn’t—the models’ performance dropped slightly.
Second, the researchers modified the wording of each problem by adding minor details that seemed relevant, but that didn’t change the problem’s conclusion or the reasoning required to reach it. For example, a problem that involved counting apples might have the added detail “five of the apples were a bit smaller than average.” The result was a “catastrophic” collapse in performance, affecting even the best LLMs available today. Worse yet, the researchers found that even when given multiple examples of modified problems to train their answers on, LLMs still failed to improve their performance.
It’s this finding that should significantly dampen enthusiasm about using AI to help ordinary people with legal issues. A core part of legal training is issue spotting: teaching lawyers to sift through messy client stories and give them structure and detail.
A client may not know which parts of their story are irrelevant and which matter, especially when they’re facing an urgent legal issue. If LLMs are thrown wildly off course by extraneous or minor changes in wording, it’s hard to see how they can be the basis for reliable legal help.
For lawyers and law firms evaluating LLM tools for their own use, this study is a warning that demos you’ve received, and use cases you’ve discovered, may not be reliable indicators of a product’s overall performance. While we might reasonably infer a human’s skill level from their performance on a test, this research suggests that we can’t and shouldn’t make the same inference for a large language model.
This isn’t an argument to simply ignore large language models and the products on which they’re built. LLM-powered products are likely to find traction in legal workflows, from basic proofreading to complex searches of large document sets. Other products may not incorporate unstructured, chaotic user input. And lawyers will still need to help clients navigate emerging problems created by LLMs and other generative AI tools.
But lawyers must choose sobriety over hype. Before and during adoption, it’s critical to subject LLM products to repeated, dynamic testing, with many variations of individual problems. As the Apple team’s research demonstrated, a static benchmark of tests won’t be enough to evaluate a product’s true, real-world performance. If you aren’t trying to expose a product’s flaws, your colleagues and clients will end up discovering them by accident.
Even with robust testing, legal users still must build workflows that help catch hidden errors. This is harder than it sounds—large language model errors look different from human errors, and even savvy users may not be well-trained to catch and correct them. Even as someone who teaches large language models to law students, I’m continually amazed at the diverse and strange errors that emerge from sophisticated LLMs, even during seemingly simple work.
Of course, technology could improve. A new model might one day be capable of formal reasoning, rather than brittle pattern matching. But it isn’t yet. And buying a product today in hopes that it might work tomorrow is a sure path to disappointment.
It’s easy to get swept up in AI fever. See enough demos, hear enough pitches, and you might start to believe in the mythical robot lawyer: the brilliant, computerized associate who never sleeps, is always available, and is paid in massive electricity bills.
But lawyers have a duty to face the messy reality of what large language models can and can’t accomplish today. To do otherwise is just, well, unreasonable.
This article does not necessarily reflect the opinion of Bloomberg Industry Group, Inc., the publisher of Bloomberg Law and Bloomberg Tax, or its owners.
Author Information
Keith Porcaro is a senior lecturing fellow and director of the digital governance design studio at Duke University School of Law.
Write for Us: Author Guidelines
To contact the editors responsible for this story:
Learn more about Bloomberg Law or Log In to keep reading:
Learn About Bloomberg Law
AI-powered legal analytics, workflow tools and premium legal & business news.
Already a subscriber?
Log in to keep reading or access research tools.