AI Accuracy in Legal Research Remains in ‘Check Your Work’ Phase

Generative AI holds out promise to make legal research faster and more thorough—if a technique that aims to stop the technology from making up things can be perfected.

A method called retrieval-augmented generation, or RAG, is the leading contender to prevent AI hallucinations. Developers of AI tools across a wide range of applications are using RAG, and the method has been widely embraced by the legal technology industry over the last year. But a recent study and real-world use cases are raising questions about how thoroughly RAG eliminates errors in legal research AI systems, according to several people in the field.

“Ultimately, these legal research-specific RAG tools can be very helpful,” said Adam Rouse, senior counsel and director of e-discovery at Walgreens. “I think they can increase efficiency quite a bit. But I still think we’re still at the ‘check your work’ phase. We are not at the, ‘rely on this as truth.’”

Large language models like ChatGPT are prone to make mistakes because they guess the most likely next word in answering a query based on a vast quantity of training data. RAG goes a step further by sending the AI model to a specific dataset before answering that query—a document, for instance—to find and craft an answer based on that material.

Imagine a bright student who walks into a history class mid-semester. Ask her to write about the French Revolution before she’s studied it, and the essay may be well-written and correctly formatted, but some facts will likely be wrong or wholly invented. That’s a large language model. But give that same student a textbook to reference in an open-book test, and she can craft a factually accurate essay. That’s RAG.

Building RAG into a legal research tool seems to help avoid one of the more embarrassing hallucinations that generative AI is prone to in legal tasks: fabricating cases that don’t exist. But RAG-based systems still make errors, like misunderstanding what cases are about and confusing precedent, according to a recent Stanford University study. They can point lawyers to real cases, but they won’t necessarily be the right cases to cite.

Expecting Accuracy

When many law firms evaluate legal AI tools, they’re looking at accuracy and speed, said Jesse Klee, who works on knowledge management and innovation at Cleary Gottlieb. Even though legal research tools are much faster than human lawyers at finding cases, the time it takes to check their work can make time savings a wash, he said.

UK-based firm Ashurst ran a trial for several months of three generative AI tools, measuring factors like accuracy and time savings. The tools aren’t perfect, said the firm’s chief digital officer, Tara Waters. But she expects second- and third-generation versions to be much better.

At Walgreens, Rouse’s legal department is still only evaluating generative AI tools, not yet fully adopting them—and concerns about accuracy are part of the reason they’re still in the testing phase, he said.

The nature of generative AI—which relies on creativity to produce new answers to user’s questions—means it may be impossible to entirely eliminate wrong answers.

The term “RAG” was coined several years ago, but the method has come to increasing prominence in legal technology tools over the last year.

Ed Sohn, global head of insights and innovation at legal services company Factor, said RAG-based tools should be “third or fourth in line” in a law-firm’s generative AI toolbox behind less fact-based uses, like brainstorming ideas to add value to legal work.

“Where we should be going first are areas where 100% truth and accuracy are not actually the table stakes,” Sohn said. That accuracy will improve with time, he added.

“This thing is going to get leaps and bounds better,” Sohn said. “We’re in the very early innings of this. This is the worst it’s ever going to be.”

How RAG Works

In theory, RAG combines the intelligence of generative AI with the reliability of search technology. It’s seemingly a good fit for the needs of the legal tech market, which seeks tools that can write lawyer-quality answers in seconds, instead of hours, but without the dangerous hallucinations.

There’s two important steps to RAG: Retrieval, or finding documents that are relevant to the question, and the answer then generated off of them.

The retrieval is basically a search function. It could use simple keyword search, or more sophisticated searching that finds words with similar meanings, like “employee” and “worker,” said Damien Riehl, a vice president at the legal research company VLex. If the data the RAG-based tool is searching is organized and tagged well, the results will be better.

To perform well, the tool should fetch material that’s relevant to the question at hand, not miss anything, and then generate an answer that’s accurate to what’s in the source documents, Riehl said.

Why Does RAG Still Make Mistakes?

A team at Stanford University tested RAG-based legal research tools from Lexis, Thomson Reuters, and Westlaw, on more than 200 questions, and found two main types of mistakes. The answer could be incorrect—with a factual error about the law—or “mis-grounded,” citing a real source that didn’t support the argument in the model’s answer, according to a pre-print paper published in May.

The tools’ system for retrieving information relies in part on looking for similar terms, but the model may struggle to pick out the relevant similar terms, said Daniel Ho, the Stanford Law School professor who led the research.

The models also made errors when they couldn’t figure out which was the binding authority in a legal question, the study said.

Lexis and Thomson Reuters, which owns Westlaw, challenged the results of the Stanford study. Both said their tools performed better in the companies’ own tests than the Stanford researchers found. Bloomberg Law, which sells legal research tools, including some that make use of generative AI and RAG, was not included in the study.

While some legal questions will have a purely factual answer, “a lot of legal research also involves questions that have much more judgment involved,” like finding the most applicable precedent in a novel issue, Ho said. That’s a common task for attorneys, but the study shows RAG systems can struggle with it, he said.

“I think to some degree, people maybe have expected perfection out of these systems,” said Joel Hron, head of AI at Thomson Reuters. “And while that may be our goal, I don’t think that that’s anyone’s expectation, that we’ll get there in near term, or maybe even ever.”

Lexis is focusing on improving its generative AI research tool using techniques including RAG, said Jeff Pfeifer, chief product officer at LexisNexis for North America and the UK, in an emailed statement.

“At this stage of large language model evolution, RAG and fine-tuning are the best methods available to ensure answer quality,” he wrote. Pre-training models specifically for the legal market wouldn’t be a good solution because the models wouldn’t do a good job keeping up-to-date with legal opinions, laws and regulation, he added.

The quality of the responses in a RAG system can also depend on the quality of the data the system is querying—so it matters how you retrieve, store, and link the data, said Ashish Agrawal, a founder at a legal tech startup who formerly worked on machine learning for Amazon’s Alexa.

There are also simply quirks to the way the models operate that have to be overcome. For example, some generative AI models are more accurate about the beginning and end of a document than the middle, said Chris Dzoba, chief technology officer at the legal tech startup Briefly.

Building a Better Mousetrap

Researchers and tech companies are trying to figure out how to make RAG better.

One approach might be to set up two or more models to check each others’ work. The first one retrieves and generates an answer, and the second—or third—one verifies it. That method could reduce hallucinations, but it’s not very widely used yet—though Riehl said VLex is experimenting with the technique.

One of the Stanford paper’s important findings was that just citing a case is insufficient to curb mistakes, because there is also context to the citation—like whether part of the case was subsequently overturned, said Bryan Davis, co-founder and chief technology officer at the legal tech startup Centari. A relatively simple fix is encoding better rules about hierarchy into the systems, he said.

And while the tech will get better, expectations for what it should do may also have to be adjusted, Sohn said.

“I think a Gen AI legal research tool is very likely to meet slightly better-calibrated expectations in the coming years,” he said. “But we need to have more realistic expectations.”

To contact the reporter on this story: Isabel Gottlieb in New York at igottlieb@bloombergindustry.com

To contact the editors responsible for this story: Gregory Henderson at ghenderson@bloombergindustry.com; Rachael Daigle at rdaigle@bloombergindustry.com

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.