INSIGHT: Evaluating AI Accuracy for Contract Comparisons

Many corporate legal departments and law firms are implementing contract abstraction tools powered by artificial intelligence (AI). Contract abstraction is about giving structure to contracts, which are inherently unstructured data, ultimately looking to answer the question “What do our contracts say?”

The use cases are broad and the potential benefits significant. Some examples:

understanding the obligations being inherited from a target company as part of an M&A transaction,
identifying contracts that need to be amended in response to a regulatory event (e.g., GDPR, Brexit),
determining when revenue contracts are up for renewal or renegotiation, and
pinpointing contracts with non-standard terms to better characterize risk exposure, or (v) analyzing notice and consent issues for internal reorganizations.

For those considering a contract abstraction system like eBrevia, Kira Systems, ContractSifter and iManage RAVN, you can find an evaluation framework in this Case Study on Evaluating AI Abstraction Tools.

The value of the AI in these systems is its ability to automate the process of finding relevant text from agreements. But how accurately can the tools do this? Only after answering this question can you decide when the AI results can be relied upon and where to use quality reviews by a human.

Evaluating Accuracy

At the risk of oversimplification, accuracy is defined as the AI’s ability to find the correct text in a document. If, for example, we ask an AI model to find the “Change of Control” provision in an agreement, here are the possible results.

This scenario illustrates two types of wrong results: (1) not finding the right text (false negative), and (2) finding wrong text (false positive).

Traditionally, each result is evaluated individually to determine whether it is right or wrong. For example, if the AI abstracts the correct Change of Control text, but also finds additional wrong text, it would be scored with one right result (true positive) and one wrong result (false positive).

Similarly, if the agreement has two areas of relevant text and only one is found, it would be scored with one right result (true positive) and one wrong result (false negative). But is this helpful? If you can’t determine whether a Change of Control is triggered without knowing what both provisions say, should we not instead give the overall agreement one wrong result?

An Alternative Approach

An alternative methodology to use when testing the accuracy of AI abstraction accounts for these nuances by providing five categories to grade accuracy at the agreement level.

In our example, if there is only one instance of correct text for Change of Control and the AI finds that plus additional wrong text, it would be scored as “Extra.” If we ask an AI model to find the Change of Control provision in an agreement that has two instances of correct text and the AI finds one but not the other, it would be scored as “Some.”

Why This Matters

Building a workflow in a live project requires understanding when you can rely on the AI results and when you need to implement a quality review.

If a given AI model is “50% accurate,” we may think we can rely on it 50% of the time. But which 50%? Using an agreement-level framework gets us closer to understanding where to direct the quality review.

For example, if the tool is consistently returning a score of “Some,” this would indicate a need to review every document as the AI is having trouble finding all the relevant provisions. Scores of All and/or Extra, would require far less demanding quality review.

This framework is also more flexible. Results can be weighted to properly align the quality review process with potential risks. For example, since your M&A due diligence exercise cannot miss a Change of Control provision, you can apply a higher risk rating to results that miss relevant text (e.g., “Highly Critical” or a risk score of 5 out of 5) and a lower risk score to results that find all relevant text plus extra text (e.g., “Not Critical” or a risk score of 1 out of 5).

For an example of how results from testing using this alternative framework compare to results from testing using a traditional approach, see this Case Study on Evaluating AI Abstraction Tools.

Evaluating AI at the contract level and with more nuanced scoring than “right” or “wrong,” leads to more usable results, supports a framework to help better align review workflow, and allows you to adjust your evaluation to the specifics of your review.

This column does not necessarily reflect the opinion of The Bureau of National Affairs, Inc. or its owners.

Author Information

Chase D’Agostino is AVP of Corporate Solutions at QuisLex, a leading legal services provider. He works with legal departments and law firms to design, implement and oversee the deliver of legal services around contracts, M&A, legal operations and compliance. He draws from his experiences as part of the leadership team at QuisLex, as an in-house lawyer at Colgate-Palmolive and an associate at Simpson Thacher & Bartlett.

Special thanks to Aditya Mohanty, head of Quality and Business Excellence at QuisLex.

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.