Bloomberg Law
Free Newsletter Sign Up
Bloomberg Law
Advanced Search Go
Free Newsletter Sign Up

Text Analytics, Big Data and Law

Sept. 9, 2015, 3:16 PM

Editor’s Note: The author of this post leads product teams to develop legal technology.

By Mark Kerzner, Chief Product Architect, LexInnova

Big Data and text analytics are all the rage in the field of law today. However, there is a dearth of precise definitions of these technologies and of the understanding of what they do. This is the void that the present article will try to fill.

Definitions such as “What is TAR (Technology Assisted Review)? It is something that helps a lawyer deal with documents” abound. However, this definition is so generic that it misses the very essence of TAR. Eyeglasses also help lawyer to deal with documents, yet they are not TAR, or at least not in the modern sense of the word.

Also popular is the following trend of thought: “Predictive analytics is used everywhere. We buy products based on it (Amazon recommendations), make friends (Facebook recommendations), and drive based on it (Google maps telling you that “you are on the best road”). And yet, with the abundance of TAR and predictive analytics offerings from vendors, who is to be blamed for TAR is not being used more often? It must be the old-fashioned partners!”

This logic has an internal flaw though. These same “old-fashioned partners” ARE using Amazon, Facebook and Google recommendations. And if these same people are not jumping at predictive analytics offerings from eDiscovery vendors, one may have a nagging suspicion that something is lacking in the offerings of the “old-fashioned vendors.” And I believe I can point this out. But first let’s look at the available TAR technologies clearly, and with examples.

Text analytics - number-based

Firstly, there is numbers-based text analytics. Say, I’m looking for documents about patents. Looking at each document in my set, I can count how many times the word “patent” occurs within them, if at all. The more the merrier: Documents that contain the word “patent” more often are probably more related to patents than other documents. So, by counting the number of times that “patent” is mentioned, I can group similar documents.

But there are also other words in these documents! No problem, we will compute the ratio of our target word “patent” to other words. But there are also other documents! For this, we will look at how many times the word “patent” occurs in other documents. If it occurs too much, then our document is nothing special, and should not be singled out as one talking about patent.

Thus, I can have a graph that will group together documents with similar word frequency and similar frequency of occurring in other documents. Here is an example of such a technique applied to find important words in newspaper articles.

[Image “kerzner word capture” (src=]

This is number-based text analytics. Is it helpful? Probably yes. Does it really understand what the documents are talking about? Everybody will agree, far from it.

Based on the very simple number crunching described above, the computer can do the “show more documents like this” trick. Will it be extremely precise, or will it likely contain enough noise to become “not so useful” to put it mildly? Very likely so.

Beside this palpable need for improvement, the previous few paragraphs might have scared away anyone who is not into “data science” or “number crunching.” As the lawyers like to joke, “I went into law so that I would NOT have to do math.” Let’s respect this feeling.

Text analytics grammar-based

There is also a mature branch of text analytics dealing with the analysis of the language, or text-engineering. One example of this approach is GATE, a ten-year old project from Sheffield University, which has become one of the standard tools for text analytics. GATE stands for General Architecture for Text Engineering.

GATE can reasonably break down the text into paragraphs, sentences, nouns and verbs. It can also find people, companies, organizations and places mentioned in the documents. How does it do this? For example, the first rule for sentence detection is “It must end with a period (dot), and there must be a word starting with an uppercase letter after it.” As you can imagine, this rule will indeed catch perhaps 80% of the phrases found in documents overall. Then there will be some exceptions, and some additional rules to catch them. The detection will never be 100 percent perfect (even human reviewers may disagree on how to break up sentences), but it can get pretty accurate.

For people, companies, places, etc., GATE uses the concept of gazetteers. Gazetteer is simply a list of all possible values: For example, all counties in the U.S., if we are trying to detect the county court in a legal proceeding.

The screenshot below shows GATE configured to analyze US Court of Appeals documents. It can detect judges, courts, counsel, etc. The tag on the right is convenient: once you click on a checkbox which says “Date,” you will see all dates in all possible formats that GATE has detected in the given document.

[Image “Kerzner screenshot” (src=]

(Click to enlarge image.)

Well, how do you use this display? You could read the document yourself! The answer is that technologies like GATE help you extract meaningful entities, adding them to the existing metadata fields. The standard metadata fields number a few dozen and include fields like “author,” “date created,” “date received,” “recipient,” etc. Using entity extraction, one can add many more fields dealing with the document contents, actors, places, etc.

That, in turn, will help to form an accurate picture of the case. For example, the screenshot below shows each document as a larger white circle, while people and places extracted from it are shown as smaller filled circles. If you can further act on this chart by drilling in or zooming out, it can be used as a very effective investigation tool.

[Image “Kerzner3" (src=]

(image courtesy of DARPA Memex.Click image to enlarge.)

Big Data

The sheer number of documents may present a problem by itself. Let’s do a simple back-of-the-napkin calculation. A million documents may take a million seconds to process. This assumes that a document takes a second — not too unlikely if you think of opening this document with Microsoft Word. If you add optical character recognition (OCR) required for scanned documents, one second may even seem low.

A million seconds is about 20,000 minutes, or about 40 hours. Now, that’s too much. However, if I had a hundred computers to do the work, it would be only about 20 min. That is reasonable. And this is where Big Data comes in.

Big Data deals with too much data, more data than can be stored and processed on one computer. It is capable of connecting billions of people around the globe and of navigating billions of cars through millions of places. But for our purposes, we can simply think of it as a glue that combines our hundred computers into a cluster and lets them work together on solving one problem. This glue is called Hadoop. A more modern version, which uses in-memory computing and is much faster, is called Spark. Such feats as processing a million documents in twenty minutes on a hundred machines are commonplace there.

So what’s missing?

We started by stating that the lack of acceptance of predictive analytics by law firm partners perhaps shouldn’t be blamed on them. We showed you how text analytics works, whether based on numbers or on language parsing. We also showed what role Big Data plays there.

Of course, this is a simplification. All the techniques, examples of which are given above, analyze not one word but thousands of words in a collection of documents. They compute complicated statistics, use machine learning algorithms to find the best matches, and use multiple rules for text analysis.

Two thing are missing however. First, the litigation industry is not known for innovation. The common predictive analytics techniques only use the most simple common algorithms. Therefore, they fall short in their usability. This can be fixed. Google, one of the leaders in text analysis, came up with an algorithm that takes into account not only individual words, but also their position in the sentence. This algorithm is much more precise (think of Google Translate), and it is called “ paragraph vectors.

Instead of judges ordering the use of predictive analytics and law firms sometimes balking at that, lawyers will flock to these tools through their own volition

It has been published and even implemented in the open source world. Therefore, more depth of analysis is required. This would allow us to bridge the simple first results and the demands of the legal analysis. It may bring our analytics to the level of usefulness required by law firm partners and by everybody else involved.

Secondly, the exposition above shows that the tools will require maintenance and the per-case configuration. Here, I believe that the research work done by the Memex team at DARPA (of which I am a member) is pointing in the right direction. For example, their memex-gate is an extension of the standard GATE tool for text engineering, but configured for court documents. This does require constant updates from them (actually, crawlers that do the updates are included in the open source code that the Memex team publishes). But, as is well known to all Russians, “One cannot even get a fish out of a pond without some labor involved.”

With these improvements, perhaps we may see the reverse situation: Instead of judges ordering the use of predictive analytics and law firms sometimes balking at that, lawyers will flock to these tools through their own volition, because these tools will provide them with a tangible advantage. Compare this to the Bring Your Own Device (BYOD) to work movement.