Deep learning, a much-ballyhooed aspect of artificial intelligence (AI), is showing great promise for everything from image analysis to self-driving cars, but when it comes to legal applications such as e-discovery, specifically document review, legal professionals need to be wary of the hype.
Deep learning, now and in the foreseeable future, is not a viable approach to technology-assisted review (TAR)—and has deep shortcomings.
There are four key reasons why deep learning amounts to nothing more than hype in e-discovery, and why domain-specific approaches to supervised learning, particularly TAR 2.0 based on continuous active learning, will continue to be far more effective in solving complex legal document review tasks.
Enter Deep Learning
Deep learning is the multiple layering of learning algorithms, from the raw input to the final prediction of responsiveness. Feature extraction, the transformation of unstructured text into a set of “units of meaning” such as words, phrases, and concepts, is not an independent stage. The output of every layer serves as the input to the next. Ultimately, all the layers work together to predict a document’s responsiveness.
When a training document is added, every layer in the stack adjusts. Elements of a layer that yielded a bad prediction get demoted, while those responsible for a good prediction are boosted. This adjustment is carried down through every layer in the stack, so that training data changes not only the predictions, but also the conformation of the features.
Arguing about the minimum number of layers needed in order for learning to be deep is a bit like arguing the minimum number of sand grains needed to form a heap, but there is universal consensus that less than three layers is not deep. Most of the deep learning breakthroughs described in the popular press utilize a dozen layers or more.
The Hype of Deep Learning in E-Discovery
Deep learning is not suitable for e-discovery for four reasons:
- The relative power of keywords
- The problem of training data
- Scientific research
- Two-layer neural networks are not deep learning
1. Keywords Are More Powerful Than You Think
First, deep learning might not be effective on e-discovery tasks because of the nature of the underlying raw data. A cursory survey of the domains in which deep learning has had success reveal a common pattern: the raw data often has no inherent semantics. The most notable include computer vision (object detection, scene recognition), audio analysis (speech recognition, content-based music recommendation), and self-driving cars.
In the image domain, the raw inputs are pixels. With audio, the raw input is tens of thousands of quantized pulses per second. The LiDAR sensor input data from self-driving cars are point clouds. None of these raw signals are semantically meaningful.
In these domains, feature extraction algorithms traditionally have been manually crafted, characterizing a researcher’s best guess at transforming pixels, pulses, or points into meaningful representations on which later supervised machine learning algorithms can accurately predict. If the representation is wrong, later learning becomes much more difficult. Deep learning solves this representation problem by learning which feature representations are best and how to give good predictions using those features simultaneously – all dozen or more layers work inextricably together.
Human language on the other hand, despite occasional polysemous collisions such as Jaguar the car versus jaguar the animal, already has words to express most important concepts. Deep layers of “convolved” words upon words are not necessary to build a semantically rich representation for prediction.
2. The Problem of Training Data
Second, training a deep network requires a large amount of manually-labeled data. Getting good predictions from traditional supervised learning algorithms requires correctly setting numerous parameters. Setting a parameter is like twisting a knob or setting a slider on a mixing board. To get the sound just right, you have to twist lots of knobs. The more knobs there are, the more difficult it is to get that twisting right. It is therefore helpful to have a myriad examples of the kind of sound (i.e., training data) you will be recording. This allows you to verify that the knobs are set correctly by seeing how the settings will perform in a variety of circumstances.
Deep learning’s multiple layers explode the number of “knobs” by many orders of magnitude. Good predictions from a deep learning algorithm require massive amounts of training data. State-of-the-art systems responsible for high-profile breakthroughs often have incorporated millions of training examples.
An e-discovery matter with 100,000 documents can never have millions of labeled training examples. Similarly, a matter with millions of documents will never have millions of labeled training examples, because no one is going to review them all.
This problem was acknowledged in the recent workshop Culpepper et al., Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018), in which top Information Retrieval (IR) research scientists from industry and academia noted that “existing deep learning approaches often require a massive amount of training data to generalize suitably, which is hard to obtain in [the] IR area.”
3. Scientific Research
Scientists have heavily studied the problem discussed above. For example, academics at Georgetown University studied one form of deep learning known as “multilayer perceptrons.” Yang et al., “Effectiveness Results for Popular e-Discovery Algorithms” (ICAIL 2017). They found that this deep learning approach far underperformed more traditional supervised learning approaches—on average 15.3 percent worse than logistic regression and 18.2 percent worse than SVM.
There are in addition to multilayer perceptrons hundreds of other deep learning variants. The IR community has tested a large and growing number of these, summarized in the aforementioned SWIRL report: “While these methods [deep learning] have been very successful in some domains – such as vision, speech, audio, and NLP – these successes have not been observed for information retrieval tasks.”
So while the text Information Retrieval research community holds out hope that deep learning may one day outperform traditional shallow approaches, a sizeable amount of research does not yet support the position that deep learning is better.
4. Two-Layer Neural Networks Are Not Deep Learning
Finally, some of the hype surrounding deep learning in e-discovery comes down to marketing tactics. Many e-discovery technologies marketed as deep learning are actually not.
Recall that what differentiates deep learning is that feature extraction and prediction are inextricably intertwined. However, what is frequently pitched as deep learning boils down to a two-stage system: First, two-layer (shallow) neural networks — typically in the form of algorithms like word2vec that “embed” words into a semantic space in which the word “truck” is closer to “lorry” than it is to “airplane” – are being used for feature extraction. Second, those “embedding” features become static inputs to a single layer (shallow) neural network for document coding prediction.
Notice the slight-of-hand: If a two-layer shallow neural network performs feature extraction, followed by a one-layer shallow neural network for prediction, the combined system is not deep. Though there are three layers, they aren’t integrated such that a training example can propagate through all layers to alter the way in which features are constructed. The stages are independent.
The market seems to be capitalizing on a confusion between neural networks and deep learning; not all neural networks are deep. In sum, the combination of hype with attempts to pass off shallow techniques as deep does not bode well for deep learning as an approach to e-discovery.
Continuous Active Learning is a More Effective Alternative to Deep Learning
The core take-away is that the best solution to a problem is often situation-dependent. Researchers may one day solve the problem of AI, a master algorithm that handles any problem as well or better than a human. That day is not today, and deep learning is not that algorithm. And until that day, domain-specific approaches will continue to be far more effective.
One approach that has proven effective for e-discovery is process-based: continuous active learning (CAL). For example, some of the problems that word2vec tries to solve at the feature extraction stage can instead be trivially handled by CAL at the process stage. For example, instead of trying to determine statically during feature extraction whether the terms “truck” and “lorry” should be conflated into a single semantic concept, CAL’s continuousness dynamically grows the set of terms that are positive indicators of responsiveness based on the matter itself. If the matter is about transportation vehicles, then both truck and lorry will be iteratively added as positive indicators. If the matter is about New Jersey (and not Sheffield) drivers systematically exceeding the speed limit, then truck will be added as positive indicator and lorry as a negative.
CAL does not try to solve all problems up front, but instead learns and improves continuously throughout the review. CAL is exactly the right approach when large amounts of training data are not available a priori, and keywords themselves are rich in semantics. Rather than following the hype of deep learning, it is better to implement solutions developed specifically for the e-discovery problem.
Dr. Jeremy Pickens is chief scientist at Catalyst Repository Systems, which designs, builds and runs platforms for complex e-discovery, regulatory investigations, and compliance. Pickens, who has nine patents and patents pending in the field of search and information retrieval, is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search.