Five Practical Tips to Avoid a Sensitive Data Disaster in E-Discovery

As volumes of stored data increase, documents and email collected for legal matters are highly likely to contain sensitive data. Protecting this information in the electronic discovery process is a significant challenge for organizations and their counsel. Stewards of data in an organization that faces litigation, government investigations, or other legal actions are likely well aware of the risks of inadvertent production of sensitive data and may have established processes for clearing such data from ESI.

Though tools are improving, practices do vary, and some approaches are more effective than others. The goal of this article is to share several practical workflow steps that can increase efficiency and accuracy of e-discovery processes impacting sensitive data.

What Is Sensitive Data?

A quick definition can help frame the problem. We define “sensitive data” fairly broadly. It includes personally identifiable information, which is “information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual.” (Definitions of PII vary somewhat depending on the source; this one comes from 2 C.F.R. § 200.79.) PII includes, for example, Social Security numbers, passport information, driver’s license and state ID information, birthdate, address, and phone number. “Sensitive data” also includes medical insurance information, health care-related information, and personal health information (PHI). And it encompasses personal financial information, such as bank account numbers and credit and debit card numbers. Put simply, sensitive information, if disclosed, could result in harm to the individual whose privacy has been breached.

Five Key Tips for Finding and Protecting Sensitive Data

In handling e-discovery, it is highly likely that sensitive data resides within the data set. Litigation teams have a unique obligation to identify and remove, redact, or anonymize sensitive data to ensure that the opposing side does not receive any individual’s personal information. In cases where inadvertent disclosure of personal data occurs, the legal team may face serious court sanctions. Following are five best practices we have refined during years of e-discovery management on hundreds of cases.

Identify It: Scanning for Sensitive Data during Processing

Software tools that process ESI for legal discovery include scanning tools intended to find sensitive data. The types of data typically targeted by these scans are found in “regular expression” format, such as Social Security number, birthdate, credit card number, driver’s license number and other pattern-matched content. Unfortunately, when used “out of the box,” these scanning tools often perform poorly in finding sensitive data, with unacceptably low recall (here, the ratio of sensitive documents identified to the total number of sensitive documents in the collection) and precision (the ratio of sensitive documents identified to the total number of documents identified).

But these scans can be refined and optimized to improve the levels of precision and recall. A best practice is to run optimized scans on every collection of data during processing, to find as much sensitive data as possible early in the workflow.

Relying on a standard scanning tool without understanding its recall could be dangerous. Users may get a false sense of security that the scan is finding the sensitive data in the collection, when in fact the scan may have missed significant amounts. Likewise, poor precision could lead to time and money wasted through chasing down false hits for sensitive data that do not help mitigate risk. It’s a good practice to hold a healthy level of skepticism about scanner tools, test the results, and consider making improvements.

Analyze and Cull Sensitive Data

After processing consider using analytics tools to find and protect sensitive data that may not have been uncovered in the first round. For example, predictive coding, concept clustering, and other forms of technology-assisted review can identify some types of sensitive data and group and organize documents containing sensitive data for more efficient handling. Once sensitive data is found in a document, “find more like” tools can locate similar documents in the collection. Email threading can logically group communications containing sensitive information. Keyword searches – properly tested and validated, of course – can also be effective at finding sensitive data.

Once all documents containing sensitive data have been found and categorized, those that are irrelevant to the matter can be culled out—thus minimizing the volume of sensitive data that advances to the next phase of the workflow. Tools in use today enable quick and efficient culling across large groups of documents.

Conduct a Careful Review for Sensitive Data

A thoughtful review protocol can assist in protecting sensitive data as the project moves into attorney review. First, take advantage of the hosting platform’s highlighting or flagging capabilities to call out sensitive data for anyone who accesses the documents.

Next, any team member conducting a review or otherwise working in the documents—whether counsel or a specialized document review team is handling the matter—must be fully educated on the types of information considered sensitive and how to handle that information when it is encountered. Both substantive guidelines and a process workflow should be defined—and documented in writing—for handling sensitive data.

How should sensitive data be coded? What data should be withheld? What data should be redacted? If redactions will be made, who is responsible for the redactions, and what is that workflow? What information should be produced but marked with a confidentiality or other protective designation? What are the operative provisions of the applicable protective orders and/or confidentiality agreements, and how should those provisions be reflected in the workflow? All these questions must be answered at the outset of discovery before any review or other work with the documents takes place.

In some matters, counsel or their clients do not want to conduct a traditional review of the document collection before production, but instead want to run screens to filter out privileged content and then produce whatever remains. Although strong protective orders and Federal Rule of Evidence 502(d) orders can allow for that approach while still protecting privilege, such arrangements may not appropriately protect other types of sensitive data, especially PII and PHI. For a “screen then produce” approach to fully protect sensitive personal data, a robust, validated screen for sensitive data must be used to identify any protected content before production. And for certain types of sensitive data not amenable to an automated screen, some amount of review must be performed to prevent production of that information.

Whatever the review type, this screening need not be an inefficient, expensive, manual linear processing. Technology can dramatically increase overall review efficiency. For instance, coding decisions for sensitive data can be automatically propagated across duplicate documents. Redactions of sensitive data can be automated. Even in native documents like spreadsheets, redactions can be performed quickly using specialized workflows and tools.

Check for Quality, Consistency, and Gaps in Treatment of Sensitive Data

Once documents are potentially ready for production, use the technology features of the hosting platform to check for mistakes or inconsistencies in how sensitive data is handled. An experienced project manager can suggest appropriate checks based on the workflow used. For example, run searches to confirm that every document slated for production has an affirmative coding decision, that all duplicate documents are coded for sensitive data identically, and that all documents noted for redaction of sensitive data do in fact have redactions applied. Consider whether to segregate documents that contain nonsearchable content to confirm that they don’t contain sensitive data that analytics couldn’t capture.

Finally, before any production goes “out the door,” consider running one last scan/search for sensitive data elements to confirm that any sensitive content in the production correctly belongs there and, if so, that the appropriate protections are in place for those documents. That final check can also be used for an additional measurement and validation of the recall and precision of your sensitive data scans, to provide further proof of the defensibility and reasonableness of your efforts.

Use Knowledge Gained in Discovery to Inform Information Governance

Organizations frequently express surprise when the legal discovery process finds sensitive data in places where the information should never be. We refer to this situation as “data exhaust” or “data destructuring”—the movement of sensitive data from protected systems of record (such as structured databases, vaults and repositories) into unprotected systems that should not contain the information, such as file shares, network drives, email, and personal computers/devices.

By identifying sensitive data exhaust or destructuring when it turns up in discovery collections, then working “upstream” in the organization to determine the source or cause, litigation counsel can play a valuable role in improving information governance practices across the organization. Working with other stakeholders such as compliance, human resources, information management, and IT, the company can identify root causes of the information governance failures, prevent future occurrences, and mitigate the risk of compromising sensitive data.

Learn more about Bloomberg Law or Log In to keep reading:

See Breaking News in Context

Bloomberg Law provides trusted coverage of current events enhanced with legal analysis.