Companies of all types and sizes are drowning in data. A major consequence for corporate litigation is that data filtering to reduce volume is an essential part of eDiscovery. The good news is there are a number of proven, cost-effective strategies and technology tools for filtering at the identification and processing stages.
Most eDiscovery projects benefit from a combination of several filtering options prior to review. For discussion purposes, I’ve organized these options into three categories:
- “Substantive filtering” is based on legal determinations of relevance given the subject matter of the case and the scope of requests for production.
- “Technical filtering” is the use of standard eDiscovery technology tools to cull irrelevant and duplicative files.
- “Hybrid filtering” is automated culling based on substantive decisions about relevance.
Substantive filtering may be used during both identification and processing. Technical filtering and hybrid filtering options are applied during processing.
Substantive filtering: include what’s relevant and exclude what’s not
The key substantive filters are custodian and data source identification, date range restrictions and keyword searches.
In eDiscovery terminology, “custodians” are employees who have relevant documents. Most custodians also have personal knowledge of the subject matter of the case. The exception is so-called “records custodians” such as IT departments, records management staff and administrative assistants.
Data sources include computers, phones, email servers, document management systems, websites, backup tapes, etc., etc. Every data source has a custodian. Generally speaking custodians can be expected to have several data sources (although a records custodian may control only a single data source, such as the records manager for a document management system).
Identifying the right custodians and data sources at the outset is hugely important. Comprehensive identification is the foundation for a document collection that captures the relevant data while minimizing, to the extent possible, the collection of irrelevant data.
In addition, counsel should consider applying a date restriction. Date filtering excludes documents that were created before and/or after the relevant time period. Unlike custodian and data source identification, date filtering isn’t appropriate in all cases. The relevant time period may be so lengthy that date filtering does not have a meaningful impact on total data volume, or it may not be possible to determine what cut-off date to use. However, it can be very helpful in cases where the relevant time period is reasonably limited.
For technical reasons most date filtering is done during processing. A prominent exception is email servers, where it’s usually feasible to apply a date restriction during collection. Occasionally it’s even possible to exclude an entire data source based on date; an example would be a legacy system that was taken off-line prior to the relevant time period.
A strategic advantage of date filtering during processing is that different date ranges can be applied on a granular level. It’s not uncommon for a case to have multiple relevant time periods tied to different issues and custodians. For instance, many cases have different cut-off dates for liability and damages. Because financial records are typically segregated from other company records, it’s usually a straightforward matter to apply a distinct date limitation to that part of the dataset.
Custodian-based time periods come up where a change in job duties moved a custodian either in or out of the scope of the case. A specific date restriction can easily be applied to email and MS Office documents, which generally make up the bulk of the collection for individual custodians.
The third significant substantive filtering option is keyword searching. Keyword searching has long been and continues to be one of the most popular strategies for reducing data volume prior to review. It can be used by itself or in conjunction with predictive coding or other advanced review tools. For a full treatment of keyword searching, see my post “Building Blocks of Effective Keyword Search.”
Best practice is to run keyword search during processing, rather than collection. eDiscovery platforms are designed to make keyword search during processing defensible and effective. These software tools use indexing to make otherwise non-searchable documents searchable, have robust search capabilities, permit sampling and iterative searching and minimize corporate client involvement in the eDiscovery process.
It’s commonplace for new keywords to arise as issues in the case change and develop. If the collection was keyword-limited, then the only way to run additional keyword searches is to make supplemental collections. As a consequence, applying keyword searches during collection is likely to increase the burden on the corporate client in the long run.
Technical filtering: simple and powerful
There are two essential and customary technical filters used in processing: deNISTing and de-duplication. These automated processes are used across the board. Technical filters are managed by the eDiscovery service provider and require minimal if any lawyer input.
“DeNISTing” is industry parlance for culling irrelevant files, including system files, malware and other non-user created files, based upon a list of known files types published by the US National Institute of Standards and Technology (NIST). Outside of exceptional circumstances (for example, a data breach case where malware is actually at issue), deNISTing is a standardized processing tool used in all cases.
As the name indicates, de-duplication removes duplicates so that only one copy of a file is passed through to the review dataset. Today’s eDiscovery platforms like Relativity and Viewpoint populate a “duplicates” field that cross-references all the copies. This allows a reviewer to see at a glance the full list of custodians who had a copy of the document in their electronic files. The lawyer managing the review project can set the de-dupe order according to a priority custodian list or the data can simply be de-duped in the order it’s received by the service provider.
Both deNISTing and de-duplication use hash value comparison. A hash value is a unique alpha-numeric value generated by running the binary data of an electronic file through a mathematical algorithm, a process known as hashing. A file’s hash value is often described as its unique digital fingerprint.
In technical filtering, each file’s hash value is compared to, respectively, the NIST list and the other files in the dataset. If it’s a match, it’s removed from the review set. Technical filtering, especially de-duplication, can and usually does significantly reduce overall data volume.
Hybrid filtering: technical-type filters guided by substantive relevance
In addition to these “technical filtering” options, there are automated filtering options during processing that are based on substantive determinations. The two most common tools in this category are file type filtering and email domain filtering.
File type filtering culls documents based on file extension. For instance, MS Office file extensions (e.g., .DOC, .PPT, .XLS) may be retained while graphics file extensions (e.g., .JPG, .PNG, .BMP) are removed. The decision to cull certain file types and not others is based on a substantive assessment of relevance.
The most common use of file type filtering is culling music, videos and pictures. This can be surprisingly effective in reducing overall data volume even when the total number of files removed is modest. This is because on average, the file size of audio/video and graphics files is much larger than other file types. From a review perspective, it’s primarily used to remove employees’ irrelevant personal files such as iTunes downloads, vacation photos and the like. Like date filtering, file type filtering may be applied on a granular level (i.e., custodian by custodian).
Email domain filtering removes all messages from domains (e.g., @amazon.com) that are known to be irrelevant. While largely a technical process, it is based on substantive review of the domains present in the email dataset to identify which ones can be excluded. This underutilized tool is a cost-effective way to tackle the overwhelming volume of email discovery. My post “Leveraging Email Domains for Data Filtering” is an in-depth look at email domain filtering.
The scope of discovery is delineated by a party’s obligation to produce relevant and non-privileged information that is either requested by the opposing party or needed to support the producing party’s claims and defenses. However, discovery should also be proportional to the needs and value of the case. As a practical matter, this means both identifying potentially relevant files for collection and removing irrelevant files during processing to reduce data volume prior to review. Defensible filtering is an important means to achieving and balancing these goals.
Helen Geib is General Counsel and Practice Support Consultant for QDiscovery. Prior to joining QDiscovery, Helen practiced law in the intellectual property litigation department of Barnes and Thornburg’s Indianapolis office where her responsibilities included managing large scale discovery and motion practice. She brings that experience and perspective to her work as an eDiscovery consultant. She also provides trial consulting services in civil and criminal cases. Helen has published articles on topics in eDiscovery and trial technology. She is a member of the bar of the State of Indiana and the US District Court for the Southern District of Indiana and a registered patent attorney.
This post is for general informational and educational purposes only. It is not intended as legal advice or to substitute for legal counsel, and does not create an attorney-client privilege.