Processing in eDiscovery is an umbrella for a number of necessary technical processes. The purpose of processing is two-fold. First, to get the ESI collection into the best format for analysis and review. Second, to remove unneeded files from the dataset prior to review.
To best serve their clients, litigators need to understand processing basics and how they impact eDiscovery workflow and costs. Lawyers provide information to clients and help them make decisions about eDiscovery. To be most effective they must be able to explain why processing is necessary and important. In addition, lawyers who are familiar with filtering options are able to successfully collaborate with their eDiscovery service providers on cost-reduction strategies. Finally, many states’ ethical rules now include the duty of technological competence, which includes standard eDiscovery technology like processing.
Six powerful processing filters to reduce data volume
Reducing data volumes is the key purpose of filtering during processing. Each of these filters could be applied individually; however, the best results come from a strategic combination of these common filters.
DeNISTing – “DeNISTing” is industry parlance for culling system files, malware and other non-user created files based on a list of known files types published by the US National Institute of Standards and Technology (NIST). DeNISTing uses hash value comparison. A hash value is a unique alpha-numeric value generated by running the binary data of an electronic file through a mathematical algorithm, a process known as hashing. Each file’s hash value is compared to the NIST list. If it’s a match, it’s removed. This article provides an in-depth explanation of hash values in eDiscovery.
De-duplication – De-duplication removes exact duplicates so only one copy of a file is published to the review dataset. Global de-duplication across all processed data (as opposed to per-custodian de-duplication) is the norm. Leading eDiscovery platforms like Relativity and Viewpoint populate a “common custodian” field that cross-references all the copies. This allows a reviewer to see at a glance the full list of custodians who had a copy of the document in their electronic files. Like deNISTing, de-duplication is based on hash value comparison.
Date range – Date filtering excludes documents that were created before and/or after the relevant time period. A single date filter can be applied globally or different date filters can be applied to data subsets based on issues or custodians. For instance, many cases have different cut-off dates for liability and damages.
File type – File type filtering can be based on file extensions (e.g., .DOC, .PDF, .JPG, .WAV) and/or file headers (e.g., DLL files). The main purpose of file type filtering is to remove all files of a type known to have no relevance to the case. A typical use of file type filtering is culling custodians’ personal music, videos and pictures. This can be surprisingly effective in reducing overall data volume even when the total number of files removed is modest. This is because on average, the file size of multimedia and graphics files is much larger than other file types.
Email domain – Email domain filtering removes all messages from domains (e.g., @amazon.com) that are known to be irrelevant. My post Leveraging Email Domains for Data Filtering is an in-depth look at email domain filtering.
Understanding the processes that make up processing
Processing should always be handled by an experienced eDiscovery service provider or litigation support department. It’s a complex technical specialty with significant room for error in inexperienced hands. Substantial QC must be done throughout the workflow.
In addition, it must be performed using validated, purpose-built eDiscovery software. Some eDiscovery platforms, such as Relativity and Viewpoint, include robust processing functionality. Other platforms require data to be processed using standalone software prior to loading into the database.
The numerous processes and tasks that make up the processing workflow are directed to three ends. These are to
1) make it possible to load the ESI into an eDiscovery review platform;
2) organize the ESI for analysis and review; and
3) prepare for production.
Five main processes to publish data for review
The five main processes used to publish ESI in an eDiscovery review platform are:
- Extracting container files – Files are extracted from containers such as .ZIP and .PST files. Container files are often compressed to reduce storage demand in the source system. When post-processed data volume is higher than collection volume, it’s usually because there were compressed container files that had to be extracted.
- Indexing – Text is indexed to create a master index for keyword searching in the database. This includes text in the body of the document and metadata fields.
- OCRing – Optical character recognition (OCR) is a technical process to convert images into searchable text. It’s used to make files such as scanned handwritten documents and non-searchable PDFs searchable.
- Extracting embedded objects – Embedded objects such as a table embedded in a Word document are extracted and published to the review database as separate records. A parent-child relationship is maintained between the file and embedded object. Best practice is to place reasonable limits on what objects are extracted; for instance, to not extract the graphic in an email signature.
- Converting legacy files – File types that were created in obsolete technical formats must be converted into processable file types. A common example is archived mail that originated in a company’s inactive, legacy email system.
Every dataset contains files that can’t be processed. This includes corrupted files, encrypted files and non-processable technical file types like .EXE files. Non-processed files are termed exceptions. They are logged by the processing software. Your provider will generate an exceptions report for your review.
Some exceptions can be resolved. For example, the client may be able to provide replacement copies of corrupted files and passwords for password-protected files. Depending on the type of encryption that was used, password cracking software may be successful in decrypting files. However, as a general matter most exceptions don’t require further action because they fall into the category of technical file types that by definition don’t contain any user-created data.
Four ways processing lays the groundwork for efficient review
There are four main ways processing organizes data for review:
- DocID – A unique document identifier (DocID) is assigned to every electronic file. The DocID is used to track the file from processing through production.
- Family relationships – Document families are broken out into their individual files in the review database for convenience in searching, sorting and production and issue tagging. Parent-child relationships (g., email and attachments, PowerPoint with embedded video) are maintained using database fields.
- Custodian – A custodian identifier is coded in a database field for each group of files ingested into the processing software. Your eDiscovery service provider will ask you for the custodian information prior to processing.
- Time zone normalization – Time zone is normalized to Coordinated Universal Time (UTC) (e., date and timestamps are converted from the custodian’s local time zone into UTC). Normalization is necessary for de-duplication, chronological sorting and to prevent reviewer confusion. The original time zone is shown in a database field and can be used during production.
Processing is the first step in producing metadata
Finally, processing prepares for production of metadata. Production specifications typically include a list of metadata that will be produced for each file (where the metadata field exists in the original). While these lists are largely standardized, they may vary due to specific case issues or the requirements of the requesting party’s eDiscovery platform.
In order to produce a metadata field, it must first be extracted during processing. A copy of the production specifications should be provided to your eDiscovery service provider or litigation support staff before processing. QDiscovery has created template production specifications to assist clients in requesting and negotiating form of production.
Best practice is to have your provider review the proposed production format in advance. They will flag any fields that are not supported by the processing software so you can strike them from the specification during the meet and confer process.
The processing stage isn’t just an eDiscovery necessity. It’s an opportunity to reduce overall eDiscovery spend. It’s widely estimated that document review accounts for 70% of the cost of discovery. Data volume is the primary driver of review costs. Consequently, pre-review filtering in processing has an outsized positive effect on total project cost.
Helen Geib is General Counsel and Practice Support Consultant for QDiscovery. Prior to joining QDiscovery, Helen practiced law in the intellectual property litigation department of Barnes and Thornburg’s Indianapolis office where her responsibilities included managing large scale discovery and motion practice. She brings that experience and perspective to her work as an eDiscovery consultant. She also provides trial consulting services in civil and criminal cases. Helen has published articles on topics in eDiscovery and trial technology. She is a member of the bar of the State of Indiana and the US District Court for the Southern District of Indiana and a registered patent attorney.
This post is for general informational and educational purposes only. It is not intended as legal advice or to substitute for legal counsel, and does not create an attorney-client privilege.