Beyond Email: Deduplication in eDiscovery
In addition to email, an effective deduplication strategy will address Microsoft 365 documents, chat and text messages.
When searching, processing and reviewing documents, discovery teams must deal with tremendous amounts of duplicate documents. In companywide email chains, for example, a message is sent to multiple recipients and stored within each person’s mailbox. The attached documents might also be found on the employee’s hard drive, file server, or company backup locations.
Duplicate documents waste time and money in eDiscovery. When attorneys review the same files repeatedly, problems arise. One reviewer may tag a document as responsive, for example, while another reviewer sees the duplicate of that document and categorizes it as privileged, not to mention wasting time reviewing the same document again. This article discusses the major elements of deduplication and how the process differs in today’s eDiscovery world which needs to deal not just with emails but with short-form messages as well.
Deduplication in Relativity Server Processing
In Relativity Server (RSP), deduplication during processing is at the family level. RSP identifies the first published copy of a family as the master copy and then deduplicates against that master copy for any subsequent copies. There are three deduplication options available for each processing set:
Global deduplication deduplicates documents against all documents previously processed and published to the workspace.
Custodial deduplication deduplicates documents against all documents previously processed and published to the workspace from data sources with the same custodian value.
No deduplication means all documents are processed and published to the workspace without any removal of duplicates (even within the processing set).
To learn more, talk with your ProSearch project manager, or see the Relativity support documentation on deduplication considerations.
How Does It Work?
Deduplication algorithms generate a unique cryptographic fingerprint or hash for the data they are processing. A cryptographic hash is a mathematical formula that turns any stream of data into this unique fingerprint. Various algorithms exist. The SHA family of algorithms (SHA1, 2, 256, 512) are all variations of the same formula and can all be used for hash generation. Which one is used depends on the software in question.
Tools for eDiscovery run one or more of these hash algorithms on the binary content of files they are processing to generate the fingerprint. The algorithm hashes the actual zeros and ones that represent the file, not the words or the representative contents of the file. This is generally the case for loose files or content, not email. Content-based deduplication and analytics are beyond the scope of this article.
The first file seen with a unique hash is flagged the master to which all other subsequent files are compared. If the same hash is seen subsequently, then that is a duplicate. The file’s location and a few attributes are noted, and the processing of that duplicate is dropped.
Envelope Metadata
Envelope metadata can be considered information needed by the file system or operating system to manage a file that it stores. All file systems (FAT, NTFS, exFAT) require basic information from the files to store and open them correctly. Metadata such as the file’s name, last modified, created and accessed time stamps, security permissions, etc., are all examples of this metadata. All such metadata is stored by the file system in its own data structure and not in the file itself. Changes to envelope metadata do not affect the content of the file, and therefore, its hash is unaffected. Changing envelope metadata does not change deduplication behavior.
Extended Metadata
More advanced file formats store more metadata than is available in the envelope fields. Microsoft Office files, for example, track who saved the document last, how many words are in the document, the editing time, and other details. Photographs store information on the make and model of the camera used to capture the image, the GPS coordinates of the location, and camera settings. Music files record the artist, year, track length, and the album cover art information. There are standards governing how and where this metadata is stored (EXIF for images, IDv3 for audio, Open XML for MS Office) which Windows can recognize and expose in the file’s properties window.
All this metadata, however, is stored as part of the file’s content. Changing any one piece of extended metadata is in effect the same as changing the content of that file. This affects the hash and thus the deduplication results.
Email Metadata
Emails are effectively an aggregate of a series of extended metadata fields. Metadata such as the sender, recipients, CC, BCC, originating IP address, subject line, email body, content type, authentication result, SMTP version, and dozens of other fields is all packaged in a single email. These fields may be added to and changed as an email moves through the internet to its destination. Some email archiving programs even wrap the entire email file in its own file wrapper. The underlying binary content of an email therefore constantly changes and that means so does its content hash. Native basic deduplication therefore would fail on most emails.
However, eDiscovery contains a lot of email, so it is not acceptable for deduplication algorithms to treat emails as normal files, then fail deduplication and just leave it at that. To address this, rather than hashing all an email’s content, email deduplication algorithms pick a select few fields and only hash them instead. These fields are usually from, to, CC, subject, date, email body, and attachments. All other fields in the email are ignored.
Different eDiscovery providers have slightly different methods on how they do this. Relativity, for example, combines the metadata fields into one and then hashes the aggregate. Others hash the fields individually and then combine the results and hash that as well. The effect is the same. The goal is accomplished by choosing only select information from a file to determine uniqueness, rather than using all the binary content.
Normalization
Different email clients store the key metadata in different ways. Gmail, for example, may store sender information as Kahvedzic, Damir <Damir@prosearch.com> while Outlook may store it as Damir Kahvedzic <Damir@prosearch.com>. Outlook may add a carriage return or an extra space character to the email body while Mozilla Thunderbird may not.
The slight imperceptible differences in representation are enough to throw off the hash algorithm and therefore the deduplication process. To account for this, there is a step called normalization in which the deduplication algorithm standardizes the information to a single format before the hash is generated. Email addresses get converted to a single standard format, and email content is normalized by removing all spaces and space characters.
Normalization is the secret sauce of deduplication. It’s a fine art to standardize the data points. It’s not always 100% accurate, but it does significantly improve detection of duplicates.
Chat Is Not Email: Short Message Deduplication
Now let’s consider text and chat messages. Modern business communications aren’t linear. Conversations can span channels, direct messages, and platforms. They can last minutes, hours, days, or longer. We can’t meet the discovery challenge by forcing messages into arbitrary bundles for linear review.
ProSearch uses a proprietary solution called WorkStream™ to manage short message data sets. With WorkStream your review can be as dynamic as the way we communicate.
Short messages such as text, chats, or similar data are not stored as individual files at all. They are stored in databases. The deduplication algorithm needs to extract the relevant data first, then identify specific points to determine uniqueness.
WorkStream takes each message’s owner, date, event (content), and communication name and hashes them each individually before combining the result to generate a unique hash for comparison purposes. WorkStream supports export to RSMF for processing and loading in Relativity for review and production. We can generate RSMF files for distinct review populations and export in 24-hour segments.
Scope: Optimizing the Deduplication Process
Both WorkStream and Relativity Server Processing add a scope element to their deduplication process to allow the user to control what gets deduplicated. Relativity’s scope options allow global deduplication, custodian level deduplication, or no deduplication at all. This is set in the processing profile.
WorkStream allows the user to control what documents get deduplicated against each other by managing the value in the workspace field. This field is set by the user at every processing job and is added to the hash generation. Making the field the same for all processing batches in effect results in global deduplication. Making the value a custodian’s name generates custodial deduplication, making this field unique for each processing set results in batch-level deduplication. There is no option to turn off deduplication completely in WorkStream.
The process is very flexible and lends itself to some creative deduplication scoping. For example, you can in effect cause device-level deduplication by setting the same workspace value on processing jobs from the same device.
Deduplication in Microsoft 365
When using Purview eDiscovery tools to export the results of an eDiscovery search, you have the option to deduplicate the results that are exported. This means that when you enable deduplication, only one copy of an email message is exported even though multiple instances of the same message might have been found in the mailboxes that were searched. Deduplication helps you save time by reducing the number of items to be reviewed after the search results are exported.
It’s important to understand how deduplication works and be aware that there are limitations to the Microsoft algorithm that might cause a unique item to be marked as a duplicate during the export process. Work with your ProSearch project manager to determine the best approach to M365 deduplication.
Keeping Track of It All
Keeping track of the duplicates is extremely important to know where all documents come from. In our Relativity templates the XM_MD5HASH is the field that contains the key hash value used for deduplication. Other fields such as SHA1_HASH and SHA256_HASH are also kept but only to comply with certain production specifications.
There are a number of fields that contain deduplication information. Care must be taken to understand their behavior especially if deduplication settings are changed between batches. The all custodians field for a document stores all the duplicates that this document has been deduplicated against. If in the same workspace there was a batch of documents in which the deduplication was set to custodial level or none, then they may not be included in the field list.
New Standard for Email Deduplication across Platforms
Teams working on eDiscovery at times run into a situation where they want to deduplicate data across platforms. This occurs when clients or providers want or need to change processing and review platforms or wish to use multiple tools that have different duplication identification standards.
Without a solution, parties ended up having to pay for duplicative review and analysis of the same documents.
To address the problem, the EDRM organization formed a task force to create a solution. The resulting EDRM Cross-Platform Email Duplicate Identification Specification provides a framework for identifying duplicates across multiple email platforms. The solution involves the use of the hash value of an email message ID metadata field, in what is known as the EDRM message identification hash or MIH. This new approach will not replace current email deduplication methods but will enable cross-platform email duplicate identification.
MIH was introduced in February 2023 and is not yet in widespread use. Learn more about the message identification hash standard in a post by Craig Ball. Read it here: Introducing the EDRM E-Mail Duplicate Identification Specification and Message Identification Hash (MIH).
The ProSearch Approach
On every matter, client objectives are taken into consideration before a deduplication methodology is implemented. We encourage clients to consider plans for data reuse across multiple matters, hosting requirements, and other strategies that might inform a decision to deduplicate globally or at the custodian level.