A Data Science Approach to Keyword Searching

By March 7, 2018March 20th, 2024No Comments

I recently read the January 3rd order in the In Re Broiler Chicken Antitrust Litigation matter. In eDiscovery circles, this case is gaining interest because of its document intense discovery involving some of the nation’s largest chicken producers.

Like many who read Special Master Maura Grossman’s order, I was pleased to see the level of detail to which the industry is now discussing methods used in analyzing data during discovery. Specifically, the order went into great detail about the validation and quality control measures parties must employ to demonstrate that their search protocols were adequate. However, unlike many others, the major takeaway for me was not the importance of using TAR or predictive coding to locate relevant documents. Rather, the biggest revelation for me was the attention paid to search terms. Grossman treated TAR and search terms on even playing fields, not tipping her hat to the productivity of one over the other. Search terms are not going away, even in today’s world of Artificial Intelligence and machine learning. In fact, a great set of search terms can complement a well-defined TAR process quite nicely.

A New Era of Search Terms

In Re Broiler Chicken ushers in a new era in the use of search terms. In identifying documents for discovery, parties used to take an “educated guess” approach, choosing and agreeing to terms that seemed beneficial. In many instances, this still happens today. Two parties at a meet and confer look at the lists and, in essence, shake hands over a list of words, oftentimes without knowing how those terms will actually perform across a data set.

Now, don’t get me wrong, much thought often goes into these educated guesses, search term revisions, and hand shake meetings. Data custodians are interviewed, subject matter experts weigh in, and software gurus hone lists. Sometimes even a key word “hit list” is generated from a processing tool and used to select search terms. But at the end of the day, most parties underestimate the power and importance of creating a compelling and accurate set of highly relevant search terms through true statistical validation. (Note: Using a key word hit list from a processing tool is not true validation, especially when randomly creating term combinations with the syntax that you think will return the desired relevant documents.)


A Data-driven Approach to Searching

Legal teams must have a methodology for selecting and validating the results of search terms.  Most of the time, organizations need expert assistance in truly understanding their data, and any analysis of search terms should be informed by actual data. For example, in analyzing search terms for clients, ProSearch developed an automated, iterative process, working in collaboration with data scientists, linguists, and attorneys with subject matter expertise in the case. The result is a highly relevant set of search terms, backed up by statistics. This process can include predictive coding, but can be leveraged solely to create the most effective set of search terms.

But, how do you go about applying a data-driven approach to key word selection? This is where sampling plays the starring role. These sampling methods form the backbone of the statistical analysis. Data points are gathered with a myriad of sampling strategies, including:

  • Sampling against a population that is most unique (e.g., de-duplicated, consolidated, depending on the client’s preference)
  • Stratified sampling
  • Term sampling
  • Uncertainty sampling (if including predictive modeling in the process)
  • Random sampling

Obviously, the more you can automate the sampling iterations, the more effective you will be at determining the most appropriate key terms. Also, at ProSearch, we take process documentation seriously, so no opposing party can ever claim that key words were chosen in a “black box” that no one really understands, the end result being a highly effective set of search terms that are also defensible.

The bottom line? Using a data science schema, no longer are legal teams guided by what they think is best, but rather they are informed by what they know is best. And the results are intuitive – cost and risk reduction without sacrificing defensibility.

Validation Protocols that Apply for Keyword Searching and TAR

Developing the search terms is one task at hand. But, the work does not stop there. A legal team needs to prove that their search term schema is performing at adequate levels of effectiveness. This is where a validation methodology comes into play.

Grossman’s validation proposal has the potential for changing what it means to be conducting a complete and adequate document review using search terms. The longest part of the January 3rd order focuses on results, based on validation, rather than what steps were taken in a workflow process. A process will not necessarily determine the adequacy of a review, but rather validation metrics will show if a review is insufficient or inadequate.

Grossman’s validation schema is very similar, at a high level, to the approach we take at ProSearch in recall calculation. If anything, the order highlights the difficulties that go into using recall for validation, and ensuring a certain level of recall gets only harder when the richness (percentage of relevant documents) in a document set is low.

For example, Grossman’s order puts a stake in the ground to the volume of documents needed for validation – 3,000 documents – no matter the richness in a document set. Albeit, Grossman’s order leverages a relatively large sample from the unreviewed population of documents to ensure that error margins for calculating remaining richness are relatively small.

However, when it comes to calculating recall, it’s not just the size of the sample taken; it’s about the size of the unreviewed population compared to the reviewed population. For example, if there is an overall collection of 1 million documents, having a remaining richness of 1% in an unreviewed population of 900,000 vs. 90,000 would have a dramatic impact on both recall as well as error margins for recall.

Also, the Grossman order does not take into account the variability when interpreting production requests, even on the same review team. Also, at the beginning of a review the team’s definition of relevance changes as the understanding of the case improves. Both factors impact the efficacy of the blind test. Admittedly there must be a process in which the review team doing the broader review can learn what it means to be relevant via a subject matter expert.


The End Game: Fulfill Your Discovery Obligation

At the end of the day legal teams are simply trying to fulfill their discovery obligation and produce the desired relevant documents. Done correctly – mission accomplished!

Along the way, it helps to complete the mission at the lowest possible cost accompanied by a defensible methodology. Have we simply left search terms in the dust because we have been told that AI/TAR is less expensive and more accurate than search terms? Did we miss an opportunity to refine the process of using search terms? It’s hard to say, but one thing is certain, while no single process can account for every data set being readied for review, a strong set of search terms is an extremely powerful tool, stand-alone or in tandem with TAR. I applaud Special Master Maura Grossman for recognizing the importance of data science applications and the importance of statistical validation.


Xiao He, Ph.D. Data Scientist, Linguistics, Analytics, & Data Science (LADS)

Dr. Xiao He is a data scientist on the Linguistics, Analytics, & Data Science team at ProSearch. In addition to implementing the Technology Assisted Review solution for ProSearch, Xiao develops custom solutions and workflows, and researches machine learning applications in eDiscovery. Xiao received his Ph.D. in Linguistics with emphases in experimentation and statistics from the University of Southern California, Los Angeles, and B.A. in Psychology from the University of California, Berkeley. Prior to joining ProSearch, Xiao worked as an assistant professor of linguistics and quantitative analysis at the University of Manchester, United Kingdom.

[activecampaign form=1]

Filed under:

Xiao He

Xiao He

Dr. Xiao He is a senior data scientist on the Applied Science team at ProSearch. Dr. He engineers information retrieval and text classification solutions for a range of eDiscovery needs – from Technology Assisted Review to identifying and protecting private and privileged information. Xiao received his Ph.D. in Linguistics with emphases in experimentation and statistics from the University of Southern California, Los Angeles, and B.A. in Psychology from the University of California, Berkeley. Prior to joining ProSearch, Xiao worked as an assistant professor of linguistics and quantitative analysis at the University of Manchester, United Kingdom