Technology and eDiscovery: State of the Union.

I rounded out 2014 by looking back over the last 15 plus years in eDiscovery, focusing on the historical context that influenced, and continues to influence, the legal profession’s alliance with the tech industry.  I want to ring in 2015 with a conversation about what has been happening since these two love birds came together, moving from a shaky partnership born of necessity, into a relationship that has endured and matured, and even spawned a new industry: Legal technology.

As the legal tech industry has evolved over the years, there’s been a range of tech solutions that have come on to the market for legal professionals.  A relatively young industry, legal tech is sort of like the wild west in a lot of ways, eager to accommodate eDiscovery homesteaders in particular with a variety of products and services.  Large-scale hosting platforms abound. Data analytics that help you make sense of your massive amount of ESI are customary offerings. All manner of technology, tools and processes are available to assist you in wrangling that nebulous, Everest of electronica in the context of eDiscovery, and beyond.

To be sure, there is nothing more daunting than being faced with scaling this Everest of ESI in the context of eDiscovery.  And there is nothing more comforting to have an expert show up and tell you they’re going to outfit you with everything you need to make this herculean task manageable. Or better yet, easy.

Be that as it may, there’s no elixir vitae in legal tech that meets all of the demands of eDiscovery. In fact, the field relies on a staple of “a la carte” technologies and technical expertise to accommodate different aspects of the spectrum of eDiscovery processes.  And as we all know, there are a lot of various stages and aspects to eDiscovery, many requiring a gamut of technical expertise.  Although technology and technical expertise in eDiscovery is paramount, it is an ambitious task navigating all of our options under the circumstances in which we’re operating: The legal infrastructure in which we endeavor places an extra layer of complexity on a series of processes that are already highly complex.

As with every consumer product or service, there’s a usual “buyer beware” caveat and legal tech is no exception. All tech is not created equal. Products may appear similar, but may be very distinct if you peek under the hood. Many times, these distinctions are singular and far-reaching. This makes it even more complicated when choosing the right tool/process/application to suit your eDiscovery needs, which vary and shift at every stage of the game.

Thus, finding and employing the *right* technology and technical expertise is important. Assessing and understanding what this program does, or what that automated method achieves, is key in legal tech. But it’s challenging for a number of reasons.

First, it’s hard to understand the constraints or drawbacks of a highly technical product or service because of the nature of the data we’re working with. Here’s something that everybody who works with unstructured, text-based natural language needs to understand: It is the most complex, varied and ever-evolving data-type going. It is not like structured data in that it offers a neat one-to-one correspondence between form and function. Language is just not like that, as any linguist will tell you.  Language, and the text used to graphically represent Language, is infinitely variable, innovative and changes every day. This is an empirical fact that needs to be accommodated with every single tech solution in eDiscovery. Period.

(Garden path alert! Skip this paragraph if you value continuity.) In the early days of auto-correct, I used to ponder the statistical methods used as the foundation of these potentially useful predictive algorithms. These algorithms were originally developed to work on structured, numerical data. They were designed to work on data that had a consistent one-to-one relationship between symbol and value. But as with everything interesting and useful, folks wanted to expand predictive modelling algorithms to work in different contexts, on different data types.  As previously mentioned, text-based natural language data does not superficially manifest into tidy, regular form/meaning relationships. In their incipience, auto-correct programs were pretty terrible, making it obvious to me that whatever data was used for beta-testing wasn’t a representative sample of the kind of text-based language that characterizes much of our every-day, computer-mediated communication.  I would think to myself: Did they even consult an applied/empirical/corpus linguist in the research and development of these applications? It irritated me. But then we had all those awesome buzzfeed tributes to auto-corrects gone awry, which amused me greatly. I’m conflicted. But I digress…

Secondly, employing a technical solution on huge collections of complex data is an even bigger adventure.  Margins of error look very different depending on how much data you have. Not only that, but it becomes more complicated to systematically assess a tool or automated method’s validity and reliability in large collections of unstructured, text-based natural language data. Complex data and lots and lots of it. Double whammy. This makes it harder for a non-expert to empirically verify if a particular program or application is doing what it should, or what it purports to do. To be sure, it’s doing something. But what are its limitations? What isn’t it doing? What sort of information is it eliminating or not returning? What sort of information does it privilege?

Third, it is our nature to think that when we find a tech solution that works in one context, that maybe we can use it successfully in another, seemingly related context. While one tool/process/application may work well in one particular context, that doesn’t mean that it is a perfect fit for another, related but slightly different one. Forging ahead anyway will often produce uninterpretable or inconsistent results, unbeknownst to the user. I’ve seen this with a lot of borrowed technology that has made its way into eDiscovery, namely, predictive coding programs.  Predictive coding has mostly been used for large-scale, automated categorization on produced ESI.  Now predictive coding is being considered as “proven technology” for relevancy identification and review in pre-production stages (see DaSilva Moore v Publicus). In order to assess whether or not this is a good idea, you really need to have a nuanced understanding of how predicting coding algorithms work. And specifically, how these algorithms, that again were developed for structured data, react when used on unstructured, text-based natural language. I actually have a good deal to say about this, but at the risk of another garden path, I’ll save it for its own post.

Here’s a quick example that illustrates a mash up of all these points. You’re tasked with document review. You’re using a large document hosting platform. It has a complex search feature. The search feature operates by indexing the words in the collection. This index is created and stored in a database, as is standard fare in searching tools on large-scale document hosting platforms.  Nothing interesting to see here.  Well, not to most people anyway. But I have questions, namely: Does the tool’s methods of indexation involve dropping the “noise” words to expedite the process? Noise words are those words that occur frequently in a language and are thought to impart little to no information. They include words like the, a, of,  and well, and.  These words occur at very frequent rates and make up a large percentage of an English language corpus (in linguistics, corpus is another word for collection). Most indexers drop them because it cuts down tremendously on the amount of text the indexer will be processing and sifting through, ultimately making the search feature faster.

My inquiry about knowing whether the indexer you’re working with does “noise word” elimination is something that probably nobody ever using these tools, or purchasing these tools for use, ever really thinks about, or cares about. On its face, it seems not at all worth considering.  Is it even important? Maybe not if you’re just using the search tool for organizing content, or making broad inquiries in your collection, with the goal of then organizing big piles of text into smaller piles of text. But what if you’re trying to use this search feature to conduct any sort of fine-tuned, forensic text investigation? What if you’re looking for very specific answers to very specific questions? What if you’re investigating to find a smoking gun document? Then you better believe that noise word elimination matters.

Consider the difference between the following statements, “I have noticed a problem” versus “I have noticed the problem.” Now consider that the first statement was written in an email by one scientist to another regarding the analysis of endpoints in a clinical trial conducted by a big pharma company. Now consider the second email in the same context. Is there a difference? You bet. A problem could be any old problem, but the problem has obviously been identified before, as now it warrants a definite article to qualify it. It has evolved from being just a random problem, to being the problem. It also indicates that both emailing parties are aware of said problem, as it warranted no further explanation.  So much meaning and contextual information all wrapped up in one little “noise” word. And one of the most frequent noise words in the English language at that. A word only a linguist could love.

But seriously, if you’re still using a document review search tool for any sort of document investigation, stop reading this and get in touch with me right now. I can tell you definitively that document investigation is NOT the same as document review. Different processes, different objectives.

Well, this was a long one. What can I say? I’m a linguist. I love the words. If you’ve made it this far, you’re kindred at best, and at the very least, you’re a rebel in this virtual theatre of short attention spans.

Next week I’m going to do what everybody else is doing as we head into a new year and talk about the future of technology and technical expertise in eDiscovery. Like 99.9% of my blogging compatriots, I have some predictions. And I’m gonna lay them out for you! Before then, I’ll get to work on this year’s resolution: Brevity.

Happy New Year, folks.

This entry was posted in document investigation, document review, predictive coding and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s