eDiscovery and Linguistics: Strange bedfellows?

If there’s one thing that a dozen years in eDiscovery has shown me, it’s that there is a range of expertise that bolsters this industry. There are the usual suspects: IT expertise, digital forensic science expertise, legal tech expertise, legal knowledge of the Discovery process under FRCP and the amendments therein, etc. All of which contribute to the eDiscovery process, all of which are necessary ingredients in a comprehensive eDiscovery strategy.

Something that is not often talked about, however, is this: If you are tasked with knowledge discovery, and the data set you are working with is a collection of some sort of natural language, and in the case of eDiscovery, unstructured text-based natural language, you need to have language/linguistic expertise. And I am not saying that you need to have perfect command of your native language. I’m not saying that you have to be a grammar maven or a clever orator or rhetorician. I am saying that you need expertise about linguistic processes at work when people communicate ideas, sentiments, and information through Language.

Linguists and language experts have training in areas such as lexical semantics (word meaning and word relationships), pragmatics (how context informs how we impart and interpret meaning), language variation and change (all of the different ways we can express one idea and how this evolves over time), and all of the other linguistic and extra-linguistic variables that characterize human communication, in both written and spoken language. If you don’t consult this particular area of expertise, especially when you are engaged in finding evidence in huge collections of ESI, which by its very nature is linguistic evidence, then you are overlooking a valuable resource in eDiscovery in particular, and knowledge discovery in general.

I liken it to this: If your car won’t start and is in obvious need of repair, you don’t call the engineers who designed the engine. You don’t take it to the guy next door who has extensive knowledge of antique cars. You don’t take it to a body shop. And you certainly don’t assume that because you can change your own oil and spark plugs, you can diagnose the problem and have the tools and expertise to fix the problem. You take it to a mechanic. What a mechanic is to the inner workings of your automobile, so is a linguist to the inner workings of human communication and Language.

This assertion may not be received well by people in this profession, but it’s a fact: We use language to communicate. We use language to express ideas and opinions. If you’re trying to discovery critical information buried in an Everest of computer-mediated communication, the medium is, in fact, the message and the message is linguistic in nature. You should be leveraging this area of expertise in any and all eDiscovery-related tasks.

Now that I’ve made this assertion, I’ll go a step further and say it’s not just any linguist you need, but an empirical linguist who works in an industry setting and one who works with language data. Particularly, one who works in the legal industry and understands that the goal of employing linguistics in this framework isn’t to develop a model of Language, or a model human communication as it were. Rather the goal is to use knowledge of Language, linguistic processes and models of communication to extract ideas, information and patterns imparted in the language people use to communicate them.

I’ll put a quick example out there, and one that is central in my practice of large-scale forensic text investigation in eDiscovery. Redflag language is language that *may* lead to some interesting investigative trajectories in the context of a larger legal narrative. It’s language that represents a little warning signal, or something that may warrant a further look. I always evaluate entire collections to see if there’s higher linguistic norms of the stuff than there should be. Or lower. I investigate how redflag language clusters around certain dates or subject matter. I generate reports of content-of-interest that is statistically correlated with redflag language in a collection because I want to see what other language regularly co-occurs with it. Evaluating redflag language in a collection is a great way to see where the trouble spots in a business or corporation lie.

See there? Warning signal, trouble spots. I used two great examples of redflag language to define redflag language. These phrases impart the concept of “problem.” And the concept of a “problem” can be expressed in a huge variety (I’m talking thousands) of ways. You can have a serious issue or are dealing with a serious matter. Or maybe you face a challenge. You’ve bumped up against an obstacle. You’ve encountered a tricky situation. Or you sense something is amiss and you email your colleague and ask “what is going on here.” They respond that they have some bad news.

If you’re trying to find documentary evidence to support a legal narrative, you want to incorporate redflag language into your investigation. However, you shouldn’t just rely on your own intuition about how people talk about problems. Or how they express negative sentiment, or give advice or imply causation, or threaten a co-worker, or sexually harass a co-worker. A linguist will use data-driven, principled methods to uncover all of these areas of investigation in a consistent and comprehensive way. It’s what we do.

Recently I was talking with an individual who is involved in risk mitigation in a Fortune 500 company. He told me that they do email sweeps and investigations into various areas by relying on keyword searches. I asked him who came up with the keywords and the query algorithms? What methodology did they use to validate their term list(s) and what data did they validate it against? What was their margin of error when conducting these keyword sweeps? I had a lot questions. He said, “No, no, no. We literally have a list of several dozen of words and do searches for each.” He then told me the people on his team (and it’s a vast team, to be sure) came up with the terms based on internet research. I was stunned. I could not believe in this day and age, with all of the unbelievable expertise out there, a company of this size and stature could not do better than this.

For the record, if you’re trying to figure out if somebody is sending sexually explicit and harassing emails to a co-worker, simply searching for “sex” does not do the trick. Likewise, if you’re trying to uncover evidence of fraud in a huge collection of ESI, searching for “fraud” isn’t going to turn up anything useful. But I can tell you from experience, people engaged in harassing other people through computer-mediated communication, as well as people engaged in defrauding another person or an entire entity, these folks use specific language and patterns of linguistic behavior that leave an electronic trail of their misdeeds. You can uncover these trails. Do you want to know how? Then you should seek out linguistic expertise.

Posted in Uncategorized | Tagged | Leave a comment

Document Review versus Document Investigation: Apples to Oranges.

In my last post I asserted that document review and large-scale,‭ ‬discovery-driven text investigation are not the same thing. I said that comparing the two is like equating combing the beach with metal detector with an archaeological dig. Now I want to flesh out this analogy a bit to underscore the sentiment behind it.

In the first endeavor, you show up at the beach with your tool, maybe having mapped out a long stretch for exploring, or maybe breaking up the search into smaller, defined areas, or perhaps you just planning on walking as long and as far as you can in the time that you have. Then you start making passes and sweeps across the sand with your detector, hoping you find something interesting. You may find a coin or a ring, but you never know what sort of treasures you’ll dig up and put in your pocket. If your lucky, you’ll stumble across something valuable. If not, the most you can say for your day is you got some exercise.

In the archaeology example, you use your expertise and experience to carefully pinpoint the location of the dig. You understand the nature of the terrain, which in turn informs the types of your tools and your methods of excavation. You have an expectation of what you’re going to uncover. You know where to look for certain artifacts. You have a point of reference for what you’ll find because you know what type of site you’re excavating before you ever disturb the soil. For example, if you’re excavating a dwelling structure, you immediately recognize a pottery shard as such and you set out to uncover the rest of the pieces in the immediate area in order to reconstruct the entire artifact. You proceed in a principled manner, recording your findings and placing them in a larger context of discovery. Finally, you assess the artifacts you’ve uncovered in a larger context, one of an entire culture or a historical point in time.

If you’re tasked with finding evidence in a large produced collection, no day should go by where you don’t uncover valuable information that supports you case narrative in a meaningful way. Document review isn’t an exercise in reading, just as combing the beach in search of treasures is not an exercise in, well, getting exercise.

Finding documentary evidence in a produced collection of ESI should be a dynamic, flexible endeavor that represents the intersection between various tools and methods and the right kind of expertise. Discovering case-winning information should not simply be a linear process by which a memo or an email or a sales slide is read, categorized by checking a box, and moved off of one pile onto another. Categorizing documents as “hot” or “super hot” is not the same as deriving facts and intel by way of meaningful, data-driven forensic investigation.

I have seen it happen all too often that huge review endeavors begin with a certain set of expectations and objectives only to uncover information months down the line that changes the course of everything, rendering efforts up to that point counterproductive. Large, resource-intensive review efforts may or may not be what is needed to uncover winning documentary evidence, but regardless, where the review team is the army, the forensic text investigators are the scouts. We ride out ahead of the army and find critical intel and facts that inform overall strategy in the most productive manner possible. We uncover information and find investigative leads quickly, which can transform the very nature of your case. We find the story in the documents that underscores your legal narrative.

Look, document review and large scale analytics (predictive modeling, etc) may be a valuable part of your eDiscovery strategy, but if you want to be ahead of the curve, as well as save time and money, you should recognize that you have options. You can hire investigative experts that will tackle your collection to find out: What did they know (and who *they* are), when did they know it, and what did they do about it. And we will have a variety of tools at our disposal, as we will use any and all applications or processes (often making our own tools) that we need in order to extract critical, case-building information.

Knowing the difference between reviewing and investigating, and including both in your eDiscovery strategy, is going to ensure that at the end of the day, you have the most valuable documentary evidence you need, uncovered in the most efficient way, and in the quickest turn around possible.

Posted in Uncategorized | Tagged , | 1 Comment

Notes from the eDiscovery trenches: Every collection tells a story.

When you’re tasked with uncovering critical documentary evidence in a produced collection of electronically stored information, or ESI, you’re doing so to support a legal narrative that is the center piece of a case. In an eDiscovery setting, everything you do with respect to collecting, managing, and assessing ESI should be directed at finding the evidence you need to inform your legal narrative in a meaningful way. And to be certain, there is a story to be told in your collection that correlates to the theme(s) of your case. There always is.

I’ve been doing large scale forensic investigation in produced collections of computer-mediated communications and other business related electronica for over a decade. I’ve worked on some of the biggest class action civil suits in that time frame, related to everything from shareholder fraud to product liability, and if you know how and where to focus your investigation, you can find case-building evidence. Always.

When I first started investigating the communications and documents of Fortune 100 companies, I was amazed that smart and informed business people would go on the record and say the most brazen things. And not just rank and file employees of a large corporation, but higher ups, upper management, CEOs, people that one would think should know to exercise caution. The fact remains, there is a trove of evidence in a company’s electronic data. The digital artifacts of doing business in a computer-mediated world are full of critical information.

People talk, in writing, a lot. You can hang your company communication policy on the wall in front of them, but it won’t matter. Millennials in particular have grown up online. They have grown up communicating with people in every time zone, in every part of the world, about, well, everything. They write it all down. All of their opinions, feelings, all of their business dealings, all of their personal business. It all goes into a computer-mediated communication of some kind intended for an audience of some sort. Let’s face it, nowadays writing is really text-based speech and not simply a formal, graphical linguistic representation of a Language. And this communication genre creates the linguistic timeline of our lives.

Millennials entering the work force seemed to usher in a change in the very the nature of computer-mediated communication and documentation in a business setting. For example, I have witnessed an evolution in company email communication patterns over the years: In the late 90s, just as email started to really become the method of casual business communication, these communications resembled written letters. You would have a formal greeting, an introduction paragraph laying out the topic(s) at hand and consecutive paragraphs dealing with said topic(s), and a formal closing. You could expect standard punctuation conventions, as well as perfect spelling and grammar.

Now? Even formal business emails resemble a tweet: No greeting, topics laid out in a bulleted list containing around 100 characters, mixed content covering a range of both personal and business related topics, all rolled up into one communication, usually to several people at a time. Maybe capitalization and some punctuation, or maybe not. Actually, you’re more likely to see capitalization to indicate emphasis than to indicate a sentence boundary. Contextualizing IM-ese abounds, replete with lols and smhs (or LOL if the participants are really excited). All of these “non standard” writing conventions proliferate contemporary business communications at every level.

Forget the saying “never put anything in an email that you don’t want your mother to hear in a trial” (a 21st century version of a famous Sydney Biddle Barrows quote). Millennials put everything in writing somewhere, whether it’s on social media, their blog, in a comment section on a news site, or in a company-wide email. Their moms have already read it all. They talk with text and you can’t censor or “policy-away” these communication habits, even in a professional setting. Especially in a professional setting.

At some point, contemporary corporate information governance will have to accept these facts and deal with them effitively, but that is a post for another day. Suffice it to say, that in today’s eDiscovery ventures, you are going to have to deal with not only a large quantity of data, but data that varies mightily in quality as well.

During my tenure in the legal profession working in eDiscovery, I’ve learned 4 key pieces of wisdom about doing large scale text-based investigation for civil litigation support. I’ll be fleshing each point out over the next week or so, but here they are in a nutshell:

1) Document review and large-scale, discovery-driven text investigation are not the same thing. Comparing the two is like equating combing the beach using metal detector with an archaeological dig.

2) Language/linguistic expertise. You need it. This may not be received well by people in my profession, but it’s a fact. We use language to communicate. We use language to express ideas and opinions. Knowledge discovery at its very core linguistic in nature.

3) Technology and technical expertise is paramount. BUT all tech is not created equal and just because one tool/process/application works well in one particular context does not mean it will work in another. Specifically, you have to understand the limitations and margins of error with every tech solution you consider.

4) The future of eDiscovery is going to require innovation, ultimately resulting in a true marriage of human expertise and technical expertise. Oh, do I have a lot to say about this particular point…

Stay tuned for a discussion on each of these things individually. In the meantime, feel free to muse about any and all of these points in the comment section.

Posted in Uncategorized | Leave a comment

Illocution Inc.

We are expert language investigators. We work with law firms to discover critical information quickly. We combine expertise, experience, and technology to uncover the “who, what, when, where and why” in large, electronic document collections.

Posted in Uncategorized | Leave a comment