eDiscovery and Linguistics: Strange bedfellows?

If there’s one thing that a dozen years in eDiscovery has shown me, it’s that there is a range of expertise that bolsters this industry. There are the usual suspects: IT expertise, digital forensic science expertise, legal tech expertise, legal knowledge of the Discovery process under FRCP and the amendments therein, etc. All of which contribute to the eDiscovery process, all of which are necessary ingredients in a comprehensive eDiscovery strategy.

Something that is not often talked about, however, is this: If you are tasked with knowledge discovery, and the data set you are working with is a collection of some sort of natural language, and in the case of eDiscovery, unstructured text-based natural language, you need to have language/linguistic expertise. And I am not saying that you need to have perfect command of your native language. I’m not saying that you have to be a grammar maven or a clever orator or rhetorician. I am saying that you need expertise about linguistic processes at work when people communicate ideas, sentiments, and information through Language.

Linguists and language experts have training in areas such as lexical semantics (word meaning and word relationships), pragmatics (how context informs how we impart and interpret meaning), language variation and change (all of the different ways we can express one idea and how this evolves over time), and all of the other linguistic and extra-linguistic variables that characterize human communication, in both written and spoken language. If you don’t consult this particular area of expertise, especially when you are engaged in finding evidence in huge collections of ESI, which by its very nature is linguistic evidence, then you are overlooking a valuable resource in eDiscovery in particular, and knowledge discovery in general.

I liken it to this: If your car won’t start and is in obvious need of repair, you don’t call the engineers who designed the engine. You don’t take it to the guy next door who has extensive knowledge of antique cars. You don’t take it to a body shop. And you certainly don’t assume that because you can change your own oil and spark plugs, you can diagnose the problem and have the tools and expertise to fix the problem. You take it to a mechanic. What a mechanic is to the inner workings of your automobile, so is a linguist to the inner workings of human communication and Language.

This assertion may not be received well by people in this profession, but it’s a fact: We use language to communicate. We use language to express ideas and opinions. If you’re trying to discovery critical information buried in an Everest of computer-mediated communication, the medium is, in fact, the message and the message is linguistic in nature. You should be leveraging this area of expertise in any and all eDiscovery-related tasks.

Now that I’ve made this assertion, I’ll go a step further and say it’s not just any linguist you need, but an empirical linguist who works in an industry setting and one who works with language data. Particularly, one who works in the legal industry and understands that the goal of employing linguistics in this framework isn’t to develop a model of Language, or a model human communication as it were. Rather the goal is to use knowledge of Language, linguistic processes and models of communication to extract ideas, information and patterns imparted in the language people use to communicate them.

I’ll put a quick example out there, and one that is central in my practice of large-scale forensic text investigation in eDiscovery. Redflag language is language that *may* lead to some interesting investigative trajectories in the context of a larger legal narrative. It’s language that represents a little warning signal, or something that may warrant a further look. I always evaluate entire collections to see if there’s higher linguistic norms of the stuff than there should be. Or lower. I investigate how redflag language clusters around certain dates or subject matter. I generate reports of content-of-interest that is statistically correlated with redflag language in a collection because I want to see what other language regularly co-occurs with it. Evaluating redflag language in a collection is a great way to see where the trouble spots in a business or corporation lie.

See there? Warning signal, trouble spots. I used two great examples of redflag language to define redflag language. These phrases impart the concept of “problem.” And the concept of a “problem” can be expressed in a huge variety (I’m talking thousands) of ways. You can have a serious issue or are dealing with a serious matter. Or maybe you face a challenge. You’ve bumped up against an obstacle. You’ve encountered a tricky situation. Or you sense something is amiss and you email your colleague and ask “what is going on here.” They respond that they have some bad news.

If you’re trying to find documentary evidence to support a legal narrative, you want to incorporate redflag language into your investigation. However, you shouldn’t just rely on your own intuition about how people talk about problems. Or how they express negative sentiment, or give advice or imply causation, or threaten a co-worker, or sexually harass a co-worker. A linguist will use data-driven, principled methods to uncover all of these areas of investigation in a consistent and comprehensive way. It’s what we do.

Recently I was talking with an individual who is involved in risk mitigation in a Fortune 500 company. He told me that they do email sweeps and investigations into various areas by relying on keyword searches. I asked him who came up with the keywords and the query algorithms? What methodology did they use to validate their term list(s) and what data did they validate it against? What was their margin of error when conducting these keyword sweeps? I had a lot questions. He said, “No, no, no. We literally have a list of several dozen of words and do searches for each.” He then told me the people on his team (and it’s a vast team, to be sure) came up with the terms based on internet research. I was stunned. I could not believe in this day and age, with all of the unbelievable expertise out there, a company of this size and stature could not do better than this.

For the record, if you’re trying to figure out if somebody is sending sexually explicit and harassing emails to a co-worker, simply searching for “sex” does not do the trick. Likewise, if you’re trying to uncover evidence of fraud in a huge collection of ESI, searching for “fraud” isn’t going to turn up anything useful. But I can tell you from experience, people engaged in harassing other people through computer-mediated communication, as well as people engaged in defrauding another person or an entire entity, these folks use specific language and patterns of linguistic behavior that leave an electronic trail of their misdeeds. You can uncover these trails. Do you want to know how? Then you should seek out linguistic expertise.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s