Forensic Linguistic Investigation as a Risk Mitigation Strategy

Tap, tap, tap…Is this thing still on? Yes? Excellent. Blogs can be like laundry. When you get busy with life and work and things, laundry and blogs are the first sacrificial lambs on the alter of Time. Laundry piles up and blogs grow cobwebs. Now the semester is coming to a close, I am wearing a clean pair of jeans, and I am dusting off this corner of the internet. Here we go.

I have all sorts of observations that have accumulated over the past few months that have to do with the intersection of the law and linguistics, but I think the first one should be a companion piece to the last entry (better late than never?). However, I want to reframe the discussion of “profits over people” before moving on to “taking the temperature of a company.”

To be clear, both of these themes are about identifying and mitigating risk. If there is evidence that can be construed as your company valuing profit over human safety, your company is at risk. If your company turns a blind eye to hostile work environments, again, your company is at risk.

Let’s keep going: If you have a rogue employee acting against your company’s best interests, that individual is an insider threat to the organization and exposing you to risk. If you have an entire department perpetrating fraud in order to cover up a range of misdeeds, or mistakes even, you have a substantial point of vulnerability that exposes you to risk.

Corporations have all sorts of ways they manage risk identification, analysis and mitigation, but the least popular and perhaps most complex method is regularly auditing their business communications, the digital artifacts of doing business in today’s computer-mediated environment. Furthermore, I think maybe inside counsel and the teams that oversee risk-mitigation strategies treat investigating and auditing their company’s text-based artifacts for things like “insider threats” (a most definite risk), or “fraud” (risk personified), as solely a technology/IT-related exercise/issue. I think the legal profession, and “risk” stakeholders in general, believe this to be true.

I also believe that auditing a company’s business communications is easier theorized about instead of accomplished. In fact, everything having to do with managing information today is easier said than done. You can’t go three clicks on the internet without reading a theoretical piece about information governance in the era of Big Data to see that it is a hot topic of discussion. Again, this discussion is almost always framed as technology issue, an IT issue. And to be sure, there is tech/IT component to it. A big one. BUT, it is also an unstructured data issue, as unstructured data (which includes text-based data) makes up 80% of Big Data. Unfortunately, unstructured text is the most complex data-type going. It is hard to deal with. It is hard to qualify and quantify. It’s hard to investigate and analyze. It is especially hard to govern.

Here’s why: Unstructured text-based data is natural language data. And today, it’s email, IMs, social media, and computer-mediated communication in general, that are filling up a company’s virtual landscape, making information governance a beast of a thing. And these types of communication? They don’t necessarily represent standard, systematic written language. These communication genres are more like written speech than anything else. We write with our keyboards like we talk on the phone, not like we’re writing a report or a thesis (unless of course we’re writing a report or thesis). Here’s something: The vast majority of a Fortune 500 company’s text-based artifacts are email.

No, text-based natural language data isn’t simply a tech/IT issue. Not if you want to extract useful information from the text, or if you want to archive it in some way that gives you visibility into the ideas/themes expressed in the text itself. And certainly not if you want to audit your text-based communications to identify risk. In order to do this effectively, you have to have linguistic expertise, and not just any type of linguistic expertise. A phonologist isn’t going to come to the rescue here. But generally speaking, linguistic and language expertise are the essential components regularly missing from risk mitigation strategies, as well as information governance strategies.

This also is not simply a classification issue either. Effective risk-mitigation strategies, as well as smart information governance protocols, are not just about classifying and categorizing information. How would you classify an email that said “If anybody finds out about what we are doing, we are going to end up on the news” (that’s an actual email I’ve come across, fyi). There is no keyword or “bag of words,” clustering, or predictive coding schema that is going to identify and categorize that email as “risk.” BUT, there are forensic text investigation strategies that will discover communications like this, and discovery is the first step.

There is a forensic linguistic component to insider threat detection, as well as fraud detection. Employees who are becoming increasingly disgruntled demonstrate it through their language use. Just as employees who threaten (overtly and using “veiled” threats) do so through their language use. People trying to conceal or downplay, use linguistic strategies to do so. Companies who value money over human safety leave a trail of linguistic evidence that transforms into a “profits over people” narrative. All of this is extant in email communications, if you have the expertise to identify it and measure it, that is.

Here is my point: These sorts of forensic linguistic investigations should be framed as risk identification and mitigation strategies, pure and simple. This is the framework that we should consult when talking about “profits over people” and “taking a company’s temperature.” What we are really saying here is that we are investigating and analyzing a company’s text-based artifacts in order to identify risk. You can’t mitigate risk if you can’t detect it in the first place. And you can’t effectively detect it if you aren’t employing the right expertise and methodologies to do so. Then and only then, can a company take steps to reduce risk, or eliminate it altogether.

The right forensic linguistic expertise plus the right tech/IT infrastructure can make all the difference in effective risk mitigation strategies. It might just be the critical factor between theorizing and doing.

Posted in Uncategorized | Leave a comment

 Profits Over People.

This is our name for a classic case theme; for a compelling story that provides motive, shows intent. It’s human nature to want to know why something happens, why a company took, or didn’t take a particular action. The jury, the press, the public want a motive, they want to know why. It’s a classic tale that’s been told in many courtrooms.

The story itself is pretty straightforward: one party paints the other as so greedy, so driven by the bottom line as to place its own profit over the well being of people. But while the theme itself is straightforward, finding the supporting documentary evidence is a complex endeavor.

A profits over people language investigation is one of the most sophisticated and complex searches we’ve undertaken. It’s the epitome of the marriage between linguistic and legal subject matter expertise. In this blog, we provide an overview of our approach.

To recap, a corporation can appear to be putting the all-mighty dollar before human safety and dignity in a number of ways. Again, we call this profits over people. This case theme involves a corporation’s indifference towards it’s customers, clients, vendors, or the public in a quest make money.

In a variation of the theme, a company can also exhibit disregard for its own workers, perpetuating a corporate culture based on greed, hostile working environments, unfair labor practices, institutionalized corporate caste systems, and other systemic inequality within its own borders.

All of these things are fodder for litigation. If profits over people is a case theme central to your litigation, you’ll be obligated to uncover evidence that supports, or refutes, this claim.  This theme is one that entails a lot of different moving parts, different in every case, or even different in every deposition. It can be nebulous with respect to how it plays out in ESI. You can’t simply type “profits” or “people” or “profits w/5 people” and expect anything interesting to come of it.  One thing is certain: to find this evidence, you need to think like a detective.

People use language to express ideas and information. People use their words (in addition to their deeds, which they also write about) to accomplish both good things and bad things. So, when we investigate profits over people, what we are really doing is investigating people, as either individuals or as a group or entity.

So the first thing to do when investigating language usage for evidence in this context, is acknowledge that you’re not *just* sifting through text-based language. You’re scrutinizing linguistic evidence to prove or disprove agency.  That said, let’s look at how profits over people plays out in a Big Pharma products liability situation.

Before diving in, let’s be clear here: companies have to make money. There’s nothing wrong with making money. BUT, when the balance between making money and corporate responsibility shifts into “all bottom line, all the time” territory, then a company is vulnerable. It has moved into a dangerous neighborhood where the neighbors (lawyers, shareholders, employees, regulatory agencies, the public, politicians) aren’t exactly going to roll out the welcome wagon.

Let’s consider what it looks like when a Big Pharma company is potentially putting their bottom line over consumers.  As I mentioned upstream, the first step in a profits over people investigation is to shift your focus from the “what” (the bottom line) to the “who.” You’re just as interested in “who” is talking about the bottom line, as much as the bottom line itself. For example, accountants and finance departments talk about money.  Shareholders want to hear about money. This is normal. No redflags raised here. BUT, what about the scientists? The ones who work on clinical trials, the ones who liaise with regulatory agencies? Are they talking about money? That is potentially interesting content.

Secondly, if scientists are talking about money, what’s the context? Are they talking about their budgets? That’s probably normal. Are they talking about not conducting a particular study because it won’t have an impact on sales, or because it will potentially have a negative impact sales? Are the talking solely about the commercial benefit of some action or inaction? That’s worth looking into.

Third, how do these conversations align with a broader picture? Once you have the first two pieces to the puzzle, you now have to broaden the scope of your investigation.  Are the money-related conversations these scientists having an anomaly? Or are these conversations typical and possibly indicative of some bigger corporate trend? How are folks responding to these conversations? Was the decision to halt the study or test an abrupt one, with no opportunity for discussion? Or is there evidence that it was an informed decision, put to committee and decided on after a lively debate? The journey is just as important as the destination.

This is just one of several investigative frameworks that could uncover a profits over people narrative: scientists talking about money-related and commercially-driven issues, such as not conducting a particular post-marketing study because such a study wouldn’t be commercially relevant to drug X sales and marketing. Or perhaps the scientist’s recommendation over-ruled and the study went ahead. The next step is finding evidence of whether or not this is an isolated incident or par for the course.

When you’re investigating a broad theme like profits over people, remember it is not just what is being talked about, it’s who is doing the talking and what is the broader context. Don’t simply flash one “hot doc” and expect the story to be told. One document is just the a part of the narrative. What you’re looking for are investigative leads that demonstrate a bigger picture, a story of either a few people misrepresenting their department or company, or a corporate culture that encourages and even dictates putting money above everything else. Or you could uncover evidence that there was no wrong-doing at all.

Next week we’re going to focus on one of our favorite types of investigation: Taking the temperature of a company. This involves investigating things like the morale of a company’s employees, the rumor mill, the synergy between departments, etc…

Posted in document investigation, document review | Tagged , , , , | 1 Comment

ESI Investigation 101: Finding the company big mouth.

I talk a lot about the difference between document review and document investigation, but today I’m going to start the first in a series of posts that give some pointers on investigative techniques that can help those of you who are more inclined to approach a collection of ESI like a detective rather than say, a data entry clerk.

Today’s investigative tip is going to focus on finding the big mouth in a collection. Every collection has at least one, and if you’re lucky, several. Why do you want to find the big mouth? That’s easy: Because they talk. And they talk. And they go on record and express their dismay, and delight, with everything and everyone. And they do so often because they just can’t help themselves. It’s a part of their personality. In forensic linguistics we call this sort thing authorship profiling.

Before delving into finding the talkers in a collection, I want to back it up a little. I want to give a brief investigative primer to put us in the right framework for the task at hand.

There is more than one reason to investigate a document collection. Sometimes you investigate to get a lay-of-the-land in terms of the existence, as well as the extent, of content-of-interest. Sometimes you investigate the communicative habits of the persons-of-interest in your collection, with an eye toward determining who is talking about what topics and with whom. Sometimes you just go in and find the dirt outright, because really, that’s what it all boils down to a lot of the time.

Sometimes it is extremely important to investigate a collection to determine what isn’t present, as much as what is in there. It’s not just the absence of particular ideas or information that creates gaps in your collection, but it is also the presence or absence of a variety of types of computer-mediated communication that speak to the quality of your collection. And we all know that the outcome of good investigative efforts hinge on a good data set. So determining the quality of a collection of produced ESI is something best done without delay. For example, do you know how much email versus edocs you should have? Do you know how to create a norm of distribution for communication patterns so you can have visibility into up-ticks in email quantity? A flurry of email activity surrounding a particular date or custodian or topic can be a great way to generate productive investigative leads…

But I digress. Sort of. In fact, determining patterns of communication is a first step in locating that individual that is likely talking about your case themes in abundance. This requires generating an email network analysis, which will elucidate who is sending the most email in your collection, who is receiving the most emailing in your collection, and what individuals are more likely to co-occur on the receiving end of email communications.

So, the first step is looking at who is producing the most email, and the ratio of sent email to received email. Quantity isn’t the only clue you will need to go on here. After all, the quantity of an individual’s email communication is dependent on many things, only one of which may be that they are indeed the company big mouth. However, once you determine quantity, then you can investigate the quality of the communications. Now, I’m going to just do you one solid and give you the things to look out for, the things that generate evidence that you may have yourself a talker on your hands.

  1. Big mouths use a lot of first person singular and plural pronouns. A whole lot. I this and we that, but mostly I this. Big mouths are usually extroverts and extroverts like to be a part of the group, but they also like to talk about themselves, so look for clues in pronominal distribution.
  2. Big mouths use a lot of stative verbs, or verbs that describe how they feel or think. I believe thatI think that… A big mouth will go on record (often inadvertently) with their thoughts on a matter because it is just impossible for them not to do so.
  3. Big mouths use a lot of negative and positive sentiment. Not only do they use a lot of stative verb constructions, but they express a lot of positive and negative emotion about everything from their wonderful family vacation, to the worrisome outcomes of a particular clinical trial.
  4. Big mouths talk about a wide-range of topics, and typically exhibit interesting type-token ratios in their email communication. The more types they have in comparison to the number of tokens, the richer the range of vocabulary they use. It’s not just about how much they talk, but about the variation they exhibit in their conversations.
  5. Big mouths often exhibit an informal, personal style of communication that includes attempts at humor, sarcasm, friendly banter, phatic communication, or those bits of communication we use to solidify personal relationships, such as “How are you doing?” “Hope you are well” and the like. Again, big mouths are often extroverts and exhibit communicative habits that are indicative of the personality type.

Essentially, it is not any one of these things that designates a big mouth as such, but rather the interplay between all of these features. With that, you can now go and create a “big mouth” model and use it to hone in on the custodians in a collection that are likely to give you the most bang for your investigative buck.

For plaintiff attorneys, if you’re really lucky, once you figure out who the company big mouth is, you’ll get to depose them. As my colleague can tell you, if they can’t keep themselves from going on record about anything and everything in company email, then they will be just as likely to be a run away train in a deposition.

For defense attorneys, if you’re really lucky, you’ll identify who the company big mouth is before your production goes out the door so you can mitigate any risk created by their tidal wave of communications.

For in-house counsel, if you’re really lucky, you’ll identify these individuals before any litigation hold ensues, and perhaps gently reiterate your email policy or communication policy with them.

Next week we’re going to look at how to investigate the theme of “profits over people” in a collection of business communications. In the meantime, I will leave you with this piece of sage advice, from one language investigator to another: Figure out what you need from your collection and then just go get it.

Posted in Uncategorized | Tagged , , , , | Leave a comment

The World Doesn’t Need Another Document Review Platform and Other Hard Truths, by Betsy Barry and Suzanne Smith.




We’ve probably already offended some of you. We’re don’t mean to hurt anybody’s feelings. Our intent is certainly not to offend, but rather to have a bold, honest conversation about the possibilities that exist for truly effective Information Governance and E-Discovery.

So, if you are interested in reading some hard truths, take a deep breath and plunge right in. We’ll make this brief. Well, we’re not exactly known for brevity, so we’ll focus on making it thought-provoking instead.

Hard truth #1: We don’t need another tool.

If the legal profession is longing for some disruptive technology that will answer all their eDiscovery prayers (and comments coming out of Legal Tech certainly seem to indicate that they are) then let us say that what the legal profession needs isn’t wishful thinking about a tool or a platform, or some other yet-to-be-invented TAR review system. It needs a fundamental shift in thinking, a completely different approach to finding case-winning information in ESI. Or relevant information in ESI. Or potentially privileged information in ESI. Or any kind of evidence in ESI. The problem isn’t that the right technology hasn’t been invented yet. The problem is that the right teams of experts aren’t being assembled to navigate eDiscovery workflow processes yet.

Hard truth #2: Effective eDiscovery and Information Governance is, at its core, a problem of control.

Companies, inside counsel, and outside counsel all need to control ESI *before* it goes out the door. 1) You have to meet your legal obligation regarding relevancy; and, 2) You don’t want to produce any privileged or confidential information. To meet these 2 essential responsibilities you need control. Control means visibility into your information. Visibility means devising reasonable, defensible, data-driven means of identifying the scope of your ESI and the content of your ESI. That right there? That entails investigation. Simply put, you can’t control what you don’t know.

This control issue is a problem for the party receiving a production as well. What is devising an eDiscovery workflow process other than a massive effort in controlling for quantity versus quality? The whole notion of document review is one rooted in discipline and control. Otherwise, why would the legal professionals driving this process disregard research that confirms inter-variation and intra-variation between individuals engaged in tasks and task switching endeavors involving executive control (such as multi-classification/coding criteria document review tasks), which ultimately renders tasks like reviewing and coding inconsistent at best, and at worse, unreliable. There’s also great research regarding task complexity (such as multi-classification/coding criteria document review tasks) that concludes when you give an individual several classification/evaluation variables to consider all at once, they will simplify the decision-making process by eliminating information, thereby constraining choice. In other words, inherent inconsistency and unreliability.

So, a lawyer’s eyes on every document? A team of attorneys using several criteria in which to code and classify each and every document in a collection? This incarnation of the “doctrine of control” isn’t reasonable in today’s Big Data/ Big ESI environment. A lawyer’s eyes on the *right* set of documents? The most valuable set of documents that resulted from a targeted investigation into a particular topic or individual? Now that makes sense.

Hard truth #3: Investigation not review

Instead of throwing money and temps and a dime-a-dozen review platforms at eDiscovery, consider this: You shouldn’t be reviewing ESI. You’re not studying for a test. You shouldn’t be classifying documents. You’re not readying your collection to be archived in the Library of Congress. Decide what you need. That is the question: What do you need from your ESI? Figure that out and then go in and get it. That, my friends, is investigation, not review. And this sort of large-scale forensic investigation endeavor is fast, efficient, and because of those two things, much, much more cost effective. It’s also defensible, both scientifically and legally. A win-win.

Hard truth #4: You need linguistic/language experts.

We know you’re sick of hearing it, but that doesn’t mean it isn’t true. You need people on your team who understand things like orderly variation, relative frequency and distribution, language innovation and change, how context shapes meaning, how words are known by the company they keep, and a host of other linguistic principles that interact and govern how we concretely express ideas. Linguistic principles govern things like how we ask and give advice–the foundation of potentially privileged communications. Going with the idea of privilege, consider this: There are both linguistic and extra-linguistic components to a privilege “recipe.” You need to find all the potentially privileged documents in your ESI? Then task the right expert, the one who understands and utilizes patterns of linguistic behavior, who knows how to control for extra-linguistic variables, the person who can go in and grab all the potentially privilege documents. It really is that simple. The right expertise trumps the *right* tool. The right expertise trumps sheer human-muscle. Every. Time. Somebody with the right expertise and training, and a copy of DT search (a great off-the-shelf search tool), can prevail against an entire review team of randomly assembled temp attorneys with the swankiest TAR platform. We’re serious, people.

Let’s put it this way: Why wouldn’t you have a language expert/linguist on your eDiscovery team? Why wouldn’t you consult the group of experts who specialize in natural language production? Who better to consult with respect to ESI, which, let’s face it, is largely linguistic in nature.

So, are the reasons that there are a million document review platforms and sky-rocketing eDiscovery costs because nobody has figured this out? We don’t need another tool. We need to recognize the value of linguistic/language expertise. We need to incorporate expert training with respect to concrete language usage instead of thinking up search terms. Searching is an investigative skill, not simply a matter of typing a bunch of words into a field. Likewise, predictive coding hinges on identifying and plugging in the right “seed set” to train the modeling algorithm. How are you finding that seed set of “responsive documents?” Or who is finding them, we should ask. Is it somebody with knowledge of language variation and context-based meaning? We surely hope so, because we can tell you definitely that for every one document in your seed set, there are dozens upon dozens out there in the universe of ESI that impart the exact same information in a completely different way, linguistically speaking. And if you don’t include all of them in your seed set? Then there’s a significant amount of relevant material that is going to be left behind. Period. The algorithm can’t predict and account for the range of language variation. It can’t “learn” what it doesn’t “know” outside of the seed set. That is not how predictive modeling works. It’s not magic, after all.

We’ll end this intentionally provocative conversation by circling back to how it began: What the legal profession needs is to recognize is that the landscape is changing. Document review is the fossil fuel of the legal world. Document investigation is the wind farm on the horizon.

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Storytelling, Linguistics & ESI: All the Better to Win Your Case With My Dear, By Suzanne Smith.

A few weeks ago, I wrote about the powerful and suggestive smoking gun document. At the risk of repeating myself, a smoking gun document is the Holy Grail of document investigation. And you should certainly put yourself in the best position to find them. But like the cup, the smoking gun can be elusive; and perhaps, expert investigators should work with legal teams on more productive areas like identifying material that proves or disproves the elements of a case, that solidly support crucial case themes, or that are the foundation for the story you want to tell the jury. And so, last week’s conclusion is this week’s segue to a topic near and dear to my heart: storytelling.

We all have our favorite stories. Whether it’s a folk tale from childhood; a classic novel; or the story of how your grandparents met. A well told story can leave you sitting in your car, in your driveway with the radio on, just so you can hear the ending. A well told story can make your day, break your heart, or maybe even change your life. Gifted storytellers make it seem as if words spring spontaneously from their lips; but of course, this is rarely the case. More often, but much less romantically, hours, days, weeks, months of painstaking research, writing, re-writing and editing lie at the heart of a well told story.

Nowhere is this more true than in litigation. After all the filings, discovery, depositions, motions, settlement conferences…it basically comes down to who can tell the best story. And of course, since the story told in the courtroom, it must be based on the facts. As a good story teller, you have to make sure you are in the best position to have the facts you need as the foundation of your story. Obviously, there’s a lot more factual evidence than that which is found in a document production. But it should come as no great shock that I’m going to speak to that great wealth of storytelling fodder, the multi-million page electronic document production.

If you do it properly, or if you are very lucky, you can find the facts that will help you (1) set the stage of your story; (2) describe your characters; and (3) deliver your message via your overall case themes.

Some of the things I like to look for include:


What is the corporate culture like?

How is employee morale?

Does the team work well together?


Find out what your custodians are talking about; what they are expressing negative sentiment about; what advice did they give?

Look beyond produced custodians to third parties; email confidantes; non-human “players” like committees or agencies and other entities.

And of course, find out what did the characters know, when did they know it, and what did they do about it?


Did they put profits over people?

Did they lack the resources to do the job well?

Was it a one time mistake?

Did one bad apple spoiled the barrel?

The reality is that this type of investigation — one in aid of crafting a great case winning story — can be really hard to do when you are using a large, disparate team to review the collection. Traditional methods don’t accommodate both the granular and the over-arching views of what is actually happening in the collection. The smartest, hardest working reviewer is only able to develop this knowledge over the course of the review…over week, months or even years. A critical document, a smoking gun document, could easily be over looked if encountered in early days of the review.  That reviewer can only make connections between the documents she’s read, and some of the patterns, like communication networks or topical trends for example, can be hard to spot: Without insight into the email communications of the deponent and an unproduced third party consultant, you might not get the opportunity to ask a critical question during the deposition. You might miss your opportunity to tell a great story.

So instead of relying on traditional document review, we suggest building a team that specializes in targeted investigation. They should understand the substance of the matter. They should understand your goals (which may be as disparate as finding support for an element of the case like “intent”  or understanding the key admission you need from a deponent) and they should understand that the goals may change during the course of discovery. The team should be savvy enough to not be distracted by “juicy” but ultimately needless details. It’s also important that the team knows it is ok to say that the supporting documents that you are hoping for simply do not exist in the collection. Don’t waste time searching for something that isn’t there. Finally, your investigative team should be armed with the proper tools and have the expertise to use those tools.

You know what I’m going to tell you next: You need language/linguistic expertise in order to deal with the complexity of ESI, which is mostly unstructured text, created for the sole purpose of expressing ideas and communicating information, which are the essential building blocks of any great story. In the same way a writer will pluck ideas and facts out of every day life and use them as the foundation of a compelling story, the language/linguistic expert will pluck the most important facts out of a sea of ESI so that the attorneys can do the same. And to be sure, every collection tells a compelling story, if you have the right expertise and tools to tease it out, that is.

Finally, obviously, but very importantly, you can’t tell these stories if the documentary evidence is never produced, so be sure to craft your discovery requests to insure the most qualitatively robust collection. Dedicating the time and expertise at the beginning of discovery will pay off when you have the evidence to tell a case wining story

Posted in Uncategorized | Tagged , , , , , | Leave a comment

The Lure of the Smoking Gun: Use Linguistics not Luck to Find Your Best Evidence, By Suzanne Smith.

“We rushed into the captain’s cabin . . . there he lay with his brains smeared over the chart of the Atlantic . . . while the chaplain stood with a smoking pistol in his hand.”

Used to mean indisputable evidence the “smoking pistol” was first used in the Sherlock Holmes story, The Gloria Scott (1893). Modern parlance now refers to it as a “smoking gun” and as NY Times columnist William Safire identified, it has been used during the Watergate scandal and the Iraqi nuclear arms controversy.

As a document investigator in litigation, I live and die by concept of the smoking gun.  What trial lawyer doesn’t rub his or her hands with glee when presented with a “smoking gun” document…the next best thing to an on the stand admission (maybe even better, hmmm?). It’s that tangible thing, the witnesses’ own words coming back to haunt her.

Over the years, I’ve trained a number of lawyers and legal professionals on the best way to search in a large document collection.  I go through my presentation and then set them to it.  Their training task is to locate something “of interest”  that relates to their case.  (They are the lawyers – they decide what’s key.)  They are searching a produced custodial document set of a few thousand emails. There’s some finger cracking, friendly wagers are made. They get to it.  Can you guess what most folks type into that search box first? It splits about 50/50 into (1) the searcher’s name or (2) curse words.

Of course this is simply human nature.  When I sit down to crack open a brand new corpus, I go straight for the negative sentiment and red flag language as well.  But there are better, much more productive words and phrases (language markers) to start with than  F*** you. My favorites include: a bad situation, the problem with, will have an issue with.

Search and retrieval requires a level of language expertise and subject matter familiarity that can prove difficult and elusive those who do not routinely work with huge natural language datasets, or in the particular genre of business communication, or subgenres therein.  In traditional document review settings, finding smoking gun documents often rests on chance and the ability to dedicate resources to the task.  However, when dealing with huge collections of text typical of today’s productions, collections that are growing exponentially every day, leaving things to chance and just assigning a team of contract lawyers to read millions of pages of text are neither valid and reliable approaches.

Let’s consider how a linguist, or in this case an applied linguist specializing in legal document collections and a lawyer trained in linguistics could apply their expertise to contribute to finding smoking gun documents. (Yes, I am referring to my partner Dr. Barry and yours truly.) Over the years, we have found dozens upon dozens of smoking gun documents. Overtime we learned that it isn’t just the content of the documents that makes them important, it’s not just the combination of linguistic features or patterns, but it’s also context and extra-linguistic variables such as when the document was produced, who wrote the document and to whom, all of these things together make a document smoking gun material. For example, a whistle blower’s  memo wouldn’t be so impressive if it were written after the DOJ started it’s investigation of the company would it? It would just look like some self-serving CYA. Something easily explained away by opposing counsel.  Or an admission by a pharmaceutical sales rep that “drug X has some problematic and concerning side effects” AFTER the FDA has withdrawn it from the market isn’t very crucial.   It’s an intersection of linguistic and extra-linguistic variables that have legal import; and,  it requires language AND legal expertise to engage in this process.

We have had the good fortune of combining our expertise and informing each others research objectives for a decade now, and what’s more, we’ve worked with countless corpora consisting of business communications and documentation produced specifically in the context of civil litigation.  Here’s what we know: Who better to develop and implement investigative methods in large natural language datasets of unstructured text than linguists, particularly forensic linguists who understand and are comfortable working within a legal setting. Here is another important point to make: Language is complex, to say the least, but investigating Language in a legal setting adds another layer of complexity to the task. Remember, the linguist isn’t making the call about what is or is not a smoking gun. The linguist is leveraging their linguistic expertise about Language and patterns of communication in tandem with another legal professional’s expert opinion, often times a lawyer, in order to wring every bit of critical information out of a collection of ESI.

For example, when we see patterns of overlapping content like financial language and risk/benefit language that includes talks of fatalities in a document devoid of personal opinion and emotive language, we understand this is something our client should look at immediately. Likewise, when we see a communication that includes some higher ups emailing back and forth, expressing negative sentiment and including a lot of informal, personal language, co-occuring with business strategy-related language and unique terms of art, we understand this is also something the legal team should look at immediately. We then distill these linguistic patterns of communication into an algorithm and refer to this algorithm going forward.

So yes, a smoking gun document is the Holy Grail of document investigation. And you should certainly put yourself in the best position to find them. But like the cup, the smoking gun can be elusive; and perhaps, expert investigators should work with legal teams on more productive areas like identifying material that proves or disproves the elements of a case, that solidly support crucial case themes, or that are the foundation for the story you want to tell the jury. Sure, it would be great to have a smoking gun as the centerpiece. But more on that next week. Until then, good luck.

Posted in Uncategorized | Tagged , , , , | Leave a comment

I Love Words, or Why a Lawyer Needs a Linguist, by Suzanne Smith*

Why does a lawyer need a linguist? Lawyers are master wordsmiths. We spend years honing the fine art of persuasion. Rhetoric, logic, oral and written arguments. We spend the first year of law school learning how to write persuasively, how to craft a legal brief that is a thing of beauty. I have known trial lawyers  who could deliver a closing argument that had opposing counsel weeping with bitter frustration over the power of their words. To some extent, to be a lawyer is to love words. Whether spoken or written, words are the currency with which our profession trades. And yet here I am, telling you that if you are involved in any litigation that involves a large collection of text based documents, you need a linguist. I am. I am telling you this with the fervor of a religious convert. Once, I was just like you. Comfortable with doing it the way it had been done for years and years and years. Then, something changed. A decade ago this year, I met my linguists.

Ten years ago, when I was introduced to them for the first time, I was skeptical. Very skeptical. I remember thinking, “OK. Linguists. Cool. Great for translating these Spanish documents.” Then I met them and they were talking about corpora of communications and identifying potentially privileged communications and using language markers for expediting relevancy review. I thought, “Huh. This is different.” And I went away for a while. I was busy, very busy doing lawyerly stuff. Well, sort of… I was actually working for an eDiscovery vendor as a legal subject matter expert and was in the process of preparing for a big document review platform sales pitch. Part of my job was to find interesting (read juicy) and legally relevant material to use in the demo.

I prepared, using traditional review methods (using search to generate a stack of documents that I would then read through, one at a time, page after page.) I struck gold! I got lucky, finding a memo from a former employee, now a whistle blower. In the memo, the whistle blower indicates to the CEO that there is something rotten in Denmark and urges the CEO to do the right thing and set the company to rights. Great. I’d found the centerpiece to my demo. Now I just had to find supporting or refuting content to round out my document set. I had over 200,000 emails and e-docs to work with. I had a name, I had a project name, I had a time frame and I had a fancy search tool. How hard could it be?

Time was ticking away. I had one good document. I did NOT want to resort to searching for curse words in a last ditch effort to locate examples of juicy documents.

Then I remembered the linguists. I was getting pretty desperate (thank goodness it was just a sales pitch, not an actual deposition.) So I called them up and told them what I was trying to do. I didn’t want every memo in this collection. I didn’t want everything authored by the whistle blower. I didn’t want EVERY document referencing the project. I wanted material that supported the author’s claim that other, highly placed executives knew about this, or other questionable deals, and had reservations about the wisdom of proceeding. Oh, by the way, I needed these key documents really, really quickly.

They laughed at me. Ten years ago, they laughed. No, they didn’t, but they wanted to. They didn’t because they are from the South and are too polite. They spoke about the infinite variability of language and the dynamic nature of modern business communication. They told me that it would take years of language research to create a suite of linguistic markers that could reliably pull back the content I was interested in. They did help me come up with a pretty snazzy BOOLEAN search query that got me a respectable stack of demo documents. I didn’t have to resort to the curse words. The sales pitch went just fine. We maybe even got the client. I don’t remember.

But that experience was an epiphany. I realized that despite my ability to frame my legal needs, to pose my question, I didn’t have the linguistic expertise to reach into a massive collection of documents and reliably, let alone efficiently, locate the critical content.

I realized that reviewing documents for crucial information and important content comes down to a couple of key things: Luck and having an abundance of time, money and human-power to throw at this part of the discovery process. I don’t want to rely on luck. I don’t want to spend my time and resources this way. In contrast, linguists have insight into language and experience working with natural language data and can leverage these in productive ways to assist in document investigation of this magnitude. Why not partner legal and linguistic expertise? That, in combination with technology that can process massive amounts of data quickly is the ideal, multidisciplinary approach to discovery.

After a decade of working with legal teams, we are convinced that the collaboration potential for linguists and lawyers has only increased over time. The legal profession should be looking to fields like Natural Language processing and Corpus linguistics for solutions to dealing with these massive collections of text-based language. The collaboration should reach across all stages of discovery. We need to collaborate with language experts in a research and development capacity, in the best research environment, working with relevant collections to address the specific types of issues that the legal community deals with on an on-going basis.

Language expertise is paramount, but it’s the application of this expertise in this legal setting that matters. After all, the linguist isn’t making a model of language, or studying language for language’s sake. The linguist isn’t making a call about what is or is not legally relevant, or privileged or a smoking gun document. The linguist isn’t deciding which case themes should be argued. The linguist studies how people use language and patterns of linguistic behavior to impart specific information and intelligence, for crucial pieces of evidence as determined by the legal team. The linguist is working with a legal expert—a trial lawyer, a litigation specialist—and applying their expertise in order to winnow down the pool of evidence to uncover documents that have the most potential and/or legal import with respect to a case. The linguist is able to leverage patterns of communication, control for extra-linguistic variables, as they co-occur with legal case themes in ways that not only contribute to the investigative process, but in ways that shape that process.

The legal community stands to gain a lot by utilizing the expertise of linguists. After all, we employ forensic accountants, accident reconstruction specialists, epidemiologists and jury consultants. Why wouldn’t we use linguists when we need to find language-based evidence? After a decade of experiencing what can happen when one small group of lawyers and linguists work together, as the exception to traditional discovery practices, I’d like to spend the next decade watching as wide scale collaboration becomes the rule.

*Suzanne is founder and CEO of Illocution Inc.

Posted in Uncategorized | Tagged , , | Leave a comment

eDiscovery in 2015: Predicting the Unpredictable

I’ve been mulling over my eDiscovery predictions for 2015 and I realized that they represent more of a “wish list” than anything. Suffice it to say, I feel like we have a lot of ground to cover with respect to innovation and evolution in eDiscovery-related products and services. Here’s my road map for the journey forward:

1) The contrast between document review and document investigation will become more stark, especially as we continue to confront the exponential growth of ESI. Finding text-based documentary evidence is more than creating different ways to categorize and read piles of electronica, simple information search and retrieval, and/or large scale data analytics. There are forensic linguistic investigative methodologies that can offer real solutions to finding out the ‘who, what, when, and why’ in a collection of ESI. And they don’t require huge platforms, armies of human power, or a lot of time and money, but rather the right expertise and expert tools. Why set up huge review efforts to find evidence when you can consult experts who just go in and find data-driven answers to your questions? There are significant areas of expertise that eDiscovery has yet to properly tap, but I expect this will change in the coming months and years.

2) Big, clunky, “everything but the kitchen sink” review platforms are going to go the way of the dinosaur. We’re becoming a more a la carte (a la app?) industry, and society in general, with respect to technology. We have a specific enterprise and we want a tool that accomplishes it. On the horizon I see lean, efficient, agnostic expert tools and solutions, and expert users manning the tools which provide these solutions.

3) eDiscovery tools and solutions are going to evolve beyond simply reacting to large quantities of data. And in the same vein, we will start to move past focusing primarily on eDiscovery processes for culling or reducing data. Rather than exclusion, I predict a move toward quite the opposite end of the spectrum: Inclusion. Let’s stop hand-wringing over reducing size and start concentrating on identifying and collecting the most qualitatively relevant, robust data set possible, using the most valid and reliable methods available. When your objective is to identify and produce the most qualitatively valuable data set, the idea of culling becomes a moot point. We simply have to stop letting the fact that we’re dealing with a lot of data be the driving force behind adopting eDiscovery tech solutions.

4) Much to my dismay, predictive coding is going to continue to make headway into relevancy identification. This should not imply that I am not a fan of predictive modeling, because in the right context, it is a hugely powerful and productive methodology. In fact, I love using predictive models in some  investigative contexts. However, I’m going on record to say that relevancy/responsiveness identification and collection is not the right context for this methodology. And when you use this methodology in an improper context, you will get unreliable and uninterpretable outcomes. It’s a risky venture.

5) Large corporations and businesses will move to handle more eDiscovery processes internally. There’s a pervasive conversation happening about information governance and management. eDiscovery is a natural extension of this conversation. If large companies advance toward principled, data-driven information governance, and truly get a handle on their ESI, then they are going to have to create teams representing the right combinations of technical experts and subject matter experts. When these expert teams settle into place and develop/implement smart information management processes, then wholesale out-sourcing of things like relevant/responsive document identification and collection will not be necessary. As a result, the data flowing outside of a company will be contained to the most relevant information to the task at hand, whether it is in the context of litigation, arbitration, or compliance, etc. In sum, good corporate information governance will have a have a huge impact on eDiscovery.

There you have it. My stab at prognosticating about the future of our field. And now instead of just writing about all this fun stuff, I’m off to practice what I preach.

Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Technology and eDiscovery: State of the Union.

I rounded out 2014 by looking back over the last 15 plus years in eDiscovery, focusing on the historical context that influenced, and continues to influence, the legal profession’s alliance with the tech industry.  I want to ring in 2015 with a conversation about what has been happening since these two love birds came together, moving from a shaky partnership born of necessity, into a relationship that has endured and matured, and even spawned a new industry: Legal technology.

As the legal tech industry has evolved over the years, there’s been a range of tech solutions that have come on to the market for legal professionals.  A relatively young industry, legal tech is sort of like the wild west in a lot of ways, eager to accommodate eDiscovery homesteaders in particular with a variety of products and services.  Large-scale hosting platforms abound. Data analytics that help you make sense of your massive amount of ESI are customary offerings. All manner of technology, tools and processes are available to assist you in wrangling that nebulous, Everest of electronica in the context of eDiscovery, and beyond.

To be sure, there is nothing more daunting than being faced with scaling this Everest of ESI in the context of eDiscovery.  And there is nothing more comforting to have an expert show up and tell you they’re going to outfit you with everything you need to make this herculean task manageable. Or better yet, easy.

Be that as it may, there’s no elixir vitae in legal tech that meets all of the demands of eDiscovery. In fact, the field relies on a staple of “a la carte” technologies and technical expertise to accommodate different aspects of the spectrum of eDiscovery processes.  And as we all know, there are a lot of various stages and aspects to eDiscovery, many requiring a gamut of technical expertise.  Although technology and technical expertise in eDiscovery is paramount, it is an ambitious task navigating all of our options under the circumstances in which we’re operating: The legal infrastructure in which we endeavor places an extra layer of complexity on a series of processes that are already highly complex.

As with every consumer product or service, there’s a usual “buyer beware” caveat and legal tech is no exception. All tech is not created equal. Products may appear similar, but may be very distinct if you peek under the hood. Many times, these distinctions are singular and far-reaching. This makes it even more complicated when choosing the right tool/process/application to suit your eDiscovery needs, which vary and shift at every stage of the game.

Thus, finding and employing the *right* technology and technical expertise is important. Assessing and understanding what this program does, or what that automated method achieves, is key in legal tech. But it’s challenging for a number of reasons.

First, it’s hard to understand the constraints or drawbacks of a highly technical product or service because of the nature of the data we’re working with. Here’s something that everybody who works with unstructured, text-based natural language needs to understand: It is the most complex, varied and ever-evolving data-type going. It is not like structured data in that it offers a neat one-to-one correspondence between form and function. Language is just not like that, as any linguist will tell you.  Language, and the text used to graphically represent Language, is infinitely variable, innovative and changes every day. This is an empirical fact that needs to be accommodated with every single tech solution in eDiscovery. Period.

(Garden path alert! Skip this paragraph if you value continuity.) In the early days of auto-correct, I used to ponder the statistical methods used as the foundation of these potentially useful predictive algorithms. These algorithms were originally developed to work on structured, numerical data. They were designed to work on data that had a consistent one-to-one relationship between symbol and value. But as with everything interesting and useful, folks wanted to expand predictive modelling algorithms to work in different contexts, on different data types.  As previously mentioned, text-based natural language data does not superficially manifest into tidy, regular form/meaning relationships. In their incipience, auto-correct programs were pretty terrible, making it obvious to me that whatever data was used for beta-testing wasn’t a representative sample of the kind of text-based language that characterizes much of our every-day, computer-mediated communication.  I would think to myself: Did they even consult an applied/empirical/corpus linguist in the research and development of these applications? It irritated me. But then we had all those awesome buzzfeed tributes to auto-corrects gone awry, which amused me greatly. I’m conflicted. But I digress…

Secondly, employing a technical solution on huge collections of complex data is an even bigger adventure.  Margins of error look very different depending on how much data you have. Not only that, but it becomes more complicated to systematically assess a tool or automated method’s validity and reliability in large collections of unstructured, text-based natural language data. Complex data and lots and lots of it. Double whammy. This makes it harder for a non-expert to empirically verify if a particular program or application is doing what it should, or what it purports to do. To be sure, it’s doing something. But what are its limitations? What isn’t it doing? What sort of information is it eliminating or not returning? What sort of information does it privilege?

Third, it is our nature to think that when we find a tech solution that works in one context, that maybe we can use it successfully in another, seemingly related context. While one tool/process/application may work well in one particular context, that doesn’t mean that it is a perfect fit for another, related but slightly different one. Forging ahead anyway will often produce uninterpretable or inconsistent results, unbeknownst to the user. I’ve seen this with a lot of borrowed technology that has made its way into eDiscovery, namely, predictive coding programs.  Predictive coding has mostly been used for large-scale, automated categorization on produced ESI.  Now predictive coding is being considered as “proven technology” for relevancy identification and review in pre-production stages (see DaSilva Moore v Publicus). In order to assess whether or not this is a good idea, you really need to have a nuanced understanding of how predicting coding algorithms work. And specifically, how these algorithms, that again were developed for structured data, react when used on unstructured, text-based natural language. I actually have a good deal to say about this, but at the risk of another garden path, I’ll save it for its own post.

Here’s a quick example that illustrates a mash up of all these points. You’re tasked with document review. You’re using a large document hosting platform. It has a complex search feature. The search feature operates by indexing the words in the collection. This index is created and stored in a database, as is standard fare in searching tools on large-scale document hosting platforms.  Nothing interesting to see here.  Well, not to most people anyway. But I have questions, namely: Does the tool’s methods of indexation involve dropping the “noise” words to expedite the process? Noise words are those words that occur frequently in a language and are thought to impart little to no information. They include words like the, a, of,  and well, and.  These words occur at very frequent rates and make up a large percentage of an English language corpus (in linguistics, corpus is another word for collection). Most indexers drop them because it cuts down tremendously on the amount of text the indexer will be processing and sifting through, ultimately making the search feature faster.

My inquiry about knowing whether the indexer you’re working with does “noise word” elimination is something that probably nobody ever using these tools, or purchasing these tools for use, ever really thinks about, or cares about. On its face, it seems not at all worth considering.  Is it even important? Maybe not if you’re just using the search tool for organizing content, or making broad inquiries in your collection, with the goal of then organizing big piles of text into smaller piles of text. But what if you’re trying to use this search feature to conduct any sort of fine-tuned, forensic text investigation? What if you’re looking for very specific answers to very specific questions? What if you’re investigating to find a smoking gun document? Then you better believe that noise word elimination matters.

Consider the difference between the following statements, “I have noticed a problem” versus “I have noticed the problem.” Now consider that the first statement was written in an email by one scientist to another regarding the analysis of endpoints in a clinical trial conducted by a big pharma company. Now consider the second email in the same context. Is there a difference? You bet. A problem could be any old problem, but the problem has obviously been identified before, as now it warrants a definite article to qualify it. It has evolved from being just a random problem, to being the problem. It also indicates that both emailing parties are aware of said problem, as it warranted no further explanation.  So much meaning and contextual information all wrapped up in one little “noise” word. And one of the most frequent noise words in the English language at that. A word only a linguist could love.

But seriously, if you’re still using a document review search tool for any sort of document investigation, stop reading this and get in touch with me right now. I can tell you definitively that document investigation is NOT the same as document review. Different processes, different objectives.

Well, this was a long one. What can I say? I’m a linguist. I love the words. If you’ve made it this far, you’re kindred at best, and at the very least, you’re a rebel in this virtual theatre of short attention spans.

Next week I’m going to do what everybody else is doing as we head into a new year and talk about the future of technology and technical expertise in eDiscovery. Like 99.9% of my blogging compatriots, I have some predictions. And I’m gonna lay them out for you! Before then, I’ll get to work on this year’s resolution: Brevity.

Happy New Year, folks.

Posted in document investigation, document review, predictive coding | Tagged , , , | Leave a comment

The ghost of Discovery past.

I’m going to start out today’s post with an exercise in stating the obvious: Technology is a central part of Discovery in the legal profession. Indisputably so. Take away computers and hardware and software and automated processes and all the rest of it, and you’re left with banker boxes full of physical documents, a highlighter, hundreds of hours of reading, and a whole lot of paper cuts on your fingers.

Computers and automation literally transformed the discovery process in the legal profession (it gave it a prefix!), and it seems like it happened in the blink of an eye, which is not a temporal pace one usually associates with legal. And it’s not just that the legal profession had to adjust to the fact that computers and automation and advances in tech permeated every facet of every industry. That’s a speeding train that everybody has had to board. It’s that computers and automation and tech advances profoundly changed the quality and quantity of the artifacts of every industry, as well as the artifacts of personal documentation and communication. And these artifacts are the very center pieces of discovery processes. A discernible pile of physical documents became a limitless and ever-expanding universe of electronically stored information. Banker boxes were traded for hosting platforms, highlighters for radio buttons, and paper cuts for carpal tunnel. Discovery became eDiscovery.

Interesting aside: Here you have a profession that operates on the very notion of precedence, historical relevance and traditional methods, having to quickly adjust and evolve due to broad external technological forces, and doing so at a speed that could be construed at “uncomfortably terse” and at a pace that shows little or no pause for precedence and history.

Regardless of the legal profession being traditional by nature, eDiscovery has had to operate in a cutting-edge manner as it is a driver of technology in the field, as much as it is also the result of technology in the field.

And if technology is a foundation of eDiscovery, then data is the massive metropolis erected on top of this foundation.

The legal profession has been dealing with, and reacting to, “Big Data” (or Big ESI, as it were) long before it was a regular part of the tech lexicon. It could be argued that the legal profession in general, and eDiscovery in particular, has been at the fore in adopting practical tech solutions to deal with large quantities of computer-generated data, or what I referred to upstream as the artifacts of doing business in our computer-mediated world.

Now, I am getting to the point (finally!) that I want to frame a larger discussion of technology, technical expertise and eDiscovery: I believe technology and technical expertise in eDiscovery proper has been driven almost solely by the fact that our computer-mediated world produces a lot of data. I want to reiterate this point because it’s an important one: In eDiscovery, the drivers of incorporating technology, and adopting tech solutions, have been primarily a reaction to the quantity of ESI, as much as to the digital environment in which it is produced. And even if you don’t completely agree with this assessment, it’s an interesting idea to ponder, nonetheless.

And with that, I’ll conclude this historical look at technology and eDiscovery. I’ll end this last post of 2014 by looking ahead to next week’s first post of 2015, which will be a present-day technology/eDiscovery state of the union, segueing into a look to the future. Until then, I hope your holidays have been uncommonly merry and undeniably bright.

Posted in Uncategorized | Tagged , | Leave a comment