Recently, SAS announced support for White House efforts in the fight against patent trolls. As indicated in the announcement, lawsuits filed by patent trolls cost innovators $500 billion in lost wealth from 1990 to 2010 - and are growing at an average rate of 22% a year.
Finding the right information, in seas of documents is a challenge for many organization – patent search and litigation are no exception. Legal organizations are awash in hard drives filled with reports, emails, communications, and alike.
Which brings me to the topic of predictive coding. In following some of the historic debate regarding the usefulness of this approach to help alleviate some of the burden of manual review, I’ve asked: Why has this been called ‘predictive’ in the first place? For the legal profession, a field founded in facts – predictive notions might even be downright scary. And ‘predictive’ doesn’t really even describe what this text analytics method does to improve legal searches.
According to Wikipedia a prediction is “a statement about the way things will happen in the future often but not always based on experience or knowledge”. Well, that’s not what predictive coding does. This type of analysis uses computer software to analyze documents, with the goal of finding important or highly related content within existing material. There is nothing futuristic about it.
Predictive inference (in statistics) considers extending the characteristics of a sample to an entire population. Well, that’s not what predictive coding does. A document is examined to determine membership to one or more topics, terms, themes or phrases. A relevance score is defined – which reflects a probability of membership.
In fact, defining the relevance of a document, describing membership to a fact, taxonomy or topic is within the well-established field of categorization. Categorization of content is a descriptive analysis method – putting text/documents into relevant buckets. Descriptive analysis is different than predictive analysis – the first explains, while the second forecasts or projects. And probabilities are different from predictions – asking will something happen is different from asking when something might happen. But descriptive coding, while perhaps more accurate, isn’t really very catchy. Established, alternate naming conventions for this eDiscovery technique, such as ‘technology assisted review’ or ‘computer assisted review’ seem more helpful describing what this is.
I’ve even gone so far as to interview lawyers on this topic. Their conclusion was, that for extremely high volume cases, and as a method of triage for certain types of documents, computer assisted review can be quite helpful. The goal is to filter out materials that are unrelated to the case at hand. Ideally, the remaining, potentially relevant materials, are grouped into different topics – providing context. And then an intensive search exercise occurs, to isolate pertinent documents. Still – nothing futuristic.
So, one may ask ‘How do you predict from text data? Or any kind of document for that matter’?
Prediction from text happens once it’s numerically represented. Structured in such a way that it retains the essence of the text meaning – but described as numbers (like the presence or absence of a term). There are very sophisticated ways to do this – and are well defined in the field of text mining. Once documents are numerically structured, then they are in the format needed for predictive models – to see if the terms, phrases, facts and themes are meaningful to future events.
Will customers leave in the future based on a dissatisfying experience that they had?
- Say they’ve called into the 1.800 line and complained, or written emails. First you’d analyze the text to understand the issues. These ‘issues’ (whether they be topics, or concepts, or even linguistic rules) are translated to structured representations (as new variables or taxonomies). In turn, these new elements are used as input to a churn model – which is estimating the probability that they will leave at some time in the future.
When might a car no longer be roadworthy, given its history of repairs, age, use, etc?
- Text mining of service notes for that make/model, warranty claims, reported issues, and alike creates structured, numeric variables. These new insights, along with other numeric information (like mileage), would be inputs into a model to identify the future failing of the vehicle.
When will demand for a product increase?
- Monitor social media, identify the ‘buzz’ – from crawling external information sources and extracting pertinent commentary. Use these identified elements, along with sales trend data, in a model to forecast when more demand is expected to happen.
… the list goes on…
Text mining is a well-established discipline – and as many of our customer’s know – is a discovery process. Sound familiar? Based on the data – not humans – documents are classified – with machine learning methods that identify clusters, topics and even create taxonomies, or profile how a term changes over time.
Text mining is, however, only part of the electronic data discovery technology solution described by Joel Henry. Today, text mining can help remove the burden to manually develop training sets, and provides a method for active learning - for machine generated categories to learn from human conditioning.
ESI in Joel Henry’s article stands for ‘electronically standardized information’. Having documents in electronic form is a requirement for any type of machine learning exercise.
SAS announced commitment to converting 38 years of user documentation and technical papers to electronic form for IP.com, who, in turn, work with the US Patent and Trademark Office (USPTO). With the documentation in electronic form, IP.com will be able to publish, aggregate and analyze technical documentation, helping USPTO efforts reduce the burden of patent troll litigation.
The future is predicted to be very bright for organizations committed to stemming abusive patent business practices, as well as for those who are making use of advanced analytics to address big data burdens.
 Findings from Boston University, School of Law study : http://www.bu.edu/law/news/BessenMeurer_patenttrolls.shtml