~ Semantic Search Art ~

Predictive Coding

May 19, 2013


Predictive coding is combination of programmatic methods and human actions directed towards obtaining subset of specific documents from the large collection. It is used, typically, in litigation procedure, when attorneys need to find documents to back up the lawsuit, knowing rather a concept of wanted documents than a set of keywords.

The term predictive coding is very misleading. The origin of the term, however, can be explained. Coding means assigning of the ratio of responsiveness to each document. How the document match the concept, on the scale 0 to 1. Since this coding is assigned by the program and not by the human, it is called prediction or predictive coding. Highly rated documents then are examined by experts for final decision. The advantage of this method is significant reduction of the number of documents to be examined (usually from several millions down to several thousands). The other method existed before predictive coding was linear search, which is another misleading term, because it is not clear why it is linear. But the meaning is just manual examination of every document in the collection, which is extremely long and expensive.

Typically predictive coding is iterative procedure. It starts from so-called seeding, which is finding few responsive documents by any way, not really matter how. Using these seeding set, all other collection is programmatically examined and each document is assigned ratio of responsiveness. Part of these found responsive documents are manually examined and seeding set is updated, after which the programmatic search is repeated. Obviously, this procedure can be repeated multiple times, until attorneys reach some confidence that they found all or close to all they could.

There is mystification of predictive coding made mostly by companies involved in e-Discovery. Some of them even suggest that programmatic search is conducted according to the rules established by attorneys, which they can modify at every next step, or that predictive coding identify responsiveness of the documents based on the meaning they convey and not by the words they use and so on. After some research it becomes clear, that under the rules that attorneys make they mean rates that attorneys assign to examined documents, and under search by the meaning of the document they mean classical and known for long time usage of words co-occurrence statistics in such methods as LSA, PLSA, LDA, NB, HAC and others. Speaking about machine comprehension, we can say that at the current moment there is no such program that can tell that phrase I lost ten pounds is a good news and phrase I lost my wallet is a bad news, unless they are specifically trained on these specific phrases. Also there is no such semantic search engine that answers simple question Is it possible to visit London without visiting England? correctly, which is Yes, if London is in Canada.

According to available research the average accuracy that can be expected from the linear search is about 60% and from predictive coding is about 80%. Needless to repeat that predictive coding is much quicker and cheaper. There is no need to prove that predictive coding is effective in order to push this procedure to the courtrooms because they have no choice. They will have to use it simply because linear search soon will become not affordable.

As it was already explained, predictive coding programmatically examines large set of documents by comparing them to small set of preselected documents and rates all documents, i.e. assigns them the degree of proximity to preselected set. The documents can be sorted according to the assigned rate and examined by attorneys. Suppose you have two million sorted documents with rates from 0.999999 to 0.000001, where to start manual examination? The document with rate 0.999999 is the most responsive and the document with rate 0.000001 is the least responsive. This is where crucial mistake can be made. The manual examination should be started from the tail. Attorneys should start looking at the least responsive documents and not at the most responsive. I explain it on example:

Presume we are looking for the documents speaking about the weather. Our seeding set contains document with key phrase Blue sky and sunshine. Predictive coding will give high rating to the document with phrase The small single white cloud on the blue sky covered the sun and miss or rate as unresponsive the document with phrase Snow storm, bitter cold, strong wind and darkness. By examining the tail attorney has a chance to find relevant document with completely different vocabulary and include it into seeding set, after which much wider set of documents can be found in the next run.