~ Semantic Search Art ~

Demo for Split and Merge plus Triple Clustering

Split and Merge plus

The matter of this algorithm is two steps: first step splits each cluster into two clusters, second step merges newly created clusters until its number becomes equal to original number of clusters. It starts from randomly created clusters. For example, if we have simple array of bits sitting in two clusters in a random way


0100110101010 - 110101010100011,

we identify minimum and maximum in each cluster and split each cluster into two clusters, by passing each element to the left or to the right, depending on proximity to either minimum or maximum.

0000000 - 111111 - 0000000 - 11111111.

Merging step is obvious, we merge clusters with similarities.

00000000000000 - 11111111111111.

If clusters contain documents we need to find pair of most different documents and sort all others based on proximity to either of these documents. The merge is conducted in the way it is performed in hierarchical agglomerative clustering. Identification of critical pair may need more detailed explanation. We may call this pair opposite, critical, or even polar. My suggestion is to pick random document, then find document with the most different vocabulary and again another document with most different vocabulary. Let say, we picked randomly document A. The document with most different vocabulary is B, the document with the most different vocabulary compared to B is C, so documents B and C forms our polar pair not A. The similarity or dissimilarity can be simply computed as cosine of the vectors, which represent quantities of vocabulary words.

The obvious advantage is that this algorithm is applicable to extremely large collections. It can be executed concurrently for each cluster and some operations may be conducted concurrently even within each cluster. The merge step should be single-threaded and requires usage of triangular matrix, but matrix size equal to the number of clusters, which is always relatively small (10 to 100).

Human-machine interaction

The accuracy of S&M algorithm is lower than a classical HAC (about 75%, compared to 85% for the same data). However, for the collection of 10,000 documents and more classical HAC is simply inapplicable and users have no other choice. Every algorithm has a limited accuracy. Even in the best case accuracy of 85% to 90% means that big number of documents may be not found. Many users are looking for the way of improvement accuracy by limited human intervention. This opportunity exists for S&M. Since the most accurate algorithms need training sample (NB, PO and RF) it is suggested to use S&M for obtaining only training sample and not for final solution. The training sample is selected after clustering process is completed. User selects only the size of the training sample. It should be larger than number of clusters but not necessarily divisible by number of clusters. The algorithm of selection of training sample may be shortly described as follows:
  • Select random document from the first cluster.
  • Select most different to it document from the next cluster.
  • Select most different to above couple from the next cluster, and so on...
  • After last cluster is reached, the process continues from the first cluster again.

The documents selected for training sample are labeled according the cluster, from which they are selected. After fully constructed, the training sample is subjected to user inspection and approval. The size of the training sample is small and this human intervention step is not a big burden. The dissimilarity between documents that is used when collecting the training sample provides good coverage of all possible types of documents in the collection.

The next step in processing of large collections that allows limited human intervention is triple clustering, using three most fast and accurate algorithms: Naive Bayes, Polar Orthogonalization and Random Forest. They all need training sample and assign label to each processed document. After triple clustering is completed, users have three groups of labeled documents. One group where all three algorithms agree, another group where only two algorithms agree, and third group where all documents are classified differently. Obviously, users can examine selectively some of those classified differently documents and include them into training sample. The triple clustering may be repeated multiple times after user intervention step, or interrupted and restarted. In this case the accuracy depends on the time user is willing to spend for selective inspections and opens the way to cluster all documents with a very high accuracy.