~ Semantic Search Art ~

Demo for comparison of several best clustering technologies

Comparison of the best clustering technologies

This article is a review of my one year research in semantic search and document clustering technologies. I found several algorithms significantly more effective and accurate than the others:

  • Naive Bayes
  • Hierarchical Agglomerative Clustering
  • Polar Orthogonalization
  • k-means with Polar Orthogonalization
  • Random Forest with Polar Orthogonalization used in classifiers

Besides HAC they all need the training sample. In case training sample is not available, it can be easily obtained by HAC. After HAC has created several clusters, they can be examined and used as training sample. In the demo at the top HAC also uses training sample. When merging the clusters, it checks if clusters to be merged contain files from training sample that were categorized differently. If yes, HAC skips this merge and use another best couple. Polar Orthogonalization can be used along with other methods and improve them in this way. Random Forest and k-means are two algorithms that can be significantly improved by PO. In Random Forest the critical part for the algorithm is construction of the classifier. Articles about Random Forest don't tell how to do that. In my code I make classifier as two orthogonal vectors. For the group of vectors in each node I seek two the most orthogonal vectors and investigate the others. I classify each other vector as being closer to either of them, add them to accordingly selected vector, making two groups, orthogonalize these two vectors and use them as classifier in navigating further down the tree.

My remarks should look clear for those who are familiar with algorithms that I mention in my review. The detailed description and code samples can be found on this site along with references to other articles.

The project at the top can be used for any data corpus, but different data needs different training samples and different expected result for estimation of accuracy. It is recommended to execute it first with PATENTCORPUS128. For this data corpus no changes in code are required besides passing the correct data location to a program. Code is not optimized for a quick processing of several million documents but rather designed to show the concepts and compare algorithms. It is interesting that disabling stemming deproves HAC, PO and k-means but improves NB and RF. Disabling PO in k-means and RF allow them to run but reduce accuracy on 25%. Enabling or disabling stemming is done by the flag passed to function WordProcessing.processFiles.

The program also shows how different technologies may be used for simple voting. All five technologies cluster data with slight difference but with accuracy 65 to 90 percent. Applying simple voting improves result only by one or two percent, which is sort of disappointing and clearly shows that adding new but inaccurate technique may not improve the end result.

Having multiple technologies may, however, be used for improvement of the accuracy. When different techniques disagree in clustering a particular document, this document can be examined manually and added to the training set. Repeating this step will quickly improve clustering. User needs spot the document classified differently by different techniques, examine it manually, add it to training sample and rerun the program. This strategy is widely used (Recommind predictive coding, for example).