~ Semantic Search Art ~

k-means test in C#


k-means algorithm involve two iterative steps: assignment and updating. We can define problem as clustering of large set of same size vectors. In categorization of documents these vectors are rows in document/term matrix. Solution starts from making node points called centroids, number of which is equal to the number of presumed categories. On assignment step each testing vector is compared to each centroid and obtains the label of the closest centroid. When all vectors are labeled the centroids are recalculated, which is called updating step. Centroids are simple mean points for the group of labeled vectors. The choice of proximity measure is optional. I chose the cosine similarity. Some technical papers say that initial centroids can be chosen freely, I, however, applied Hierarchical Agglomerative Clustering (HAC) at the first step. HAC continues until first several clusters are made and then program switch to assignment/updating steps. Since the goal was to test accuracy, the performance is not optimized for this test project but it is quick enough non-the-less. The accuracy for PATENTCORPUS128 is about 50 percent. HAC, for comparison, shows 84 percent accuracy for the same data, so switch to k-means reduce precision significantly. k-means is still better than Latent Dirichlet Allocation for the same data.