~ Semantic Search Art ~

Testing Strategy

The rule of thumb for testing of semantic search and document clustering software:

The software must be tested by someone developer never met on the data developer never saw.

My own experiments in development and testing of available semantic search technologies frquently showed results opposite to those found on-line so far. The biggest surprise was that stemming actually may reduce accuracy (on average by 5%) in document clustering, while reducing dictionary size only by 30%. Until current moment I did not see any other article reporting that very strange phenomenon. There is no need to take my word for anything, because it can be verified instantly and easily. The code fragment performing stemming in HAC or NB can be spotted and commented out

englishStemmer.SetCurrent(strWord);
if (englishStemmer.Stem())
{
	strWord = englishStemmer.GetCurrent();
}
Enabling and disabling stemming changes accuracy in an unpredicted way. That was only tested on English texts and may work differently for other languages. Program QLANGO allows to output dictionary, so enabling or disabling stemming can be verified.

While trying to make implementation of known algorithms more effective, I discovered many options that can drive accuracy up to 20% for particular data, but, at the same time, reduce accuracy for any other data. So I have serious doubts about research conducting the comparison between different technologies and reporting differences of a fraction of percent. It simply can not be determined with such precision. For example, when merging files in HAC, NB or k-means we need to add numbers for same words occurred in two different files. Instead of adding we can, however, take average or maximum. Any of these options may affect the result in an unpredicted way. Adding and removing stop words from default list can affect accuracy by 10%, and using tf-idf opens even better opportunities for manipulation of result. The accuracy, reported by developers and obtained on published data corpuses must be simply ignored. In order to evaluate accuracy and performance each user should find his individual data set that was never seen by developer.

QLANGO is a good candidate to evaluate and verify claims for technical progress in semantic search and document clustering. User needs to find data that was never published on-line, for which the correct clustering result is obtained manually, and compare result of QLANGO and tested software. QLANGO is written based on known for last 30 years algorithms and core part that constitute algorithms is about 1000 lines only, so it can be seen instantly if the claimed advantage of tested technology is true.