~ Semantic Search Art ~

n-grams test in C#

n-grams

n-grams concept only affects the way document/term matrix is created. Although the concept is simple and clear the details may be very different in each implementation. This test is close to trigrams plus k-means described in article that I found. Trigrams are all possible sequences of 3 symbols after filtering of stop words and stemming. In this test spaces between words are also filtered. The test suite can be downloaded from the link above.

I can show how trigrams are constructed on example of following text fragment FIELD OF INVENTION:

It is passed through stop word filter, that removes word OF, then through stemmer that converts word INVENTION into INVENT, then space is removed FIELDINVENT and then expression is split into trigrams FIE, IEL, ELD, LDI, DIN, INV, NVE, VEN, ENT. Obviously, the number of trigrams in file is significantly less than theoretically possible number. It is typically between 1000 and 7000. After processing of each file we have statistical data as a vector that is recorded in document/term matrix. When matrix is ready we apply k-means algorithm. The accuracy of the result is 53 percent. When k-means was applied for the words the accuracy was 50 percent. Presumably, this slight improvement occurred due to implicit usage of words order, because some trigrams contain symbols from two adjacent words.