~ Semantic Search Art ~

C# DEMO of Support Vector Machine (libsvm library)
SVM test project (formatted data preparation)

Support Vector Machine

Support Vector Machine is very well explained in many papers. There is no need to repeat algorithm here. There are also libraries. I found one of them libsvm very convenient. It includes one library (libsvm) and two executables (svm_predict and svm_train). I incorporated them into one Visual Studio solution, the structure of which is shown in the image below.

svm_train.exe is used for training program to recognize category. The training data set is provided in text file TRAINSET. The file format represents data matrix as follows:
+1 1:8 2:1 3:1
+1 1:5 2:2 
-1 4:1 5:8
-1 3:1 4:1 5:5
Each row starts from either of two categories denoted as +1 or -1. Only non-zero elements are shown in the row as ELEMENT:VALUE pairs. The program svm_train.exe takes name of the TRAINSET as input parameter and returns TRAINSET.model file. The other program svm_predict.exe performs categorization. It needs TRAINSET.model and TESTSET data as matrix
3: 1:5 2:1 3:1
2: 1:4 2:1
2: 4:1 5:7
3: 3:1 4:1 5:3
2: 1:7 2:2
where each row contains only non-zero elements shown as ELEMENT:VALUE pairs plus the number of elements in each row (shown at the left). The result of categorization is saved to the file, but, in original code binary writer saves some non-ASCII characters that makes result not readable, so I added clear print out to the console. The category for each row of TESTSET is identified as +1 or -1. For shown above data it is +1 +1 -1 -1 +1, which is very obvious.

In order to test text data for PATENTCORPUS128 I needed to prepare two properly formatted data files: one with training set and another with tested data. This project for processing of text data and preparation of proper formatted data can be downloaded from the second link at the top. When executed, it outputs two files with the same names TRAINSET and TESTSET. The list of files chosen for train set can be altered within the project code. In order to run SVM and see the result both file names should be passed to svm_train.exe and svm_predict.exe. The command lines are follows:

svm_train.exe TRAINSET 
svm_predict.exe TESTSET TRAINSET.model RESULT 
File TRAINSET.model is generated by svm_train.exe, file RESULT is output for SVM result. As I said, the format of RESULT is not properly set in the program, so I added output to console.

Since SVM can distinguish only two categories, I made test data out of PATENTCORPUS128 consisting of all possible couples of 8 categories, 28 sets. Each set contained 2 categories in 32 files, 16 files in each category. Two files were chosen as training data and other 30 files were used for testing. The average accuracy was 63 percent. The training data was chosen identical to one used in Naive Bayes. For Naive Bayes accuracy for the same training and testing sample was 89 percent. I have to add that Naive Bayes recognize categories in 120 files having training set of 8 files, but SVM recognize categories within each set of 32 files, having 2 of them as training set. For SVM data, the random choice returns 50 percent of correct results, so accuracy 63 percent is very low. The presumed reason for very low accuracy is large number of function words (words with low meaning) such as NEXT, PREVIOUS, RETURN, CONSIDER, ... and so on, that can be in any category. The number of words that distinguish one category from another is much smaller compared to function words. Naive Bayes use only words presented in training set and ignore others, SVM (according to the method) use all words. As I already mentioned the accuracy on any individual data corpus is not enough to judge the efficiency of the method.