Bag-Of-Words Binary SVM Classification

The utility learns binary Support Vector Machine (SVM) classifier on the input file ("-i") for classifying documents into one category ("-cat"). It produces model ("-o") in Bag-Of-Words format ".bowmd". Both positive and negative examples are needed for learning. Input vectors can be weighted ("-w") with different weights.

The parameter "-c" determines the value of cost parameter for SVM, which must be greater than 0. Cost parameter can be weighted differently for positive and negative examples with parameter "-j" (C+ = jC, C- = C). The parameter "-t" selects kernel used for learning:

  1. 0 - linear kernel (much faster than others)
  2. 1 - polynomial kernel k(x, y) = (s (xTy) + c)p
  3. 2 - radial kernel k(x, y) = exp(-gamma ||x - y||2)
  4. 3 - sigmoid kernel k(x, y) = tanh(s xTy + c)
Parameters "-ker_p", "-ker_s", "-ker_c" and "-ker_gamma" determine parameters of nonlinear kernels.

The parameter "-cachesize" determines size of cache (in MB) non-linear SVM can use for caching evaluated kernel functions. The parameter "-time" determines maximal time in seconds allowed for learning classifier. The parameter "-v" determines verbosity during learning. The parameters "-subsize" determines size of sub-problems used at learning algorithm (-1 means classifier decides). The parameters "-ter" determines termination criteria. By increasing it learning gets faster but at the end classifier is less accurate. The parameters "-shrink" determines if support vectors are prediction while learning. Using this option can increases learning time dramatically
The parameter "-t" is used for Reuters21578 dataset. It determines what documents from ModApte split of this dataset are used for learning.

usage: BowTrainBinSVM.exe
-i:Input-BagOfWords-FileName (default:'')
-o:Output-Binary-SVM-Model-FileName (default:'')
-cat:Category-Name (default:'')
-td:Training-Documents (0 - all, 1 - train, 2 - test) (default:0)
-w:Weighting (none, norm, bin, tfidf) (default:'tfidf')
-c:Cost-Parameter (default:1.0)
-j:Weight-for-Cost-Parameter-for-Relevant-Documents (default:1)
-t:SVM-Type: 0-linear, 1-polynomial, 2-radial, 3-sigmoid (default:0)
-ker_p:Degree-of-Polynomail-Kernel (default:3)
-ker_s:Linear-Part-in-Polynomial-Kernel (default:1)
-ker_c:Constant-Part-in-Polynomail-Kernel (default:1)
-ker_gamma:Gamma-for-Radial-Kernel (default:1)
-cachesize:Memory-Cache-Size (default:50)
-time:Upper-Time-Limit (default:-1)
-v:Verbosity (default:0)
-subsize:Subproblem-Size (default:-1)
-ter:Terminating-Condition (default:0.001)
-shrink:Shrinking (default:'T')

Example 1:
BowTrainBinSVM.exe -i:Reuters21578.Bow -w:tfidf -cat:corn -c:2.0 -j:4.0 -td:1

The above example learns linear SVM classifier for category corn using documents from Reuters21578 tagged as training documents. Cost parameter is set to 2.0 for negative examples and 8.0 (2.0*8.0) for positive. This is done using parameter "-j-". Model is saved into file reuters21578.BowMd.

Example 2:
BowTrainBinSVM.exe -i:Reuters21578.Bow -w:tfidf -cat:corn -t:2 -ker_gamma:2 -td:1 -cachesize:100

The above example learns SVM classifier with RBF kernel for category corn using documents from Reuters21578 tagged as training documents. Gamma parameter of Radial kernel is set of 2.0 and 100 MB of memory is used for caching calculated kernels. Model is saved into file reuters21578.BowMd.