Feature subset selection in text-learning
This paper describes several known and some new methods
for feature subset selection on large text data.
Experimental comparison given on real-world data collected
from Web users shows that characteristics of the problem
domain and machine learning algorithm
should be considered when feature scoring measure is selected.
Our problem domain consists of hyperlinks given in a form of
small-documents represented with word vectors.
In our learning experiments naive Bayesian classifier was used
on text data.
The best performance was achieved by the feature selection methods
based on the feature scoring measure called Odds ratio
that is known from information retrieval.