---------------------------------------- ----------------------------------------

:-) PhD thesis project page: Machine Learning on non-homogeneous, distributed text data

---------------------------------------- ----------------------------------------

Here is the postscript version of the whole thesis in English (compressed) and in Slovene (compressed). Hyperlinks to the separated chapters (compressed postscript) are given belowe in the overview.

PhD thesis defense took place October 8, 1998 at Faculty of Computer and Information Science, University of Ljubljana, Slovenia.
Here are some photos from the event!

Lunch waiting to start waiting... introduction
talking listening more talking... questions
answers announcement of PhD congratulations happy end :)
Next day
----------------------------------------

Overview (Table of contents and Literature)

This dissertation proposes new elements of machine learning methods where the corresponding learning problem is characterized by a high number of features (several tens of thousands), unbalanced class distribution (less than 1%-10% of examples belong to the target class value) and asymmetric misclassification costs. By asymmetric misclassification costs we mean that one of the class values is the target class value for which we want to get predictions and we prefer false positive over false negative. The input is given as a set of text documents or their Web addresses (URLs). The induced target concept is appropriate for the classification of new documents including shortened documents describing individual hyperlinks. (Introduction and Experimental methods and materials)

Proposed is document representation that extends the bag-of-words representation by adding word sequences. Experimental comparison of extended document representation, with the commonly used bag-of-words representation has confirmed its usefulness. Based on the expetriment on the data obtained formt he Yahoo hierarchy of Web documents we suggest usage of word sequences including up to two words. (Document representation and learning algorithms)

An approach to automatic document categorization based on a large categorization hierarchy is proposed where, a high number of class values, examples and features, are handled by (1) dividing a problem into subproblems based on the hierarchical structure of class values and examples, (2) by applying feature subset selection and (3) by considering only promising categories during the classification. Pruning unpromising ategories during the classification is based on the minimal number of highly scored features shared by the subproblem classifier and the testing example. This pruning either improves or has no significant influence to the classification results while pruning 85%-95% of categories. When learning from the categorization hierarchy we get on different domains the correct category among the twenty categories with the highest predicted probability out of the several thousands categories. When assigning keywords to a document based on the category prediction we get about 80% of the correct keywords (Recall is 0.80) while about 50% of all the predicted keywords are correct (Precision is 0.50). The models for category prediction and keyword assignment can be used as background knowledge giving higher level information about the document content. (Learning from class hierarchy)

Experimental comparison and analysis of different feature scoring measures on text data has shown the importance of considering problem specifics and learning algorithm characteristics during the feature subset selection. Instead of incorporating them using the computationally intensive `wrapper approach' to feature subset selection, we carefully select the feature scoring measure and use a rather simple feature subset selection approach. Our experiments show that in general the most important characteristics of a good feature scoring measure for text are favoring common features and considering characteristics of the learning algorithm. In the case of domains where one of the class values is the target class value, the most important characteristics of a good feature scoring measure is to discriminate the target class value from the other class values. The best performing measures are Odds ratio (and its variants) that discriminate the target class value from the other class values. Very good performance is also achieved by Cross entropy for text and Term frequency that both favor common features. The best results are achieved when only a small number of features is used. By this we mean using 50-100 best features or in other words using only 0.2%-5% of all features. (Feature subset selection)

Preliminary experiments on predicting word occurrences by learning mutually dependent class attributes has shown that in many cases this hyperlink description can be successfully used to predict part of the content of document the hyperlink points to. Using the proposed approach we successfully get about 35% to 45% of the document words (Recall is 0.35 to 0.45), while among all the predicted words about 20% are correct (Precision is 0.2). In this way, human knowledge and associations captured on the Web in the text of hyperlinks can be modeled and used for learning on hypertext. (Predicting word occurrences) Additionally to research motivation this work is also motivated by the potential applications of the developed methods. (Conclusions, discussion and further work) ----------------------------------------

Thesis advisors

----------------------------------------

This project is strongly related to the Personal WebWatcher project and the Yahoo Planet project ----------------------------------------

Dunja Mladenic
Last modified: Wed Jul 21 14:38:33 METDST