Archive

Archive for the ‘Text’ Category

Hub Miner Development

April 3rd, 2015 Comments off

Hub Miner (https://github.com/datapoet/hubminer) has been significantly improved since its initial release and it now has full OpenML support for networked experiments in classification and a detailed user manual for all common use cases.

There have also been many new method implementations, especially for data filtering, reduction and outlier detection.

I have ambitious implementation plans for future versions.
If you would like to join the project as a contributor, let me know!

While I am still dedicated to the project, I have somewhat less time than before since I joined Google earlier (since January 2015), so I have decided to open up the project for new contributors that can help in making this an awesome machine learning library.

I am also interested in developing Python/R/Julia/C++ implementations of hubness-aware approaches, so feel free to ping me if you would be interested in that as well.

PhD thesis: The Role of Hubness in High-dimensional Data Analysis

November 24th, 2013 Comments off

On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness in High-dimensional Data Analysis”.

The thesis discusses the issues involving similarity-based inference in intrinsically high-dimensional data and the consequences of emerging hub points. It integrates the work presented in my journal and conference papers, proposes and discusses novel techniques for designing nearest-neighbor based learning models in many dimensions. Lastly, it mentions potential practical applications and promising future research directions.

I would like to thank everyone who gave me advice and helped in shaping this thesis.

The full text of the thesis is available here.

Learning under Class Imbalance in High-dimensional Data: a new journal paper

August 30th, 2013 Comments off

I am pleased to say that our paper titled ‘Class Imbalance and The Curse of Minority Hubs’ just got accepted for publication in Knowledge-Based Systems (IF 4.1 (2012)). The research presented in the paper is one of the pillars of the work in my PhD thesis, so I am happy to have gotten some more quality feedback from the reviews and further improved the paper during the whole process.

In the paper, we examine a novel aspect of the well known curse of dimensionality, one that we have named ‘The Curse of Minority Hubs’. It has to do with learning under class imbalance in high-dimensional data. Class-imbalanced problems have been known to pose great difficulties to standard machine learning approaches, just as it was known that a high number of features and sparsity poses problems of its own. Surprisingly, these two phenomena haven’t been often considered simultaneously. In our analysis, we have focused on the high-dimensional phenomenon of hubness, the skewness of the distribution of influence/relevance in similarity-based models. Hubs emerge as centers of influence within the data – and a small number of points determines most properties of the model. In case of classification, a small number of points is responsible for most correct or incorrect classification decisions. The points that cause many label mismatches in k-nearest neighbor sets are called ‘bad hubs’, while the others are referred to as ‘good hubs’.

It just so happens that the points that belong to the minority classes have a tendency to become prominent bad hubs in high-dimensional data, hence the phrase ‘The Curse of Minority Hubs’. This is surprising for several reasons. First of all, it is exactly the opposite of what is usually the standard assumption in class-imbalanced problems: that most misclassification is cause by the majority class, due to an average relative difference in density in the borderline regions. In low or medium-dimensional data, this is indeed the case. Therefore, most machine learning methods that are tailored for class imbalanced data try to improve the classification of the minority class points by penalizing the majority votes.

However, it seems that, in high-dimensional data, it is often the case that the minority hubs induce a disproportionally large percentage of all misclassifications. Therefore, standard methods for class-imbalanced classification face great difficulties when learning in many dimensions, as their base premise is turned upside-down.

In our paper, we take a closer look at all the related phenomena and also propose that the hubness-aware kNN classification methods could be used in conjunction with other class-imbalanced learning strategies in order to alleviate the arising difficulties.

If You’re working in one of these areas and feel that this might be relevant for your work, You can have a look at the paper here.

It seems that Knowledge-Based system also encourages authors to associate a short (less than 5 min.) presentation with the papers, in order to clarify the main points. This is a cool feature and we have also added a few slides with brief explanations and suggestions.

Improving the semantic representations for cross-lingual document retrieval

May 4th, 2013 Comments off

I have had the pleasure of presenting some of our recent results at PAKDD 2013 in Gold Coast, Australia. The conference was great and the location couldn’t have been better, so we were able to catch some sun and walk along the beaches while discussing future collaboration, theory and applications.

Hubs are known to be the centers of influence and are known to arise in textual data. Also, they are known to cause problems by being frequent neighbors (= very similar) to semantically different types of documents. However, it was previously unknown whether this property is language-dependent and how it affects the cross-lingual information retrieval process.

What we have shown by analyzing aligned text corpora can be summarized by the following: Hubs is one language are not necessarily hubs in another language, different documents become influential. However, surprisingly, the percentage of label mismatches in reverse neighbor sets remains more or less unchanged. In other words, the nature of occurrences is preserved over different languages. This comes as a bit of surprise, since hubness is arguably a geometric property arising from the interplay of metrics and data representations. Yet, it seems that more semantics than was previously thought remains hidden there, captured and preserved across different languages.

We have used this observation to show that it was possible to improve the common semantic representation made via the CCA method (canonical correlation analysis) by simply introducing some hubness-aware instance weights. This is certainly not the only way to go about it and probably not the very best one, but it served as a good proof-of-concept.

The entire paper can be found here: The Role of Hubs in Cross-lingual Supervised Document Retrieval

Categories: Application, Data Mining, Hubness, Text Tags: