Home > Application, Classification, Data Mining, Hubness, Images, Machine Learning, Text, Theory > Learning under Class Imbalance in High-dimensional Data: a new journal paper

Learning under Class Imbalance in High-dimensional Data: a new journal paper

August 30th, 2013

I am pleased to say that our paper titled ‘Class Imbalance and The Curse of Minority Hubs’ just got accepted for publication in Knowledge-Based Systems (IF 4.1 (2012)). The research presented in the paper is one of the pillars of the work in my PhD thesis, so I am happy to have gotten some more quality feedback from the reviews and further improved the paper during the whole process.

In the paper, we examine a novel aspect of the well known curse of dimensionality, one that we have named ‘The Curse of Minority Hubs’. It has to do with learning under class imbalance in high-dimensional data. Class-imbalanced problems have been known to pose great difficulties to standard machine learning approaches, just as it was known that a high number of features and sparsity poses problems of its own. Surprisingly, these two phenomena haven’t been often considered simultaneously. In our analysis, we have focused on the high-dimensional phenomenon of hubness, the skewness of the distribution of influence/relevance in similarity-based models. Hubs emerge as centers of influence within the data – and a small number of points determines most properties of the model. In case of classification, a small number of points is responsible for most correct or incorrect classification decisions. The points that cause many label mismatches in k-nearest neighbor sets are called ‘bad hubs’, while the others are referred to as ‘good hubs’.

It just so happens that the points that belong to the minority classes have a tendency to become prominent bad hubs in high-dimensional data, hence the phrase ‘The Curse of Minority Hubs’. This is surprising for several reasons. First of all, it is exactly the opposite of what is usually the standard assumption in class-imbalanced problems: that most misclassification is cause by the majority class, due to an average relative difference in density in the borderline regions. In low or medium-dimensional data, this is indeed the case. Therefore, most machine learning methods that are tailored for class imbalanced data try to improve the classification of the minority class points by penalizing the majority votes.

However, it seems that, in high-dimensional data, it is often the case that the minority hubs induce a disproportionally large percentage of all misclassifications. Therefore, standard methods for class-imbalanced classification face great difficulties when learning in many dimensions, as their base premise is turned upside-down.

In our paper, we take a closer look at all the related phenomena and also propose that the hubness-aware kNN classification methods could be used in conjunction with other class-imbalanced learning strategies in order to alleviate the arising difficulties.

If You’re working in one of these areas and feel that this might be relevant for your work, You can have a look at the paper here.

It seems that Knowledge-Based system also encourages authors to associate a short (less than 5 min.) presentation with the papers, in order to clarify the main points. This is a cool feature and we have also added a few slides with brief explanations and suggestions.

Comments are closed.