Archive for October, 2014

First Hub Miner release

October 18th, 2014 Comments off

This is the announcement for the first release of Hub Miner code.

Hub Miner is the machine learning library that I have been working on during the course of my Ph.D. research. It is written in Java and released as open source on GitHub. This is the first release and updates are already underway, so please be a little patient. The code is well documented, with many comments – but the library is quite large and it is not that easy to navigate without a manual.

Luckily, a full manual should be done by the end of October and will also appear on GitHub along with the code, as well as on this website, under the Hub Miner page.

Hub Miner is a hubness-aware machine learning library and it implements methods for classification, clustering, instance selection, metric learning, stochastic optimization – and more. It handles standard data types and can handle both dense and sparse data types, continuous and discrete and discretized features. There is some basic implemented support for text and image data processing.

Image Hub Explorer is also within Hub Miner source, a GUI for visual hubness inspection in image data.

A powerful experimentation framework under and learning.unsupervised.evaluation.BatchClusteringTester allows for testing the various baselines in challenging conditions.

OpenML support is also under way and should be completed by the end of October, so expect it to appear in the next release.

New publications

October 12th, 2014 Comments off

Two of our new papers have recently been accepted.

The paper titled Boosting for Vote Learning in High-dimensional kNN Classification has been accepted for presentation at the International Conference on Data Mining (ICDM) workshop on High-dimensional Data Analysis. In the paper, the possibility of using boosting for vote learning in high-dimensional data is examined, since it has been determined that hubness-aware k-nearest neighbor classifiers permit boosting in the classical sense. Standard kNN baselines are known to be robust to training data sub-sampling and the instance sampling and instance re-weighting approaches to boosting do not typically work on kNN, which can be boosted by feature sub-sampling instead. In case of hubness-aware classifiers, it is possible to use the re-weighting type of boosting without greatly increasing the computational complexity (as the kNN graph only needs to be calculated once on the training data for the neighbor occurrence model). We have extended the basic neighbor occurrence models by introducing instance weights and weighted neighbor occurrences, with trivial changes to the hubness-aware voting frameworks. The results look promising, though we have only tried the Adaboost.M2 boosting approach so far – and other branch programs are less prone to over-fit and more robust to noise… So, there is more work to be done here.

Speaking of noise, our paper on Hubness-aware kNN Classification of High-dimensional Data in Presence of Label Noise has just been accepted for publication at Neurocomputing Special Issue on Learning from Label Noise. It is an in-depth study of the impact of data hubness and the curse of dimensionality on classification performance and the inherent robustness of hubness-aware approaches in particular. Additionally, we have introduced a novel concept of hubness-proportional random label noise as a way to test for worst-case scenarios. To show that this noise model is realistic, we have demonstrated an adversarial label-flip attack based on the estimated TF-IDF message weights that were inversely correlated with point-wise hubness in SMS spam data under standard TF-IDF normalization. We hope to do more work on hubness-aware learning under label noise soon.