Shonan Seminar: Dimensionality and Scalability (II)

August 1st, 2015 Comments off

In June, Shonan Village in Japan hosted a seminar dedicated to diving into the issues surrounding high intrinsic data dimensionality, distance concentration, similarity search and scalability. It was an amazing opportunity to spend a few days brainstorming these topics with the leading experts in the field and hear about many new surprising results.

We are all looking forward to some new projects and collaborations that were set up during the seminar.

A big shout out to the organizers who did an amazing job and the participants with their individual contributions.

The official report from the seminar can be found here.

Categories: Uncategorized Tags:

Hub Miner Development

April 3rd, 2015 Comments off

Hub Miner (https://github.com/datapoet/hubminer) has been significantly improved since its initial release and it now has full OpenML support for networked experiments in classification and a detailed user manual for all common use cases.

There have also been many new method implementations, especially for data filtering, reduction and outlier detection.

I have ambitious implementation plans for future versions.
If you would like to join the project as a contributor, let me know!

While I am still dedicated to the project, I have somewhat less time than before since I joined Google earlier (since January 2015), so I have decided to open up the project for new contributors that can help in making this an awesome machine learning library.

I am also interested in developing Python/R/Julia/C++ implementations of hubness-aware approaches, so feel free to ping me if you would be interested in that as well.

First Hub Miner release

October 18th, 2014 Comments off

This is the announcement for the first release of Hub Miner code.

Hub Miner is the machine learning library that I have been working on during the course of my Ph.D. research. It is written in Java and released as open source on GitHub. This is the first release and updates are already underway, so please be a little patient. The code is well documented, with many comments – but the library is quite large and it is not that easy to navigate without a manual.

Luckily, a full manual should be done by the end of October and will also appear on GitHub along with the code, as well as on this website, under the Hub Miner page.

Hub Miner is a hubness-aware machine learning library and it implements methods for classification, clustering, instance selection, metric learning, stochastic optimization – and more. It handles standard data types and can handle both dense and sparse data types, continuous and discrete and discretized features. There is some basic implemented support for text and image data processing.

Image Hub Explorer is also within Hub Miner source, a GUI for visual hubness inspection in image data.

A powerful experimentation framework under learning.supervised.evaluation.cv.BatchClassifierTester and learning.unsupervised.evaluation.BatchClusteringTester allows for testing the various baselines in challenging conditions.

OpenML support is also under way and should be completed by the end of October, so expect it to appear in the next release.

New publications

October 12th, 2014 Comments off

Two of our new papers have recently been accepted.

The paper titled Boosting for Vote Learning in High-dimensional kNN Classification has been accepted for presentation at the International Conference on Data Mining (ICDM) workshop on High-dimensional Data Analysis. In the paper, the possibility of using boosting for vote learning in high-dimensional data is examined, since it has been determined that hubness-aware k-nearest neighbor classifiers permit boosting in the classical sense. Standard kNN baselines are known to be robust to training data sub-sampling and the instance sampling and instance re-weighting approaches to boosting do not typically work on kNN, which can be boosted by feature sub-sampling instead. In case of hubness-aware classifiers, it is possible to use the re-weighting type of boosting without greatly increasing the computational complexity (as the kNN graph only needs to be calculated once on the training data for the neighbor occurrence model). We have extended the basic neighbor occurrence models by introducing instance weights and weighted neighbor occurrences, with trivial changes to the hubness-aware voting frameworks. The results look promising, though we have only tried the Adaboost.M2 boosting approach so far – and other branch programs are less prone to over-fit and more robust to noise… So, there is more work to be done here.

Speaking of noise, our paper on Hubness-aware kNN Classification of High-dimensional Data in Presence of Label Noise has just been accepted for publication at Neurocomputing Special Issue on Learning from Label Noise. It is an in-depth study of the impact of data hubness and the curse of dimensionality on classification performance and the inherent robustness of hubness-aware approaches in particular. Additionally, we have introduced a novel concept of hubness-proportional random label noise as a way to test for worst-case scenarios. To show that this noise model is realistic, we have demonstrated an adversarial label-flip attack based on the estimated TF-IDF message weights that were inversely correlated with point-wise hubness in SMS spam data under standard TF-IDF normalization. We hope to do more work on hubness-aware learning under label noise soon.

Image Hub Explorer: Journal Paper

August 26th, 2014 Comments off

We were notified today that the extended version of the paper that we have presented at last year’s European Conference on Machine Learning has been accepted for publications in the Multimedia Tools and Applications journal. The paper is titled “Image Hub Explorer: Evaluating Representations and Metrics for Content-based Image Retrieval and Object Recognition“. The full text of the article will soon be available online on the publications page.

The paper is about the Image Hub Explorer system for interactive evaluation and visualization of the utility of various image feature representations and metrics from the perspective of the semantic consistency of the top-k result sets and the emergence of beneficial and/or detrimental image hubs in the data. Indeed, our results indicate that different image feature representations have different levels of susceptibility to the hubness phenomenon and the curse of dimensionality. In the paper, we have examined the quantized bag-of-feature representations for SIFT, SURF, ORB and BRIEF descriptors, though the system itself was build to be applicable to generic representations as well, including DeCaf and similar learned feature types.

The system implements state-of-the-art hubness-aware machine learning methods for metric learning, ranking and classification, as well as several novel visualization layers and components. It will be made freely available in about a month as a part of the Hub Miner library that is to be released soon as open source. We will post more notifications soon.

A video of the demo of Image Hub Explorer is available here.

Categories: Uncategorized Tags:

A Novel Kernel Clustering Algorithm

July 26th, 2014 Comments off

We have a new book chapter coming out now on high-dimensional data clustering in the book on partitional clustering algorithms. It is titled ‘Hubness-Based Clustering of High-Dimensional Data’ and it is an extension of our earlier work where we have shown that it is possible to exploit kNN hubs for effective data clustering in many dimensions.

In our chapter, we have extended the original algorithm to incorporate a ‘kernel trick’ in order to be able to handle non-hyperspherical clusters in the data. This has resulted in the Kernel Global Hubness-proportional K-Means algorithm (Kernel-GHPKM) that our experiments show as highly promising and preferable to standard kernel K-means on some high-dimensional datasets.

The implementation is available in Hub Miner and will be released very soon along with the rest of the library.

Stay tuned for more updates.

PhD thesis: The Role of Hubness in High-dimensional Data Analysis

November 24th, 2013 Comments off

On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness in High-dimensional Data Analysis”.

The thesis discusses the issues involving similarity-based inference in intrinsically high-dimensional data and the consequences of emerging hub points. It integrates the work presented in my journal and conference papers, proposes and discusses novel techniques for designing nearest-neighbor based learning models in many dimensions. Lastly, it mentions potential practical applications and promising future research directions.

I would like to thank everyone who gave me advice and helped in shaping this thesis.

The full text of the thesis is available here.

@ ECML PKDD 2013

September 27th, 2013 Comments off

This year’s ECML/PKDD has definitely exceeded my expectations. Great talks, an inspiring atmosphere, an involving poster session – and lots of feedback from different people.

My own presentations also went quite well.

Here are the posters that I’ve used to present the papers about the Image Hub Explorer and Augmented Naive Hubness-Bayesian k-Nearest Neighbor classifier for intrinsically high-dimensional data.

Learning under Class Imbalance in High-dimensional Data: a new journal paper

August 30th, 2013 Comments off

I am pleased to say that our paper titled ‘Class Imbalance and The Curse of Minority Hubs’ just got accepted for publication in Knowledge-Based Systems (IF 4.1 (2012)). The research presented in the paper is one of the pillars of the work in my PhD thesis, so I am happy to have gotten some more quality feedback from the reviews and further improved the paper during the whole process.

In the paper, we examine a novel aspect of the well known curse of dimensionality, one that we have named ‘The Curse of Minority Hubs’. It has to do with learning under class imbalance in high-dimensional data. Class-imbalanced problems have been known to pose great difficulties to standard machine learning approaches, just as it was known that a high number of features and sparsity poses problems of its own. Surprisingly, these two phenomena haven’t been often considered simultaneously. In our analysis, we have focused on the high-dimensional phenomenon of hubness, the skewness of the distribution of influence/relevance in similarity-based models. Hubs emerge as centers of influence within the data – and a small number of points determines most properties of the model. In case of classification, a small number of points is responsible for most correct or incorrect classification decisions. The points that cause many label mismatches in k-nearest neighbor sets are called ‘bad hubs’, while the others are referred to as ‘good hubs’.

It just so happens that the points that belong to the minority classes have a tendency to become prominent bad hubs in high-dimensional data, hence the phrase ‘The Curse of Minority Hubs’. This is surprising for several reasons. First of all, it is exactly the opposite of what is usually the standard assumption in class-imbalanced problems: that most misclassification is cause by the majority class, due to an average relative difference in density in the borderline regions. In low or medium-dimensional data, this is indeed the case. Therefore, most machine learning methods that are tailored for class imbalanced data try to improve the classification of the minority class points by penalizing the majority votes.

However, it seems that, in high-dimensional data, it is often the case that the minority hubs induce a disproportionally large percentage of all misclassifications. Therefore, standard methods for class-imbalanced classification face great difficulties when learning in many dimensions, as their base premise is turned upside-down.

In our paper, we take a closer look at all the related phenomena and also propose that the hubness-aware kNN classification methods could be used in conjunction with other class-imbalanced learning strategies in order to alleviate the arising difficulties.

If You’re working in one of these areas and feel that this might be relevant for your work, You can have a look at the paper here.

It seems that Knowledge-Based system also encourages authors to associate a short (less than 5 min.) presentation with the papers, in order to clarify the main points. This is a cool feature and we have also added a few slides with brief explanations and suggestions.

Upcoming ECML talks

June 18th, 2013 Comments off

I was just notified that two of my papers got accepted for presentation at the European Conference on Machine Learning (ECML). This is great news and I am looking forward to the conference and the opportunity to share my results and get some valuable feedback.

The regular paper that got accepted is titled “Hub Co-occurrence Modeling for Robust High-dimensional kNN Classification” and has to do with learning from the second-order neighbor dependencies (co-occurrences) in intrinsically high-dimensional data. We have analyzed the consequences of hubness for the neighbor co-occurrence distributions and utilized them in a novel kNN classification method, the Augmented Naive Hubness-Bayesian k-NN (ANHBNN). The method is based on the Hidden Naive Bayes model and introduces hidden nodes in order to model dependencies between individual attributes. The attributes of the model are the neighbor occurrences themselves. This paper solves some problems but also raises new issues and it shows how difficult and multi-faceted the hubness issue can become.

The other paper that got accepted is actually a demo-paper on the Image Hub Explorer tool, which means that I will get the opportunity to present my software at the conference and demonstrate its capabilities in front of the gathered audience. I am really happy about this and I am certain that it will be a great experience. The demo paper is titled: Image Hub Explorer: Evaluating Representations and Metrics for Content-based Image Retrieval and Object Recognition.