Archive for the ‘Application’ Category

Hub Miner Development

April 3rd, 2015 Comments off

Hub Miner ( has been significantly improved since its initial release and it now has full OpenML support for networked experiments in classification and a detailed user manual for all common use cases.

There have also been many new method implementations, especially for data filtering, reduction and outlier detection.

I have ambitious implementation plans for future versions.
If you would like to join the project as a contributor, let me know!

While I am still dedicated to the project, I have somewhat less time than before since I joined Google earlier (since January 2015), so I have decided to open up the project for new contributors that can help in making this an awesome machine learning library.

I am also interested in developing Python/R/Julia/C++ implementations of hubness-aware approaches, so feel free to ping me if you would be interested in that as well.

First Hub Miner release

October 18th, 2014 Comments off

This is the announcement for the first release of Hub Miner code.

Hub Miner is the machine learning library that I have been working on during the course of my Ph.D. research. It is written in Java and released as open source on GitHub. This is the first release and updates are already underway, so please be a little patient. The code is well documented, with many comments – but the library is quite large and it is not that easy to navigate without a manual.

Luckily, a full manual should be done by the end of October and will also appear on GitHub along with the code, as well as on this website, under the Hub Miner page.

Hub Miner is a hubness-aware machine learning library and it implements methods for classification, clustering, instance selection, metric learning, stochastic optimization – and more. It handles standard data types and can handle both dense and sparse data types, continuous and discrete and discretized features. There is some basic implemented support for text and image data processing.

Image Hub Explorer is also within Hub Miner source, a GUI for visual hubness inspection in image data.

A powerful experimentation framework under and learning.unsupervised.evaluation.BatchClusteringTester allows for testing the various baselines in challenging conditions.

OpenML support is also under way and should be completed by the end of October, so expect it to appear in the next release.

PhD thesis: The Role of Hubness in High-dimensional Data Analysis

November 24th, 2013 Comments off

On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness in High-dimensional Data Analysis”.

The thesis discusses the issues involving similarity-based inference in intrinsically high-dimensional data and the consequences of emerging hub points. It integrates the work presented in my journal and conference papers, proposes and discusses novel techniques for designing nearest-neighbor based learning models in many dimensions. Lastly, it mentions potential practical applications and promising future research directions.

I would like to thank everyone who gave me advice and helped in shaping this thesis.

The full text of the thesis is available here.

Learning under Class Imbalance in High-dimensional Data: a new journal paper

August 30th, 2013 Comments off

I am pleased to say that our paper titled ‘Class Imbalance and The Curse of Minority Hubs’ just got accepted for publication in Knowledge-Based Systems (IF 4.1 (2012)). The research presented in the paper is one of the pillars of the work in my PhD thesis, so I am happy to have gotten some more quality feedback from the reviews and further improved the paper during the whole process.

In the paper, we examine a novel aspect of the well known curse of dimensionality, one that we have named ‘The Curse of Minority Hubs’. It has to do with learning under class imbalance in high-dimensional data. Class-imbalanced problems have been known to pose great difficulties to standard machine learning approaches, just as it was known that a high number of features and sparsity poses problems of its own. Surprisingly, these two phenomena haven’t been often considered simultaneously. In our analysis, we have focused on the high-dimensional phenomenon of hubness, the skewness of the distribution of influence/relevance in similarity-based models. Hubs emerge as centers of influence within the data – and a small number of points determines most properties of the model. In case of classification, a small number of points is responsible for most correct or incorrect classification decisions. The points that cause many label mismatches in k-nearest neighbor sets are called ‘bad hubs’, while the others are referred to as ‘good hubs’.

It just so happens that the points that belong to the minority classes have a tendency to become prominent bad hubs in high-dimensional data, hence the phrase ‘The Curse of Minority Hubs’. This is surprising for several reasons. First of all, it is exactly the opposite of what is usually the standard assumption in class-imbalanced problems: that most misclassification is cause by the majority class, due to an average relative difference in density in the borderline regions. In low or medium-dimensional data, this is indeed the case. Therefore, most machine learning methods that are tailored for class imbalanced data try to improve the classification of the minority class points by penalizing the majority votes.

However, it seems that, in high-dimensional data, it is often the case that the minority hubs induce a disproportionally large percentage of all misclassifications. Therefore, standard methods for class-imbalanced classification face great difficulties when learning in many dimensions, as their base premise is turned upside-down.

In our paper, we take a closer look at all the related phenomena and also propose that the hubness-aware kNN classification methods could be used in conjunction with other class-imbalanced learning strategies in order to alleviate the arising difficulties.

If You’re working in one of these areas and feel that this might be relevant for your work, You can have a look at the paper here.

It seems that Knowledge-Based system also encourages authors to associate a short (less than 5 min.) presentation with the papers, in order to clarify the main points. This is a cool feature and we have also added a few slides with brief explanations and suggestions.

Upcoming ECML talks

June 18th, 2013 Comments off

I was just notified that two of my papers got accepted for presentation at the European Conference on Machine Learning (ECML). This is great news and I am looking forward to the conference and the opportunity to share my results and get some valuable feedback.

The regular paper that got accepted is titled “Hub Co-occurrence Modeling for Robust High-dimensional kNN Classification” and has to do with learning from the second-order neighbor dependencies (co-occurrences) in intrinsically high-dimensional data. We have analyzed the consequences of hubness for the neighbor co-occurrence distributions and utilized them in a novel kNN classification method, the Augmented Naive Hubness-Bayesian k-NN (ANHBNN). The method is based on the Hidden Naive Bayes model and introduces hidden nodes in order to model dependencies between individual attributes. The attributes of the model are the neighbor occurrences themselves. This paper solves some problems but also raises new issues and it shows how difficult and multi-faceted the hubness issue can become.

The other paper that got accepted is actually a demo-paper on the Image Hub Explorer tool, which means that I will get the opportunity to present my software at the conference and demonstrate its capabilities in front of the gathered audience. I am really happy about this and I am certain that it will be a great experience. The demo paper is titled: Image Hub Explorer: Evaluating Representations and Metrics for Content-based Image Retrieval and Object Recognition.

Image Hub Explorer Demo Video

May 10th, 2013 Comments off

We have completed the initial demo video of the Image Hub Explorer System and it is now available on YouTube:

The demo covers the basic functionality of the system, demonstrating some of the predicted use cases.
The user interface will undergo further changes and improvements and the functions will be extended by new models and learning approaches.

Improving the semantic representations for cross-lingual document retrieval

May 4th, 2013 Comments off

I have had the pleasure of presenting some of our recent results at PAKDD 2013 in Gold Coast, Australia. The conference was great and the location couldn’t have been better, so we were able to catch some sun and walk along the beaches while discussing future collaboration, theory and applications.

Hubs are known to be the centers of influence and are known to arise in textual data. Also, they are known to cause problems by being frequent neighbors (= very similar) to semantically different types of documents. However, it was previously unknown whether this property is language-dependent and how it affects the cross-lingual information retrieval process.

What we have shown by analyzing aligned text corpora can be summarized by the following: Hubs is one language are not necessarily hubs in another language, different documents become influential. However, surprisingly, the percentage of label mismatches in reverse neighbor sets remains more or less unchanged. In other words, the nature of occurrences is preserved over different languages. This comes as a bit of surprise, since hubness is arguably a geometric property arising from the interplay of metrics and data representations. Yet, it seems that more semantics than was previously thought remains hidden there, captured and preserved across different languages.

We have used this observation to show that it was possible to improve the common semantic representation made via the CCA method (canonical correlation analysis) by simply introducing some hubness-aware instance weights. This is certainly not the only way to go about it and probably not the very best one, but it served as a good proof-of-concept.

The entire paper can be found here: The Role of Hubs in Cross-lingual Supervised Document Retrieval

Categories: Application, Data Mining, Hubness, Text Tags:

Demo: exploring the influence of hubs in object recognition and image retrieval

January 15th, 2013 Comments off

As we have been maily focused on research recently, I decided to take a short detour and put together some libraries that I have been working on into a small, lightweight demo for exploring image sets. I will demonstrate some of the features of the initial version here, though it will certainly grow in time.

I use my own java libraries for data mining, which have been optimized for nearest neighbor methods in particular. There are other benchmark methods as well, but they are not the focus of my work, so I’ve dedicated them less attention. The library itself is cca. 100 000 lines of code and I intend to release it eventually, once I get my PhD and document it properly, as I suppose that it might be of help to some other researchers, as well as enable them to use the methods that we have been developing. This involves novel methods for classification, clustering, metric learning, re-ranking, instance selection, anomaly detection, etc. Not all of it is included in the demo, but there is enough functionality to make some interesting data analysis.

So, briefly, what I am working on is the phenomenon of the emergence of hubs in high-dimensional feature spaces, learning under the skewed distribution of influence in scale-free-like k-nearest neighbor networks. In this context, a hub is an influential point that is a very frequent neighbor, a very frequently retrieved object.

There are four main views/tabs of the applet, as well as several method-selecting menus. The Data Overview tab shown in the screenshot below outlines the basic k-nearest neighbor-based properties of the data and its neighbor occurrence distribution. All the statistics, as well as graphs and charts in all the tabs – interactively respond to a change in neighborhood size invoked by moving the slider. The initial covered range is [1,50].

We have used image data to make the visualizations easier, but the system design is such that the distance matrix used to find the neighbor sets need not be calculated from the image data. One could use an aligned document/image dataset and visualize documents by their associated images.

The Data Overview Tab also shows a 2D visualization of the dataset, obtained by performing multidimensional scaling. The background is colored based on the good or bad hubness quantities in the relevant points. Good hubness involves occurrences that do not constitute a label mismatch, while bad hubness marks label mismatches in reverse k-neighbor sets. A certain number of hub-points is
initially drawn onto the panel and they can be selected. The frame color of each image corresponds to a certain class. A chart on the right bottom part shows the occurrence frequency distribution itself, while various labels contain information about the statistical moments of the distribution and the purity of the k-nearest neighbor sets and the reversed k-nearest neighbor sets.

The Neighbor View Tab shows a bit more structure. The k-nearest neighbors and the reverse k-nearest neighbors are shown in a list for the selected image. It is possible to select a new image from either of the lists or the local view of the k-NN graph. What is shown in the graph on the left is a restriction of the k-NN graph on the selected set of points. One can add either the selected point or its neighbors or reverse neighbors. It is also possible to remove points from the graph. These graphs are generated automatically for the entire range of k-values and are updated when the k-slider is moved. Additionally, one can see the occurrence profile of a neighbor point as a pie chart and examine how often it occurs as a good neighbor and how often as a bad neighbor.

The Class View offers an insight into a hub-structure of each class, as well as the interplay between hubs in different classes, which is summarized in the class-to-class hubness on the lower right side. The main set of panels in this view is contained in a scroll pane and shows an ordered list of major hubs, good hubs and bad hubs for each class separately. As before, they are selectable. Additionaly, there is a point type distribution, where the points are labeled either as safe, borderline, rare or outliers, based on the percentage of label mismatches in the respective k-neighbor sets. The chart in the upper part shows a distribution of classes, which allows us to see if the data is imbalanced. Imbalanced data is known to pose some difficulties for many data mining techniques.

The last View deals with potential queries to the image database, i.e. the similarity search. A user can upload an image and the applet will return the set of most similar images, based on the quantized SIFT features extended by binned color histograms. The applet extracts the features of the new image and does the metric comparisons. Apart from the k-neighbor set, a user can also look at how various variants of the k-nearest neighbor algorithms would assign the label, based on the retrieved points. Eight such algorithms are currently supported, some of which are our own and have been recently proposed precisely for dealing with this sort of data. The applet shows the classification confidence of: kNN, FNN, NWKNN, AKNN, hw-kNN, h-FNN, HIKNN and NHBNN.
Apart from classification, a user can also try to invoke hubness-based re-ranking of the neighbor set, performed based on what was learned from the previous occurrences of those neighbor points. In practice, this works quite well – and we have also done it for other forms of data.

All the outlined analysis is possible for several primary and several secondary similarity/distance measures. The primary ones are quite standard and include Manhattan, Euclidean, Cosine and Jaccard, while the secondary distances include: simcos, simhub, mutual proximity, NICDM and local scaling. Simcos has long been used as a sort of a cure for the dimensionality curse. Simhub is a hubness-aware extension of the simcos measure and is one of our achievements. It has been analyzed in detail in our paper in the Knowledge and Information Systems journal. Mutual proximity is a slightly different, yet quite effective approach by a group of authors from Austria and has been published in 2012 in the Journal of Machine Learning Research. NICDM and local scaling are other notable attempts at tackling hubness.

This is merely a first version of the app and we intend to polish the interface a little bit and add more methods and functionalities. Yet, I feel that it is a nice way to visualize medium-sized image collections and gain some insights into their k-nearest neighbor topologies in order to improve the performance of either the object recognition or the image retrieval systems.

Categories: Application, Hubness, Images, Visualization Tags:

Outlier/error detection in sensor data based on bad hubs

October 23rd, 2012 Comments off

A lot of sensor data is being collected every minute and used for various sorts of prediction. Yet, these measurements are not perfect and the sensors sometime break or malfunction. Detecting these anomalies is a part of the data cleaning and preparation process.
There are many ways to do outlier and anomaly detection and there is a whole body of literature devoted to the problem.
What we have taken a look at instead was one specific test scenario – whether the curse of dimensionality affects the time series enough that the emerging hubs in the data can be used as potential markers for such anomalous measurement records. It turns out that they can and that high bad hubness of measurement points clearly indicates that something is not right. What exactly, well – that is for experts to say in any particular test-case. Here are some graphs from the tool we’ve developed and described in one of our papers.

What is the ‘bad’ hubness of the suspicious points here? They are frequent neighbors to points in other geographical regions, so the distance in measurements does not correspond well to the spatial distance. Of course, the properties of a region are not homogenous and it is certainly possible for correct, non-noisy sensors to produce such data. However, the number of such measurement points is usually small and they are the prime candidates for taking a closer look. This makes for a good semi-automated anomaly detection system.

Categories: Application, Hubness, Sensor data Tags: