Archive for the ‘Visualization’ Category

Hub Miner Development

April 3rd, 2015 Comments off

Hub Miner ( has been significantly improved since its initial release and it now has full OpenML support for networked experiments in classification and a detailed user manual for all common use cases.

There have also been many new method implementations, especially for data filtering, reduction and outlier detection.

I have ambitious implementation plans for future versions.
If you would like to join the project as a contributor, let me know!

While I am still dedicated to the project, I have somewhat less time than before since I joined Google earlier (since January 2015), so I have decided to open up the project for new contributors that can help in making this an awesome machine learning library.

I am also interested in developing Python/R/Julia/C++ implementations of hubness-aware approaches, so feel free to ping me if you would be interested in that as well.

First Hub Miner release

October 18th, 2014 Comments off

This is the announcement for the first release of Hub Miner code.

Hub Miner is the machine learning library that I have been working on during the course of my Ph.D. research. It is written in Java and released as open source on GitHub. This is the first release and updates are already underway, so please be a little patient. The code is well documented, with many comments – but the library is quite large and it is not that easy to navigate without a manual.

Luckily, a full manual should be done by the end of October and will also appear on GitHub along with the code, as well as on this website, under the Hub Miner page.

Hub Miner is a hubness-aware machine learning library and it implements methods for classification, clustering, instance selection, metric learning, stochastic optimization – and more. It handles standard data types and can handle both dense and sparse data types, continuous and discrete and discretized features. There is some basic implemented support for text and image data processing.

Image Hub Explorer is also within Hub Miner source, a GUI for visual hubness inspection in image data.

A powerful experimentation framework under and learning.unsupervised.evaluation.BatchClusteringTester allows for testing the various baselines in challenging conditions.

OpenML support is also under way and should be completed by the end of October, so expect it to appear in the next release.

PhD thesis: The Role of Hubness in High-dimensional Data Analysis

November 24th, 2013 Comments off

On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness in High-dimensional Data Analysis”.

The thesis discusses the issues involving similarity-based inference in intrinsically high-dimensional data and the consequences of emerging hub points. It integrates the work presented in my journal and conference papers, proposes and discusses novel techniques for designing nearest-neighbor based learning models in many dimensions. Lastly, it mentions potential practical applications and promising future research directions.

I would like to thank everyone who gave me advice and helped in shaping this thesis.

The full text of the thesis is available here.

Upcoming ECML talks

June 18th, 2013 Comments off

I was just notified that two of my papers got accepted for presentation at the European Conference on Machine Learning (ECML). This is great news and I am looking forward to the conference and the opportunity to share my results and get some valuable feedback.

The regular paper that got accepted is titled “Hub Co-occurrence Modeling for Robust High-dimensional kNN Classification” and has to do with learning from the second-order neighbor dependencies (co-occurrences) in intrinsically high-dimensional data. We have analyzed the consequences of hubness for the neighbor co-occurrence distributions and utilized them in a novel kNN classification method, the Augmented Naive Hubness-Bayesian k-NN (ANHBNN). The method is based on the Hidden Naive Bayes model and introduces hidden nodes in order to model dependencies between individual attributes. The attributes of the model are the neighbor occurrences themselves. This paper solves some problems but also raises new issues and it shows how difficult and multi-faceted the hubness issue can become.

The other paper that got accepted is actually a demo-paper on the Image Hub Explorer tool, which means that I will get the opportunity to present my software at the conference and demonstrate its capabilities in front of the gathered audience. I am really happy about this and I am certain that it will be a great experience. The demo paper is titled: Image Hub Explorer: Evaluating Representations and Metrics for Content-based Image Retrieval and Object Recognition.

Image Hub Explorer Demo Video

May 10th, 2013 Comments off

We have completed the initial demo video of the Image Hub Explorer System and it is now available on YouTube:

The demo covers the basic functionality of the system, demonstrating some of the predicted use cases.
The user interface will undergo further changes and improvements and the functions will be extended by new models and learning approaches.

Demo: exploring the influence of hubs in object recognition and image retrieval

January 15th, 2013 Comments off

As we have been maily focused on research recently, I decided to take a short detour and put together some libraries that I have been working on into a small, lightweight demo for exploring image sets. I will demonstrate some of the features of the initial version here, though it will certainly grow in time.

I use my own java libraries for data mining, which have been optimized for nearest neighbor methods in particular. There are other benchmark methods as well, but they are not the focus of my work, so I’ve dedicated them less attention. The library itself is cca. 100 000 lines of code and I intend to release it eventually, once I get my PhD and document it properly, as I suppose that it might be of help to some other researchers, as well as enable them to use the methods that we have been developing. This involves novel methods for classification, clustering, metric learning, re-ranking, instance selection, anomaly detection, etc. Not all of it is included in the demo, but there is enough functionality to make some interesting data analysis.

So, briefly, what I am working on is the phenomenon of the emergence of hubs in high-dimensional feature spaces, learning under the skewed distribution of influence in scale-free-like k-nearest neighbor networks. In this context, a hub is an influential point that is a very frequent neighbor, a very frequently retrieved object.

There are four main views/tabs of the applet, as well as several method-selecting menus. The Data Overview tab shown in the screenshot below outlines the basic k-nearest neighbor-based properties of the data and its neighbor occurrence distribution. All the statistics, as well as graphs and charts in all the tabs – interactively respond to a change in neighborhood size invoked by moving the slider. The initial covered range is [1,50].

We have used image data to make the visualizations easier, but the system design is such that the distance matrix used to find the neighbor sets need not be calculated from the image data. One could use an aligned document/image dataset and visualize documents by their associated images.

The Data Overview Tab also shows a 2D visualization of the dataset, obtained by performing multidimensional scaling. The background is colored based on the good or bad hubness quantities in the relevant points. Good hubness involves occurrences that do not constitute a label mismatch, while bad hubness marks label mismatches in reverse k-neighbor sets. A certain number of hub-points is
initially drawn onto the panel and they can be selected. The frame color of each image corresponds to a certain class. A chart on the right bottom part shows the occurrence frequency distribution itself, while various labels contain information about the statistical moments of the distribution and the purity of the k-nearest neighbor sets and the reversed k-nearest neighbor sets.

The Neighbor View Tab shows a bit more structure. The k-nearest neighbors and the reverse k-nearest neighbors are shown in a list for the selected image. It is possible to select a new image from either of the lists or the local view of the k-NN graph. What is shown in the graph on the left is a restriction of the k-NN graph on the selected set of points. One can add either the selected point or its neighbors or reverse neighbors. It is also possible to remove points from the graph. These graphs are generated automatically for the entire range of k-values and are updated when the k-slider is moved. Additionally, one can see the occurrence profile of a neighbor point as a pie chart and examine how often it occurs as a good neighbor and how often as a bad neighbor.

The Class View offers an insight into a hub-structure of each class, as well as the interplay between hubs in different classes, which is summarized in the class-to-class hubness on the lower right side. The main set of panels in this view is contained in a scroll pane and shows an ordered list of major hubs, good hubs and bad hubs for each class separately. As before, they are selectable. Additionaly, there is a point type distribution, where the points are labeled either as safe, borderline, rare or outliers, based on the percentage of label mismatches in the respective k-neighbor sets. The chart in the upper part shows a distribution of classes, which allows us to see if the data is imbalanced. Imbalanced data is known to pose some difficulties for many data mining techniques.

The last View deals with potential queries to the image database, i.e. the similarity search. A user can upload an image and the applet will return the set of most similar images, based on the quantized SIFT features extended by binned color histograms. The applet extracts the features of the new image and does the metric comparisons. Apart from the k-neighbor set, a user can also look at how various variants of the k-nearest neighbor algorithms would assign the label, based on the retrieved points. Eight such algorithms are currently supported, some of which are our own and have been recently proposed precisely for dealing with this sort of data. The applet shows the classification confidence of: kNN, FNN, NWKNN, AKNN, hw-kNN, h-FNN, HIKNN and NHBNN.
Apart from classification, a user can also try to invoke hubness-based re-ranking of the neighbor set, performed based on what was learned from the previous occurrences of those neighbor points. In practice, this works quite well – and we have also done it for other forms of data.

All the outlined analysis is possible for several primary and several secondary similarity/distance measures. The primary ones are quite standard and include Manhattan, Euclidean, Cosine and Jaccard, while the secondary distances include: simcos, simhub, mutual proximity, NICDM and local scaling. Simcos has long been used as a sort of a cure for the dimensionality curse. Simhub is a hubness-aware extension of the simcos measure and is one of our achievements. It has been analyzed in detail in our paper in the Knowledge and Information Systems journal. Mutual proximity is a slightly different, yet quite effective approach by a group of authors from Austria and has been published in 2012 in the Journal of Machine Learning Research. NICDM and local scaling are other notable attempts at tackling hubness.

This is merely a first version of the app and we intend to polish the interface a little bit and add more methods and functionalities. Yet, I feel that it is a nice way to visualize medium-sized image collections and gain some insights into their k-nearest neighbor topologies in order to improve the performance of either the object recognition or the image retrieval systems.

Categories: Application, Hubness, Images, Visualization Tags:

The Hub Word-Cloud

October 23rd, 2012 Comments off

My research has recently been going in several different directions.
I have generated some topic word clouds from my papers to help in summarizing what it is about.

First of all, there is clustering and exploiting the hubness, as a consequence of the dimensionality curse, for improving the clustering performance in high-dimensional data. It is an extension on my work first presented in the awarded PAKDD paper The Role of Hubness in Clustering High-dimensional Data.

Hubs in Clustering

Then, there is the work I did on metric learning under the assumption of hubness, which yielded some surprisingly good results. I will probably soon give some updates on that. The approach which I proposed can essentially be viewed as an extension of a cosine similarity used in collaborative filtering, so that the influence of hub-points is taken into account and properly modeled.

Shared neighbor distances for high-dimensional data

We have also worked with the cross-lingual supervised document retrieval, where hubs play an important role.

Cross-lingual hubness-aware document retrieval

Last, but not least – is my work on improving the k-nearest neighbor classification in high-dimensional data. I have explored several ways of doing so and I am in the process of revising some of the proposed approaches in order to better model the co-occurrences and provide a more robust alternative to other nearest neighbor methods. The initial tests, however, show that the hubness-aware kNN classification is in fact very good even in its basic (proof of concept) form and outperforms many standard methods in high-dimensional data.

Hubness-aware classification

Categories: Hubness, Visualization Tags: