Image Hub Explorer Demo Video

May 10th, 2013 Comments off

We have completed the initial demo video of the Image Hub Explorer System and it is now available on YouTube:

The demo covers the basic functionality of the system, demonstrating some of the predicted use cases.
The user interface will undergo further changes and improvements and the functions will be extended by new models and learning approaches.

Improving the semantic representations for cross-lingual document retrieval

May 4th, 2013 Comments off

I have had the pleasure of presenting some of our recent results at PAKDD 2013 in Gold Coast, Australia. The conference was great and the location couldn’t have been better, so we were able to catch some sun and walk along the beaches while discussing future collaboration, theory and applications.

Hubs are known to be the centers of influence and are known to arise in textual data. Also, they are known to cause problems by being frequent neighbors (= very similar) to semantically different types of documents. However, it was previously unknown whether this property is language-dependent and how it affects the cross-lingual information retrieval process.

What we have shown by analyzing aligned text corpora can be summarized by the following: Hubs is one language are not necessarily hubs in another language, different documents become influential. However, surprisingly, the percentage of label mismatches in reverse neighbor sets remains more or less unchanged. In other words, the nature of occurrences is preserved over different languages. This comes as a bit of surprise, since hubness is arguably a geometric property arising from the interplay of metrics and data representations. Yet, it seems that more semantics than was previously thought remains hidden there, captured and preserved across different languages.

We have used this observation to show that it was possible to improve the common semantic representation made via the CCA method (canonical correlation analysis) by simply introducing some hubness-aware instance weights. This is certainly not the only way to go about it and probably not the very best one, but it served as a good proof-of-concept.

The entire paper can be found here: The Role of Hubs in Cross-lingual Supervised Document Retrieval

Categories: Application, Data Mining, Hubness, Text Tags:

A publication in the IEEE Transactions on Knowledge and Data Engineering

April 3rd, 2013 Comments off

Our clustering paper was recently accepted for publication in the IEEE TKDE Journal (IF. 1.7). It is titled The Role of Hubness in Clustering High-Dimensional Data and is an extended version of the paper that was previously awarded the Best research paper runner-up award at the PAKDD 2011 conference in Shenzhen, China.

We have shown that hubs in the kNN topology can be successfully exploited for data clustering. Furthermore, we have shown that point hubness (neighbor node degree in the kNN graph) is a much better measure of local cluster centrality than density, under the assumption of high intrinsic data dimensionality.

We have analyzed three proof-of-concept hubness-based clustering algorithms (K-hubs, global hubness-proportional clustering and global hubness-proportional K-means).
The experimental evaluation confirms the robustness of hubness-based clustering and suggests that this might indeed be a promising research direction

GHPC clustering process

GHPC clustering process

The change in cluster entropy (non-homogeneity) with increasing levels of noise. The proposed methods greatly outperform the standard K-means++ algorithm.

The change in cluster entropy (non-homogeneity) with increasing levels of noise. The proposed methods greatly outperform the standard K-means++ algorithm.

Demo: exploring the influence of hubs in object recognition and image retrieval

January 15th, 2013 Comments off

As we have been maily focused on research recently, I decided to take a short detour and put together some libraries that I have been working on into a small, lightweight demo for exploring image sets. I will demonstrate some of the features of the initial version here, though it will certainly grow in time.

I use my own java libraries for data mining, which have been optimized for nearest neighbor methods in particular. There are other benchmark methods as well, but they are not the focus of my work, so I’ve dedicated them less attention. The library itself is cca. 100 000 lines of code and I intend to release it eventually, once I get my PhD and document it properly, as I suppose that it might be of help to some other researchers, as well as enable them to use the methods that we have been developing. This involves novel methods for classification, clustering, metric learning, re-ranking, instance selection, anomaly detection, etc. Not all of it is included in the demo, but there is enough functionality to make some interesting data analysis.

So, briefly, what I am working on is the phenomenon of the emergence of hubs in high-dimensional feature spaces, learning under the skewed distribution of influence in scale-free-like k-nearest neighbor networks. In this context, a hub is an influential point that is a very frequent neighbor, a very frequently retrieved object.

There are four main views/tabs of the applet, as well as several method-selecting menus. The Data Overview tab shown in the screenshot below outlines the basic k-nearest neighbor-based properties of the data and its neighbor occurrence distribution. All the statistics, as well as graphs and charts in all the tabs – interactively respond to a change in neighborhood size invoked by moving the slider. The initial covered range is [1,50].

We have used image data to make the visualizations easier, but the system design is such that the distance matrix used to find the neighbor sets need not be calculated from the image data. One could use an aligned document/image dataset and visualize documents by their associated images.

The Data Overview Tab also shows a 2D visualization of the dataset, obtained by performing multidimensional scaling. The background is colored based on the good or bad hubness quantities in the relevant points. Good hubness involves occurrences that do not constitute a label mismatch, while bad hubness marks label mismatches in reverse k-neighbor sets. A certain number of hub-points is
initially drawn onto the panel and they can be selected. The frame color of each image corresponds to a certain class. A chart on the right bottom part shows the occurrence frequency distribution itself, while various labels contain information about the statistical moments of the distribution and the purity of the k-nearest neighbor sets and the reversed k-nearest neighbor sets.

The Neighbor View Tab shows a bit more structure. The k-nearest neighbors and the reverse k-nearest neighbors are shown in a list for the selected image. It is possible to select a new image from either of the lists or the local view of the k-NN graph. What is shown in the graph on the left is a restriction of the k-NN graph on the selected set of points. One can add either the selected point or its neighbors or reverse neighbors. It is also possible to remove points from the graph. These graphs are generated automatically for the entire range of k-values and are updated when the k-slider is moved. Additionally, one can see the occurrence profile of a neighbor point as a pie chart and examine how often it occurs as a good neighbor and how often as a bad neighbor.

The Class View offers an insight into a hub-structure of each class, as well as the interplay between hubs in different classes, which is summarized in the class-to-class hubness on the lower right side. The main set of panels in this view is contained in a scroll pane and shows an ordered list of major hubs, good hubs and bad hubs for each class separately. As before, they are selectable. Additionaly, there is a point type distribution, where the points are labeled either as safe, borderline, rare or outliers, based on the percentage of label mismatches in the respective k-neighbor sets. The chart in the upper part shows a distribution of classes, which allows us to see if the data is imbalanced. Imbalanced data is known to pose some difficulties for many data mining techniques.

The last View deals with potential queries to the image database, i.e. the similarity search. A user can upload an image and the applet will return the set of most similar images, based on the quantized SIFT features extended by binned color histograms. The applet extracts the features of the new image and does the metric comparisons. Apart from the k-neighbor set, a user can also look at how various variants of the k-nearest neighbor algorithms would assign the label, based on the retrieved points. Eight such algorithms are currently supported, some of which are our own and have been recently proposed precisely for dealing with this sort of data. The applet shows the classification confidence of: kNN, FNN, NWKNN, AKNN, hw-kNN, h-FNN, HIKNN and NHBNN.
Apart from classification, a user can also try to invoke hubness-based re-ranking of the neighbor set, performed based on what was learned from the previous occurrences of those neighbor points. In practice, this works quite well – and we have also done it for other forms of data.

All the outlined analysis is possible for several primary and several secondary similarity/distance measures. The primary ones are quite standard and include Manhattan, Euclidean, Cosine and Jaccard, while the secondary distances include: simcos, simhub, mutual proximity, NICDM and local scaling. Simcos has long been used as a sort of a cure for the dimensionality curse. Simhub is a hubness-aware extension of the simcos measure and is one of our achievements. It has been analyzed in detail in our paper in the Knowledge and Information Systems journal. Mutual proximity is a slightly different, yet quite effective approach by a group of authors from Austria and has been published in 2012 in the Journal of Machine Learning Research. NICDM and local scaling are other notable attempts at tackling hubness.

This is merely a first version of the app and we intend to polish the interface a little bit and add more methods and functionalities. Yet, I feel that it is a nice way to visualize medium-sized image collections and gain some insights into their k-nearest neighbor topologies in order to improve the performance of either the object recognition or the image retrieval systems.

Categories: Application, Hubness, Images, Visualization Tags:

A publication in Knowledge and Information Systems Journal

January 15th, 2013 Comments off

I’ve just got a paper published in the Knowledge and Information Systems (KAIS) journal. It proposes a novel secondary similarity measure, tailored specifically for high-dimensional data, under the assumption of hubness. The experimental results are quite encouraging and the analysis shows many interesting properties. The overall percentage of label mismatches is reduced, the occurrence profile purity is increased – and the overall classifier performance is therefore significantly better under the new measure. If you are working with similarity-based methods in high-dimensional data, it is worth checking it out:

The Electronic Version

Categories: Uncategorized Tags:

The Hub Word-Cloud

October 23rd, 2012 Comments off

My research has recently been going in several different directions.
I have generated some topic word clouds from my papers to help in summarizing what it is about.

First of all, there is clustering and exploiting the hubness, as a consequence of the dimensionality curse, for improving the clustering performance in high-dimensional data. It is an extension on my work first presented in the awarded PAKDD paper The Role of Hubness in Clustering High-dimensional Data.

Hubs in Clustering

Then, there is the work I did on metric learning under the assumption of hubness, which yielded some surprisingly good results. I will probably soon give some updates on that. The approach which I proposed can essentially be viewed as an extension of a cosine similarity used in collaborative filtering, so that the influence of hub-points is taken into account and properly modeled.

Shared neighbor distances for high-dimensional data

We have also worked with the cross-lingual supervised document retrieval, where hubs play an important role.

Cross-lingual hubness-aware document retrieval

Last, but not least – is my work on improving the k-nearest neighbor classification in high-dimensional data. I have explored several ways of doing so and I am in the process of revising some of the proposed approaches in order to better model the co-occurrences and provide a more robust alternative to other nearest neighbor methods. The initial tests, however, show that the hubness-aware kNN classification is in fact very good even in its basic (proof of concept) form and outperforms many standard methods in high-dimensional data.

Hubness-aware classification

Categories: Hubness, Visualization Tags:

Outlier/error detection in sensor data based on bad hubs

October 23rd, 2012 Comments off

A lot of sensor data is being collected every minute and used for various sorts of prediction. Yet, these measurements are not perfect and the sensors sometime break or malfunction. Detecting these anomalies is a part of the data cleaning and preparation process.
There are many ways to do outlier and anomaly detection and there is a whole body of literature devoted to the problem.
What we have taken a look at instead was one specific test scenario – whether the curse of dimensionality affects the time series enough that the emerging hubs in the data can be used as potential markers for such anomalous measurement records. It turns out that they can and that high bad hubness of measurement points clearly indicates that something is not right. What exactly, well – that is for experts to say in any particular test-case. Here are some graphs from the tool we’ve developed and described in one of our papers.

What is the ‘bad’ hubness of the suspicious points here? They are frequent neighbors to points in other geographical regions, so the distance in measurements does not correspond well to the spatial distance. Of course, the properties of a region are not homogenous and it is certainly possible for correct, non-noisy sensors to produce such data. However, the number of such measurement points is usually small and they are the prime candidates for taking a closer look. This makes for a good semi-automated anomaly detection system.

Categories: Application, Hubness, Sensor data Tags:

Hubness in 3D

August 13th, 2011 Comments off

As I was delving deeper and deeper into the hubness-related topics, I also started making some visualizations which would enrich our intuition about the problems at hand. This was not, however, an easy task. Hubness is a property of inherently high dimensional data – where influential points (hubs) emerge as a consequence of distance concentration and other peculiar geometric phenomena which plague such high dimensional spaces. As humans, we can not, however, visualize these things in their original dimensionality, we have to significantly reduce it and represent it in some other way. This being as it is – I have either been applying multidimensional scaling onto high dimensional data OR playing around with 2D data (even though hubness as such does not and can not exist in 2D) to see if I could gain some more insight into the related problems. After all, the hubness-aware classification algorithms that I have been developing (and still am) are also applicable to lower dimensional data and can lead to significant improvements even in such restricted cases.

So, some time ago – I thought about this – 3D data is still not high dimensional and no proper hubness can be found there – but it has to be better than 2D, so maybe I could see some more interesting things if I dare taking a peek into 3D nearest-neighbor structures. And so I did. Long story short – I took some UCI data, performed some dimensionality reduction to reduce it to 3D – and then determined the class affiliation probabilities of each voxel by applying the above-mentioned hubness-aware algorithms (most of which you can find in my publication list). Also, I compared the results to the basic k-nearest neighbor methods. The Binn-Phong shading model was used for lighting the images (so that the 3D structures can be more easily perceived). The volume of each cube was projected onto each of its sides, so 6 images were created for each cube.

I am not going to discuss any of the comparisons nor discuss the few selected images which I’ve given here. I have generated many such images, but what I wanted to do here was to select some of them which simply – looked cool 🙂 – and make a sort of a gallery of interesting/weird hubness-related 3D images. Not very scientific, I know, but there are always publications/conferences/journals for serious stuff. No reason why we shouldn’t have some fun every now and then. So, enjoy 😉

generated by HIKNN generated by HIKNN
generated by HIKNN generated by HIKNN
generated by kNN generated by kNN
generated by HIKNN generated by hFNN
generated by kNN generated by HIKNN
generated by HIKNN generated by HIKNN
Categories: Hubness Tags:

About me

July 7th, 2011 Comments off

So, this is going to be my web page, apparently, the place where I will upload some of my research papers and discuss topics of interest. As you have already figured out, I am a computer scientist and I’m currently working on my PhD in machine learning, at the Jožef Stefan Institute in Ljubljana, Slovenia. My interests span into other topics as well, such as stochastic optimization, artificial life, AI, game theory, social network analysis, bio-informatics, as well as some more mathematically rich areas like dynamical systems, chaos theory, graph theory, etc.

I was born in Serbia, in a lovely city of Novi Sad (mostly known for its summer festival – Exit), where I finished my basic education – and where I graduated in informatics in 2008. For my success during the studies, as well as my graduation thesis, I was awarded the Aleksandar Saša Popović award of excellence. Along with my interests in mathematics and computer science, I was also into biology back then, doing some field work and mostly studying insect populations. I have participated in various educational seminars and courses at the Petnica Science Center, by both attending them as a highschool student and lecturing there when I was older. I have also had the opportunity to lead a short project at the Višnjan summer school for gifted highschool kids which was titled Evolution in the Core – a Journey Through the Basics of Artificial Life.

I am presently working on some high dimensional phenomena in real world data, focusing mostly on the phenomenon of hubness, which is an aspect of the dimensionality curse pertaining to nearest neighbor methods in general, both classification and clustering. I am constantly able to find new applications and new ways of exploiting it, so it is an exciting new research area. If you have any comments/ideas/proposals related to this or machine learning in general, do not hesitate to get in touch.

And so it begins…

Cheers,

Nenad

Categories: Uncategorized Tags: