Home > Application, Hubness, Images, Visualization > Demo: exploring the influence of hubs in object recognition and image retrieval

Demo: exploring the influence of hubs in object recognition and image retrieval

January 15th, 2013

As we have been maily focused on research recently, I decided to take a short detour and put together some libraries that I have been working on into a small, lightweight demo for exploring image sets. I will demonstrate some of the features of the initial version here, though it will certainly grow in time.

I use my own java libraries for data mining, which have been optimized for nearest neighbor methods in particular. There are other benchmark methods as well, but they are not the focus of my work, so I’ve dedicated them less attention. The library itself is cca. 100 000 lines of code and I intend to release it eventually, once I get my PhD and document it properly, as I suppose that it might be of help to some other researchers, as well as enable them to use the methods that we have been developing. This involves novel methods for classification, clustering, metric learning, re-ranking, instance selection, anomaly detection, etc. Not all of it is included in the demo, but there is enough functionality to make some interesting data analysis.

So, briefly, what I am working on is the phenomenon of the emergence of hubs in high-dimensional feature spaces, learning under the skewed distribution of influence in scale-free-like k-nearest neighbor networks. In this context, a hub is an influential point that is a very frequent neighbor, a very frequently retrieved object.

There are four main views/tabs of the applet, as well as several method-selecting menus. The Data Overview tab shown in the screenshot below outlines the basic k-nearest neighbor-based properties of the data and its neighbor occurrence distribution. All the statistics, as well as graphs and charts in all the tabs – interactively respond to a change in neighborhood size invoked by moving the slider. The initial covered range is [1,50].

We have used image data to make the visualizations easier, but the system design is such that the distance matrix used to find the neighbor sets need not be calculated from the image data. One could use an aligned document/image dataset and visualize documents by their associated images.

The Data Overview Tab also shows a 2D visualization of the dataset, obtained by performing multidimensional scaling. The background is colored based on the good or bad hubness quantities in the relevant points. Good hubness involves occurrences that do not constitute a label mismatch, while bad hubness marks label mismatches in reverse k-neighbor sets. A certain number of hub-points is
initially drawn onto the panel and they can be selected. The frame color of each image corresponds to a certain class. A chart on the right bottom part shows the occurrence frequency distribution itself, while various labels contain information about the statistical moments of the distribution and the purity of the k-nearest neighbor sets and the reversed k-nearest neighbor sets.

The Neighbor View Tab shows a bit more structure. The k-nearest neighbors and the reverse k-nearest neighbors are shown in a list for the selected image. It is possible to select a new image from either of the lists or the local view of the k-NN graph. What is shown in the graph on the left is a restriction of the k-NN graph on the selected set of points. One can add either the selected point or its neighbors or reverse neighbors. It is also possible to remove points from the graph. These graphs are generated automatically for the entire range of k-values and are updated when the k-slider is moved. Additionally, one can see the occurrence profile of a neighbor point as a pie chart and examine how often it occurs as a good neighbor and how often as a bad neighbor.

The Class View offers an insight into a hub-structure of each class, as well as the interplay between hubs in different classes, which is summarized in the class-to-class hubness on the lower right side. The main set of panels in this view is contained in a scroll pane and shows an ordered list of major hubs, good hubs and bad hubs for each class separately. As before, they are selectable. Additionaly, there is a point type distribution, where the points are labeled either as safe, borderline, rare or outliers, based on the percentage of label mismatches in the respective k-neighbor sets. The chart in the upper part shows a distribution of classes, which allows us to see if the data is imbalanced. Imbalanced data is known to pose some difficulties for many data mining techniques.

The last View deals with potential queries to the image database, i.e. the similarity search. A user can upload an image and the applet will return the set of most similar images, based on the quantized SIFT features extended by binned color histograms. The applet extracts the features of the new image and does the metric comparisons. Apart from the k-neighbor set, a user can also look at how various variants of the k-nearest neighbor algorithms would assign the label, based on the retrieved points. Eight such algorithms are currently supported, some of which are our own and have been recently proposed precisely for dealing with this sort of data. The applet shows the classification confidence of: kNN, FNN, NWKNN, AKNN, hw-kNN, h-FNN, HIKNN and NHBNN.
Apart from classification, a user can also try to invoke hubness-based re-ranking of the neighbor set, performed based on what was learned from the previous occurrences of those neighbor points. In practice, this works quite well – and we have also done it for other forms of data.

All the outlined analysis is possible for several primary and several secondary similarity/distance measures. The primary ones are quite standard and include Manhattan, Euclidean, Cosine and Jaccard, while the secondary distances include: simcos, simhub, mutual proximity, NICDM and local scaling. Simcos has long been used as a sort of a cure for the dimensionality curse. Simhub is a hubness-aware extension of the simcos measure and is one of our achievements. It has been analyzed in detail in our paper in the Knowledge and Information Systems journal. Mutual proximity is a slightly different, yet quite effective approach by a group of authors from Austria and has been published in 2012 in the Journal of Machine Learning Research. NICDM and local scaling are other notable attempts at tackling hubness.

This is merely a first version of the app and we intend to polish the interface a little bit and add more methods and functionalities. Yet, I feel that it is a nice way to visualize medium-sized image collections and gain some insights into their k-nearest neighbor topologies in order to improve the performance of either the object recognition or the image retrieval systems.

Categories: Application, Hubness, Images, Visualization Tags:
Comments are closed.