Image Hub Explorer

May 12th, 2013

Image Hub Explorer is a new tool for exploring large image collections that is aimed specifically at data mining practitioners and system developers. It allows the users to select the best feature representations and metrics for their data and examine the consequences of their choices. Metric learning, search, re-ranking and classification is also supported.

I have developed it as a demonstration of some of the capabilities of my data mining library, Hub Miner.

The main idea behind Image Hub Explorer is to look at the emerging image hubs, the data points of highest impact for the overall system performance.

Here is a demo video:

Hubs are the centers of influence that arise in various sorts of networks and graphs. They are one of the hallmarks of scale-free (power law) degree distributions. In such networks, a small number of nodes (points) dominate and account for the majority of transpiring events.

In the context of kNN (k-nearest neighbor) graphs, where two points are connected if one is a k-neighbor of the other – hubs are simply very frequent neighbors, connected to many other points. It was shown that intrinsically high-dimensional data (images, text, audio, time series, etc.) exhibits high hubness and that such influential neighbor points arise.

So, why should we care? Well, kNN methods are very common in machine learning and data mining and any significant change in the underlying neighbor occurrence distribution affects a wide spectrum of learning techniques, usually in a detrimental way. More importantly, one of the main functions of information retrieval (IR) and recommendation systems is to return the “top-K” result set for a given query. This implicitly corresponds to the kNN case, even if the ranking was not produced directly by some simple metric, but rather via collaborative filtering or some other advanced approach.

In order to demonstrate the consequences of the hubness phenomenon and allow for researchers and developers to track for emerging hubs in their databases, I have developed an analytic tool for visualizing various aspects of k-nearest neighbor topologies, Image Hub Explorer. As the name suggests, the primary goal was to help in visualizing image data, though it can also be applied to other data types. Different feature representations and different metrics exhibit different degrees of hubness, so by using this tool – it is suddenly possible to choose the underlying representation and metric in such a way that reduces the overall hubness of the data, as well as the percentage of label mismatches in k-nearest neighbor sets. Image Hub Explorer supports 7 different primary metrics and 5 recently proposed secondary metrics for high-dimensional data analysis, which amounts to 35 different metric combinations for any given feature representation. This includes some approaches proposed as recently as 2012 and 2013 (simhub hubness-aware shared neighbor distances (published in Knowledge and Information Systems) and mutual proximity (published in JMLR)).

Hubness is a recently discovered phenomenon (an aspect of the well known curse of dimensionality, related to distance concentration) and this is the first tool of this kind. Most attention to hubness was given in the music retrieval and recommendation community. It is yet to be carefully examined in object recognition tasks, which was the main motivation behind the Image Hub Explorer.

It is a complex application built on top of the Hub Miner data mining and machine learning library that I have built from scratch during the course of PhD. It implements many state-of-the art methods for hubness-aware learning and high-dimensional data analysis. It offers several different views of the data and some additional interactive functions that we will mention, one by one. The underlying library contains almost 600 classes, is implemented in Java and has over 100 000 lines of code. I intend to publish it as open-source sometime next year (2014).

We will show the analysis on Leeds Butterfly dataset, from which we have extracted quantized SIFT feature representations for each image, along with the binned color histograms. We have performed similar analysis on Caltech image data, 17 flowers dataset, subsets of the ImageNet repository and Essex face recognition database. We intend to extend our experiments to include the WIKImage data and implement textual search over images and image search over the associated text.

Let’s have a look at the first screen:

It offers an overview of the main aggregate properties of the data, related to the kNN graph topology and data hubness. The skewness and kurtosis (2nd and 3rd standard distribution moments) of neighbor occurrence frequencies are shown, along with the purity of direct and reverse k-neighbor sets and the percentages of hubs, orphans and regular points in the data. All statistics are available for any neighborhood size k in between 1 and 50 and users can change the current k by simply moving the k-selection slider.

On the left, the main hubs from various classes are projected onto a plane via a multi-dimensional scaling (MDS) procedure (currently performed by using the MDSJ library, developed at the University of Konstanz). The background is calculated within the Image Hub Explorer and it is determined based on the nature of projected hub points, red corresponding to bad hubs and green to good hubs, respectively. So, when are the hubs bad and when are they good? Well, bad occurrences are those where the labels of an image and its neighbor do not match, when they belong to different classes. This is what we wish to avoid. Each image is given a frame in color of its class, for easier overview. All images are selectable by mouse clicks.

Here is one very bad hub from the examined butterfly data:

In this case, the Artogeia rapae image that is shown in the middle, acts as a neighbor only to points that are not from its own class (species), which is obviously detrimental to kNN-based analysis. This image was generated by the subgraph visualization component in the second tab/view of Image Hub Explorer, the “Neighbor View”:

The neighbor view allows the user to select certain points and examine the local kNN structure defined around those points. One can also automatically add all the neighbors and reverse neighbors of any selected image and examine the lists of neighbors and reverse neighbors separately. A pie chart representing the aggregate neighbor occurrence profile for the selected image is shown in the upper right corner. It is possible to move the components around the graph and export its image to a desired *.jpg file on the disk. All nodes are selectable.

The graph is drawn by the Java JUNG graph library.

An overview of the point type distribution in each class, as well as the list of top hubs, good hubs and bad hubs within each class separately, is available in the next screen, the class view:

Within each class, we can distinguish between different types of points: safe points are those that are easy to properly classify by kNN methods, contained in the class interiors. Borderline points exist in the borderline regions between different classes and are much more difficult to handle. Rare points and outliers have most of their neighbors belonging to different classes and are very difficult to properly label.

We can see that different classes have significantly different distributions of point types:

In this particular case, images of Heliconius erato seem to be much more difficult to handle than those of Danaus plexippus. Image Hub Explorer helps in discovering these patterns and it is possible to easily discover the individual culprits by selecting the top bad hubs in various classes and examining their neighbor occurrence profiles. An overall class-to-class occurrence matrix is shown in the right of the Class View screen, where we can see which class causes most mis-classification to which other class.

Some points are bad hubs and in case of images – we can visualize what helps an image become a good or a bad hub, feature-wise. Image Hub Explorer includes a feature visualization and assessment component that currently supports the standard SIFT features, that allows the user to visualize the location of good and bad features on each image. Good features are those that occur mostly on images within a single class and therefore help in classification. Bad features are those that occur across different classes and carry little or no discriminative information. This is shown in the following screenshots:

In case of Heliconius charitonius (the upper image), white stripes on the wings are correctly determined to carry discriminative information, as well as the black veins on the wings of Danaus plexippus.

The final screen offers the search functionality over the examined database. A user is able to load a new image and look for the most similar existing images. The system automatically extracts the features and uses the existing SIFT codebook to quantize the image representation. A set of kNN classification methods are available, suggesting potential labellings of the loaded image.

The following classifiers are available by default: kNN (Cover and Hart 1967), FNN (Keller, 1985), NWKNN (S. Tan, 2005), AKNN (Wang et al. 2007), hw-kNN (Radovanović et al. 2009), h-FNN (Tomašev et al. 2011), HIKNN (Tomašev et al. 2011) and NHBNN (Tomašev et al. 2011). The latter are hubness-aware and have been proposed specifically for handling high hubness data. We intend to add even more models in the next couple of weeks/months.

There is also a self-adaptive hubness-aware re-ranking procedure proposed in Exploiting Hubs for Self-Adaptive Secondary Re-Ranking In Bug Report Duplicate Detection (Tomašev et al. 2013). The basic idea is to try and increase the distances between points and those neighbors that are known to act as bad hubs, as well as reduce the distances to good hubs. On average, this helps in reducing the percentage of label mismatches in kNN sets and helps with classification.

Image Hub Explorer is applicable to other data types, as well. As long as there are some images aligned with the features from which the distances are calculated, the data can be easily visualized. If the images are not available, they can be generated as rectangles containing the object names. This might not be as visually appealing, but the analysis runs just the same. We will prepare some use-cases of this sort, on both textual and audio data.

I hope that this tool becomes useful and helps in dealing with hubs and the skewed distribution of influence in image databases. Image Hub Explorer will be available per request from fellow researchers. Just email me with Your ideas and we could coordinate and start some collaboration.

Comments are closed.