Archive for October, 2012

The Hub Word-Cloud

October 23rd, 2012 Comments off

My research has recently been going in several different directions.
I have generated some topic word clouds from my papers to help in summarizing what it is about.

First of all, there is clustering and exploiting the hubness, as a consequence of the dimensionality curse, for improving the clustering performance in high-dimensional data. It is an extension on my work first presented in the awarded PAKDD paper The Role of Hubness in Clustering High-dimensional Data.

Hubs in Clustering

Then, there is the work I did on metric learning under the assumption of hubness, which yielded some surprisingly good results. I will probably soon give some updates on that. The approach which I proposed can essentially be viewed as an extension of a cosine similarity used in collaborative filtering, so that the influence of hub-points is taken into account and properly modeled.

Shared neighbor distances for high-dimensional data

We have also worked with the cross-lingual supervised document retrieval, where hubs play an important role.

Cross-lingual hubness-aware document retrieval

Last, but not least – is my work on improving the k-nearest neighbor classification in high-dimensional data. I have explored several ways of doing so and I am in the process of revising some of the proposed approaches in order to better model the co-occurrences and provide a more robust alternative to other nearest neighbor methods. The initial tests, however, show that the hubness-aware kNN classification is in fact very good even in its basic (proof of concept) form and outperforms many standard methods in high-dimensional data.

Hubness-aware classification

Categories: Hubness, Visualization Tags:

Outlier/error detection in sensor data based on bad hubs

October 23rd, 2012 Comments off

A lot of sensor data is being collected every minute and used for various sorts of prediction. Yet, these measurements are not perfect and the sensors sometime break or malfunction. Detecting these anomalies is a part of the data cleaning and preparation process.
There are many ways to do outlier and anomaly detection and there is a whole body of literature devoted to the problem.
What we have taken a look at instead was one specific test scenario – whether the curse of dimensionality affects the time series enough that the emerging hubs in the data can be used as potential markers for such anomalous measurement records. It turns out that they can and that high bad hubness of measurement points clearly indicates that something is not right. What exactly, well – that is for experts to say in any particular test-case. Here are some graphs from the tool we’ve developed and described in one of our papers.

What is the ‘bad’ hubness of the suspicious points here? They are frequent neighbors to points in other geographical regions, so the distance in measurements does not correspond well to the spatial distance. Of course, the properties of a region are not homogenous and it is certainly possible for correct, non-noisy sensors to produce such data. However, the number of such measurement points is usually small and they are the prime candidates for taking a closer look. This makes for a good semi-automated anomaly detection system.

Categories: Application, Hubness, Sensor data Tags: