Archive

Archive for the ‘Sensor data’ Category

PhD thesis: The Role of Hubness in High-dimensional Data Analysis

November 24th, 2013 Comments off

On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness in High-dimensional Data Analysis”.

The thesis discusses the issues involving similarity-based inference in intrinsically high-dimensional data and the consequences of emerging hub points. It integrates the work presented in my journal and conference papers, proposes and discusses novel techniques for designing nearest-neighbor based learning models in many dimensions. Lastly, it mentions potential practical applications and promising future research directions.

I would like to thank everyone who gave me advice and helped in shaping this thesis.

The full text of the thesis is available here.

Outlier/error detection in sensor data based on bad hubs

October 23rd, 2012 Comments off

A lot of sensor data is being collected every minute and used for various sorts of prediction. Yet, these measurements are not perfect and the sensors sometime break or malfunction. Detecting these anomalies is a part of the data cleaning and preparation process.
There are many ways to do outlier and anomaly detection and there is a whole body of literature devoted to the problem.
What we have taken a look at instead was one specific test scenario – whether the curse of dimensionality affects the time series enough that the emerging hubs in the data can be used as potential markers for such anomalous measurement records. It turns out that they can and that high bad hubness of measurement points clearly indicates that something is not right. What exactly, well – that is for experts to say in any particular test-case. Here are some graphs from the tool we’ve developed and described in one of our papers.

What is the ‘bad’ hubness of the suspicious points here? They are frequent neighbors to points in other geographical regions, so the distance in measurements does not correspond well to the spatial distance. Of course, the properties of a region are not homogenous and it is certainly possible for correct, non-noisy sensors to produce such data. However, the number of such measurement points is usually small and they are the prime candidates for taking a closer look. This makes for a good semi-automated anomaly detection system.

Categories: Application, Hubness, Sensor data Tags: