Hub Miner

Hub Miner is a tool developed by Nenad Tomasev

Hub Miner is a machine learning / data mining Java library with a clear focus on applying and evaluating k-nearest neighbor methods in the context of high-dimensional data analysis.

The code is available at the following GitHub link.

When I was starting my PhD, I had two options. I could have used existing data mining libraries for experiments or build my own. Most people choose the former, for obvious reasons. It is much faster and easier and saves a lot of time. However, I had a slightly different view.

I decided to make a new library from scratch, for following reasons: it allowed me to learn much more about the existing methods, their scalability and all the non-trivial implementation details. It also allowed me to create a system architecture tailored specifically for the type of experiments that I had in mind, which was very important, as I had limited computational resources, so I needed to make the code run as fast as possible with as little memory consumption. Lastly, it gave me flexibility and the possibility to quickly tweak/change any parameters, without having to spend hours going through documentation and fiddling with other people's code. The result is the Hub Miner Library, more than a 100 000 lines of code, more than 550 classes implementing various machine learning and data mining methodologies.

This took some time to make, but I am very pleased with the outcome. Hub Miner is a robust and efficient library supporting multi-threaded data analysis and method evaluation. I intend to publish it as an open-source project sometime in 2014, after I add some more stuff and document it properly. Now I will just briefly mention the main implemented functions.

The Figure below gives a quick overview of the package clusters. Orange color merely signifies that most work has recently been put in the corresponding parts of the library, though all of the related groups have been extensively developed.

The data is loaded from CSV, TSV or ARFF files (WEKA format). Sparse representations are supported, as I needed to do some experiments on textual data. Methods for creating bag-of-words (BOW) representations are included.

If You wish to work with synthetic data instead, some classes will generate various sorts of Gaussian mixtures and other synthetic data types. The interface allows for full control over this process. Some of the data extension techniques are aimed at over-sampling the minority classes in class imbalanced data analysis.

Once the data is loaded, it can be preprocessed in various ways. Methods for noise detection and removal are there, as well as many feature normalization techniques (like TF-IDF).

The library implements various primary metrics (Manhatan, Euclidean, Fractional distances, Cosine, symmetrized Kullback-Leibler divergence, Tanimoto, Bray-Curtis, Canberra, etc.), secondary metrics (local scaling, NICDM, Mutual Proximity, simcos shared neighbor similarity, simhub hubness-aware shared neighbor similarity) and kernels (ANOVA, Cauchy, Chi squared, Circular, Exponential, Gaussian, Student t, Histogram intersection, Multiquadratic, Laplacian, Log-kernel, Polynomial, Power kernel, Rational Quadratic, Sigmoid and Spline). In other words, there are many ways to calculate similarity between pairs of points.

I have analyzed many image datasets and special support is given for quantized image feature representations, primarily SIFT (scale invariant feature transform). The existing methods allow for easy and fast quantization of the feature space and codebook creation and use. Color information in shape of binned color histograms is also easily attainable.

Many instance selection / data reduction methods are provided: INSIGHT, CNN, GCNN, IPT_RT3, ENN, RNNR and Random.

I have done lots of work on data classification and the following classifiers are currently available in the library: kNN, dw-kNN, FNN, LSM, Naïve Bayes, KNNNB, LWNB, ID3, AKNN, PKNN, HIKNN, NHBNN, ANHBNN, h-FNN, hw-kNN, NWKNN, CBW-kNN. More will be implemented soon and I also intend to provide the wrappers for WEKA classifiers, so that those can be included in the comparisons.

Classifier comparison and evaluation is usually done via 10-times 10-fold cross validation, which is fully supported in Hub Miner with automatic statistical comparisons via the corrected re-sampled t-test. Precision, recall, accuracy, F-score (micro and macro-averaged) and the Matthews correlation coefficient are measured. I will soon add a GUI that controls the comparisons, but they are currently run from configuration files and I will give You an example of one such experimental run:

Each algorithm (training/testing) is executed in a separate thread, in parallel over the same dataset. The distance matrix and the neighbor sets are calculated ONCE in the beginning, in several threads -> and then dynamically set/updated for each individual fold. The neighbor sets are calculated for a slightly higher k in the beginning, so that only few new lookups need to be done during the cross-validation. The cross-validation procedure supports many things, including: secondary distances, instance selection, various feature normalizations, adding noise to feature values, adding mislabeling to the data (using the mislabeled array for training and the original labels for testing). The distance matrix for each metric and each normalization on a particular dataset is persisted to the disk and loaded instead of calculated the next time the classification on that dataset with the same parameters is invoked. The user can define a range of neighborhood sizes.

Classification is not the only thing supported by Hub Miner, as it includes an extensive library for clustering. The following clustering methods are currently supported: K-means, K-means++, K-means pruning, K-harmonic means, K-medoids, Kernel K-means, Agglomerative clustering, DBScan, GKH, GHPC, GHPKM, Kernel-GHPKM. Evaluating clustering performance is not an easy task and many clustering quality indices are implemented and available: Silhouette index, Dunn index, Davies-Bouldin index, SD index, RS index, Entropy (non-homogeneity), Rand index (stability), Goodman-Kruskal index, Jaccard index, Isolation index, C-index, Total squared error. Visualization of the clustering progress is also possible for some methods:

The topic of my PhD was "The Role of Hubness in High-dimensional Data Analysis" and there exist many classes in Hub Miner for analyzing and tracking hubness of the data. A GUI that allows for easy visualization of the influence of major data hubs has recently been released: Image Hub Explorer.

Hub Miner has some support for structured data, including basic graph and network mining / analysis tools.

Additionally, several stochastic optimization techniques are provided, including several genetic algorithms, simulated annealing, stochastic hill climbing, particle swarm optimization, etc. I was always very fond of biologically-inspired techniques, as I used to practice biology before studying computer science and mathematics.

As for scalability and approximate kNN set calculations, an algorithm that uses recursive Lanczos bisections for constructing the approximate kNN graph is implemented, but I also intend to add various locality sensitive hashing (LSH) techniques that will hopefully further improve the scalability of the system.