Hub Miner library

November 12th, 2014

Hub Miner is a machine learning / data mining Java library with a clear focus on tackling the challenges of computing relevance in high-dimensional data and applying and evaluating instance-based methods in the context of high-dimensional data analysis.

HubMiner is an OpenML-compatible library, as it supports networked science and repeatable/verifiable experiments.

The code is available at the following GitHub link.

There is also a user manual.

Hub Miner is a robust and efficient library supporting multi-threaded data analysis and method evaluation.

The Figure below gives a quick overview of the package clusters. Orange color merely signifies that most work has recently been put in the corresponding parts of the library, though all of the related groups have been extensively developed.

The data is loaded from CSV, TSV or ARFF files (WEKA format). Sparse representations are supported, as I needed to do some experiments on textual data. Methods for creating bag-of-words (BOW) representations are included.

For users that wish to perform experiments on synthetic data, there are utility classes that are able to generate various sorts of Gaussian mixtures and other synthetic data types. The interface allows for full control over this process. Some of the data extension techniques are aimed at over-sampling the minority classes in class imbalanced data analysis.

Once the data is loaded, it can be pre-processed in various ways. Methods for noise detection and removal are there, as well as many feature normalization techniques (like TF-IDF).

The library implements various primary metrics (Manhatan, Euclidean, Fractional distances, Cosine, symmetrized Kullback-Leibler divergence, Tanimoto, Bray-Curtis, Canberra, etc.), secondary metrics (local scaling, NICDM, Mutual Proximity, simcos shared neighbor similarity, simhub hubness-aware shared neighbor similarity) and kernels (ANOVA, Cauchy, Chi squared, Circular, Exponential, Gaussian, Student t, Histogram intersection, Multiquadratic, Laplacian, Log-kernel, Polynomial, Power kernel, Rational Quadratic, Sigmoid and Spline). In other words, there are many ways to calculate similarity between pairs of points.

HubMiner provides the basic pipelines for image feature representation quantization, since images are a prime example of high-hubness data and a lot of hubness-related research has been done on images. The existing methods allow for easy and fast quantization of the feature space and codebook creation and use. Color information in shape of binned color histograms is also easily attainable.

Many instance selection / data reduction methods are provided: INSIGHT, CNN, GCNN, IPT_RT3, ENN, RNNR and Random.

A lot of exciting work has recently been done on hubness-aware classification. Hub Miner offers many hubness-aware classifiers and many standard baselines to compare them to. The following classifiers are currently available in the library: kNN, dw-kNN, FNN, LSM, Naïve Bayes, KNNNB, LWNB, ID3, AKNN, PKNN, HIKNN, NHBNN, ANHBNN, h-FNN, hw-kNN, NWKNN, CBW-kNN. More will be implemented soon and it is possible to do cross-system comparisons via OpenML, in order to compare with baselines from Weka, R or RapidMiner.

Classifier comparison and evaluation is usually done via 10-times 10-fold cross validation, which is fully supported in Hub Miner with automatic statistical comparisons via the corrected re-sampled t-test. Precision, recall, accuracy, F-score (micro and macro-averaged) and the Matthews correlation coefficient are measured. I will soon add a GUI that controls the comparisons, but they are currently run from configuration files and I will give You an example of one such experimental run:

Each algorithm (training/testing) is executed in a separate thread, in parallel over the same dataset. The distance matrix and the neighbor sets are calculated ONCE in the beginning, in several threads -> and then dynamically set/updated for each individual fold. The neighbor sets are calculated for a slightly higher k in the beginning, so that only few new lookups need to be done during the cross-validation. The cross-validation procedure supports many things, including: secondary distances, instance selection, various feature normalizations, adding noise to feature values, adding mislabeling to the data (using the mislabeled array for training and the original labels for testing). The distance matrix for each metric and each normalization on a particular dataset is persisted to the disk and loaded instead of calculated the next time the classification on that dataset with the same parameters is invoked. The user can define a range of neighborhood sizes.

Classification is not the only thing supported by Hub Miner, as it includes an extensive library for clustering. The following clustering methods are currently supported: K-means, K-means++, K-means pruning, K-harmonic means, K-medoids, Kernel K-means, Agglomerative clustering, DBScan, GKH, GHPC, GHPKM, Kernel-GHPKM. Evaluating clustering performance is not an easy task and many clustering quality indices are implemented and available: Silhouette index, Dunn index, Davies-Bouldin index, SD index, RS index, Entropy (non-homogeneity), Rand index (stability), Goodman-Kruskal index, Jaccard index, Isolation index, C-index, Total squared error. Visualization of the clustering progress is also possible for some methods:

There are many exploratory classes in Hub Miner for analyzing and tracking hubness in the data. A GUI that allows for easy visualization of the influence of major data hubs has recently been released: Image Hub Explorer.

Hub Miner has some support for structured data, including basic graph and network mining / analysis tools.

Additionally, several stochastic optimization techniques are provided, including several genetic algorithms, simulated annealing, stochastic hill climbing, particle swarm optimization, etc.

As for scalability and approximate kNN set calculations, an algorithm that uses recursive Lanczos bisections for constructing the approximate kNN graph is implemented. Including various locality sensitive hashing (LSH) techniques is also planned for future releases.

Comments are closed.