Datasets

May 21st, 2013

During the course of my research, I have either created or participated in creating several datasets. You can find the links below, with a brief data description. Please cite the specified papers if you choose to use the data.

WIKIMage is an aligned labeled image-text dataset from public Wikipedia images. The D3 dataset (currently) has about 15000 images. We have extracted the SIFT, SURF and ORB features and performed quantization, so the “bag of visual words” representations are available for download, as well as the accompanying text from captions and paragraphs. If you use this data, please cite:

@InProceedings{pracner-sikdd2011,
author = {Doni Pracner and Nenad Toma\v{s}ev and Milo\v{s}
Radovanovi\'c and Dunja Mladeni\'c and Mirjana
Ivanovi\'c},
title = {WIKImage: Correlated image and text datasets},
booktitle = {Proc. of the 14th International Multiconference on
Information Society (IS 2011)},
pages = {141--144},
year = 2011,
month = 10,
volume = {A},
publisher = {Institut "Jožef Stefan", Ljubljana},
url = {http://is.ijs.si},
isbn = {978-961-264-035-4},
address = {Ljubljana, Slovenia}
}

The H1C CoreWar dataset is quite an unusual one and very challenging for classification. It has 13 classes and 666 instances (no pun intended). The data objects are the redcode assembly-like programs that participate in the CoreWar artificial life simulation and are usually referred to as ‘warriors’. They compete over the shared resources and eliminate their opponents. The data represents a set of well-known human-coded warriors (as they can also be evolved via genetic programming). One view of the data is given as a ‘bag of instructions’, via n-grams and another where the features are the results achieved by the programs within the simulation against a predefined and carefully selected benchmark. Therefore, both a static and a dynamic representation are available. The classes are different strategies and I have manually labeled most of the programs in the dataset. The data is very high-dimensional and difficult to handle. It is very good for testing new classifiers. If you decide to use this data, please cite:

@incollection{
year={2007},
isbn={978-3-540-74975-2},
booktitle={Knowledge Discovery in Databases: PKDD 2007},
volume={4702},
series={Lecture Notes in Computer Science},
editor={Kok, JoostN. and Koronacki, Jacek and Lopez de Mantaras, Ramon and Matwin, Stan and Mladenič, Dunja and Skowron, Andrzej},
doi={10.1007/978-3-540-74976-9_62},
title={Automatic Categorization of Human-Coded and Evolved CoreWar Warriors},
url={http://dx.doi.org/10.1007/978-3-540-74976-9_62},
publisher={Springer Berlin Heidelberg},
author={Tomašev, Nenad and Pracner, Doni and Radovanović, Miloš and Ivanović, Mirjana},
pages={589-596}
}

The High-dimensional Gaussian mixture data with high class overlap that was used in the “Hubness-aware shared neighbor distances for high-dimensional k-nearest neighbor classification” that was published in Knowledge and Information Systems. There are 10 100-dimensional 10-class dataset of more than a thousand points each. The data is very difficult for kNN classification. It can easily be handled by Bayesian methods, though. If you wish to use this data, please cite:

@article{nenadKAIS,
year={2013},
issn={0219-1377},
journal={Knowledge and Information Systems},
doi={10.1007/s10115-012-0607-5},
title={Hubness-aware shared neighbor distances for high-dimensional k-nearest neighbor classification},
url={http://dx.doi.org/10.1007/s10115-012-0607-5},
publisher={Springer-Verlag},
keywords={Hubs; Metric learning; Curse of dimensionality; $$k$$ -nearest neighbor; Classification; Shared neighbor distances},
author={Toma\v{s}ev, Nenad and Mladeni\'{c}, Dunja},
pages={1-34},
language={English}
}

Comments are closed.