(c) Marko Grobelnik, Dunja Mladenic
Artificial Intelligence Laboratory
Jozef Stefan Institute, Slovenia
Text-Garden Components enable easy handling of text documents for the purpose of data analysis including automatic model generation and document classification, document clustering, document visualization, dealing with Web documents, crawling the Web and many other. The code is written in C++ and originally runs on Windows platform and using Wine or similar utility can be run on Linux/Unix. The code was developed through our own research needs guided by our research projects and refined/polished as the time permitted. The top level components build on the core of the software are contributed through the time by several people from our group including Luka Bradesko, Janez Brank, Blaz Fortuna, Miha Grcar, Jure Leskovec, Blaz Novak.
Please reference the Web site , if you are using any of the provided utilities.
Lexical text processing
Lexical text processing in Text-Garden includes operations such as tokenization, stop-words, stemming, n-grams, Wordnet usage. The functionality is covered mainly through parameters of utility transforming textual data into vector representation Bag-Of-Words format with the file extension “.Bow”
- Text To Bag-Of-Words Converter (Txt2Bow) Download
Transforms various raw text formats, such as Text-Base, Transactions-File, Compact-Documents-File, some standard datasets (eg., Reuters) into the file in Bag-Of-Words format “.Bow”.
Parameters and example call.
- Html To Xml Converter (Html2Xml) Download
Transforms Html documents into cleaned XML documents.
Parameters and example call.
- Html To Text Converter (Html2Txt) Download
Transforms Html documents into cleaned text documents.
Parameters and example call.
Unsupervised Learning
- Bag-Of-Words K-Means clustering (BowKMeans) Download
Performs K-Means clustering on the Bag-Of-Words format of document and outputs the clustering of documents in different format, such as text file or XML file.
Parameters and example call.
- Bag-Of-Words Hierarchical-K-Means clustering (BowHKMeans) Download
Performs hierarchical K-Means clustering on the Bag-Of-Words format of documents and outputs the clustering of documents in different format, such as text file or XML file.
Parameters and example call.
- One-Class Support-Vector-Machine training algorithm (BowTrainOneClassSVM) Download
Learns a model via training one-class Support Vector Machine on the set of input documents provided in the Bag-Of-Words format.
Parameters and example call.
Semi-Supervised Learning
- Active learning on sparse training sets using binary SVM model (ALTrainBinSVM) Download
Performs active learning loop on the specified input.
Parameters and example call.
- Semi-Supervised transduction (BowTrainRegSVM) Download
Performs a transductive inference on a joint labelled and unlabelled dataset.
Parameters and example call.
Supervised Learning
- Binary-class Support-Vector-Machine training algorithm (BowTrainBinSVM) Download
Learns a model via training a binary-class Support Vector Machine on the set of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Logistic Regression training algorithm (BowTrainLogReg) Download
Learns a model using logistic regression on the set of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Winnow training algorithm (BowTrainWinnow) Download
Learns a model via training a Winnow on the set of input documents provided in the Bag-Of-Words format.
Classification of Documents
- Generic classifier using models created with one of the training algorithms (BowClassify) Download
Classifies input documents provided in the Bag-Of-Words format using the provided model.
- Classification using nearest-neighbour algorithm (Bow2NNbrs) Download
Classifies input documents provided in the Bag-Of-Words format using the provided model.
Feature construction/extraction
- Projecting on the semantic-space of documents (ProjBow2SemSpace) Download
Projects documents provided in the Bag-Of-Words format to semantic-space representation of documents
Parameters and example call.
- Feature Extraction from Images (ImgFeatures) Download
The utility extracts various groups of features from input images.
Parameters and example call.
Visualization of documents based on clustering
- Creating graph representation of documents (Bow2VizGraph) Download
Creates graph representation of input documents provided in the Bag-Of-Words format and outputs .xml file.
Parameters and example call.
- Creating tiling representation of documents (Bow2VizTile) Download
Creates tiling representation of input documents provided in the Bag-Of-Words format and outputs .xml file.
Parameters and example call.
- Visualization of documents represented with graph representation (BowGraphViz) Download
Provides visualization of documents as graph using graphical interface.
Parameters and example call.
- Visualization of documents represented with tiling representation (BowTileViz) Download
Provides visualization of documents with tiling representation using graphical interface.
Parameters and example call.
Visualization of documents based on semantic-space
- Visualization of semantic-space of documents (Bow2VizMap) Download
Provides visualization of documents as a 2-D map based on the semantic-space representation
Parameters and example call.
- Exploring visualization of semantic-space of documents (VizMap) Download
Exploring visualizations created by Bow2VizMap via graphical user interface.
Parameters and example call.
Crawling
- Get one page from the web (GetWebPage) Download
Retrieves web-document from the address specified with URL.
Parameters and example call.
- Search on Google (Google2RSet) Download
Search using the provided keywords on Google and writes the result set in xml file.
- Crawl Google Scholar (GoogleScholar2Xml) Download
Search using the provided keywords on Google and writes the result set in xml file.