---------------------------------------- ----------------------------------------

Text-Garden -- Text-Mining Software Tools

---------------------------------------- ----------------------------------------

(c) Marko Grobelnik, Dunja Mladenic
Department of Knowledge Technologies
Jozef Stefan Institute, Slovenia 

Text-Mining Software Tools enable easy handling of text documents for the purpose of data analysis including automatic model generation and document classification, document clustering, document visualization, dealing with Web documents, crawling the Web and many other.  The code is written in C++ and originally runs on Windows platform and using Wine or similar utility can be run on Linux/Unix. The code was developed through our own research needs guided by our research projects and refined/polished as the time permitted. The top level components build on the core of the software are contributed through the time by several people from our group including Janez Brank, Blaz Fortuna, Miha Grcar, Jure Leskovec, Blaz Novak.

Please reference the Web site <www.textmining.net>, if you are using any of the provided utilities.

----------------------------------------

File formats use for document representation

Three main file formats for document representation are used by the tools. They cover different ways of handling text documents:

  1. Compact-Documents format with the file extension ".Cpd"
  2. Text-Base format with the file extension ".TBs"
  3. Bag-Of-Words format with the file extension ".Bow"

----------------------------------------

Pre-processing of Documents

 ----------------------------------------

Document Clustering

----------------------------------------   

Learning Model for Classification of Document

----------------------------------------   

Using unlabeled data

----------------------------------------

Classification of Documents

----------------------------------------

Feature construction/extraction

----------------------------------------

Vizualization of documents based on clustering

----------------------------------------

Vizualization of documents based on sematic-space

----------------------------------------

Simple Web Mining

----------------------------------------

Crawling

----------------------------------------

Search engine

---------------------------------------- ----------------------------------------