---------------------------------------- ----------------------------------------

Text-Garden -- Text-Mining Software Tools

---------------------------------------- ----------------------------------------

(c) Marko Grobelnik, Dunja Mladenic
Department of Knowledge technologies
Jozef Stefan Institute, Slovenia 

Text-Mining Software Tools enable easy handling of text documents for the purpose of data analysis including automatic model generation and document classification, document clustering, document visualization, dealing with Web documents, crawling the Web and many other.  The code is written in C++ and originally runs on Windows platform and using Wine or similar utility can be run on Linux/Unix. The code was developed through our own research needs guided by our research projects and refined/polished as the time permitted. The top level components build on the core of the software are contributed by several people from our group.

----------------------------------------

File formats use for document representation

Three main file formats for document representation are used by the tools. They cover different ways of handling text documents:

  1. Compact-Documents format with the file extension ".Cpd"
  2. Text-Base format with the file extension ".TBs"
  3. Bag-Of-Words format with the file extension ".Bow"

----------------------------------------

Pre- processing of Documents

 ----------------------------------------

Document Clustering

----------------------------------------

 

 

Html2Xml.Exe

Download

Transforms Html documents into cleaned XML documents. The format of the output file (-o parameter) is controled by several parameters which are by default all turned on. '-otxt' enables output of continues parts of text. '-ourl' enables output of urls appearing the text (which may be absolutized by providing base url in '-u' parameter). '-otok' enables output of single tokens from original html. '-otag' enables output of tags, and '-oarg' enables output of tag parameters.

Command line parameters:

  • -i:Input-Html-File (default:'')
  • -o:Output-XML-File (default:'')
  • -u:Base-Url (default:'')
  • -otxt:Output-Text (default:'T')
  • -ourl:Output-Urls (default:'T')
  • -otok:Output-Tokens (default:'T')
  • -otag:Output-Tags (default:'T')
  • -oarg:Output-Arguments (default:'T')
  • Example:

    Html2Xml.Exe -i:test.html -o:test.xml -i:http://www.ijs.si/
    ----------------------------------------

    Html2Txt.Exe

    Download

    Transforms Html documents into cleaned text documents. In default setting it writes only text. By using -ourl parameter adds urls appearing the html file and by using -otag parameter adds plain version of tags.

    Command line parameters:

  • -i:Input-Html-File (default:'')
  • -o:Output-Text-File (default:'')
  • -u:Base-Url (default:'')
  • -ourl:Output-Urls (default:'F')
  • -otag:Output-Tags (default:'F')
  • Example:

    Html2Txt.Exe -i:test.html -o:test.txt -i:http://www.ijs.si/
    ----------------------------------------

    GetWebPage.Exe

    Download

    Retrieves web-document from address specified in input URL parameter ('-i'). Web-document is saved in http file ('-ohttp' parameter), http-body file ('-o' parameter), xml file ('-oxml' parameter), text file ('-otxt' parameter) or to the screen ('-oscr' parameter).

    Command line parameters:

  • -i:Input-Url (default:'')
  • -ohttp:Output-Http-File (default:'WebPage.Http')
  • -o:Output-Http-Body-File (default:'WebPage.Body')
  • -oxml:Output-Xml-File (default:'WebPage.Xml')
  • -xotxt:Xml-Output-Text (default:'T')
  • -xourl:Xml-Output-Urls (default:'T')
  • -xotok:Xml-Output-Tokens (default:'T')
  • -xotag:Xml-Output-Tags (default:'T')
  • -xoarg:Xml-Output-Arguments (default:'T')
  • -otxt:Output-Text-File (default:'WebPage.Txt')
  • -tourl:Text-Output-Urls (default:'F')
  • -totag:Text-Output-Tags (default:'F')
  • -oscr:Output-To-Screen (default:'F')
  • Example:

    GetWebPage.exe -i:http://www.ijs.si/
    ---------------------------------------- ----------------------------------------