Paired-Set To Semantic-Space

The utility learns language independent semantic space for two languages from paired corpus ("-ips"). It also uses Bag-Of-Word files for each language ("-ibow1", "-ibow2"). It outputs two Semantic-Space files, one per language ("-ossp1", "-ossp2").

Parameter "-t" is regularization parameter thau from derivations. Parameter "-tnrm" determines how basis vectors are normalized after learning ("none" means no normalizing, "one" means normalizing to norm 1 and "eigval" means normalizing to its eigenvalue). Parameter "-tnrm" determines stopping criteria for incomplete Cholesky decomposition. Parameter "-docs" determines number of documents from paired corpus that will be randomly selected (randomizer is initialized with parameter "-seed"). Parameter "-dim" determines dimension of calculated semantic space. Parameter "-len" determines maximal length of documents used for learning from paired corpus. If documents are split into paragraphs and document is longer than maximal length, than only random subset of paragraphs is used. Parameter "-stat" determines if text file with statistics for each semantic space should be made.

usage: PrSet2SemSpace.exe
-ips:Input-PrSet-File-Name (default:'')
-ibow1:Input-Bow-File-Name-For-First-Language (default:'')
-ibow2:Input-Bow-File-Name-For-Second-Language (default:'')
-ossp1:Output-Semantic-Space-File-Name-For-First-Language (default:'')
-ossp2:Output-Semantic-Space-File-Name-For-Second-Language (default:'')
-t:Regularization-Parameter-For-KCCA (default:0.5)
-tnrm:Correlation-Normalization-Type (none, one, eigval) (default:'one')
-eps:Threshold-For-Partial-Gram-Schmidt (default:0.4)
-docs:Number-Of-Documents-For-Training-KCCA (default:1000)
-dim:Number-Of-Calculated-Dimensions (default:500)
-len:Maximal-Length-Of-Training-Document (-1 for no limit) (default:1000)
-seed:Seed-For-Randomizer (default:0)
-stat:Make-Semantic-Space-Statistics (default:'F')

Example 1:
PrSet2SemSpace -ips:En-De.PrSet -ibow1:bow.en.bow -ossp1:en.ssp -ossp2:de.ssp -odid:didv.dat -t:0.5 -tnrm:one -docs:500 -len:5000 -stat:T

The above example learns common semantic space for two languages based on paired set "En-De.PrSet". Bag-Of-Words files "bow.en.bow" and "" are used to define word-space for each language. It uses a random subset of 500 documents to calculate semantic space with maximal length of document 5000. Regularisation parameter is 0.5 and all directions are normalised to 1.0.