The utility learns language independent semantic space for two languages from paired corpus ("-ips"). It also uses Bag-Of-Word files for each language ("-ibow1", "-ibow2"). It outputs two Semantic-Space files, one per language ("-ossp1", "-ossp2").
Parameter "-t" is regularization parameter thau from derivations. Parameter
"-tnrm" determines how basis vectors are normalized after learning ("none"
means no normalizing, "one" means normalizing to norm 1 and "eigval" means
normalizing to its eigenvalue). Parameter "-tnrm" determines stopping criteria
for incomplete Cholesky decomposition. Parameter "-docs" determines number
of documents from paired corpus that will be randomly selected (randomizer
is initialized with parameter "-seed"). Parameter "-dim" determines dimension
of calculated semantic space. Parameter "-len" determines maximal length of
documents used for learning from paired corpus. If documents are split into
paragraphs and document is longer than maximal length, than only random subset
of paragraphs is used. Parameter "-stat" determines if text file with
statistics for each semantic space should be made.
usage: PrSet2SemSpace.exe
-ips:Input-PrSet-File-Name (default:'')
-ibow1:Input-Bow-File-Name-For-First-Language (default:'')
-ibow2:Input-Bow-File-Name-For-Second-Language (default:'')
-ossp1:Output-Semantic-Space-File-Name-For-First-Language (default:'')
-ossp2:Output-Semantic-Space-File-Name-For-Second-Language (default:'')
-t:Regularization-Parameter-For-KCCA (default:0.5)
-tnrm:Correlation-Normalization-Type (none, one, eigval) (default:'one')
-eps:Threshold-For-Partial-Gram-Schmidt (default:0.4)
-docs:Number-Of-Documents-For-Training-KCCA (default:1000)
-dim:Number-Of-Calculated-Dimensions (default:500)
-len:Maximal-Length-Of-Training-Document (-1 for no limit) (default:1000)
-seed:Seed-For-Randomizer (default:0)
-stat:Make-Semantic-Space-Statistics (default:'F')
Example 1:
PrSet2SemSpace -ips:En-De.PrSet -ibow1:bow.en.bow -ibow2:bow.de.bow -ossp1:en.ssp -ossp2:de.ssp -odid:didv.dat -t:0.5 -tnrm:one -docs:500 -len:5000 -stat:T
The above example learns common semantic space for two languages based on paired
set "En-De.PrSet". Bag-Of-Words files "bow.en.bow" and "bow.de.bow" are used to
define word-space for each language. It uses a random subset of 500 documents to
calculate semantic space with maximal length of document 5000. Regularisation
parameter is 0.5 and all directions are normalised to 1.0.