Download NiuTrans Open Source Statistical Machine Translation System

Transcript
26
Command
$ cd NiuTrans/bin/
$ mkdir ../work/lex/ -p
$ ./NiuTrans.PhraseExtractor --LEX \
-src ../sample-data/sample-submission-version/TM-training-set/chinese.txt \
-tgt ../sample-data/sample-submission-version/TM-training-set/english.txt \
-aln ../sample-data/sample-submission-version/TM-training-set/Alignment.txt \
-out ../work/lex/lex
where
--LEX, which indicates that the program (NiuTrans.PhraseExtractor) works for extracting lexical translations.
-src, which specifies the source sentences of bilingual training corpus.
-tgt, which specifies the target sentences of bilingual training corpus.
-aln, which specifies word alignments between the source and target sentences.
-out, which specifies the prefix of output files (i.e., lexical translation files)
Also, there are some optional parameters, as follows:
-temp, which specifies the directory for sorting temporary files generated during the processing.
-stem, which specifies whether stemming is used. e.g., if -stem is specified, all the words are stemmed.
Output: two files ”lex.s2d.sorted” and ”lex.d2s.sorted” are generated in ”/NiuTrans/work/lex/”.
Output (/NiuTrans/work/lex/)
- lex.s2d.sorted
- lex.d2s.sorted
3.2.3
B "source → target" lexical translation file
B "target → source" lexical translation file
Generating Phrase Translation Table
The next step is the generation of phrase translation table which will then be used in the decoding step.
Basically the phrase table is a collections of phrase-pairs with associated scores (or features). In NiuTrans,
all the phrase-pairs are sorted in alphabetical order, which makes the system can efficiently loads/organizes
the phrase table in a internal data structure. Each entry of the phrase table is made up several fields. To
illustrate their meaning, Figure 3.7 shows a sample table.
In this example, each line is separated into five fields using ” ||| ”. The meaning of them are:
• The first field is the source side of phrase-pair.