No category

Download Tilde`s wrapper system for CollTerm

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

Transcript

Contract no. 248347






“--output” specifies the file to write the found parallel sentence pairs to;
“--param seg=true” specifies that the text in the source and target documents is
already sentence split and tokenized (default “false”);
“--param maxrep=<integer>” specifies the maximum number of target sentences to
align to one source sentence (default “1”);
“--param kif=true” instructs LEXACC to not delete the intermediary files it produces
(i.e. to keep intermediary files). Useful for debugging purposes; default “false”.
When processing very large corpora it is recommended to set this parameter to
“true” because LEXACC may crash when trying to sort (in memory) the extracted
pairs by the translation similarity measure;
“--param t=<float>” causes LEXACC to output only those sentence pairs that have a
translation similarity measure above the specified real value (default “0.2”);
“--param filter=false” causes LEXACC to NOT perform a pre-filtering step of the
candidate sentence pairs before computing the PEXACC translation similarity
measure (default “true”). Filtering greatly reduces the running time but it also reduces
the recall of LEXACC.
For instance, running LEXACC on an English-Romanian comparable corpus with available
document alignments, requesting at most 2:2 sentence alignments with at least 0.3 translation
similarity score, with filtering and LEXACC-supplied sentence splitting and tokenization, the
command line would be:
lexacc32.exe --source en --target ro --docalign en-ro-docalign-list.txt \
--param seg=false --param filter=true --param maxrep=2 \
--param t=0.3 --output results.txt
or, using the defaults
lexacc32.exe --source en --target ro --docalign en-ro-docalign-list.txt \
--param maxrep=2 --param t=0.3 --output results.txt
2.7.6
I/O data formats and constraints
When running with document alignments, LEXACC requires as input a single document
alignment file produced by EMACC or DictMetric for instance. The format of that file has
been presented already but is also given below:
/path/to/source/document1.txt<TAB>/path/to/target/document15.txt<TAB>-0.5
/path/to/source/document1.txt<TAB>/path/to/target/document10.txt<TAB>-1
/path/to/source/document2.txt<TAB>/path/to/target/document2.txt<TAB>-2
…
Thus a line contains a pair of documents with an alignment score (probabilities in natural
logarithm). The source and target documents are separated by “\t” (the TAB character) and
the alignment score is also separated by “\t” from the pair. The source and target documents
themselves must be UTF-8 encoded without byte order markings at the beginning.
D2.6 V3.0
Page 71 of 164

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Top types

Top brands

Download Tilde`s wrapper system for CollTerm