Download Tilde`s wrapper system for CollTerm

Transcript
Contract no. 248347
This file calculates the cosine similarity of term frequency overlap between the
two documents.
5.
TermFreqOverlapStemmedCosineSimilaritySub.pm:
This file calculates the cosine similarity of term frequency overlap between the
two documents. Both documents are previously stemmed.
6.
TFIDFOverlapStemmedCosineSimilaritySub.pm:
This file calculates the cosine similarity of TF*IDF score between the two
documents. Both documents are previously stemmed.
7.
TriGramFreqOverlapStemmedCosineSimilaritySub.pm:
This file calculates the cosine similarity of words tri-gram frequency overlap of a
document pair. The content of these documents are stemmed beforehand.
8.
WordOverlapSub.pm:
This file calculates the word overlap between both documents.
9.
WordOverlapCosineSimilaritySub.pm:
This file calculates the cosine similarity of word overlap between both
documents.
10. WordOverlapStemmedSub.pm:
This file calculates the cosine similarity of word overlap between both
documents. Both documents are previously stemmed.
11. WordOverlapStemmedCosineSimilaritySub.pm:
This file calculates the cosine similarity of word overlap between both
documents. Both documents are previously stemmed.
Other features which do not require translations are extracted using
“CalculateIndependentFeatures.pl”. This tool requires the exact same parameters as the
previous script and can be run using this command:
perl
CalculateIndependentFeatures.pl
--source
[sourceLang]
--target
[targetLang] –-metadata [metadataFile] --outputFolder [outputFolder]
This file calls the following modules:
1.
AllInterLinksOverlapSub.pm:
This file calculates the overlap of inter links between the two documents.
2.
AllOutLinksOverlapSub.pm:
This file calculates the overlap of outlinks between the two documents.
3.
DocLengthWithoutTranslationSub.pm:
This file calculates the difference between document lengths of the original
documents (both documents are not translated).
4.
ImageLinksFilenameOverlapSub.pm:
This file calculates the character overlap of image filenames in both documents.
5.
ImageLinksOverlapSub.pm:
This file calculates the character overlap of the entire image links in both
documents.
6.
URLLevelAndCharacterOverlapSub.pm:
This file calculates the URL level overlap and URL character overlap of both
documents.
D2.6 V3.0
Page 49 of 164