Download Tilde`s wrapper system for CollTerm
Transcript
Contract no. 248347 This file calculates the cosine similarity of term frequency overlap between the two documents. 5. TermFreqOverlapStemmedCosineSimilaritySub.pm: This file calculates the cosine similarity of term frequency overlap between the two documents. Both documents are previously stemmed. 6. TFIDFOverlapStemmedCosineSimilaritySub.pm: This file calculates the cosine similarity of TF*IDF score between the two documents. Both documents are previously stemmed. 7. TriGramFreqOverlapStemmedCosineSimilaritySub.pm: This file calculates the cosine similarity of words tri-gram frequency overlap of a document pair. The content of these documents are stemmed beforehand. 8. WordOverlapSub.pm: This file calculates the word overlap between both documents. 9. WordOverlapCosineSimilaritySub.pm: This file calculates the cosine similarity of word overlap between both documents. 10. WordOverlapStemmedSub.pm: This file calculates the cosine similarity of word overlap between both documents. Both documents are previously stemmed. 11. WordOverlapStemmedCosineSimilaritySub.pm: This file calculates the cosine similarity of word overlap between both documents. Both documents are previously stemmed. Other features which do not require translations are extracted using “CalculateIndependentFeatures.pl”. This tool requires the exact same parameters as the previous script and can be run using this command: perl CalculateIndependentFeatures.pl --source [sourceLang] --target [targetLang] –-metadata [metadataFile] --outputFolder [outputFolder] This file calls the following modules: 1. AllInterLinksOverlapSub.pm: This file calculates the overlap of inter links between the two documents. 2. AllOutLinksOverlapSub.pm: This file calculates the overlap of outlinks between the two documents. 3. DocLengthWithoutTranslationSub.pm: This file calculates the difference between document lengths of the original documents (both documents are not translated). 4. ImageLinksFilenameOverlapSub.pm: This file calculates the character overlap of image filenames in both documents. 5. ImageLinksOverlapSub.pm: This file calculates the character overlap of the entire image links in both documents. 6. URLLevelAndCharacterOverlapSub.pm: This file calculates the URL level overlap and URL character overlap of both documents. D2.6 V3.0 Page 49 of 164