其他摘要:Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this prob- lem, several lexical, syntactic and semantic based tech- niques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntac- tic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures , and linear combination and a similarity matrix built in turn from syntactic n- grams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each ap- proach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syn- tactic dependency n-grams, or syntactic constituent n- grams. We compare too both approaches with traditional n-grams and state-of-the-art systems.
关键词:Paraphrase recognition; Microsoft Research paraphrase corpus; similarity measures; syntactic n- grams; constituent analysis; dependency analysis