文章基本信息

标题：Evaluating the Performance of LSA for Source-code Plagiarism Detection
本地全文：下载
作者：Georgina Cosma ; Mike Joy
期刊名称：Informatica
印刷版ISSN：1514-8327
电子版ISSN：1854-3871
出版年度：2012
卷号：36
期号：4
出版社：The Slovene Society Informatika, Ljubljana
摘要：Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of docu- ments. LSA is a highly parameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA as a technique for source-code plagiarism detection when its parameters are appro- priately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing – in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applying a combined weighting scheme based on term frequencies, normalized term frequencies, and a cosine-based document normalization. Further- more, the use of similarity thresholds (instead of mere rankings) requires the use of a higher number of dimensions.
关键词：LSA; source-code similarity detection; parameter tuning