首页    期刊浏览 2024年12月04日 星期三
登录注册

文章基本信息

  • 标题:Evaluating the Performance of LSA for Source-code Plagiarism Detection
  • 本地全文:下载
  • 作者:Georgina Cosma ; Mike Joy
  • 期刊名称:Informatica
  • 印刷版ISSN:1514-8327
  • 电子版ISSN:1854-3871
  • 出版年度:2012
  • 卷号:36
  • 期号:4
  • 出版社:The Slovene Society Informatika, Ljubljana
  • 摘要:Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of docu- ments. LSA is a highly parameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA as a technique for source-code plagiarism detection when its parameters are appro- priately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing – in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applying a combined weighting scheme based on term frequencies, normalized term frequencies, and a cosine-based document normalization. Further- more, the use of similarity thresholds (instead of mere rankings) requires the use of a higher number of dimensions.
  • 关键词:LSA; source-code similarity detection; parameter tuning
国家哲学社会科学文献中心版权所有