期刊名称:International Journal of Data Mining & Knowledge Management Process
印刷版ISSN:2231-007X
电子版ISSN:2230-9608
出版年度:2014
卷号:4
期号:6
页码:39
DOI:10.5121/ijdkp.2014.4604
出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting NearDuplicates is very difficult in large collection of data like ”internet”. The presence of these web pagesplays an important role in the performance degradation while integrating data from heterogeneoussources. These pages either increase the index storage space or increase the serving costs. Detecting thesepages has many potential applications for example may indicate plagiarism or copyright infringement.This paper concerns detecting, and optionally removing duplicate and near duplicate documents which areused to perform clustering of documents .We demonstrated our approach in web news articles domain. Theexperimental results show that our algorithm outperforms in terms of similarity measures. The nearduplicate and duplicate document identification has resulted reduced memory in repositories.
关键词:Web Content Mining; Information Retrieval; document clustering; duplicate; near-duplicate detection;similarity; web documents