期刊名称:International Journal of Advanced Research in Computer Engineering & Technology (IJARCET)
印刷版ISSN:2278-1323
出版年度:2017
卷号:6
期号:5
页码:610-613
出版社:Shri Pannalal Research Institute of Technolgy
摘要:Duplicate detection is major task in data processing and cleaning. In this paper we discussed about various methods of duplicate detection for a given dataset. Calculating edit distance is the most preferred approach for duplicate detection. Various methods like EdJoin , Winnowing are based on calculating edit distance. Strings could be divided into number of small substrings known as Grams. VGRAM algorithm uses this gram based approach. While calculating edit distance strings are divided into number of small strings called Chunks. VChunkJoin algorithm uses this chunking scheme. Comparison is made based on results for best duplicate detection of records.