文章基本信息

标题：Detecting Duplicates and near Duplicates Records in Large Datasets
本地全文：下载
作者：Shailesh Singh ; Syed Imtiyaz Hassan
期刊名称：International Journal on Computer Science and Engineering
印刷版ISSN：2229-5631
电子版ISSN：0975-3397
出版年度：2017
卷号：9
期号：05
页码：178-185
出版社：Engg Journals Publications
摘要：The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Since the data sources are incoherent and autonomous, they may adopt their own conventions and often, integrating data from different sources may lead to erroneous redundancy of data. To ensure high quality data, the database must validate and filter the incoming data from the external sources. In this regard, data normalization has become a necessity to ensure the high quality of the data stored in these databases. The process of identifying the record pairs that represent the same entity is commonly known as duplicate record detection making it one of the most important tasks in the process of data cleansing. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. The background of implementation trials for these concepts was chosen as Scholarship Portal data developed for various organizations where finding and identifying of such records to the most possible extents as well as enabling the genuine students not to be debarred from getting scholarships as it has various kind of reservation/quota mechanism was a dire need.
关键词：Big Data; Trigrams; Similarity; Lavensthein Edit Distance; Database data mining; Scholarships.