Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach.
Tamilselvi, J. Jebamalar ; Saravanan, V.
The process of detecting and removing database defects and
duplicates is referred to as data cleaning. The fundamental issue of
duplicate detection is that inexact duplicates in a database may refer
to the same real world object due to errors and missing data. Duplicate
elimination is hard because it is caused by different types of errors
like typographical errors, missing values, abbreviations and different
representations of the same logical value. In the existing approaches,
duplicate detection and elimination is domain dependent. These domain
dependent methods for duplicate elimination rely on similarity functions
and threshold for duplicate elimination and produce high false
positives.
This research work presents a general sequential framework for
duplicate detection and elimination. The proposed framework uses six
steps to improve the process of duplicate detection and elimination.
First, an attribute selection algorithm is used to identify or select
best and suitable attributes for duplicate identification and
elimination. The token is formed for the selected attribute field values
in the next step. After the token formation, clustering algorithm or
blocking method is used to group the records based on the similarities
value. The best blocking key will be selected for the blocking records
by comparing performance of the duplicate detection. In the next step
the threshold value is calculated based on the similarities between
records and fields. Then, a rule based approach is used to identify or
detect duplicates and to eliminate poor quality duplicates by retaining
only one copy of the best duplicate record. Finally, all the cleaned
records are grouped or merged and made available for the next process.
This research work will be efficient for reducing the number of
false positives without missing out on detecting duplicates. To compare
this new framework with previous approaches the token concept is
included to speed up the data cleaning process and reduce the
complexity. Analysis of several blocking key is made to select best
blocking key to bring similar records together through extensive
experiments to avoid comparing all pairs of records. A rule based
approach is used to identify exact and inexact duplicates and to
eliminate duplicates.
Introduction
In the 1990's as organizations of scale began to need more
timely data for their business, they found that traditional information
systems technology was simply too cumbersome to provide relevant data
efficiently and quickly. Completing reporting requests could take days
or weeks using antiquated reporting tools that were designed more or
less to 'execute' the business rather than 'run' the
business.
A data warehouse is basically a database and having unintentional
duplication of records created from the millions of data from other
sources can hardly be avoided. In the data warehousing community, the
task of finding duplicated records within data warehouse has long been a
persistent problem and has become an area of active research. There have
been many research undertakings to address the problems of data
duplication caused by duplicate contamination of data.
There are two issues to be considered for duplicate detection:
Accuracy and Speed. The measure of accuracy in duplicate detection
depends on the number of false negatives (duplicates that were not
classified as such) and false positives (non-duplicates which were
classified as duplicates). The algorithm's speed is mainly affected
by the number of records compared, and how costly these comparisons are.
Generally CPUs are not able to do duplicate detection on large databases
within any reasonable time, so normally the number of record comparison
needs to be cut down [4].
In this research work, a framework is developed to handle any
duplicate data in a data warehouse. The main objective of this research
work is to improve data quality and increase speed of the data cleaning
process. A high quality, scalable blocking algorithm, similarity
computation algorithm and duplicate elimination algorithm are used and
evaluated on real datasets from an operational data warehouse to achieve
the objective.
Framework Design
A sequential based framework is developed for detection and
elimination of duplicate data. This framework comprises of some existing
data cleaning approaches and new approaches which are used to reduce the
complexity of duplicate data detection and elimination and to clean with
more flexibility and less effort.
Fig. 1 shows the framework to clean the duplicate data in a
sequential order. Each step of the framework is well suited for the
different purposes. This framework will work according to the data by
using software agent in each step with less user interaction.
The principle on this framework is as follows:
A. Selection of attributes: There is a clear need to identify and
select attributes. These selected attributes are to be used in the other
steps.
B. Formation of tokens: The well suited token is created to check
the similarities between records as well as fields.
C. Clustering/Blocking of records: Clustering algorithm or blocking
method is used to group the records based on the similarities of block
token key value.
D. Similarity computation for selected attributes: Jaccard
similarity method is used for token-based similarity computation.
E. Detection and elimination of duplicate records: The rule based
detection and elimination approach is used for detecting and eliminating
the duplicates.
F. Merge: The result or cleaned data is merged.
A. Selection of attributes
The data cleaning process is complex with the large amount of data
in the data warehouse. The attribute selection is very important to
reduce the time and effort for the further work such as record
similarity and elimination process etc. Attribute selection is very
important when comparing two records [5]. This step is the foundation
step for all the remaining steps. Therefore time and effort are two
important requirements to promptly and qualitatively select the
attribute to be considered.
[FIGURE 1 OMITTED]
A missing value is expressed and treated as a string of blanks.
Obviously, missing character values are not the smallest strings.
Distinct is used to retrieve number of rows that have unique values for
each attribute. Uniqueness is the characteristic of an attribute that
make up its relevance for duplicate detection. The uniqueness is a
property that is reflected in the distribution of the similarity of the
attribute-pairs of non-duplicates, i.e. if an attribute is highly
unique, the mean similarity on non-duplicates for this attribute will be
low. The value of measurement types are also considered for the
attribute selection. The data cleaning with numeric data will not be
effective. The categorical data is efficient for the data cleaning
process.
To identify relevant attributes for further data cleaning process,
three criteria are used. The three criteria are: (a) Identifying key
attributes, (b) classifying attributes with high distinct value and low
missing value and (c) measurement types of the attributes. Based on the
above information, 'weight' or 'rank value' is
calculated for all the attributes. Finally, the highest priority
attributes are selected for the further data cleaning process [2].
This flow diagram (Fig 3) shows attribute selection procedure in a
sequential way. First, the data set is identified for the data cleaning
process. From this data set, attributes are analyzed by identifying
types of the attribute, relationship between attributes, and properties
of each attribute to select the appropriate attribute. Type of attribute
is classified using data type, size or length of the attribute.
Threshold value is used to identify best attributes for the data
cleaning process. The threshold value is measured by using three
different criteria--High Threshold value, Data Quality and High Rank.
High threshold value is calculated to identify high power attribute for
the data cleaning process. Attributes are ranked based on the threshold
value and data quality value. Finally, high rank attributes are selected
for the next cleaning process to improve speed of the data cleaning
process. The developed attribute selection algorithm is presented in Fig
2.
The attribute selection algorithm can eliminate both irrelevant and
redundant attributes and is applicable to any type of data (nominal,
numeric, etc.). Also, this attribute selection algorithm can handle
different attribute types of data smoothly. The quality of the algorithm
is confirmed by applying a set of rules. The main purpose for this
attribute selection for data cleaning is to reduce time and to improve
the speed for the further data cleaning process such as token formation,
record similarity and elimination process in an efficient way.
Figure 2: Algorithm for attribute selection.
Input : N-Attributes, no. of tuples n, relation instance r
Output : S--Sub set of the attributes
Initialize : L--Temporary relation instance, x--Attribute set
Var : L--Temporary relation instance, x--Attribute set, F--Field set,
i, j
begin
I. Analyze attribute set X to select best attributes
II. Calculate threshold value a for each attribute
for each attribute xi, i<{0,1,2,.... N}
do
i) Assign threshold value as 1 for key attribute, put into L
ii) Calculate threshold a with ([sigma] : D [and] M [and] MT
[and] S)
a. Distinct(D) value of the attribute xi if tuple
[[??].sub.i]=[i.sup.n] t = ti
b. Missing(M) value of the attribute xi if tuple
[[??].sub.i]=[i.sup.n] ti = NULL
c. Measurement types(MT) of attribute (ordinal, nominal,
interval, and ratio) xi
Put into L.
end
III. Select attribute with high threshold value a i, then put into L.
IV. Ignore attribute with low threshold value
V. Calculate data quality [dq.sub.j] of the selected attributes
for each attribute xi, i<{0,1,2,.... N}
for each field fj, j<{0,1,2,.... n}
compute
a. [Completeness.sub.j] = [[summation].sup.N.sub.i=0]
[completeness.sub.i,j]/n
b. [Accuracy.sub.j] = [[summation].sup.N.sub.i=0]
[accuracy.sub.i,j]/n
c. [Consistency.sub.j] = [[summation].sup.N.sub.i=0]
[consistency.sub.i,j]/n
d. [dq.sub.j] = [alpha] x [Completeness.sub.j] +
[beta] x [accuracy.sub.j] + [gamma] x
[consistency.sub.j] / n
end for
end for
VI. Rank attributes based on data quality and threshold value
VII. Select high rank attributes
End
[FIGURE 3 OMITTED]
B. Formation of Tokens
This step makes use of the selected attribute field values to form
a token. The tokens can be created for a single attribute field value or
for combined attributes. For example, contact name attribute is selected
to create a token for further cleaning process. The contact name
attribute are split as first name, middle name and last name. Here first
name and last name are combined as contact name to form a token.
Tokens are formed using numeric values, alphanumeric values and
alphabetic values by selecting some combination of characters.
Unimportant elements are removed before the token formation [title
tokens like Mr., Dr. and so on [6].
Numeric tokens comprise only digits [0-9]. Alphabetic tokens
consist of alphabets (aA-zZ). The first character of each word in the
field is considered and the characters are sorted. Alphanumeric tokens
comprise of both numeric and alphabetic tokens. It composes a given
alphanumeric element into numeric [7].
This step eliminates the need to use the entire string records with
multiple passes, for duplicate identification. It also, solves
similarity computation problem in a large database by forming token key
from some selected fields, to reduce the number of comparisons.
Fig 4 shows an algorithm for token formation. In this algorithm,
rules are specified to form the tokens. This algorithm works according
to the type of the data. For example, if address attribute is selected,
alphanumeric rule is used to form the tokens. The formed tokens are
stored in LOG table.
The idea behind this algorithm is to define smart tokens from
fields of selected multiple most important attributes by applying simple
rules for defining numeric, alphabetic, and alphanumeric tokens.
Temporary table now consists of smart token records, composed from field
tokens of the records. These smart token records are sorted out using
block-token-key.
Figure 4: Algorithm for Token Formation.
Input: Tables with dirty data, Reference table, Selected attributes
Output: LOG table with tokens
Var: i, j, m--attribute set, n--no. of records
begin
For attribute i = 1 to m
for row j = 1 to n
do
i) remove unimportant characters
ii) expand abbreviations using Reference
table
iii) if row(j) isnumeric then
a. convert string into number
b. sort or arrange the number in order
c. form a token, then put into LOG table
end if
iv) if row(j) isaphanumeric then
a. separate numeric and alphanumeric
b. split alphanumeric into numeric and
alphabetic
c. sort numeric and alphabetic separately
d. form a token, then put into LOG table
end if
v) if row(j) isalphabetic then
a. select the first character from each word
b. sort these letters in a specific order
c. string them together
d. if one word is present, take the first three
character as token, then sort the
characters
e. form a token, then put into LOG table
end if
end
C. Clustering/Blocking of records
Data mining primarily works with large databases. Sorting the large
datasets and data duplicate elimination process with this large database
faces the scalability problems. The clustering techniques are used to
cluster or group the dataset into small groups based on the distance
values or some threshold values to reduce the time for the elimination
process.
The blocking methods are used for reducing huge number of
comparisons. This step is used for grouping the records that are most
likely to be duplicated based on the similarity of block-token-key. It
works in a blocking fashion. i.e Block-token key is used to split the
data sets into blocks. Block-token key is very important in blocking the
records. There are four types of block-token-key generation used and
identified good block-token-key for blocking. They are i) Blocking with
Single Attribute ii) Blocking with Multiple Attributes iii) Array based
Block-Token-Key and iv) Token based Blocking key. The choice of a good
block-token-key can greatly reduce the number of record pair evaluations
to be performed and so the user can achieve significant result.
Figure 5: Algorithm for blocking single and multiple attribute
records.
Input: Data set (records) D, keys k[]
Output: Blocked records
Var: n [left arrow] no. of records, block b[], i, j
begin
1. Sort database using Block-Token-Key
2. Initialize new block
3. Blocking Records
i. Single Attribute
for each record i = 0 to n
for each key j = 1 to kn
for each block k = 1 to bn
if distance (key[j], key[j + 1]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j = j + 1
end for
end
ii. Multiple Attribute
for each record i = 0 to rn
for each column j = 0 to cn
for each key k = 1 to kn
for each block b = 1 to bn
Calculates distance of all selected column key values d(values)
if distance (d[values]) > threshold
add key[j] to block b[k]
else
add key[j + 1] to b1
initialize new block b[k + 1]
end if
j = j + 1
end for
end
This algorithm is efficient to identify exact and in exact
duplicate records based on the selection of block-token key. The number
of comparison is reduced when compared with other existing methods and
increases the speed of the data cleaning process. This blocking method
gives better performance than other methods like shrinking or expanding
the window size based on the block-token key.
D. Similarity Computation for Selected Attributes
Record linkage algorithms fundamentally depend on string similarity
functions for record fields as well as on record similarity functions
for string fields. Similarity computation functions depend on the data
type. Therefore the user must choose the function according to the
attribute's data type, for example numerical, string and so on.
This step uses Jaccard similarity function to compare token values
of adjacent field values for selected attribute. Tokenization is
typically formed by treating each individual word of certain minimum
length as a separate token or by taking first character from each word.
In step (ii), token has been created for the selected attributes. Each
function measures the similarity of selected attributes with other
record fields and assigns a similarity value for each field. In the next
step, the clustering techniques have been selected to group the fields
based on the similarity values.
Accurate similarity functions are important for clustering and
duplicate detection problem. Better string distance might also be useful
to pair the record as match or non-match. This matching and non-matching
pairs is used for clustering and to eliminate the duplicates.
E. Detection and elimination of duplicate data
In step (v), the rule based duplicate detection and elimination
approach is used for detecting and eliminating the records. During the
elimination process, only one copy of duplicated records are retained
and eliminated other duplicate records [8] [9]. The elimination process
is very important to produce a cleaned data. The above steps are used to
identify the duplicate records. This step is used to detect and remove
the duplicate records from one cluster or many clusters. Before the
elimination process, the similarity threshold values for all the records
in the dataset are calculated. The similarity threshold values are
important for the elimination process.
The same comparison process is done in Step (iv). Step (v) also
requires a comparison process. To reduce a comparison, a threshold value
or similarity value is stored as a LOG file. The threshold criteria and
certainty factors are used to detect and eliminate the duplicate
records. Finally one record is retained as prime representative and
maintained this value in the log file. This primary copy will be used
for the incremental cleaning process also for further work. This
approach can substantially reduce the probability of false mismatches,
with a relatively small increase in the running time.
F. Merge
This step merges the corrected data as a single cluster [10],[11].
The user must maintain the merged record and the prime representative as
a separate file in the data warehouse. This information helps the user
for further changes in the duplicate elimination process. This merge
step is useful for the incremental data cleaning. When a new data enters
into the data warehouse, incremental data cleaning checks the new data
with the already created LOG file. Hence, this reduces the time for the
data cleaning process.
Experimental Results
An attribute selection algorithm is implemented to select more
important attributes which is having enough information for identifying
duplicate records. Selected attributes are used for duplicate record
detection. The selected attributes have enough information for duplicate
data detection. The results are drawn below with different attribute
values, numbers of duplicate records detection and token formation..
Efficiency of duplicate detection and elimination largely depends on the
selection of attributes. Token formation algorithm is used to form
tokens for selected attribute field values to reduce the time for
cleaning planning.
Attribute Selection with parameters
Wrong selection of attribute affects the performance and accuracy
of data cleaning process. Hence key fields are selected in such a way
that fields should contain sufficient information to identify the
duplication of the record. Fig. 6 shows the attribute selection for data
cleaning process. The threshold value is calculated using four important
criteria such as size of the field, missing values, distinct values and
measurement type. The best attributes are selected based on the
threshold value for the data cleaning process. The name attribute has
the highest threshold value compared to other attributes and seven
attributes are selected for the next process of data cleaning.
[FIGURE 6 OMITTED]
Attribute Vs Duplicates
The identification of duplicates is mainly based on the selection
of attributes and selection of window size. In the existing methods,
fixed size sliding window is used to minimize the number of comparison.
In this method, dynamic window size is used based on the similarities of
field values. The best accuracy of the duplicate detection is obtained
by using our dynamic method. Fig. 7 shows how the number of duplicate
records detected varies as the sliding window size changes and for
dynamic size of window. The result of duplicate detection is varied
based on the selection of window size and dynamic size. To test this
phenomenon, results are taken by varying the attribute values for each
execution setting window size between 10 and 50 and dynamic size.
[FIGURE 7 OMITTED]
Duplicates Vs No. of Attributes
Efficiency of the record matching algorithm mainly depends on the
selection of the attributes. It has been observed from fig. 8 that
selection of key column influences the result greatly. Table 1 shows key
field selection and approximate duplicate record detected for those
selected key fields. The selection of combination of multiple attribute
will be more useful to identify exact and inexact duplicates.
Time Vs Token Formation
In this research work, token formation step takes very low time for
token formation. Table loading and attribute selection from the data
warehouse also takes few seconds to select best attribute for the data
cleaning process (Shown in Fig. 9). The time is varied based on the size
of the data set.
[FIGURE 9 OMITTED]
Similarity Time Vs Database Size
In multiple similarities, the similarity between the entire records
and fields are compared bit by bit using exact string matching function.
In token similarities, token values of the adjacent selected attribute
fields are compared. The time taken for token based similarity method is
lower than the multiple similarity method. Fig. 10 shows that variance
of time between token based similarities and multiple attribute
similarities with different database size. The speed of the data
similarity computation is increased in Token-based similarity
computation. From above fig. 10, as the size of the data increases, the
time taken for similarity computation also increases. As a result, the
time taken for the similarity computation is reduced when comparing
token based similarity computation with multi-similarity computation.
[FIGURE 10 OMITTED]
Duplicates Vs Size of dataset and attributes
Number of blocks are varied based on the size of dataset and number
of duplicates. Basically, size of block depends on the number of
duplicates available in the data set. In this fig. 11, identification of
duplicates is very low using address1 attribute and identification of
duplicates is high using phone attribute. Attribute address1 identifies
more duplicates because attribute address1 has more high distinct values
than other attributes. As a result, high power attributes identify high
amount of duplicates than other attributes. At the same time,
identification of duplicates varies for different sizes of data.
Duplicates Vs Window
The fig. 12 shows relationship between identification of
duplicates, window size and attributes. In this figure 5h, size of each
window and number of windows are varied for each attributes and also
numbers of identified duplicates are varied with each attribute.
Accurate results are obtained using attributes address1 and phone.
Theses two attributes are having enough information that is high
distinct values for duplicate identification. Attributes name and
address2 have less number of unique values than other attributes. But,
wrongly identified duplicates are high with these two attributes.
Finally, as a result, smaller blocks result in less comparison but match
pairs are missed. Pair completeness results improve with larger windows.
Figure 5h shows the identification of duplicates as being varied based
on the size of the window. As the duplicates increase the size of the
window also increases for different attributes.
[FIGURE 12 OMITTED]
Pair completeness for blocking keys and dynamic window
Fig. 13 shows pair completeness with different types of blocking
keys. The pair completeness (PC) is defined as PC = [S.sub.M] /
[N.sub.M], where [S.sub.M] is the number of true match record pairs in
the set of record pairs produced for comparison by the blocking keys and
dynamic window size blocking method and [N.sub.M] is the total number of
true match record pairs in the entire data. These results show pair
completeness by using the blocking keys and dynamic blocking method.
Pair completeness varies as size of dataset increases based on the
blocking key. Size of the window varies based on the resemblance of
records. In this research work, token based blocking key gives better
pair completeness results. Identification of duplicates is high when
comparing token based blocking key method with other three blocking key
methods. Hence, token based blocking key gives better results in pair
completeness.
To evaluate the performance of this research work, the errors
introduced in duplicate records range from small typographical changes
to large changes of some fields. Generally, the database duplicate ratio
is the ratio of the number of duplicate records to the number of records
of the database. To analyze the efficiency of this research work,
proposed approach is applied on a selected data warehouse with variant
window size, database duplicate ratios and database sizes. In all tests
the time taken for duplicate data detection and elimination processes
are analyzed to evaluate the efficiency of time saved in this research
work.
[FIGURE 13 OMITTED]
Duplicates Vs Threshold Values
Fig. 14 shows the performance of duplicate detection with false
mismatches by varying the threshold value. Table 2 contains number of
duplicates detected by each combination of attributes for particular
threshold value. The data values shown in bold letters represent the
total number of duplicate records detected at optimum threshold value.
The optimum threshold values are 0.6, 0.7 and 0.8. The results are not
accurate if the threshold value is greater than or less than the optimum
threshold value. The number of mismatches and false mismatches are
increased and it is not possible to detect exact and inexact duplicates.
Hence, the accuracy of the duplicate detection is not exact. The
threshold value is an important parameter for duplicate detection
process.
[FIGURE 14 OMITTED]
Percent of incorrectly detected duplicated pairs Vs Attributes
Fig. 15 shows that false positive ratio is increased as increases
the window sizes. Identification of duplicates depends on keys selected
and size of the window. This figure shows the percent of those records
incorrectly marked as duplicates as a function of the window size. The
percent of false positive is almost insignificant for each independent
run and grows slowly as the window size increases. The percentage of
false positives is very slow if the window size is dynamic. This
suggests that the dynamic window size is more accurate than fixed size
window in duplicate detection.
[FIGURE 15 OMITTED]
Time taken on databases with different dataset size
Fig. 16 shows variations of time taken in different database sizes.
The result on time taken by each step of this research work is shown in
figure 6e. The time taken by proposed research work increases as
database size increases: the time increases when the duplicate ratio
increases in dataset. The time taken for duplicate data detection and
elimination is mainly dependent upon size of the dataset and duplicate
ratio in dataset. The efficiency of time saved is much larger than
existing work because token based approach is implemented to reduce the
time taken for cleaning process and improves the quality of the data.
[FIGURE 16 OMITTED]
Time performance Vs Different size databases and Percentage of
created duplicates
10%, 30% and 50% duplicates are created in the selected datasets
for result analysis. The results are shown in Fig. 17. The time taken
for duplicate detection and elimination is varied based on the size of
the data and percentage of duplicates available in the dataset. For
these relatively large size databases, the time seems to increase
linearly as the size of the databases increase is independent of the
duplicate factor.
[FIGURE 17 OMITTED]
Conclusion
Deduplication and data linkage are important tasks in the
pre-processing step for many data mining projects. It is important to
improve data quality before data is loaded into data warehouse. Locating
approximate duplicates in large databases is an important part of data
management and plays a critical role in the data cleaning process. In
this research wok, a framework is designed to clean duplicate data for
improving data quality and also to support any subject oriented data.
This framework is useful to develop a powerful data cleaning tool by
using the existing data cleaning techniques in a sequential order.
A new attribute selection algorithm is implemented and evaluated
through extensive experiments. The attribute selection algorithm can
eliminate both irrelevant and redundant attributes and is applicable to
any type of data (nominal, numeric, etc.). Also, this attribute
selection algorithm can handle data of different attribute types
smoothly. The quality of the algorithm results are confirmed by applying
a set of rule. The main purpose of this attribute selection for data
cleaning is to reduce the time for the further data cleaning process
such as token formation, record similarity and elimination process in an
efficient way.
The token formation algorithm is used to form smart tokens for data
cleaning and it is suitable for numeric, alphanumeric and alphabetic
data. There are three different rules described for the numeric,
alphabetic, and alphanumeric tokens. The result of the token based data
cleaning is to remove duplicate data in an efficient way. The time will
be reduced by the selection of attributes and by the token based
approach. These formed tokens are stored in the LOG Table. The time
required to compare entire string is more than comparison of tokens.
This formed token will be used as the blocking key in the further data
cleaning process. So, the token formation is very important to define
best and smart token.
Using of an unsuitable key, which is not able to group the
duplicates together, has a deterring effect on the result, i.e. many
false duplicates are detected in comparison with the true duplicates,
using say, the address key. Hence, key creation and selection of
attributes are important in the blocking method to group similar records
together. The selection of the most suitable blocking key (parameter)
for the blocking method is addressed in this research work. Dynamically
adjusting the blocking key for the blocking method will be effective in
record linkage algorithms during the execution time. The blocking key is
selected based on the type of the data and usage of the data in the data
warehouse. The dynamically adjusting blocking key and token based
blocking key as well as the dynamic window size SNM method is used in
this research work. An agent is used in tuning parameter or everything
is set dynamically for the blocking method without human intervention to
yield better performance. However, in most real world problems where
expert knowledge is hard to obtain, it is helpful to have methods that
can automatically choose reasonable parameters for us.
Time is critical in cleansing large database. In this research
work, efficient token based blocking method and similarity computation
method is used to reduce the time taken on each comparison. In this
research work, efficient duplicate detection and duplicate elimination
approach is developed to obtain good result of duplicate detection and
elimination by reducing false positives. Performance of this research
work shows that there was significant time saving and improved duplicate
results than the existing approach.
To compare this new framework with previous approaches the token
concept is included to speed up the data cleaning process and reduce the
complexity. Each step of this new framework is specified clearly in a
sequential order by means of the six data cleaning process offered such
as attribute selection, token formation, clustering, similarity
computation, elimination, and merge. An agent is used in this framework
to reduce the effort taken by the user. This agent will work according
to the type and size of the data set. This framework is flexible for all
kinds of data in the relational databases.
The framework is mainly developed to increase the speed of the
duplicate data detection and elimination process and to increase the
quality of the data by identifying true duplicates and strict enough to
keep out false-positives. The accuracy and efficiency of duplicate
elimination strategies are improved by introducing the concept of a
certainty factor for a rule. Data cleansing is a complex and challenging
problem. This rule-based strategy helps to manage the complexity, but
does not remove that complexity. This approach can be applied to any
subject oriented databases in any domain. This proposed research work
maintains LOG files with all the cleaning process for the incremental
data cleaning.
References
[1] Dorian Pyle, Data Preparation for Data Mining, Published by
Morgan Kaufmann, ISBN 1558605290, 9781558605299, 540 pages, 1999.
[2] Jiawei Han, Micheline Kamber, Data Mining: Concepts and
Techniques, Publisher: Elsevier Science & Technology Books,
ISBN-13:9781558609013, March 2006.
[3] Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey
Pryakhin, Matthias Schubert, Arthur Zimek, Future trends in data mining,
Data Mining and Knowledge Discovery, Volume 15, Issue 1, Pages: 87--97,
ISSN:13845810, 2007.
[4] Robert Leland, Duplicate Detection with PMC--A Parallel
Approach to Pattern Matching Department of Computer and Information
Science, Norwegian University of Science and Technology, Ph.D. Thesis,
August 2007.
[5] Kononenko and S.J. Hong, Attribute Selection for Modeling,
Future Generation Computer Systems, Elsevier, Volume 13, Number 2, pp.
181195(15), November 1997.
[6] C.I. Ezeife and Timothy E. Ohanekwu, Use of Smart Tokens in
Cleaning Integrated Warehouse Data, the International Journal of Data
Warehousing and Mining (IJDW), Vol. 1, No. 2, pp. 1-22, Ideas Group
Publishers, April June 2005.
[7] Ohanekwu, T.E. & Ezeife, C.I., A token-based data cleaning
technique for data warehouse systems. Proceedings of the International
Workshop on Data Quality in Cooperative Information Systems (pp. 21-26),
held in conjunction with the 9th International Conference on Database
Theory (ICDT 2003), Siena, Italy, January 2003.
[8] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating
Fuzzy Duplicates in Data Warehouses. VLDB, pages 586-597, 2002.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
Duplicate Record Detection: A Survey, IEEE TKDE, 19(1):1-16, 2007.
[10] Hernandez, M.A. & Stolfo, S.J. The merge/purge problem for
large databases, Proceedings of the ACM SIGMOD International Conference
on Management of Data, pp. 127-138, 1995.
[11] A. Hernandez Mauricio and J. S. Stolfo, Real-world Data is
dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and
Knowledge Discovery, pp. 9-27, 1998.
[12] Partrick Lehti, Unsupervised Duplicate Detection Using Sample
Non-duplicates, Lecture Notes in Computer Science, NUMB 4244, pages
136-164, 2006.
[13] Prabhu CSR, Data Warehousing: Concepts Techniques 3/e, ISBN:
8120336275, ISBN-13: 9788120336278, Publisher: Prentice-hall Of India
Pvt Ltd, 2008
[14] J. Paul Kirby, Tom Pohlmann, Shelby Semmes, Topic Overview:
Data Warehousing, Information & Knowledge Management Professionals,
Forrester Research, January 23, 2007.
[15] Su Yan, Dongwon Lee, Min-Yen Kan, C. Lee Giles, Adaptive
sorted neighborhood methods for efficient record linkage, 185-194, JCDL
2007.
(1) J. Jebamalar Tamilselvi and (2) V. Saravanan
(1) PhD Research Scholar, Department of Computer Application,
Karunya University, Coimbatore--641114, Tamilnadu, INDIA
E-mail:
[email protected]
(2) Professor & HOD, Department of Computer Application,
Karunya University, Coimbatore--641114, Tamilnadu, INDIA
E-mail:
[email protected]
Table 1: Key columns and no. of duplicate detected.
No. of
Columns Key Columns Selected No. of duplicate detected
1 ADD_ADDRESS1 60988
2 ADD_ADDRESS1 ADD_NAME 58944
3 ADD_ADDRESS1 ADD_NAME 50860
ADD_PHONE1
4 ADD_ADDRESS1 ADD_NAME 45128
ADD_PHONE1 ADD_CDATE
5 ADD_ADDRESS1 ADD_NAME 40156
ADD_PHONE1 ADD_CDATE ADD_DEL
6 ADD_ADDRESS1 ADD_NAME 39160
ADD_PHONE1 ADD_CDATE ADD_DEL
ADD_PARENTTYPE
7 ADD_ADDRESS1 ADD_NAME 33172
ADD_PHONE1 ADD_CDATE ADD_DEL
ADD_PARENTTYPE ADD_PINCODE
Table 2: Attributes and threshold value.
ATTRIBUTES 0.5 0.6 0.7 0.8 0.9 1
ADD_ADDRESS1 40153 38511 37462 34423 31048 30988
ADD_ADDRESS1 48523 45263 45139 42874 40743 38944
ADD_NAME
ADD_ADDRESS1 49898 48990 48987 45214 41985 39860
ADD_NAME
ADD_PHONE1
ADD_ADDRESS1 52984 50234 49136 46987 42035 40128
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_ADDRESS1 55133 53134 52456 50156 42542 40156
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL
ADD_ADDRESS1 55982 53634 53452 50598 42590 40160
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL
ADD_ADDRESS2
ADD_ADDRESS1 57213 54136 53874 52345 42898 40172
ADD_NAME
ADD_PHONE1
ADD_CDATE
ADD_DEL
ADD_ADDRESS2
ADD_PINCODE
Figure 8: Duplicate detected Vs No. of attribute selected.
No. of keys columns
1 60988
2 58944
3 50860
4 45128
5 40156
6 39160
7 33172
Note: Table made from line graph.
Figure 11: Duplicates Vs Size of dataset and attributes.
Phone Address2 Name Address1
100000 45923 14275 12472 44075
200000 66479 36457 36315 63841
300000 73425 52346 51043 85353
400000 118953 707398 778231 103891
500000 189623 83752 98734 148737
Note: Table made from line graph.