期刊名称:Bulletin of the Technical Committee on Data Engineering
出版年度:2010
卷号:33
期号:01
出版社:IEEE Computer Society
摘要:Consider a universe of tokens, each of which is associated with a weight, and a database consisting of
strings that can be represented as subsets of these tokens. Given a query string, also represented as a
set of tokens, a weighted string similarity query identifies all strings in the database whose similarity
to the query is larger than a user specified threshold. Weighted string similarity queries are useful
in applications like data cleaning and integration for finding approximate matches in the presence of
typographical mistakes, multiple formatting conventions, data transformation errors, etc. We show that
this problem has semantic properties that can be exploited to design index structures that support very
efficient algorithms for query answering.