文章基本信息

标题：Weighted Set-Based String Similarity
本地全文：下载
作者：Marios Hadjieleftheriou ; Divesh Srivastava
期刊名称：Bulletin of the Technical Committee on Data Engineering
出版年度：2010
卷号：33
期号：01
出版社：IEEE Computer Society
摘要：Consider a universe of tokens, each of which is associated with a weight, and a database consisting of strings that can be represented as subsets of these tokens. Given a query string, also represented as a set of tokens, a weighted string similarity query identiﬁes all strings in the database whose similarity to the query is larger than a user speciﬁed threshold. Weighted string similarity queries are useful in applications like data cleaning and integration for ﬁnding approximate matches in the presence of typographical mistakes, multiple formatting conventions, data transformation errors, etc. We show that this problem has semantic properties that can be exploited to design index structures that support very efﬁcient algorithms for query answering.