文章基本信息

标题：Efficient Similarity Join Method Using Unsupervised Learning
本地全文：下载
作者：Bilal Hawashin ; Farshad Fotouhi ; William Grosky 等
期刊名称：International Journal of Computer Science & Information Technology (IJCSIT)
印刷版ISSN：0975-4660
电子版ISSN：0975-3826
出版年度：2012
卷号：4
期号：5
页码：23
出版社：Academy & Industry Research Collaboration Center (AIRCC)
摘要：This paper proposes an efficient similarity join method using unsupervised learning, when no labeled datais available. In our previous work, we showed that the performance of similarity join could improve whenlong string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback,are used under supervised learning, where a training set exists. In this work, we adopt using long stringattributes during the similarity join under unsupervised learning. Along with its importance when nolabeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quickpreprocessing method for huge datasets. Here, we show that using long attributes during the unsupervisedlearning can further enhance the performance. Moreover, we provide an efficient dynamically expandablealgorithm for databases with frequent transactions.
关键词：Similarity Join; Unsupervised Learning; Diffusion Maps; Databases; Machine Learning.