首页    期刊浏览 2024年12月11日 星期三
登录注册

文章基本信息

  • 标题:SampleClean: Fast and Reliable Analytics on Dirty Data
  • 本地全文:下载
  • 作者:Sanjay Krishnan ; Jiannan Wang ; Michael J. Franklin
  • 期刊名称:Bulletin of the Technical Committee on Data Engineering
  • 出版年度:2015
  • 卷号:38
  • 期号:3
  • 出版社:IEEE Computer Society
  • 摘要:An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect,or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to esti-mate the results of queries when only a sample of data can be cleaned. Some forms of data corruption,such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed toensure correctness of the approximate query results. We first describe our initial project on computingstatistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We sub-sequently explored how the same techniques could apply to other problems in database research, namely,materialized view maintenance. To avoid expensive incremental maintenance, we maintain only a sam-ple of rows in a view, and then leverage SampleClean to approximate aggregate query results. Finally,we describe our work on a gradient-descent algorithm that extends the key ideas to the increasinglycommon Machine Learning-based analytics.
国家哲学社会科学文献中心版权所有