期刊名称:IOP Conference Series: Earth and Environmental Science
印刷版ISSN:1755-1307
电子版ISSN:1755-1315
出版年度:2019
卷号:252
期号:3
页码:1-8
DOI:10.1088/1755-1315/252/3/032218
出版社:IOP Publishing
摘要:As we all know, data is one of the most valuable assets, however, raw data is often problematic, not conducive to the training of algorithm models. To cope with this, we can process the dirty data with cleaning systems [1] to obtain standard clean data for data statistics, data mininig and other use. Instead of manually modifying data, writing SQLs or other cumbersome methods which are popular present ways to clean data, the article proposes an approach by making use of the Hadoop big data platform to support massive data and support the cleaning of multiple heterogeneous data sources. Moreover, our system prototype supports custom rules and algorithms, can export results to a specified database, greatly simplifying the workload of data cleaning personnel. Based on the system design and theoretical verification presented in this paper, the author implemented a big data cleaning tool based on big data platform. The typical data cleaning process shows that the data cleaning can be achieved and user operations can be simplified on the basis of the theory proposed in this paper.