文章基本信息

标题：An Empirical Study for Handling Scientific Datasets
本地全文：下载
作者：Yunhee Kang ; Heeyoul Choi
期刊名称：International Journal of Grid and Distributed Computing
印刷版ISSN：2005-4262
出版年度：2012
卷号：5
期号：3
出版社：SERSC
摘要：Since the volume of data generated by a scientific data experiment has grown exponentially, new scientific methods to analyze and organize the data are required. Hence, these methods need to be used effective infrastructure composed of computing resources that are used for pre-processing and post-processing data. The demanding requirement has led to development of methods to reduce the size of dataset and to apply a new programming model and its implementation like MapReduce. In this paper, we describe an empirical study for handling the dataset of a scientific data experiment to support data transformation, which is an essential phase to handling large-scale data in scientific data experiments. In this experiment we show a way to optimize the dataset written in netCDF by a data reduction as a sub-setting method and to process the dataset about tornado outbreak in the US by Hadoop, a MapReduce framework. These methods can be applied to pre-processing and post-processing in scientific data experiments
关键词：MapReduce; Scientific Data Experiment; Sub-Setting; Data Transformation