文章基本信息

标题：Clustering of Datasets Using K-Means Algorithm in SPARK
本地全文：下载
作者：Subiksha N ; Pallavi R Reddy ; Mounica B 等
期刊名称：International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN：2320-9798
电子版ISSN：2320-9801
出版年度：2017
卷号：5
期号：5
页码：9788
DOI：10.15680/IJIRCCE.2017.0505263
出版社：S&S Publications
摘要：With The growing trend of large volume of data generated every day, storing and processing the data isa great challenge. The existing solutions like SAS, R, Excel and MapReduce prove to be inefficient. MapReduce is apart of Apache Hadoop that allows distributive processing of unstructured data, where each distributive node has itsown storage. Most of the algorithms are iterative in nature. But MapReduce is bane as it involves undesirable amountof read and writes to process such iterative algorithms. As an advancement, in this paper, we use the SPARK toimplement algorithms like K-means, linear regression etc on real-time data. Spark is an unified engine that can runHadoop, Mesos, cloud or standalone. It stores the intermediate result in RAM, thus avoiding read/write from/to disk.Like the MapReduce, Spark is used for batch processing. In addition, Spark can be used to handle streaming data,queries and machine learning. In this paper, we prefer to use the cloud services to access the large datasets becausecloud is flexible, scalable and involves less hardware cost than using our own infrastructure which requires manysoftware’s (Hadoop, Ubuntu, Java etc) to be installed. Once the data is obtained, we run the K-means algorithm onSpark using EMR (Elastic MapReduce).
关键词：MapReduce; iterative algorithms; Spark; K-means; cloud services