首页    期刊浏览 2025年01月06日 星期一
登录注册

文章基本信息

  • 标题:Clustering of Datasets Using K-Means Algorithm in SPARK
  • 本地全文:下载
  • 作者:Subiksha N ; Pallavi R Reddy ; Mounica B
  • 期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
  • 印刷版ISSN:2320-9798
  • 电子版ISSN:2320-9801
  • 出版年度:2017
  • 卷号:5
  • 期号:5
  • 页码:9788
  • DOI:10.15680/IJIRCCE.2017.0505263
  • 出版社:S&S Publications
  • 摘要:With The growing trend of large volume of data generated every day, storing and processing the data isa great challenge. The existing solutions like SAS, R, Excel and MapReduce prove to be inefficient. MapReduce is apart of Apache Hadoop that allows distributive processing of unstructured data, where each distributive node has itsown storage. Most of the algorithms are iterative in nature. But MapReduce is bane as it involves undesirable amountof read and writes to process such iterative algorithms. As an advancement, in this paper, we use the SPARK toimplement algorithms like K-means, linear regression etc on real-time data. Spark is an unified engine that can runHadoop, Mesos, cloud or standalone. It stores the intermediate result in RAM, thus avoiding read/write from/to disk.Like the MapReduce, Spark is used for batch processing. In addition, Spark can be used to handle streaming data,queries and machine learning. In this paper, we prefer to use the cloud services to access the large datasets becausecloud is flexible, scalable and involves less hardware cost than using our own infrastructure which requires manysoftware’s (Hadoop, Ubuntu, Java etc) to be installed. Once the data is obtained, we run the K-means algorithm onSpark using EMR (Elastic MapReduce).
  • 关键词:MapReduce; iterative algorithms; Spark; K-means; cloud services
国家哲学社会科学文献中心版权所有