文章基本信息

标题：A COMPARISON STUDY OF DOCUMENT CLUSTERING USING DOC2VEC VERSUS TFIDF COMBINED WITH LSA FOR SMALL CORPORA
本地全文：下载
作者：AMALIA AMALIA ; OPIM SALIM SITOMPUL ; ERNA BUDHIARTI NABABAN 等
期刊名称：Journal of Theoretical and Applied Information Technology
印刷版ISSN：1992-8645
电子版ISSN：1817-3195
出版年度：2020
卷号：98
期号：17
页码：3644-3657
出版社：Journal of Theoretical and Applied
摘要：The selection of a suitable word vector representation is one of the essential parameters in document clustering because it affects the performance of clustering. The excellent word vector representation will generate a good clustering result, even only using the simple clustering algorithm like K-Means. Doc2Vec, as one of word vector representations, has been extensively studied in large text datasets and proven outperforms the performance of traditional word vector representation in document categorization. However, only a few studies analyze word vector representations of small corpora. As appropriate, learning observation in a small corpus is also crucial because, in some cases, a large corpus was not always available, particularly in some low-resources languages like Bahasa Indonesia. Moreover, the clustering of the small datasets also plays essential roles in pattern recognition and can be an initial step to implement the analysis result in a more significant corpus. This study is an experimental study that aims to explore more in-depth exploration to compare document clustering using Doc2Vec versus TFIDF-LSA for small corpora in Bahasa Indonesia. In this study, the quality of word vector representation is measure by the cluster performance using intrinsic and extrinsic measurements. The study also considers measuring word representation based on time and memory consumption. This study also concerns with getting an optimal word vector representation by tuned appropriate hyper-parameter. The word vector representations were tested to various sizes of the small corpora using the K-Means algorithm. The result of this study, a TFIDF-LSA gets a better cluster performance; meanwhile, the Doc2Vec model gets a better time and memory usage efficiency.
关键词：Clustering;Word Vector Representation;Word Embedding;Clustering Comparison;Small Corpora