期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
印刷版ISSN:2158-107X
电子版ISSN:2156-5570
出版年度:2019
卷号:10
期号:8
DOI:10.14569/IJACSA.2019.0100815
出版社:Science and Information Society (SAI)
摘要:Sentiment analysis in non-English language can be more challenging than the English language because of the scarcity of publicly available resources to build the prediction model with high accuracy. To alleviate this under-resourced problem, this article introduces the leverage of byte-level recurrent neural model to generate text representation for twitter sentiment analysis in the Indonesian language. As the main part of the proposed model training is unsupervised and does not require much-labeled data, this approach can be scalable by using a huge amount of unlabeled data that is easily gathered on the Internet, without much dependencies on human-generated resources. This paper also introduces an Indonesian dataset for general sentiment analysis. It consists of 10,806 twitter data (tweets) selected from a total of 454,559 gathered tweets which taken directly from twitter using twitter API. The 10,806 tweets are then classified into 3 categories, positive, negative, and neutral. This Indonesian dataset could help the development of Indonesian sentiment analysis especially general sentiment analysis and encouraged others to start publishing similar dataset in the future.
关键词:Sentiment analysis; under-resourced problem; Indonesian dataset; twitter