出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:Increased popularity of different text representations has also brought many improvements inNatural Language Processing (NLP) tasks. Without need of supervised data, embeddingstrained on large corpora provide us meaningful relations to be used on different NLP tasks.Even though training these vectors is relatively easy with recent methods, information gainedfrom the data heavily depends on the structure of the corpus language. Since the popularlyresearched languages have a similar morphological structure, problems occurring formorphologically rich languages are mainly disregarded in studies. For morphologically richlanguages, context-free word vectors ignore morphological structure of languages. In thisstudy, we prepared texts in morphologically different forms in a morphologically richlanguage, Turkish, and compared the results on different intrinsic and extrinsic tasks. To seethe effect of morphological structure, we trained word2vec model on texts which lemma andsuffixes are treated differently. We also trained subword model fastText and compared theembeddings on word analogy, text classification, sentimental analysis, and language modeltasks.