文章基本信息

标题：Improving Morphosyntactic Tagging of Slovene Language through Meta-tagging
本地全文：下载
作者：Jan Rupnik ; Miha Grčar ; Tomaž Erjavec 等
期刊名称：Informatica
印刷版ISSN：1514-8327
电子版ISSN：1854-3871
出版年度：2008
卷号：32
期号：4
出版社：The Slovene Society Informatika, Ljubljana
摘要：Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning morphosyntactic categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging of Slovene texts is a challenging task since the size of the tagset is over one thousand tags (as opposed to English, where the size is typically around sixty) and the state-of-the-art tagging accuracy is still below levels desired. The paper describes an experiment aimed at improving tagging accuracy for Slovene, by combining the outputs of two taggers – a proprietary rule-based tagger developed by the Amebis HLT company, and TnT, a tri-gram HMM tagger, trained on a hand- annotated corpus of Slovene. The two taggers have comparable accuracy, but there are many cases where, if the predictions of the two taggers differ, one of the two does assign the correct tag. We investigate training a classifier on top of the outputs of both taggers that predicts which of the two taggers is correct. We experiment with selecting different classification algorithms and constructing different feature sets for training and show that some cases yield a meta-tagger with a significant increase in accuracy compared to that of either tagger in isolation.
关键词：PoS tagging; meta-tagger; Slavic languages; FidaPLUS; JOS corpus; machine learning; Orange; decision trees; CN2 rules; Naive Bayes