首页    期刊浏览 2024年12月11日 星期三
登录注册

文章基本信息

  • 标题:Developing a POS Tagged Corpus of Urdu Tweets
  • 本地全文:下载
  • 作者:Amber Baig ; Mutee U Rahman ; Hameedullah Kazi
  • 期刊名称:Computers
  • 电子版ISSN:2073-431X
  • 出版年度:2020
  • 卷号:9
  • 期号:4
  • 页码:90-102
  • DOI:10.3390/computers9040090
  • 出版社:MDPI Publishing
  • 摘要:Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.
  • 关键词:natural language processing; part-of-speech tagging; user-generated text; Urdu; data-driven NLP tasks; social media; tweets; noisy; bootstrapping natural language processing ; part-of-speech tagging ; user-generated text ; Urdu ; data-driven NLP tasks ; social media ; tweets ; noisy ; bootstrapping
国家哲学社会科学文献中心版权所有