首页    期刊浏览 2024年12月02日 星期一
登录注册

文章基本信息

  • 标题:Lexical normalization of roman Urdu text
  • 本地全文:下载
  • 作者:Zareen Sharf ; Dr Saif Ur Rahman
  • 期刊名称:International Journal of Computer Science and Network Security
  • 印刷版ISSN:1738-7906
  • 出版年度:2017
  • 卷号:17
  • 期号:12
  • 页码:213-221
  • 出版社:International Journal of Computer Science and Network Security
  • 摘要:Social media text usually comprises of short length messages, which typically contain a high percentage of abbreviations, typos, phonetic substitutions and other informal ways of writing. The inconsistent manner of text representation poses challenges in performing Natural Language Processing and other forms of analysis on the available data. Therefore, to overcome these issues the text requires to be normalized for effective processing and analysis. In this work, we have performed a comparative study of how social media text in different languages like Chinese, Arabic, Japanese, Polish, Bangla, Dutch and Roman Urdu has been normalized to achieve consistency. We have discussed in detail the normalization methods proposed, their success rate and their shortcomings. Based on our analysis we have also proposed a model for achieving lexical normalization of text in Roman Urdu.
  • 关键词:Normalization; Standardization; Transliteration; Roman Urdu.
国家哲学社会科学文献中心版权所有