首页    期刊浏览 2024年12月02日 星期一
登录注册

文章基本信息

  • 标题:Automatic Sandhi Spliting Method for Telugu, an Indian Language
  • 本地全文:下载
  • 作者:Phani Chaitanya Vempaty ; Phani Chaitanya Vempaty ; Satish Chandra Prasad Nagalla
  • 期刊名称:Procedia - Social and Behavioral Sciences
  • 印刷版ISSN:1877-0428
  • 出版年度:2011
  • 卷号:27
  • 页码:218-225
  • DOI:10.1016/j.sbspro.2011.10.601
  • 语种:English
  • 出版社:Elsevier
  • 摘要:AbstractDeveloping better methods for segmenting continuous text into words is important for improving the processing of Indian languages. In this paper we discuss the methodology of building a tool for Sandhi splitting for Telugu, an Indian language. Sandhi is a process in which two or more words unite to form a compound word by undergoing some modification. This is due to the influence of adjacent words. We propose a method that uses simple finite state automata for finding the possible candidates for a given compound word. We then make use of some linguistically-driven empirical scoring mechanism for pruning and then compute the scores based on the joint probability between the possible syllables that undergo Sandhi. We made use of corpus of size 158k words as base words for building the finite state machines. We discuss our scoring mechanisms and our system performs with an accuracy of 80.30% on a test size of 500 words. We also discuss briefly about the errors made by our system and our reflections upon them.
  • 关键词:Sandhi;automata;linguistic
国家哲学社会科学文献中心版权所有