文章基本信息

标题：Automatic Sandhi Spliting Method for Telugu, an Indian Language
本地全文：下载
作者：Phani Chaitanya Vempaty ; Phani Chaitanya Vempaty ; Satish Chandra Prasad Nagalla 等
期刊名称：Procedia - Social and Behavioral Sciences
印刷版ISSN：1877-0428
出版年度：2011
卷号：27
页码：218-225
DOI：10.1016/j.sbspro.2011.10.601
语种：English
出版社：Elsevier
摘要：AbstractDeveloping better methods for segmenting continuous text into words is important for improving the processing of Indian languages. In this paper we discuss the methodology of building a tool for Sandhi splitting for Telugu, an Indian language. Sandhi is a process in which two or more words unite to form a compound word by undergoing some modification. This is due to the influence of adjacent words. We propose a method that uses simple finite state automata for finding the possible candidates for a given compound word. We then make use of some linguistically-driven empirical scoring mechanism for pruning and then compute the scores based on the joint probability between the possible syllables that undergo Sandhi. We made use of corpus of size 158k words as base words for building the finite state machines. We discuss our scoring mechanisms and our system performs with an accuracy of 80.30% on a test size of 500 words. We also discuss briefly about the errors made by our system and our reflections upon them.
关键词：Sandhi;automata;linguistic