摘要:AbstractDeveloping better methods for segmenting continuous text into words is important for improving the processing of Indian languages. In this paper we discuss the methodology of building a tool for Sandhi splitting for Telugu, an Indian language. Sandhi is a process in which two or more words unite to form a compound word by undergoing some modification. This is due to the influence of adjacent words. We propose a method that uses simple finite state automata for finding the possible candidates for a given compound word. We then make use of some linguistically-driven empirical scoring mechanism for pruning and then compute the scores based on the joint probability between the possible syllables that undergo Sandhi. We made use of corpus of size 158k words as base words for building the finite state machines. We discuss our scoring mechanisms and our system performs with an accuracy of 80.30% on a test size of 500 words. We also discuss briefly about the errors made by our system and our reflections upon them.