标题:PHMM: Stemming on Persian Texts using Statistical Stemmer Based on Hidden Markov ModelPHMM: Stemming on Persian Texts using Statistical Stemmer Based on Hidden Markov Model
期刊名称:International Journal of Information Science and Management (IJISM)
印刷版ISSN:2008-8302
电子版ISSN:2008-8310
出版年度:2016
卷号:14
期号:2
语种:English
出版社:REGIONAL INFORMATION CENTER FOR SCIENCE AND TECHNOLOGY
其他摘要:Stemming is the process of finding the main morpheme of a word andit is used in natural language processing, text mining and informationretrieval systems. A stemmer extracts the stem of the words. We can classifyPersian stemmers in to three main classes: structural stemmers, dictionarybased stemmers and statistical stemmers.The precision of structural stemmers is low and the expenses of dictionary basedstemmers is high, so the main goal of this research is to design and implementa statistical stemmer based on hidden markov model with high precision which can reduce the sizeof indexed file and increase the speedof information retrieval systems. Our proposed stemmer, finds the prefixes and suffixes of a word and removethem, so the rest of the word is the stem. But there are some exceptions inPersian words which lead to stem those words by mistakes. So we collect a dictionaryof Persian stemmers. Our proposed stemmers, search a word in the dictionary, if it is not there , itfinds the stem of it by hmm based stemmer. This stemmer is tested in Bijankhancorpus and Hamshahri test collection. The results show increment in meanaverage precision and recall. The speed of the Information retrieval system isincreased and the size of indexed filesis decreased by the algorithm.