Template Based Affix Stemmer for a Morphologically Rich Language
Sajjad Khan1, Waqas Anwar1,2, Usama Bajwa1, and Wang Xuan2
1Department of Computer science, COMSATS Institute of Information Technology, Pakistan
2Harbin Institute of Technology Shenzhen Graduate School, China
Abstract Word stemming is one of the most significant factors that affect the performance of a Natural Language Processing (NLP) application such as information retrieval system, part of speech tagging, machine translation system and syntactic parsing. Urdu language raises several challenges to NLP largely due to its rich morphology. In Urdu language, stemming process is different as compared to that for other languages, as it not only depends on removing prefixes and suffixes but also on removing infixes. In this paper we introduce a template based stemmer that eliminates all kinds of affixes i.e. prefixes, infixes and suffixes, depending on the morphological pattern of the word. The presented results are excellent and this stemmer can prove to be very affective for a morphologically rich language.
Keywords: Information retrieval, stemming, prefix, infix, suffix, exception lists.
Received October 19, 2012; accepted December 4, 2013