Template Based Affix Stemmer for a Morphologically Rich Language

Template Based Affix Stemmer for a Morphologically Rich Language

 

  Sajjad Khan1, Waqas Anwar1,2, Usama Bajwa1, and Wang Xuan2

1Department of Computer science, COMSATS Institute of Information Technology, Pakistan

2Harbin Institute of Technology Shenzhen Graduate School, China

Abstract Word stemming is one of the most significant factors that affect the performance of a Natural Language Processing (NLP) application such as information retrieval system, part of speech tagging, machine translation system and syntactic parsing. Urdu language raises several challenges to NLP largely due to its rich morphology. In Urdu language, stemming process is different as compared to that for other languages, as it not only depends on removing prefixes and suffixes but also on removing infixes. In this paper we introduce a template based stemmer that eliminates all kinds of affixes i.e. prefixes, infixes and suffixes, depending on the morphological pattern of the word. The presented results are excellent and this stemmer can prove to be very affective for a morphologically rich language.

 Keywords: Information retrieval, stemming, prefix, infix, suffix, exception lists.

 

Received October 19, 2012; accepted December 4, 2013

 Full Text


 

 

Read 2480 times Last modified on Sunday, 19 August 2018 04:48
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…