Comprehensive Stemmer for Morphologically Rich Urdu Language
Mubashir
Ali1, Shehzad Khalid2, and Muhammad Saleemi2
1Department of Computer Science & IT, University of Lahore, Gujrat
Campus, Pakistan
2Department of Computer Engineering,
Bahria University Islamabad, Pakistan
Abstract: Urdu language is used by approximately 200 million people for
spoken and written communication. Bulk of unstructured Urdu textual data is
available in the world. We can employ data mining techniques to extract useful
information from such a large potential information base. There are many text
processing systems that are available. However, these systems are mostly
language specific with the large proportion of systems are applicable to
English text. This is primarily due to the language dependant pre-processing
systems mainly the stemming requirement. Stemming is a vital pre-processing
step in the text mining process and its core aim is to reduce many grammatical
words form e.g., parts of speech, gender, tense etc. to their root form. In
this proposed work, we have developed a rule based comprehensive stemming
method for Urdu text. This proposed Urdu stemmer has the ability to generate
the stem of Urdu words as well as loan words (words belonging to borrowed
language i.e. Arabic, Persian, Turkish, etc) by removing prefix infix, and
suffix. This proposed stemming technique introduced six novel Urdu infix words
classes and minimum word length rule. In order to cope with the challenge of
Urdu infix stemming, we have developed infix stripping rules for introduced
infix words classes and generic rules for prefix and suffix stemming. The
experimental results show the superiority of our proposed stemming approach as
compared to existing technique.
Keywords: Urdu stemmer, infix classes, infix rules,
stemming rules, stemming lists.