Contextual Text Categorization: An Improved Stemming Algorithm
to Increase the Quality of Categorization in Arabic Text
Said Gadri and Abdelouahab Moussaoui
Department of Computer Science, University Ferhat
Abbas of Setif, Algeria
Abstract: One of the methods used to reduce the size of
terms vocabulary in Arabic text categorization is to replace the different
variants (forms) of words by their common root. This process is called stemming
based on the extraction of the root. Therefore, the search of the root in
Arabic or Arabic word root extraction is more difficult than in other languages
since the Arabic language has a very different and difficult structure, that is
because it is a very rich language with complex morphology. Many algorithms are
proposed in this field. Some of them are based on morphological rules and
grammatical patterns, thus they are quite difficult and require deep linguistic
knowledge. Others are statistical, so they are less difficult and based only on
some calculations. In this paper we propose an improved stemming algorithm based
on the extraction of the root and the technique of n-grams which permit to
return Arabic words’ stems without using any morphological rules or grammatical
patterns.
Keywords: Root extraction, information retrieval,
bigrams, stemming, Arabic morphological rules, feature selection.
Received February 22, 2015; accepted August 12, 2015