Conditional Arabic Light Stemmer: CondLight

Conditional Arabic Light Stemmer: CondLight

Yaser Al-Lahham, Khawlah Matarneh, and Mohammad Hassan

Computer Science Department, Zarqa University, Jordan

Abstract: Arabic language has a complex morphological structure, which makes it hard to select index terms for an IR system. The complexity of the Arabic morphology caused by multimode terms, using diacritics, letters have different forms according to its location in the word and affixes can be added at all locations in a word. Several methods were proposed to overcome these problems; such as root extraction and light stemming. Light stemming show better retrieval efficiency, Light10 is the best stemmer among a series of light stemmers, it simply removes suffixes and prefixes if it is listed in a predefined table. Light10 has no restrictions on the affixes, so it is possible to have two different terms having the same token while they have different meanings. This paper proposes CondLight stemmer which adds new prefixes and suffixes to the table of Light10, and imposes a set of conditions on removing these affixes. The implementation and testing of the proposed method show that CondLight gains 38% precision, while Light10 stemmer gains average precision of 36.7%. Moreover CondLight show better average precision either when imposing all conditions or part of them.

Keywords: Arabic IR,light stemming,morphological analysis, affixes’ removal, term selection, Arabic document indexing.

Received February 14, 2018; accepted April 18, 2018

Read 3265 times Last modified on Monday, 11 November 2019 04:44
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…