An Arabic Lemma-Based Stemmer for Latent Topic Modeling

An Arabic Lemma-Based Stemmer
for Latent Topic Modeling

Abderrezak Brahmi1, Ahmed Ech-Cherif2, and Abdelkader Benyettou2
1Department of Computer Sciences, Abdelhamid Ibn Badis University, Mostaganem, Algeria
2Department of Computer Sciences, USTO-MB University, Oran, Algeria
 

Abstract: 
Developments in Arabic information retrieval did not follow the increasing use of the Arabic Web during the last decade. Semantic indexing in a language with high inflectional morphology, such as Arabic, is not a trivial task and requires a text analysis in the original language. Excepting cross-language retrieval methods or limited studies, the main efforts, for developing semantic analysis methods and topic modeling, did not include Arabic text. This paper describes our approach for analyzing semantics in Arabic texts. A new lemma-based stemmer is developed and compared to root-based one for characterizing Arabic text. The Latent Dirichlet Allocation (LDA) model is adapted to extract Arabic latent topics from various real-world corpora. In addition to the interesting subjects discovered in the press articles during the 2007-2009 period, experiments show that the classification performances with lemma-based stemming in the topics space, are improved when comparing to classification with root-based stemming.

Keywords: Arabic stemming, topic model, semantic analysis, classification, test collection.
 
Received October 22, 2010; accepted May 24, 2011
Read 4141 times Last modified on Thursday, 23 February 2012 07:46
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…