Clustering with Probabilistic Topic Models on Arabic Texts: A Comparative Studyof LDA and K-means

Clustering with Probabilistic Topic Models on Arabic Texts: A Comparative Studyof LDA and K-means

Abdessalem Kelaiaia1 and Hayet Merouani2

1Department of Computer Sciences, University of May 08, Algeria
2 Department of Computer Sciences, University of Badji Mokhtar, Algeria


Abstract: Recently, probabilistic topic models such as Latent Dirichlet Allocation (LDA) have been widely used for applications in many text mining tasks such as retrieval, summarization and clustering on different languages. In this paper, we present a first comparative study between LDA and K-means, two well-known methods respectively in topics identification and clustering applied on arabic texts. Our aim is to compare the influence of morpho-syntactic characteristics of Arabic language on performance of first method compared to the second one. In order to, study different aspects of those methods the study is conducted on four benchmark document collections in which the quality of clustering was measured by the use of four well-known evaluation measures, Rand index, Jaccard index, F-measure and Entropy. The results consistently show that LDA perform best results more than K-means in most cases.

Keywords: Clustering, topics identification, arabic text, LDA, k-means, preprocessing.

Received November 7, 2012; accepted November 27, 2013

 

Read 1358 times Last modified on Wednesday, 01 April 2015 07:03
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…