Arabic Text Classification Using K-Nearest Neighbour Algorithm

Arabic Text Classification Using K-Nearest Neighbour Algorithm

RoissAlhutaish and Nazlia Omar

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia

Abstract: Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-NN classifier, with anInew, Cosine,Jaccard, and Dice similarities, in order to enhance Arabic ATC. We represent the dataset as unstemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to reduce the dimensionality of feature space, we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity (Inew) 92.6% Macro-F1, had better performance than the K-NN classifier with Cosine, Jaccard, and Dice similarities. Chi-Square feature selection, with representation by Bag-Of-Words (BOW), led to the best performance over other feature selection methods using BOW and 3-Gram.

 Keywords: ATC, K-NN, similarity measures, feature selection methods.

 

Received May 3, 2012; accepted March 13, 2014

Full Text

 
Read 2503 times Last modified on Sunday, 19 August 2018 04:39
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…