Arabic Text Classification Using K-Nearest Neighbour Algorithm
RoissAlhutaish and Nazlia Omar
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia
Abstract: Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area has been carried out on English texts, with only a few researchers addressing Arabic texts. We have investigated the use of the K-NN classifier, with anInew, Cosine,Jaccard, and Dice similarities, in order to enhance Arabic ATC. We represent the dataset as unstemmed and stemmed data; with the use of TREC-2002, in order to remove prefixes and suffixes. However, for statistical text representation, Bag-Of-Words (BOW) and character-level 3 (3-Gram) were used. In order to reduce the dimensionality of feature space, we used several feature selection methods. Experiments conducted with Arabic text showed that the K-NN classifier, with the new method similarity (Inew) 92.6% Macro-F1, had better performance than the K-NN classifier with Cosine, Jaccard, and Dice similarities. Chi-Square feature selection, with representation by Bag-Of-Words (BOW), led to the best performance over other feature selection methods using BOW and 3-Gram.
Keywords: ATC, K-NN, similarity measures, feature selection methods.
Received May 3, 2012; accepted March 13, 2014