Feature Selection Method Based On Statistics of
Compound Words for
Arabic Text Classification
Aisha Adel, Nazlia Omar, Mohammed Albared, and Adel Al-Shabi
Faculty of Information Science and
Technology, Universiti Kebangsaan Malaysia, Malaysia
Abstract: One of the main problems of text
classification is the high dimensionality of the feature space. Feature
selection methods are normally used to reduce the dimensionality of datasets to
improve the performance of the classification, or to reduce the processing
time, or both. To improve the performance of text classification, a feature
selection algorithm is presented, based on terminology extracted from the
statistics of compound words, to reduce the high dimensionality of the feature
space. The proposed method is evaluated as a standalone method and in
combination with other feature selection methods (two-stage method). The
performance of the proposed algorithm is compared to the performance of six
well-known feature selection methods including Information Gain, Chi-Square,
Gini Index, Support Vector Machine-Based, Principal Components Analysis and
Symmetric Uncertainty. A wide range of comparative experiments were conducted
on three Arabic standard datasets and with three classification algorithms. The
experimental results clearly show the superiority of the proposed method in
both cases as a standalone or in a two-stage scenario. The results show that
the proposed method behaves better than traditional approaches in terms of
classification accuracy with a 6-10% gain in the macro-average, F1.
Keywords: Feature selection method, compound words, arabic text
classification.