Direct Text Classifier for Thematic Arabic Discourse
Documents
Khalid Nahar1, Ra’ed Al-Khatib1,
Moy'awiah Al-Shannaq1, Mohammad Daradkeh2, and Rami
Malkawi3
1Department of Computer Sciences, Yarmouk
University, Jordan
2Department of Management Information System, Yarmouk
University, Jordan
3Department of Computer Information System, Yarmouk
University, Jordan
Abstract: Maintaining the topical coherence while writing a
discourse is a major challenge confronting novice and non-novice writers alike.
This challenge is even more intense with Arabic discourse because of the
complex morphology and the widespread of synonyms in Arabic language. In this
research, we present a direct classification of Arabic discourse document while
writing. This prescriptive proposed framework consists of the following stages:
data collection, pre-processing, construction of Language Model (LM), topics
identification, topics classification, and topic notification. To prove and
demonstrate our proposed framework, we designed a system and applied it on a
corpus of 2800 Arabic discourse documents synthesized into four predefined
topics related to: Culture, Economy, Sport, and Religion. System performance was
analysed, in terms of accuracy, recall, precision, and F-measure. The results
demonstrated that the proposed topic modeling-based decision framework is able
to classify topics while writing a discourse with accuracy of 91.0%.
Keywords: Text mining, Arabic discourse; text classification, topic modling, n-gram
language model, topical coherence.