Issues of Dialectal Saudi Twitter Corpus
Meshrif Alruily
College of Computer and Information Sciences, Jouf University, Saud
Arabia
Abstract: Text mining research relies heavily on the
availability of a suitable corpus. This paper presents a dialectal Saudi corpus
that contains 207452 tweets generated by Saudi Twitter users. In addition, a
comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic
top news raw corpus (representing Modern Standard Arabic (MSA) in various
aspects, such as the differences between formal and colloquial texts was
carried out. Moreover, investigation into the issues and phenomena, such as
shortening, concatenation, colloquial language, compounding, foreign language,
spelling errors and neologisms on this type of dataset was performed.
Keywords: Microblogs, tweets, Saudi colloquial,
corpus and modern standard Arabic.
Received January 27, 2018; accepted August 13,
2018
https://doi.org/10.34028/iajit/17/3/10