Issues of Dialectal Saudi Twitter Corpus

Issues of Dialectal Saudi Twitter Corpus

Meshrif Alruily

College of Computer and Information Sciences, Jouf University, Saud Arabia

Abstract: Text mining research relies heavily on the availability of a suitable corpus. This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users. In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out. Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.

Keywords: Microblogs, tweets, Saudi colloquial, corpus and modern standard Arabic.

Received January 27, 2018; accepted August 13, 2018
https://doi.org/10.34028/iajit/17/3/10

Full text     

Read 3041 times Last modified on Thursday, 30 April 2020 10:22
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…