Google N-Gram Viewer does not Include Arabic Corpus!
Towards N-Gram Viewer for Arabic Corpus
Izzat Alsmadi1
and Mohammad Zarour2
1Department of Computing and Cyber Security, Texas A&M University,
USA
2Information
Systems Department, Prince Sultan University, KSA
Abstract: Google N-gram viewer is one of those newly published
Google services. Google archived or digitized a large number of books in
different languages. Google populated the corpora from over 5 million books
published up to 2008. This Google service allows users to enter queries of
words. The tool then charts time-based data that show the frequency of usage of
query words. Although Arabic is one of the top spoken language in the world,
Arabic language is not included as one of the corpora indexed by the Google
n-gram viewer. This research work discusses the development of large Arabic
corpus and indexing it using N-grams to be included in Google N-gram viewer. A showcase
is presented to build a dataset to initiate the process of digitizing the Arabic
content and prepare it to be incorporated in Google N-gram viewer. One of the
major goals of including Arabic content in Google N-gram is to enrich Arabic
public content, which has been very limited in comparison with the number of
people who speak Arabic. We believe that adopting Arabic language by Google
N-gram viewer can significantly benefit researchers in different fields related
to Arabic language and social sciences.
Keywords: Arabic language processing, corpus, google N-gram
viewer.
Received May 7, 2015; accepted September 20, 2015