Exploiting Multilingual Wikipedia to Improve Arabic
Named Entity Resources
Mariam Biltawi, Arafat Awajan,
Sara Tedmori, and Akram Al-Kouz
King
Hussein Faculty of Computing Sciences, Princess Sumaya University for
Technology, Jordan
Abstract: This paper focuses on the creation of Arabic
named entity gazetteers, by exploiting Wikipedia and using the Naïve Bayes
classifier to classify the named entities into the three main categories: person,
location, and organization. The process of building the gazetteer starts with
automatically creating the datasets. The dataset for the training is
constructed using only Arabic text, whereas, the testing dataset is derived
from an English text using the Stanford name entity recognizer. A Wikipedia
title existence check of these English name entities is then performed. Next,
if the named entity exists as a Wikipedia page title, a check for Arabic
parallel pages is conducted. Finally, the Naïve Bayes classifier is applied to
verify or assign new name entity tag to the Arabic name entity. Due to the lack
of available resources, the proposed system is evaluated manually by
calculating accuracy, recall, and precision. Results show an accuracy of 53%.
Keywords: Arabic name entity resources; naïve
bayes classifier; wikipedia.
Received February 7, 2017; accepted May 10, 2017