Exploring the Potential of Schemes in Building NLP Tools for Arabic
Language
Mohamed Ben Mohamed, Souheyl Mallat, Mohamed Nahdi and Mounir Zrigui
LaTICE Laboratory, Faculty of Sciences of Monastir, Tunisia
Abstract: Arabic
is known for its sparseness, which explains the difficulty of its automatic
processing. The arabic language is based on schemes; lemmas are produced using
derivation based on roots and schemes. This latter character presents two major
advantages: First, this “hidden side” of the arabic language composed of schemes
suffers much less from sparseness since it represents a finite set, second,
schemes keep a large number of features of the language in a much reduced
vocabulary size. Schemes present a very great perspective and have great potential
in building accurate natural language processing tools for arabic. In this work
we tried to explore this potential by building some NLP tools while relying
entirely on schemes. The work is related to text classification and a
Probabilistic Context Free Grammar (PCFG) parsing.
Keywords: Arabic language, schemes, roots, derivation, text classification, PCFG, parsing