A Deep Learning Approach for the Romanized
Tunisian Dialect Identification
1Université de Tunis, ISGT, Tunisia
2Université de
Tunis, ENSIT, Tunisia
Abstract: Language identification is an important task in natural language processing that consists in determining the language of a given text. It has increasingly picked the interest of researchers for the past few years, especially for code-switching informal textual content. In this paper, we focus on the identification of the Romanized user-generated Tunisian dialect on the social web. We segment and annotate a corpus extracted from social media and propose a deep learning approach for the identification task. We use a Bidirectional Long Short-Term Memory neural network with Conditional Random Fields decoding (BLSTM-CRF). For word embeddings, we combine word-character BLSTM vector representation and Fast Text embeddings that takes into consideration character n-gram features. The overall accuracy obtained is 98.65%.
Keywords: Tunisian dialect, language identification, deep
learning, BLSTM, CRF and natural language processing.
Received August 25, 2019; accepted April 28,
2020