Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches

  • Ghadeer Written by
  • Update: 31/08/2022

Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches

Sameeah Noreen Hameed

School of Software, East China Jiaotong University, China

This email address is being protected from spambots. You need JavaScript enabled to view it.

Muhammad Adnan Ashraf

Department of Computer Science, Northwestern Polytechnical University, China

This email address is being protected from spambots. You need JavaScript enabled to view it.

Qiao Ya-nan

School of Computer Science and Technology, Xi’an Jiaotong University, China This email address is being protected from spambots. You need JavaScript enabled to view it.

Abstract: Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.

Keywords: Language variety identification, deep learning, transfer learning, binary classification.

Received July 25, 2021; accepted December 13, 2021

https://doi.org/10.34028/iajit/19/5/1

Full text

Read 613 times Last modified on Wednesday, 31 August 2022 12:51
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…