Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic
1University of Monastir, Research Laboratory in
Algebra, Numbers Theory and Intelligent Systems RLANTIS, Tunisia
2University
of Sousse, Higher Institute of Computer Science and Communication Techniques
ISITCom, Tunisia
Abstract: Paraphrase detection allows determining how original and suspect documents convey
the same meaning. It has attracted attention from researchers in many Natural
Language Processing (NLP) tasks such as plagiarism detection, question
answering, information retrieval, etc., Traditional methods (e.g., Term
Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA),
and Latent Semantic Analysis (LSA)) cannot capture efficiently hidden semantic
relations when sentences may not contain any common words or the co-occurrence
of words is rarely present. Therefore, we proposed a deep learning model based on
Global Word embedding (GloVe) and Recurrent Convolutional Neural Network
(RCNN). It was efficient for capturing more contextual dependencies between
words vectors with precise semantic meanings. Seeing the lack of resources in Arabic language publicly available, we developed
a paraphrased corpus automatically. It preserved syntactic and semantic
structures of Arabic sentences using word2vec model and Part-Of-Speech (POS) annotation.
Overall experiments shown that our proposed model outperformed the
state-of-the-art methods in terms of precision and recall.
Keywords: Arabic
language processing, word2vec, part-of-speech annotation, paraphrasing,
semantic analysis, recurrent convolutional neural networks.
Received January 24, 2019; accepted February 5,
2020
https://doi.org/10.34028/iajit/18/1/1