Enhanced Latent Semantic Indexing Using Cosine
Similarity Measures for Medical Application
Fawaz Al-Anzi1 and Dia AbuZeina2
1Department
of Computer Engineering, Kuwait University, Kuwait
2Computer Science Department, Palestine Polytechnic
University, Palestine
Abstract: The Vector Space Model (VSM) is widely used in data mining and Information
Retrieval (IR) systems as a common document representation model. However,
there are some challenges to this technique such as high dimensional space and
semantic looseness of the representation. Consequently, the Latent Semantic
Indexing (LSI) was suggested to reduce the feature dimensions and to generate
semantic rich features that can represent conceptual term-document
associations. In fact, LSI has been effectively employed in search engines and
many other Natural Language Processing (NLP) applications. Researchers thereby promote
endless effort seeking for better performance. In this paper, we propose an
innovative method that can be used in search engines to find better matched
contents of the retrieving documents. The proposed method introduces a new
extension for the LSI technique based on the cosine similarity measures. The
performance evaluation was carried out using an Arabic language data collection
that contains 800 medical related documents, with more than 47,222 unique
words. The proposed method was assessed using a small testing set that contains
five medical keywords. The results show that the performance of the proposed method
is superior when compared to the standard LSI.
Keywords: Arabic Text, Latent Semantic Indexing,
Search Engine, Dimensionality Reduction, Text Classification.
Received December 25, 2018;
accepted January 28, 2020
https://doi.org/10.34028/iajit/17/5/7