Enhanced Latent Semantic Indexing Using Cosine Similarity Measures for Medical Application

Enhanced Latent Semantic Indexing Using Cosine Similarity Measures for Medical Application

Fawaz Al-Anzi1 and Dia AbuZeina2

1Department of Computer Engineering, Kuwait University, Kuwait

2Computer Science Department, Palestine Polytechnic University, Palestine

Abstract: The Vector Space Model (VSM) is widely used in data mining and Information Retrieval (IR) systems as a common document representation model. However, there are some challenges to this technique such as high dimensional space and semantic looseness of the representation. Consequently, the Latent Semantic Indexing (LSI) was suggested to reduce the feature dimensions and to generate semantic rich features that can represent conceptual term-document associations. In fact, LSI has been effectively employed in search engines and many other Natural Language Processing (NLP) applications. Researchers thereby promote endless effort seeking for better performance. In this paper, we propose an innovative method that can be used in search engines to find better matched contents of the retrieving documents. The proposed method introduces a new extension for the LSI technique based on the cosine similarity measures. The performance evaluation was carried out using an Arabic language data collection that contains 800 medical related documents, with more than 47,222 unique words. The proposed method was assessed using a small testing set that contains five medical keywords. The results show that the performance of the proposed method is superior when compared to the standard LSI.

Keywords: Arabic Text, Latent Semantic Indexing, Search Engine, Dimensionality Reduction, Text Classification.

Received December 25, 2018; accepted January 28, 2020

https://doi.org/10.34028/iajit/17/5/7

Full Text   

Read 3151 times Last modified on Wednesday, 26 August 2020 05:43
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…