Utilizing Corpus Statistics for Hindi Word Sense Disambiguation

Utilizing Corpus Statistics for Hindi Word Sense Disambiguation

Satyendr Singh and Tanveer Siddiqui

Department of Electronics and Communication, University of Allahabad, India

 

Abstract: Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context. This paper compares three WSD algorithms for Hindi WSD based on corpus statistics. The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).

These weights are used in the disambiguation process to assign a score to each sense. We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size. The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation. The third algorithm is based on the classification information model. The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme. The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%. The WSD algorithm using classification information model results in an average precision of 76.34%. All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.

 Keywords: Supervised hindi WSD, corpus based lesk, TF-IDF, statistical WSD, word co-occurrence, information theory, classification information model.

 Received August 15, 2013; accepted May 6, 2014

 

Read 2738 times Last modified on Sunday, 19 August 2018 06:08
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…