Utilizing Corpus Statistics for Hindi Word Sense Disambiguation
Satyendr Singh and Tanveer Siddiqui
Department of Electronics and Communication, University of Allahabad, India
Abstract: Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context. This paper compares three WSD algorithms for Hindi WSD based on corpus statistics. The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).
These weights are used in the disambiguation process to assign a score to each sense. We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size. The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation. The third algorithm is based on the classification information model. The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme. The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%. The WSD algorithm using classification information model results in an average precision of 76.34%. All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.
Keywords: Supervised hindi WSD, corpus based lesk, TF-IDF, statistical WSD, word co-occurrence, information theory, classification information model.
Received August 15, 2013; accepted May 6, 2014