Development of a Hindi Named Entity Recognition System without Using Manually Annotated Training Cor

Development of a Hindi Named Entity Recognition System without Using Manually Annotated Training Corpus

Sujan Kumar Saha1 and Mukta Majumder2

1Department of Computer Science and Engineering, Birla Institute of Technology, India

2Department of Computer Science and Application, University of North Bengal, India

Abstract: Machine learning based approach for Named Entity Recognition (NER) requires sufficient annotated corpus to train the classifier. Other NER resources like gazetteers are also required to make the classifier more accurate. But in many languages and domains relevant NER resources are still not available. Creation of adequate and relevant resources is costly and time consuming. However a large amount of resources and several NER systems are available in resource-rich languages, like English. Suitable language adaptation techniques, NER resources of a resource-rich language and minimally supervised learning might help to overcome such scenarios. In this paper we have studied a few such techniques in order to develop a Hindi NER system. Without using any Hindi NE annotated corpus we have achieved a reasonable accuracy of F-Measure 73.87 in the developed system.

Keywords: Natural language processing, machine learning, named entity recognition, resource scarcity, language transfer, semi-supervised learning.

Received July 22, 2015; accepted October 7, 2015
  
Read 1595 times Last modified on Thursday, 11 October 2018 05:38
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…