Development of a Hindi Named Entity
Recognition System without Using Manually Annotated Training Corpus
Sujan Kumar Saha1 and
Mukta Majumder2
1Department of Computer Science and Engineering, Birla
Institute of Technology, India
2Department of Computer Science and Application, University
of North Bengal, India
Abstract: Machine learning based approach for Named Entity Recognition
(NER) requires sufficient annotated corpus to train the classifier. Other NER
resources like gazetteers are also required to make the classifier more
accurate. But in many languages and domains relevant NER resources are still
not available. Creation of adequate and relevant resources is costly and time
consuming. However a large amount of resources and several NER systems are available
in resource-rich languages, like English. Suitable language adaptation
techniques, NER resources of a resource-rich language and minimally supervised
learning might help to overcome such scenarios. In this paper we have studied a
few such techniques in order to develop a Hindi NER system. Without using any
Hindi NE annotated corpus we have achieved a reasonable accuracy of F-Measure
73.87 in the developed system.
Keywords: Natural
language processing, machine learning, named
entity recognition, resource scarcity, language transfer, semi-supervised
learning.