Prediction of Part of Speech Tags for Punjabi using Support Vector
Machines
Dinesh Kumar1 and Gurpreet Josan2
1Department of Information Technology, DAV Institute of Engineering and Technology, India
2Department of Computer Science, Punjabi University, India
Abstract: Part-Of-Speech (POS) tagging
is a task of assigning the appropriate POS or lexical category to each word in
a natural language sentence. In this paper, we have worked on automated
annotation of POS tags for Punjabi. We have collected a corpus of around 27,000
words, which included the text from various stories, essays, day-to-day
conversations, poems etc., and divided these words into different size files
for training and testing purposes. In our approach, we have used Support Vector
Machine (SVM) for tagging Punjabi sentences. To the best of our knowledge, SVMs
have never been used for tagging Punjabi text. The result shows that SVM based
tagger has outperformed the existing taggers. In the existing POS taggers of
Punjabi, the accuracy of POS tagging for unknown words is less than that for
known words. But in our proposed tagger, high accuracy has been achieved for
unknown and ambiguous words. The average accuracy of our tagger is 89.86%,
which is better than the existing approaches.
Keywords: POS tagging, SVM,
feature set, vectorization, machine learning, tagger, punjabi, indian languages.
Received September 18, 2013; accepted February 28, 2014; Published online December 23, 2015