Designing Punjabi Poetry Classifiers Using Machine Learning and Different Textual Features
Jasleen Kaur1 and
Jatinderkumar Saini2
1Department of Computer Engineering, PP
Savani University, India
2Symbiosis
Institute of Computer Studies and Research, India
Abstract: Analysis of poetic text is very challenging from computational
linguistic perspective. Computational analysis of literary arts, especially
poetry, is very difficult task for classification. For library recommendation
system, poetries can be classified on various metrics such as poet, time
period, sentiments and subject matter. In this work, content-based Punjabi
poetry classifier was developed using Weka toolset. Four different categories
were manually populated with 2034 poems Nature and Festival (NAFE), Linguistic
and Patriotic (LIPA), Relation and Romantic (RORE), Philosophy and Spiritual (PHSP)
categories consists of 505, 399, 529 and 601 numbers of poetries, respectively.
These poetries were passed to various pre-processing sub phases such as tokenization,
noise removal, stop word removal, and special symbol removal. 31938 extracted tokens
were weighted using Term Frequency (TF) and Term Frequency-Inverse Document
Frequency (TF-IDF) weighting scheme. Based upon poetry elements, three different
textual features (lexical, syntactic and semantic) were experimented to develop
classifier using different machine learning algorithms. Naive Bayes (NB),
Support Vector Machine, Hyper pipes and K-nearest neighbour algorithms were
experimented with textual features. The results revealed that semantic feature
performed better as compared to lexical and syntactic. The best performing
algorithm is SVM and highest accuracy (76.02%) is achieved by incorporating
semantic information associated with words.
Keywords: Classification, naïve bayes, hyper pipes,
k-nearest neighbour, Punjabi, poetry, support vector machine, word net.
Received April 7, 2017; accepted July 8, 2018
https://doi.org/10.34028/iajit/17/1/5