Semantic Similarity based Web Document
Classification Using Support Vector Machine
Kavitha
Chinniyan, Sudha Gangadharan, and Kiruthika Sabanaikam
Department of Computer Science and Engineering, PSG
College of Technology, India
Abstract: With the rapid growth of information on the World
Wide Web (WWW), classification of web documents has become important for
efficient information retrieval. Relevancy of information retrieved can also be
improved by considering semantic relatedness between words which is a basic
research area in fields of natural language processing, intelligent retrieval,
document clustering and classification, word sense disambiguation etc. The web
search engine based semantic relationship from huge web corpus can improve
classification of documents. This paper proposes an approach for web document
classification that exploits information, including both page count and
snippets. To identify the semantic relations between the query words, a lexical
pattern extraction algorithm is applied on snippets. A sequential pattern
clustering algorithm is used to form clusters of different patterns. The page
count based measures are combined with the clustered patterns to define the
features extracted from the word-pairs. These features are used to train the
Support Vector Machine (SVM), in order to classify the web documents. Experimental
results demonstrate 5% and 9% improvement in F1 measure for Reuters 21578 and 20
Newsgroup datasets in the classifier performance.
Keywords: Document classification, text mining, SVM, latent semantic indexing.
Received October 4, 2013, accepted March 19, 2014