Semantic Similarity based Web Document Classification Using Support Vector Machine

Semantic Similarity based Web Document Classification Using Support Vector Machine

Kavitha Chinniyan, Sudha Gangadharan, and Kiruthika Sabanaikam

Department of Computer Science and Engineering, PSG College of Technology, India

Abstract: With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification, word sense disambiguation etc. The web search engine based semantic relationship from huge web corpus can improve classification of documents. This paper proposes an approach for web document classification that exploits information, including both page count and snippets. To identify the semantic relations between the query words, a lexical pattern extraction algorithm is applied on snippets. A sequential pattern clustering algorithm is used to form clusters of different patterns. The page count based measures are combined with the clustered patterns to define the features extracted from the word-pairs. These features are used to train the Support Vector Machine (SVM), in order to classify the web documents. Experimental results demonstrate 5% and 9% improvement in F1 measure for Reuters 21578 and 20 Newsgroup datasets in the classifier performance.

 

Keywords: Document classification, text mining, SVM, latent semantic indexing.

 

Received October 4, 2013, accepted March 19, 2014

Full Text

Read 2546 times Last modified on Wednesday, 08 May 2019 03:52
Share
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…