Experimenting N-Grams in Text Categorization

Experimenting N-Grams in Text Categorization

Abdellatif Rahmoun and Zakaria Elberrichi

Faculty of Computer and Information Technology, University of King Faisal, KSA

Abstract: This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and two benchmark corpuses the reuters-21578 newswire articles and the 20 newsgroups data for evaluation. The evaluation was done, by using the macroaveraged F1 function. The results show the effectiveness of this approach compared to the Bag-Of-Word and stem representations.

Keywords: Text categorization, n-grams, multivariate chi-square, cosine measure, reuters21578, 20 news groups.

Received April 5, 2006; accepted June 1, 2006

 
Read 7934 times Last modified on Wednesday, 20 January 2010 02:45
Share

Upcoming courses

  • Diploma Courses
  • Business and Enterprise
  • Digital Literacy & IT
  • Health Literacy
  • Business Literacy

Free courses

Starting from Jun. 14 2016

the degree finder

in 3 easy steps
Top
We use cookies to improve our website. By continuing to use this website, you are giving consent to cookies being used. More details…