An Anti-Spam Filter Based on One-Class IB Method in
Small Training Sets
Chen Yang1, Shaofeng Zhao2, Dan Zhang3, and Junxia Ma1
1School of Software Engineering, Zhengzhou University of Light Industry, China
2Henan University of Economics and Law, China
3Geophysical Exploration Center of China Earthquake Administration, China
Abstract: We present an
approach to email filtering based on one-class Information Bottleneck (IB) method
in small training sets. When themes of emails are
changing continually, the available training set which is high-relevant to the
current theme will be small. Hence, we further show how to estimate the
learning algorithm and how to filter the spam in the small training sets.
First, In order to preserve classification accuracy and avoid over-fitting
while substantially reducing training set size, we consider the learning
framework as the solution of one-class centroid only averaged by highly positive emails, and second, we design
a simple binary classification model to filters spam by the comparison of
similarity between emails and centroids. Experimental results show that in
small training sets our method can significantly improve classification
accuracy compared with the currently popular methods, such as: Naive Bayes, AdaBoost
and SVM.
Keywords: IB method, one-class IB, anti-spam filter, Small training sets.
Received
September 5, 2014; accepted November 25, 2014