Preceding Document Clustering by Graph Mining Based Maximal Frequent Termsets Preservation
Syed Shah
and Mohammad Amjad
Department of Computer Engineering, Jamia Millia
Islamia, India
Abstract: This paper presents an
approach to cluster documents. It introduces a novel graph mining based
algorithm to find frequent termsets present in a document set. The document set
is initially mapped onto a bipartite graph. Based on the results of our
algorithm, the document set is modified to reduce its dimensionality. Then,
Bisecting K-means algorithm is executed over the modified document set to
obtain a set of very meaningful clusters. It has been shown that the proposed
approach, Clustering preceded by Graph Mining based Maximal Frequent Termsets
Preservation (CGFTP), produces better quality clusters than produced by some
classical document clustering algorithm(s). It has also been shown that the
produced clusters are easily interpretable. The quality of clusters has been
measured in terms of their F-measure.
Keywords: Bipartite graph, graph mining, frequent
termsets mining, bisecting K-means.