Machine Learning based Intelligent
Framework for Data Preprocessing
Sohail
Sarwar1, Zia UI Qayyum2, and Abdul Kaleem1
Abstract: Data preprocessing having a pivotal role in data mining ensures
reduction in cost by catering inconsistent, incomplete and irrelevant data
through data cleansing to assist knowledge workers in making effective decisions
through knowledge extraction. Prevalent techniques are not much effective for
having more manual effort, increased processing time, less accuracy percentage
etc with constrained data volumes. In this research, a comprehensive,
semi-automatic pre-processing framework based on hybrid of two machine learning
techniques namely Conditional Random Fields (CRF) and Hidden Markov Model (HMM)
is devised for data cleansing. Proposed framework is envisaged to be effective
and flexible enough to manipulate data set of any size. A bucket of
inconsistent dataset (comprising of customer’s address directory) of Pakistan
Telecommunication Company (PTCL) is used to conduct different experiments for
training and validation of proposed approach. Small percentage of semi cleansed
data (output of preprocessing) is passed to hybrid of HMM and CRF for learning
and rest of the data is used for testing the model. Experiments depict
superiority of higher average accuracy of 95.50% for proposed hybrid approach
compared to CRF (84.5%) and HMM (88.6%) when applied in separately.
Keywords: Machine learning, hidden markov model, conditional
random fields, preprocessing.