A Novel Handwriting Grading System Using
Gurmukhi
Characters
Munish Kumar1, Manish Jindal2,
and Rajendra Sharma3
1Department of Computational Sciences, Maharaja Ranjit
Singh Punjab Technical University, India
2Department of Computer Science and Applications, Panjab
University Regional Centre, India
3Department of Computer
Science and Engineering, Thapar University, India
Abstract: This paper presents a new technique for grading the writers based on
their handwriting. This process of grading shall be helpful in organizing
handwriting competitions and then deciding the winners on the basis of an
automated process. For testing data set, we have collected samples from one
hundred different writers. In order to establish the correctness of our
approach, we have also considered these characters, taken from one printed
Gurmukhi font (Anandpur Sahib) in testing data set. For training data set, we
have considered these characters, taken from four printed Gurmukhi fonts,
namely, Language Materials Project (LMP) Taran, Maharaja, Granthi and Gurmukhi_Lys.
Nearest Neighbour classifier has been used for obtaining a classification score
for each writer. Finally, the writers are graded based on their classification
score.
Keywords: Gradation; feature extraction; peak
extent based features; modified division point based features; NN.
SynchroState: A SPEM-based Solution for Synchronizing Activities and Products through State Transiti
SynchroState: A SPEM-based Solution for Synchronizing Activities
and Products through State Transitions
Amal Rochd1, Maria Zrikem1,
Thierry Millan2, Christian Percebois2, Claude
Baron3,
and Abderrahmane Ayadi1
1Laboratory of Modeling and
Information Technologies, University of Cadi Ayyad, Morocco
2Institut
de Recherche en Informatique de Toulouse, Université de Toulouse, France
3Laboratoire
d’Analyse et d’Architecture des Systèmes, Université de Toulouse, France
Abstract: Software engineering
research was always focused around the efficiency of software development
processes. Recently, we noticed an increasing interest in model-driven
approaches in this context. Models that were once merely descriptive, are
nowadays playing a productive role in defining engineering processes and
managing their lifecycles. However, there is a problem that has not been
considered enough; it is about sustaining consistency between products and the
implicated activities during the process lifecycle. This issue, identified in
this paper as the synchronization problem, needs to be resolved in order to
guarantee a flawless execution of a software process. In this paper, we present
a SPEM-based solution named SynchroState that highlights the relationship
between process activities and products. SynchroState's goal is to ensure
synchronization between activities and products in order that if one of these
two entities undergoes a change, the dependents entities should be notified and
evolved to sustain consistency In order to evaluate SynchroState, we have
implemented the solution using the AspectJ language and validated it through a
case study inspired from the ISPW-6 software process example. Results of this
study demonstrate the automation of synchronization of product state following a
change in the activity state during the evolution of the process execution.
Keywords: Synchro state, SPEM, metamodeling, process model, synchronization, aspectJ.
Security Mechanism against Sybil Attacks for High-Throughput Multicast Routing in Wireless Mesh Netw
Security Mechanism against Sybil Attacks for
High-Throughput Multicast Routing in Wireless
Mesh Networks
Anitha Periasamy1 and Periasamy Pappampalayam2
1Master of Computer Applications Department, Anna
University, India
2Electronics and Communication Engineering
Department, Anna University, India
Abstract: Wireless Mesh Networks (WMNs) have become
one of the important domains in wireless communications. They comprise of a
number of static wireless routers which form an access network for end users to
IP-based services. Unlike conventional Wireless Local Area Network (WLAN)
deployments, wireless mesh networks offer multihop routing, facilitating an
easy and cost-effective deployment. In this paper, an efficient and secure
multicast routing on such wireless mesh networks is concentrated. This paper
identifies the novel attacks against high throughput multicast protocols in
wireless mesh networks through Secure On-Demand Multicast Routing Protocol (S-ODMRP).
Recently, Sybil attack is observed to be the most harmful attack in WMNs, where
a node illegitimately claims multiple identities. This paper systematically
analyzes the threat posed by the Sybil attack to WMN. The Sybil attack is
encountered by the defense mechanism called Random Key Predistribution technique
(RKP). The performance of the proposed approach which integrates the S-ODMRP
and RKP is evaluated using the throughput performance metric. It is observed
from the experimental result that the proposed approach provides good security
against Sybil attack with very high throughput.
Keywords: Secure multicast routing, sybil attack, random
key predistribution.
An Effective Sample Preparation Method for
Diabetes Prediction
Shima Afzali and Oktay
Yildiz
Computer Engineering Department, Gazi University, Turkey
Abstract: Diabetes is a chronic disorder caused by metabolic
malfunction in carbohydrate metabolism and it has become a serious health
problem worldwide. Early and correct detection of diabetes can significantly
influence the treatment process of diabetic patients and thus eliminate the
associated side effects. Machine learning is an emerging field of high importance
for providing prognosis and a deeper understanding of the classification of
diseases such as diabetes. This study proposed a high precision diagnostic
system by modifying k-means clustering technique. In the first place, noisy,
uncertain and inconsistent data was detected by new clustering method and
removed from data set. Then, diabetes prediction model was generated by using
Support Vector Machine (SVM). Employing the proposed diagnostic system to
classify Pima Indians Diabetes data set (PID) resulted in 99.64% classification
accuracy with 10-fold cross validation. The results from our analysis show the
new system is highly successful compared to SVM and the classical k-means
algorithm & SVM regarding classification performance and time consumption. Experimental
results indicate that the proposed approach outperforms previous methods.
Keywords: Diabetes, clustering, classification,
K-means, SVM, sample preparation.
Evaluating Social
Context in Arabic Opinion Mining
Mohammed Al-Kabi1,
Izzat Alsmadi2, Rawan Khasawneh3, and Heider Wahsheh4
1Computer Science Department, Zarqa University, Jordan
2Computer Science Department, University of New Haven,
USA
3Computer Information Systems Department, Jordan
University of Science and Technology, Jordan
4Computer
Science Department, King Khaled University, Saudi Arabia
Abstract: This study is based on a benchmark corpora consisting
of 3,015 textual Arabic
opinions collected from Facebook. These collected Arabic opinions are
distributed equally among three domains (Food, Sport, and Weather),
to create a balanced benchmark corpus. To accomplish this study ten Arabic
lexicons were constructed manually, and a new tool called Arabic Opinions
Polarity Identification (AOPI) is designed and implemented to
identify the polarity of the collected Arabic opinions using the constructed
lexicons. Furthermore, this study includes a comparison between the constructed
tool and two free online sentiment analysis tools (SocialMention and
SentiStrength) that support the Arabic language. The effect of stemming
on the accuracy of these tools is tested in this study. The evaluation results
using machine learning classifiers show that AOPI is more effective than the
other two free online sentiment analysis tools using a stemmed dataset.
Keywords: Big data, social networks, sentiment
analysis, Arabic text classification, and analysis, opinion mining.
Image Quality Assessment Employing RMS
Contrast and Histogram Similarity
Al-Amin Bhuiyan1 and Abdul Raouf
Khan2
1Department of Computer Engineering, King Faisal
University, KSA
2Department
of Computer Science, King Faisal University, KSA
Abstract: This paper presents a new approach for evaluating
image quality. The method is based on the histogram similarity computation
between images and is organized with assessing quality index factors due to the
contributions of correlation
coefficient, average luminance distortion and rms contrast measurement. The
effectiveness of this proposed RMS Contrast and Histogram Similarity (RCHS)
based hybrid quality index has been justified over Lena images under different
well known distortions and standard image databases. Experimental results
demonstrate that this image quality assessment method performs better than those of widely used image
distortion quality metric Mean Squared Error (MSE), Structural
SIMilarity (SSIM) and Histogram based Image Quality (HIQ).
Keywords: Image quality measures, RMS contrast, histogram similarity, SSIM,
HIQ, minkowski distance metric.
Enhancing Anti-phishing by a Robust
Multi-Level Authentication Technique (EARMAT)
Adwan Yasin
and Abdelmunem Abuhasan
College of Engineering and Information Technology, Arab
American University, Palestine
Abstract: Phishing is a kind of social engineering attack in
which experienced persons or entities fool novice users to share their
sensitive information such as usernames, passwords, credit card numbers, etc.
through spoofed emails, spams, and Trojan hosts. The proposed scheme based on
designing a secure two factor authentication web application that prevents
phishing attacks instead of relying on the phishing detection methods and user
experience. The proposed method guarantees that authenticating users to
services, such as online banking or e-commerce websites, is done in a very
secure manner. The proposed system involves using a mobile phone as a software
token that plays the role of a second factor in the user authentication
process, the web application generates a session based onetime password and delivers
it securely to the mobile application after notifying him through Google Cloud
Messaging (GCM) service, then the user mobile software will complete the
authentication process – after user confirmation- by encrypting the received
onetime password with its own private key and sends it back to the server in a
secure and transparent to the user mechanism. Once the server decrypts the
received onetime password and mutually authenticates the client, it
automatically authenticates the user’s web session. We implemented a prototype
system of our authentication protocol that consists of an Android application,
a Java-based web server and a GCM connectivity for both of them. Our evaluation
results indicate the viability of the authentication protocol to secure the web
applications authentication against various types of threats.
Keywords: Phishing, two-factor authentication,
web security, google cloud messaging, mobile authentication.
Detection of Neovascularization in Proliferative
Diabetic Retinopathy Fundus Images
Suma Gandhimathi1 and Kavitha Pillai2
1Department of Computer Science
and Engineering, Sree
Vidyanikethan Engineering College, India
2Department of Computer Science
and Engineering, University
College of Engineering, India
Abstract:
Neovascularization
is a serious visual consequence disease arising from Proliferative Diabetic
Retinopathy (PDR). The condition causes progressive retinal damage in persons
suffering from Diabetes mellitus, and is characterized by busted growth of
abnormal blood vessels from the normal vasculature, which hampers proper blood
flow into the retina because of oxygen insufficiency in retinal capillaries. The present paper aims at
detecting PDR neovascularization with the help of the Adaptive Histogram
Equalization technique, which enhances the green plane of the fundus image,
resulting in enrichment of the details presented in the fundus image. The neovascularization blood vessels and the normal blood vessels
were both segmented from the equalized image, using the Fuzzy C-means
clustering technique. Marking of the neovascularization region, was achieved with a function matrix box based on a compactness classifier, which applied morphological and threshold
techniques on the segmented image. Subsequently, the Feed Forward
Back-propagation Neural Network interacted with extracted features (e.g.,
number of segments, gradient variation, mean, variance, standard deviation,
contrast, correlation, entropy, energy, homogeneity, cluster shade towards the
neovascularization detection region), in an attempt to achieve accurate
identification. The above method was tested on images from three online datasets, as well as two hospital eye
clinics. The performance of the detection technique was evaluated on these five
image sources, and found to show an overall accuracy of 94.5% for sensitivity
of 95.4% and of specificity 49.3% respectively, thus reiterating that the method would play a vital
role in the study and analysis of Diabetic Retinopathy.
Keywords: Diabetic retinopathy, neovascularization, fuzzy C-means clustering, compactness
classifier, feature extraction, neural network.
Machine Learning based Intelligent
Framework for Data Preprocessing
Sohail
Sarwar1, Zia UI Qayyum2, and Abdul Kaleem1
Abstract: Data preprocessing having a pivotal role in data mining ensures
reduction in cost by catering inconsistent, incomplete and irrelevant data
through data cleansing to assist knowledge workers in making effective decisions
through knowledge extraction. Prevalent techniques are not much effective for
having more manual effort, increased processing time, less accuracy percentage
etc with constrained data volumes. In this research, a comprehensive,
semi-automatic pre-processing framework based on hybrid of two machine learning
techniques namely Conditional Random Fields (CRF) and Hidden Markov Model (HMM)
is devised for data cleansing. Proposed framework is envisaged to be effective
and flexible enough to manipulate data set of any size. A bucket of
inconsistent dataset (comprising of customer’s address directory) of Pakistan
Telecommunication Company (PTCL) is used to conduct different experiments for
training and validation of proposed approach. Small percentage of semi cleansed
data (output of preprocessing) is passed to hybrid of HMM and CRF for learning
and rest of the data is used for testing the model. Experiments depict
superiority of higher average accuracy of 95.50% for proposed hybrid approach
compared to CRF (84.5%) and HMM (88.6%) when applied in separately.
Keywords: Machine learning, hidden markov model, conditional
random fields, preprocessing.
An Empirical Study to Evaluate the Relationship of Object-Oriented Metrics and Change Proneness
Ruchika Malhotra and Megha Khanna
Department
of Computer Science and Engineering, Technological University,
India
Abstract: Software maintenance deals with changes or modifications which software
goes through. Change prediction models help in identification of
classes/modules which are prone to change in future releases of a software
product. As change prone classes are probable sources of defects and
modifications, they represent the weak areas of a product. Thus, change
prediction models would aid software developers in delivering an effective
quality software product by allocating more resources to change prone
classes/modules as they need greater attention and resources for verification
and meticulous testing. This would reduce the probability of defects in future
releases and would yield a better quality product and satisfied customers. This
study deals with the identification of change prone classes in an
Object-Oriented (OO) software in order to evaluate whether a relationship
exists between OO metrics and change proneness attribute of a class. The study
also compares the effectiveness of two sets of methods for change prediction
tasks i.e. the traditional statistical methods (logistic regression) and the
recently widely used machine learning methods like Bagging, Multi-layer
perceptron etc.
Keywords: Change proneness, empirical validation,
machine learning, object-oriented and software quality.
Impulse Noise Reduction for Texture Images Using Real Word Spelling Correction Algorithm and Local B
Impulse Noise
Reduction for Texture Images Using Real Word Spelling Correction Algorithm and
Local Binary Patterns
Shervan
Fekri-Ershad1, Seyed Fakhrahmad2, and Farshad Tajeripour2
1Faculty of
Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran
2Department of
Computer Science and Engineering, Shiraz University, Shiraz, Iran
Abstract: Noise
Reduction is one of the most important steps in very broad domain of image
processing applications such as face identification, motion tracking, visual
pattern recognition and etc. Texture images are covered a huge number
of images where are collected as database in these applications. In
this paper an approach is proposed for noise reduction in texture images which is
based on real word spelling correction theory in natural language processing. The
proposed approach is included two main steps. In the first step, most similar
pixels to noisy desired pixel in terms of textural features are generated using
local binary pattern. Next, best one of the candidates is selected based on
two-gram algorithm. The quality of the proposed approach is compared with some
of state of the art noise reduction filters in the result part. High accuracy,
Low blurring effect, and low computational complexity are some advantages of
the proposed approach.
Keywords:
Image noise reduction, local binary pattern, real word spelling correction,
texture analysis.
Using Data Mining for Predicting Cultivable
Uncultivated Regions in the Middle East
Ahsan Abdullah1, Ahmed
Bakhashwain2, Abdullah Basuhail3, and Ahtisham Aslam4
1Department
of Information Technology, King Abdulaziz University, Saudi Arabia
2Department
of Arid Regions Agriculture, King Abdulaziz University, Saudi Arabia
3Department
of Computer Science, King Abdulaziz University, Saudi Arabia
4Department of Information Systems, King Abdulaziz
University, Saudi Arabia
Abstract: Middle-East region is
mostly characterized by a hot and dry climate, vast deserts and long
coastlines. Deserts cover large areas, while agricultural lands are described
as small areas of arable land under perennial grass pastures or crops. In view
of the harsh climate and falling ground-water level, it is critical to identify
which agriculture produce to grow, and where to grow it? The traditional
methods used for this purpose are expensive, complex, prone to subjectivity,
risky and are time-consuming; this points to the need of exploring novel IT
techniques using Geographic Information Systems (GIS). In this paper, we
present a data-driven stand-alone flexible analysis environment i.e., Spatial
Prediction and Overlay Tool (SPOT). SPOT is predictive spatial data mining GIS
tool designed to facilitate decision support by processing and analysing
agro-meteorological and socio-economic thematic maps and generating crop
cultivation geo-referenced prediction maps by predicative data mining. In this
paper, we present a case study of Saudi Arabia by using decade old wheat
cultivation data, and compare the historically uncultivated regions predicted
by SPOT with their current cultivation status. The prediction results were
found to be promising after verification in time and space using latest satellite
imagery followed by on-site physical ground verification using GPS.
Keywords: Data mining, image processing, GIS, prediction,
wheat, alfalfa.
Mining Consumer Knowledge from Shopping
Experience: TV Shopping Industry
Chih-Hao Wen1,
Shu-Hsien Liao2, and Shu-Fang Huang2
1Department
of Logistics Management, National Defense University, Taiwan
2Department of Management Sciences and Decision Making, Tamkang
University, Taiwan
Abstract:
TV shopping becomes far much popular in recent years. TV nowadays is almost
everywhere. People watch TV; meanwhile, they are more and more accustomed to
buy goods via TV shopping channel. Even in recession, it is thriving and has
become one of the most important consumption modes. This study uses cluster
analysis to identify the profiles of TV shopping consumers. The rules between
TV Shopping spokespersons and commodities from consumers are recognized by
using association analysis. Depicting the marketing knowledge map of
spokespersons, the best endorsement portfolio is found out to make
recommendations. By the analysis of spokespersons, period, customer profiles
and products, fourbusiness modes of TV shopping are proposed for consumers: new
product, knowledge, low price and luxury product; the related recommendations
are also provided for the industry reference.
Keywords: Consumer
knowledge, data mining, TV shopping, association rules, clustering.
A Physical Topology Discovery Method Based on AFTs of
Down Constraint
Bin Zhang1,
Xingchun Diao2, Donghong Qin3, Yi Liu4, and
Yun Yu2
1Cyberspace Security Research Center, Pengcheng
Laboratory, China
2Nanjing
Telecommunication Technology Research Institute, China
3School
of Information Science and Engineering, GuangXi University for Nationalities,
China
4National Innovation
Institute of Defense Technology, Beijing, China
Abstract: Network physical topology discovery is the key issue for network
management and application, the physical topology discovery based on Address Forwarding
Table (AFT) is a hot topic on current study. This paper defines three
constraints of AFTs, and proposes a tree chopping algorithm based on AFTs satisfying
down constraint, which can discover the physical topology of a subnet
accurately. The proposed algorithm decreases the demand for AFT integrity
dramatically, and is the loosest constraint for discovering physical topology
which just relies on AFTs of down ports. The proposed algorithm can also be
used in the switch domain of multiple subnets.
Keywords: Physical topology discovery, address
forwarding table, network management.
Modified Binary Bat Algorithm for Feature
Selection in Unsupervised Learning
Rajalaxmi Ramasamy and Sylvia Rani
Department of Computer Science and Engineering, Kongu Engineering
College, India
Abstract: Feature selection is the process of selecting a
subset of optimal features by removing redundant and irrelevant features. In supervised
learning, feature selection process uses class label. But feature selection is
difficult in unsupervised learning since class labels are not present. In this
paper, we present a wrapper based unsupervised feature selection method with
the modified binary bat approach with k-means clustering algorithm. To ensure
diversification in the search space, mutation operator is introduced in the
proposed algorithm. To validate the selected features by our method,
classification algorithms like decision tree induction, Support Vector Machine
and Naïve Bayesian classifier are used. The results show that the proposed
method identifies a minimal number of features with improved accuracy when
compared with the other methods.
Keywords: Feature selection, unsupervised
learning, binary bat algorithm, mutation.
Explicitly Symplectic Algorithm for Long-time
Simulation of Ultra-flexible Cloth
Xiao-hui
Tan1, Zhou Mingquan2, Yachun Fan2, Wang
Xuesong2, and Wu Zhongke2
1College of Information and Engineering,
Capital Normal University, China
2College of Information Science and
Technology, Beijing Normal University, China
Abstract: In
this paper, a symplectic structure-preserved algorithm is presented to solve
Hamiltonian dynamic model of ultra-flexible cloth simulation with high
computation stability. Our method can preserve the conserved quantity of a
Hamiltonian, which enables a long-time stable simulation of ultra-flexible
cloth. Firstly, the dynamic equation of ultra-flexible cloth simulation is
transferred into Hamiltonian system which is slightly perturbed from the
original one, but with generalized structure preservability. Secondly,
semi-implicit symplecticRunge-Kutta and Euler algorithms are constructed, and
able to be converted into explicit algorithms for the separable dynamic models.
Thirdly, in order to show the advantages, the presented algorithms are utilized
to solve a conservative system which is the primary ultra-flexible cloth model
unit. The results show that the presented algorithms can preserve the system
energy constant and can give the exact results even at large time-step, however
the ordinary non-symplectic explicit methodsexhabit large error with the
increasing of time-step. Finally, the presented algorithms are adopted to simulate
a large-areaultra-flexible cloth to validate the computation capability and
stability. The method employs the symplectial features and analytically integrates
the force for better stability and accuracy while keeping the integration
scheme is still explicit. Experiment results show that our symplectic schemes
are more powerful for integrating Hamiltonian systems than non-symplectic
methods. Our method is a common scheme for physically based system to
simultaneously maintain real-time and long-time simulation.It has been implemented in the scene building
platform-World Max Studio.
Keywords: Flexible cloth simulation, numerical
integration, symplectic method, scene building system.
Semi Fragile Watermarking for Content based Image Authentication and Recovery in the DWT-DCT Domains
Semi Fragile Watermarking for Content based Image
Authentication and Recovery in the
DWT-DCT Domains
Jayashree Pillai1 and Padma Theagarajan2
1Department
of Computer Science, Acharya Institute of Management and Sciences, India
2Department
of Computer Applications, Sona College of Technology, India
Abstract:
Content authentication requires
that the image watermarks must highlight malicious attacks while tolerating
incidental modifications that do not alter the image contents beyond a certain
tolerance limit. This paper proposed an authentication scheme that uses content
invariant features of the image as a self authenticating watermark and a
quantized down sampled approximation of the original image as a recovery
watermark, both embedded securely using a pseudorandom sequence into multiple
sub bands in the Discrete Wavelet Transform (DWT) domain. The scheme is blind
as it does not require the original image during the authentication stage and
highly tolerant to Jpeg2000 compression. The scheme also ensures highly
imperceptible watermarked images and is suitable in application with low
tolerance to image quality degradation after watermarking. Both Discrete Cosine
Transform (DCT) and DWT transform domain are used in the watermark generation
and embedding process.
Keywords: Content authentication, self authentication,
recovery watermark, DWT, PQ sequence.
Recognition of Handwritten Characters Based on
Wavelet Transform and SVM Classifier
Malika Ait Aider1,
Kamal Hammouche1, and Djamel Gaceb2
1Laboratoire
Vision Artificielle et Automatique des Systèmes, Université Mouloud Mammeri,
Algérie
2Laboratoire D'informatique
en Image et Systèmes D'information, Institut
National des Sciences Appliquées de Lyon, France
Abstract: This paper is devoted to the off-line handwritten
character recognition based on the two dimensional wavelet transform and a
single support vector machine classifier. The wavelet transform provides a
representation of the image in independent frequency bands. It performs a local
analysis to characterize images of characters in time and scale space. The
wavelet transform provides at each level of decomposition four sub-images: a
smooth or approximation sub-image and three detail sub-images. In handwritten
character recognition, the wavelet transform has received more attention and
its performance is related not only to the use of the type of wavelet but also
to the type of a sub-image used to provide features. Our objective here is thus
to study these two previous points by conducting several tests using several
wavelet families and several combinational features derived from sub-images.
They show that the symlet wavelet of order 8 is the most efficient and the
features derived from the approximation sub-image allow the best discrimination
between the handwritten digits.
Keywords: Feature extraction; wavelet transform,
handwritten character recognition; support vector machine; OCR.
Development of a Hindi Named Entity Recognition System without Using Manually Annotated Training Cor
Development of a Hindi Named Entity
Recognition System without Using Manually Annotated Training Corpus
Sujan Kumar Saha1 and
Mukta Majumder2
1Department of Computer Science and Engineering, Birla
Institute of Technology, India
2Department of Computer Science and Application, University
of North Bengal, India
Abstract: Machine learning based approach for Named Entity Recognition
(NER) requires sufficient annotated corpus to train the classifier. Other NER
resources like gazetteers are also required to make the classifier more
accurate. But in many languages and domains relevant NER resources are still
not available. Creation of adequate and relevant resources is costly and time
consuming. However a large amount of resources and several NER systems are available
in resource-rich languages, like English. Suitable language adaptation
techniques, NER resources of a resource-rich language and minimally supervised
learning might help to overcome such scenarios. In this paper we have studied a
few such techniques in order to develop a Hindi NER system. Without using any
Hindi NE annotated corpus we have achieved a reasonable accuracy of F-Measure
73.87 in the developed system.
Keywords: Natural
language processing, machine learning, named
entity recognition, resource scarcity, language transfer, semi-supervised
learning.
Combining Instance Weighting and Fine Tuning for Training Naïve Bayesian Classifiers with Scant Trai
Combining Instance Weighting and Fine
Tuning for Training Naïve Bayesian Classifiers with Scant Training Data
Khalil El Hindi
Department
of Computer Science, King Saud University, Saudi Arabia
Abstract: This work addresses the problem of having to train a
Naïve Bayesian classifier using limited data. It first presents an improved
instance-weighting algorithm that is accurate and robust to noise and then it
shows how to combine it with a fine tuning algorithm to achieve even better
classification accuracy. Our empirical work using 49 benchmark data sets shows that the improved instance-weighting
method outperforms the original algorithm on both noisy and noise-free data
sets. Another set of empirical results indicates
that combining the instance-weighting algorithm with the fine tuning algorithm
gives better classification accuracy than using either one of them alone.
Keywords: Naïve bayesian
algorithm, classification, machine learning, noisy data sets, instance weighting.