Mining Recent Maximal Frequent Itemsets Over Data Streams with Sliding Window
Saihua Cai1, Shangbo Hao1,
Ruizhi Sun1, and Gang
Wu2
1College of Information and Electrical Engineering,
China Agricultural University,
China
2Secretary of Computer Science Department, Tarim University, China
Abstract: The huge number of data streams makes it
impossible to mine recent frequent itemsets. Due to the maximal frequent itemsets can
perfectly imply all the frequent itemsets and the number is much smaller, therefore, the time cost and the memory usage for
mining maximal frequent itemsets are much more efficient. This paper proposes
an improved method called Recent Maximal Frequent Itemsets Mining (RMFIsM) to mine recent maximal frequent itemsets over
data streams with sliding window. The RMFIsM method uses two matrixes to store the
information of data streams,
the first matrix stores the information of each transaction and the second one stores the frequent 1-itemsets. The frequent p-itemsets are
mined with “extension” process of frequent 2-itemsets, and the maximal
frequent itemsets are obtained by deleting the sub-itemsets of long frequent itemsets. Finally, the
performance of the RMFIsM method is conducted by a series of experiments, the results show that
the proposed RMFIsM method
can mine recent maximal frequent itemsets efficiently.
Keywords: Data streams, recent maximal frequent itemsets, sliding window, matrix structure.
Evaluation of Quran Recitation via OWL
Ontology Based System
Eman Elsayed and Doaa Fathy
Mathematics
and Computer Science Department, Al-Azhar University (Girls Branch), Egypt
Abstract: The Linguistic miracle in the Holy Quran leads to many challenges in
Automate Quran recitation evaluation. This paper considers one of suggestions
of how natural language processing can benefit from using ontology. In this
paper, we proposed a general automatic system to evaluate Quran recitation
according to Hafs reading. That is via integration the ontology based as
artificial intelligent knowledge representation method and Automatic Speech
Recognition (ASR) technology as a way of interaction with computer. Our
proposed system solves the problem of evaluating all intonations (Tajweed) in
the same time in addition to evaluate set of Quran segments in the wright
arrangement of reading. The system uses Mel-Frequency Cepstral Coefficients
(MFCC) and Vector Quantization (VQ) respectively in feature extraction and
dimension reduction on Arabic speech. Also, we construct Quran ontology based
for Quranic speech and integrate it with information retrieval system. Quran
ontology based is the first version to merge Quran meaning" Tafseer"
and its recitation in the Universal oral exam to take advantage of semantic
property of ontology. Experimentally, our system gives good accuracy for Quran
recitation evaluation.
Keywords:
Holy
Quran, ontology, automatic speech recognition, feature extraction, Information
retrieval system.
A Personalized Metasearch Engine Based on
Multi-Agent System
Meijia Wang, Qingshan Li, Yishuai Lin, Yingjian
Li, and Boyu Zhou
Software
Engineering Institute, Xidian University, China
Abstract: By
providing unified access to multiple underlying search engines, metasearch
engine is an intuitiveway to increase the coverage of the WWW. Great progress
has been made in this area, butthe previous studies ignore the perspectives of
users. This paper proposes a personalization mechanism for metasearch engine
based on multi-agent system to improve precision ratio. The proposed mechanism
obtains user interests from click-through data, schedules the appropriate underlying search engines according to the expertness model, and merges results based on user interest
distribution. Moreover, it also has the ability to provide personalized result
recommendation. Compared with the baseline results, experimental results show
that the proposed personalization
mechanism performs better on precision. The proposed metasearch engine is feasible for providing
useful search results more effectively.
Keywords: Metasearch engine, multi-agent system, personalized search.
Received October 29, 2016; accepted August 26, 2018
Parameter Optimization of Single Sample Virtually
Expanded Method
Xiaoxia Zhao1, Wenjun Meng1,
Jinhu Su1, Yuxuan Chen2
1School of
Mechanical Engineering,
2Design and Research Institute, North China Municipal
Engineering Design and Research Institute Co. Ltd., China
Abstract: Aiming at single sample experiment of sample number n=1, the relationship
of parameters virtually expanded from n=1 to n=13 is derived in this paper,and the big-samples data are gained by Bootstrap
method. Instead of existing methods, a developing particle swarm optimization
based on Minimax is put forward. With the application of this method in the
parameter optimization, the lower confidence limit approaches the lower
confidence limit of the Semiempirical Evaluation Method with more rapid speed
and higher precision. In this way, the most suitable augmented parameters
virtually expanded from n=1 to n=13 are gained, which provides a better virtual
augment method for the sample augment from n=1 to n=13.
Keywords: Bootstrap method, semiempirical
evaluation method, single sample, augmented parameter, minimax, particle swarm
optimization (PSO).
Classifying Sentiment of Dialectal Arabic
Reviews: A Semi-Supervised
Approach
Omar Al-Harbi
Computer and Information Department, Jazan University,
Saudi Arabia
Abstract: Arab Internet users tend to use dialectical
words to express how they feel about products, services, and places. Although,
dialects in Arabic derived from the formal Arabic language, it differs in
several aspects. In general, Arabic sentiment analysis recently attracted lots
of researchers’ attention. A considerable amount of research has been conducted
in Modern Standard Arabic (MSA), but little work has focused on dialectal Arabic.
The presence of the dialect in the Arabic texts made Arabic sentiment analysis
is a challenging issue, due to it usually does not follow specific rules in
writing or speaking system. In this paper, we implement a semi-supervised
approach for sentiment polarity classification of dialectal reviews with the
presence of Modern Standard Arabic (MSA). We combined dialectal sentiment
lexicon with four classifying learning algorithm to perform the polarity
classification, namely Support Vector Machines (SVM), Naïve Bayes (NB), Random
Forest, and K-Nearest Neighbor (K-NN). To select the features with which the
classifiers can perform the best, we used three feature evaluation methods, namely,
Correlation-based Feature Selection, Principal Components Analysis, and SVM
Feature Evaluation. In the experiment, we applied the approach to a data set
which was manually collected. The experimental results show that the approach
yielded the highest classification accuracy using SVM algorithm with 92.3 %.
Keywords: Arabic sentiment analysis, Opinion mining,
Dialectal sentiment analysis, Dialectal lexicon, Dialectal Arabic processing.
A Proactive Caching Scheme Based on Content Concentration in Content-Centric Networks
Xi Luo1,2
and Ying An1
1Institute
of Information Security and Big Data, Central
South University, China
2Department
of Information Technology, Hunan Police Academy, China
Abstract: Content-Centric Networking (CCN) provides a new networking paradigm to
overcome the challenges of the current Internet and better support the
content-oriented services. In-network caching is one of the core technologies
for optimizing content distribution and has been attracting ever increasing
attention. In the current CCN caching schemes, the caching decision is only
implemented during the content response process after the content which matches
the user request is found, which lacks the capacity of proactive content
advancement. This paper borrows the idea of molecular diffusion and introduces
the conception of content concentration into the caching decision to
analytically model the user demands for different contents in different network
regions, and then a proactive caching scheme based on content concentration is
proposed. In this scheme, the content replica can be proactively pushed from
the high concentration region to the low concentration one to realize the fast
deployment of content cache replicas and the dynamic adjustment of the cache
location. Furthermore, a probabilistic content placement method is also
implemented by synthetically considering the content popularity. The simulation
results show that this scheme can effectively improve the cache hit ratio and
shorten the delay involved in content search and response.
Keywords: Content-centric networking, diffusion theory, proactive caching, content popularity.
Received October 13, 2016; accepted November 27, 2017
A New Approach to Improve Association Rules for Big
Data in Cloud Environment
Djilali Dahmani, Sidi Ahmed Rahal,
and Ghalem Belalem
Department of Computer Science, University of Science and Technology,
Algeria
Abstract: The technique of association rules is very useful
in Data Mining, but it generates a huge number of rules. So, a manual post-processing
is required to target only the interesting rules. Several researchers suggest
integrating users' knowledge by using ontology and rule patterns, and then select
automatically the interesting rules after generating all possible rules.
However, nowadays the business data are extremely increasing, and many
companies have already opted for Big Data systems deployed in cloud environments,
then the process of generating association rules becomes very hard. To deal
with this issue, we propose an approach using ontology with rule patterns to
integrate users' knowledge early in the preprocessing step before searching or
generating any rule. So, only the interesting rules which respect the rule
patterns will be generated. This approach allows reducing execution time and
minimizing the cost of the post-processing especially for Big Data. To confirm
the performance results, experiments are carried out on Not Only Strutured Query
Language (NoSQL) databases which are distributed in a cloud environment.
Keywords: Big data, association rules, rule
patterns, ontology, cloud computing, NoSQL.
PLDL: A Novel Method for Label Distribution
Learning
Venkatanareshbabu Kuppili,
Mainak Biswas, and Damodar Edla
Computer Science and Engineering, National Institute of
Technology Goa, India
Abstract: The nature, volume and orientation of data have been changed a lot in
the last few years. The changed situation has beckoned data scientists to
modify traditional algorithms and innovate new methods for processing new type
of high volume, extremely complex data. One of the challenges is label
ambiguity in the data, where the distribution of the significance of the labels
matters. In this paper, a new method named Probabilistic Label Distribution
Learning (PLDL) has been proposed for a computing degree of the belongingness.
It is based on a proposed new Label Probability Density Function (LPDF) derived
from Parzon estimate. The LPDF has been used in Algorithm Adoption K-Nearest Neighbors (AA-KNN) for Label Distribution
Learning (LDL). Probability density estimators are used to estimate this
ambiguity for each and every label. The overall degree of the belongingness of
unseen instance has been evaluated on various real datasets. Comparative
performance evaluation in terms of prediction accuracy of the proposed PLDL has
been made with Algorithm adaptation KNN, Multilayer Perceptron,
Levenberg-Marquardt neural network and layer recurrent neural for Label
Distribution Learning. It has been observed that the increase in prediction
accuracy for the proposed PLDL is highly statistically significant for most of
the real datasets when compared with the standard algorithms for LDL.
Keywords: Multi-label classification, data mining, label
distribution learning, probability density function.
A Comprehensive Study of Modern and High
Speed TCP-Variant in Linux Kernel: TCP CUBIC
Abrar Khan, Umar Shoaib, and
Muhammad Sarfraz
Department of Computer Science, Faculty of Computing and
Information Technology, University of Gujrat, Pakistan
Abstract: Transmission Control Protocol TCP
is no doubt most widely used congestion control protocol designed for highly
reliable and end-to-end communication over the internet. TCP is not suitable in
its standard form for modern and high speed networks. Various TCP variants are
solution for this issue. CUBIC is a modern TCP variant designed for high speed
and scalable networks. CUBIC is also adopted as default congestion control
algorithm in Linux kernel. This survey paper contains a detailed discussion
about TCP CUBIC and the directions for further improvements. It describes the
CUBIC design architecture with the pseudo code of the algorithm, TCP support in
Linux kernel and implementation of CUBIC, Network Simulator 2 and Network
Simulator 3 based study of CUBIC along with its class diagram. Finally, the
performance of CUBIC is evaluated both in wired and wireless environment under
the parameters of goodput and intra-protocol fairness along with TCP NewReno
and TCP Compound. The simulation results demonstrate that CUBIC is very
suitable for wired and high speed networks but its performance degrades in
wireless and low speed networks.
Keywords: TCP, Congestion control, Linux kernel, CUBIC,
NewReno, Compound.
Block Size Analysis for Discrete Wavelet Watermarking
and Embedding a Vector Image as a Watermark
Hayri Sever1, Ahmet Şenol1, and Ersin
Elbaşı2
1Department
of Computer Engineering, Çankaya University,
06790 Etimesgut Ankara
2College
of Engineering and Technology, American University of the Middle East, Kuwait
Abstract: As telecommunication and computer technologies
proliferate, most data are stored and transferred in digital format. Content
owners, therefore, are searching for new technologies to protect copyrighted
products in digital form. Image watermarking emerged as a technique for
protecting image copyrights. Early studies on image watermarking used the pixel
domain whereas modern watermarking methods convert a pixel based image to
another domain and embed a watermark in the transform domain. This study aims
to use, Block Discrete Wavelet Transform (BDWT) as the transform domain for
embedding and extracting watermarks. This study consists of 2 parts. The first
part investigates the effect of dividing an image into non-overlapping blocks
and transforming each image block to a DWT domain, independently. Then, effect
of block size on watermark success and, how it is related to block size, are
analyzed. The second part investigates embedding a vector image logo as a
watermark. Vector images consist of geometric objects such as lines, circles
and splines. Unlike pixel-based images, vector images do not lose quality due
to scaling. Vector watermarks deteriorate very easily if the watermarked image
is processed, such as compression or filtering. Special care must be taken when
the embedded watermark is a vector image, such as adjusting the watermark
strength or distributing the watermark data into the image. The relative
importance of watermark data must be taken into account. To the best of our
knowledge this study is the first to use a vector image as a watermark embedded
in a host image.
Keywords: Watermarking, DWT, block, vector, SVG.
Tracking Recurring Concepts from Evolving Data
Streams using Ensemble Method
Yange Sun1,2, Zhihai Wang2, Jidong
Yuan2, and Wei Zhang2
1School of Computer and Information Technology,
Xinyang Normal University, China
2School
of Computer and Information Technology, Beijing Jiaotong University, China
Abstract: Ensemble models
are the most widely used methods for classifying evolving data stream. However,
most of the existing data stream ensemble classification algorithms do not
consider the issue of recurring concepts, which
commonly exist in real-world applications. Motivated by this challenge, an
Ensemble with internal Change Detection (ECD) was proposed to enhance
performance by exploring the recurring concepts. It is done by maintaining a
pool of classifiers, which dynamically adds and removes classifiers in response
to the change detector. The algorithm adopts a two window change detection
model, which adopts the Jensen-Shannon divergence to measure the distance of
the distributions between old and recent data. When a change is detected, the
repository of stored historical concepts is checked for reuse. Experimental
results on both synthetic and real-world data streams demonstrate that
the proposed algorithm not only outperforms the state-of-art methods on standard
evaluation metrics, but also adapts well in different types of concept drift scenarios
especially when concept s reappear.
Keywords: Data
streams, ensemble classification, change detection, recurring concept,
Jensen-Shannon divergence.
The Influence of Data Classification Methods on Predictive Accuracy of Kernel Density Estimation Hot
The Influence of Data Classification Methods on Predictive Accuracy of Kernel Density Estimation Hotspot Maps
Nenad Milic1, Brankica Popovic2, Sasa Mijalkovic1, and Darko Marinkovic1
1Department of Criminalistics, University of
Criminal Investigation and Police Studies, Serbia
2Department of Informatics and Computer Science, University of Criminal
Investigation and Police Studies, Serbia
Abstract: When it comes to hot spot
identification, spatial analysis techniques come to the fore. One of such
techniques, that has gained great popularity among crime analysts, is the Kernel
Density Estimation (KDE). Small variation in KDE parameters can give different
outputs and hence affect predictive accuracy of hotspot map. The influence
these parameters have on KDE hotspot output sparked many researches, mostly
analyzing the influence of the cell size and bandwidth size. Yet, the influence
of different classification methods applied to calculated cell values,
including the choice of threshold value, on the KDE hotspot predictive accuracy
remained neglected. The objective of this research was to assess the influence of different classification methods to KDE
predictive accuracy. In each KDE computation, calculated cell values were
divided into five thematic classes, using three the most common (default)
classification methods provided by Environmental Systems Research Institute (ESRI)
Geographical Information System (Arc GIS) (equal interval classification,
quantile classification and natural breaks classification) and incremental
multiples of the grid cells’ mean. Based upon calculated hit rates, predictive
accuracy indices and recapture rate indices and taking into account the
necessity that mapping output should satisfy some operational requirements as
well as statistical rules, this research suggest that incremental mean approach
with hotspot threshold of 3 and above multiples of the grid cell’s mean, should
be used.
Keywords: Crime mapping, hot spot, kernel density, classification methods.
Correlation Dependencies between Variables in Feature Selection on Boolean Symbolic Objects
Djamal
Ziani
College of Business and Administration, Al
Yamamah University, Saudi Arabia
Abstract: Feature selection is an important process in data
analysis and data mining. The increasing size, complexity, and multi-valued
nature of data necessitate the use of Symbolic Data Analysis (SDA), which
utilizes symbolic objects instead of classical tables, for data analysis. The
symbolic objects are created by using abstraction or generalization techniques
on individuals. They are a representation of concepts or clusters. To improve
the description of these objects, and to eliminate incoherencies and
over-generalization, using dependencies between variables is crucial in SDA.
This study shows how correlation dependencies between variables can be
processed on Boolean Symbolic Objects (BSOs) in feature selection. A new
feature selection criterion that considers the dependencies between variables,
and a method of dealing with computation complexity is also presented.
Keywords: Feature selection, dependencies; symbolic data analysis,
discrimination criteria.
Towards Achieving Optimal Performance using Stacked Generalization Algorithm: A Case Study of Clinic
Towards Achieving Optimal Performance using
Stacked Generalization Algorithm: A Case
Study of Clinical Diagnosis of Malaria Fever
Abiodun
Oguntimilehin1, Olusola Adetunmbi2, and Innocent Osho3
1Department
of Computer Science, Afe Babalola University, Ado-Ekiti
2Department of Computer Science, Federal
University of Technology, Akure
3Department of Animal Production and
Health, Federal University of Technology, Akure
Abstract: The birth of
data mining has been a blessing to all fields of endeavours and there are
numerous data mining algorithms available today. One of the major problems of
mining data is the selection of the appropriate algorithm or model for a job at
hand; this has led to different comparison experiments by researchers. Stacked
Generalization is one of the methods of combining multiple models to give a
better accuracy. The method has been investigated to be effective by many
researchers over the years. This study investigates how optimal performance
could be achieved using Stacked Generalization algorithm. Six different data
mining algorithms (PART, REP Tree, J48, Random Tree, RIDOR and JRIP) arranged
in two different orders were used as base learners to two different Meta
Learners (Random Forest and NNGE) independently and the results obtained were
compared in terms of classification accuracy. The study shows that the order of
arrangement of the base learners and the choice of Meta Learner could affect
the accuracy of the Stacked Generalization method; NNGE outperforms Random
Forest as a Meta-Learner and its performance is independent of the order of
arrangement of the base learners as against Random Forest. Malaria fever
datasets collected from reputable hospitals in Ado-Ekiti, Ekiti State, Nigeria
were purposefully used for this study because malaria is one of the major
diseases killing almost a million people yearly in the tropical region of
Africa, so a more accurate malaria fever diagnosis model is as well proposed as
a result of this study.
Keywords: Data
mining, ensemble learning, stacked generalization, malaria, diagnosis.
Received October 13, 2016; accepted October 11, 2017
Optimal Dual Cameras Setup for Motion
Recognition in Salat Activity
Nor Azrini Jaafar1,
Nor Azman
Ismail1, Kamarul Azmi Jasmi2, and Yusman Azimi Yusoff1
1Faculty
of Computing, Universiti Teknologi Malaysia, Malaysia
2Faculty
of Islamic Civilization, Universiti Teknologi Malaysia, Malaysia
Abstract: Motion
recognition has received significant attention in recent years in the area of
computer vision since it has a wide range of potential application that can be developed.
A wide variety of algorithms and techniques were proposed in the context of
developing human motion recognition systems. This paper investigated optimal
dual sensors setup in motion recognition for salat activity by using multisensor
which has remained unexplored. Existing works in the related field are able to recognise
few salat movements, but not from the multisensor perspective which is
important for better recognition and analytic results. This research proposed a
solution that is relevant to the current scenario where we deal with one of the
fundamental activities required for every Muslim which is salat. Not only
carrying out salat with the right actions will help strengthen our relationship
with Allah Subhanahu Wa Taʿala (SWT), but also enable the formation of a
positive personality, mental well-being, and physical health. Firstly, this
research identified the best position setup of a dual sensor. Then, Hidden
Markov Model was used to classify all movements in salat activity and the data were
trained before the testing phase. This study led to a new way of learning for salat
activity which can be further explored and developed. This research contributed
a new way of learning by incorporating new interaction in human-computer
interaction. The outcome of this research will be very useful in validating the
salat movements of every Muslim.
Keywords: Motion recognition, Salat activity, Multisensor,
Hidden Markov model, Human-Computer Interaction.
Exploitation
of ICMP Time Exceeded Packets for A Large-Scale Router Delay Analysis
Ali Gezer1 and
Gary Warner2
1Electronic and Telecommunication Technology, Kayseri
University, Turkey
2Computer Science, University of Alabama at Birmingham, US
Abstract: Internet Control Message Protocol Time-Exceeded (ICMP-TE) time exceeded packets are particular
communication protocols to express inaccessibility of nodes in terms of hop
count limitations. With the Internet of Things (IoT) concept taking more space
in our daily life, accessibility or in some manners inaccessibility of hosts
should be analysed more carefully. ICMP time exceeded packets might be hand of
an attacker, sometimes an indicator of compromise for a possible IoT Botnet
attack or a tool for delay measurement. In this study, with the exploitation of
ICMP time exceeded packets, we analyse Round Trip Time (RTT) delays of randomly
distributed IP routers around the globe. We conduct a comprehensive delay
analysis study considering the delay results of more than 1 million time
exceeded packets taken in return for subject ICMP requests. To prove ICMP time
exceeded packets might also be a signature for a possible IoT Botnet attack, we
carry out a secure experiment for Mirai IoT Botnet scanning and exhibit the
indicators to differentiate these two possible usages.
Keywords: ICMP time exceeded packet, iot botnet, Mirai
botnet, rtt delay, performance analysis, quality of service.
PatHT: An Efficient Method of Classification over
Evolving Data Streams
Meng Han, Jian Ding, and Juan Li
School of Computer Science
and Engineering, North Minzu University, China
Abstract: Some existing classifications need frequent update to
adapt to the change of
concept in data streams. To solve this problem, an
adaptive method Pattern-based Hoeffding Tree (PatHT) is proposed to process evolving data
streams. A key
technology of a training classification decision tree is to improve the efficiency of
choosing an optimal splitting attribute. Therefore, frequent patterns are used.
Algorithm PatHT discovers constraint-based closed frequent patterns incremental
updated. It builds an adaptive and incremental updated tree based on the
frequent pattern set. It uses sliding window to avoid concept drift in mining
patterns and uses concept drift detector to deal with concept change problem in
procedure of training examples. We tested the performance of PatHT against some
known algorithms using real data streams and synthetic data streams with
different widths of concept change. Our approach outperforms
traditional classification models and it is proved by the
experimental results.
Key words: Data
mining; decision tree; data stream classification; closed pattern mining; concept drift.
FBMT: Fuzzy Based Merkle Technique for
Detecting and Mitigating Malicious Nodes
in Sensor Networks
Ranjeeth Kumar Sundararajan1 and Umamakeswari Arumugam2
1Department of Computer Science and Engineering, Srinivasa Ramanujan
Centre, SASTRA Deemed University, India
2Department
of Computer Science and Engineering, School of Computing, SASTRA Deemed
University, India
Abstract: Wireless sensor networks are prone to many
vulnerabilities because of its unattended environment policy. Intrusion is one
of the serious issue in wireless networks, since wireless networks are resource
constrained and devising a security mechanism to counter intrusion is a
challenging task. This paper focuses on building light-weight Intrusion
Detection System to counter the routing attack, by identifying the malicious
nodes at the earliest point. The proposed scheme namely Fuzzy Based Merkle
Technique applies fuzzy logic to identify the malicious nodes and builds a
light-weight Intrusion detection system and adapts Merkle tree approach for
building the network. The proposed scheme is efficient in identifying the
malicious nodes with minimum energy consumption and less communication overhead
than the existing Merkle technique. Network Simulator 2 is used to simulate the
Intrusion Detection System (IDS) and the results are verified.
Keywords: Wireless Sensor Networks,
Intrusion Detection Systems, Fuzzy logic, Merkle tree.
An Ontology Alignment Hybrid Method Based on
Decision Rules
Mohamed Biniz1
and Mohamed Fakir2
1Department of Mathematics and Computer Science, Sultan
Moulay Slimane University, Morocco
2Department of
Informatics, Sultan Moulay Slimane University, Morocco
Abstract: In this paper, we propose a hybrid approach based on the extraction of
decision rules to refine the alignment results due to the use of three
alignment strategies. This approach contains two phases: training phase which
uses structural similarity, element similarity, instance-based similarity and
C4.5 algorithms to extract decision rules, and evaluation phase which refines
discovered alignment by three alignment strategies using extracted decision
rules. This approach is compared with the best systems according to benchmark
OAEI 2016: Framework for Ontology Alignment and Mapping (FOAM), A Dynamic Multistrategy Ontology Alignment
Framework (RIMOM), AgreementMakerLight and Yet Another
Matcher-Biomedical Ontologies (YAM-BIO), the
proposed method gives good results (good recall, precision and F-measure).
Experimental results show that the proposed approach is effective.
Keywords: Decision
rules, alignment, ontology, similarity, similarity flooding.
Efficient Multiple Pattern Matching Algorithm
Based on BMH: MP-BMH
Akhtar Rasool1, Gulfishan Firdose Ahmed2, Raju Barskar3, and Nilay Khare1
1Department of Computer Science and Engineering, Maulana Azad National
Institute of Technology, India
2Department of Computer Science, College of Agriculture, India
3Department of Computer Science and Engineering, University Institute of
Technology RGPV, India
Abstract: String matching is playing a
key role in a wide range of applications in the information computing. Due to
its importance, large numbers of different algorithms have been developed over
the last 50 years. There are various standards of single and multiple pattern
string matching algorithms like Boyer-Moore (BM), Boyer-Moore-Horspool (BMH), Aho-Corasick etc.
Multiple pattern string matching is very useful in real world applications.
Standard benchmark multiple pattern algorithm Aho-Corasick and many other
multiple pattern string matching algorithms are not memory and time efficient
for wider length and large number of patterns set on large text data set
because they are using the concept like DFA and require full scan of text data.
Many string matching tools which are currently popular for string matching are
based on these algorithms. Using the bad character principle of BMH, a method
for multiple pattern string matching is being developed. Using this method a
time and memory efficient exact multiple pattern string matching algorithm
Multiple Pattern BMH (MP-BMH) is proposed. Unlike Aho-Corasick, the proffered
MP-BMH algorithm provides skipping of unnecessary matching of characters in
text while searching like BMH Algorithm. It also provides the skipping of
characters in patterns using the similarity between patterns. Along with the
shifts while searching, this algorithm also provides shrewd switches among the
patterns for efficacious matching.In this paper, the aforesaid
method, MP-BMH algorithm with its time, memory and experimental analysis are
described.
Keywords: String matching;
multiple pattern string matching, Boyer-Moore BM,BMH, MP-BMH.
Received January 8, 2017; accepted January 16, 2018