Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array
Kunthavai A1, Vasantharathna S2, and Thirumurugan S3
1,3Department of Computer Science & Engineering / IT, Coimbatore Institute of Technology, India.
2Department of Electrical & Electronics Engineering, Coimbatore Institute of Technology, India.
Abstract: Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existing accurate algorithm, such as Smith-Waterman and FASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WU-BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering the rapid growth of database sizes, this problem demands ever-growing computation resources, and remains as a computational challenge. Most common sequence alignment algorithms like BLAST, WU-BLAST, and SCT searches a given query sequence against set of database sequences. In this paper BioDBMPHF Tool has been developed to find pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database of sequences that have the highest local similarity with a given query sequence and reduces the size of the database based on frequent common subsequence. In this BioDBMPHF Tool fine-tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that HashIndexalgorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the HashIndexalgorithm is O (2 + γ) where ‘γ’ is the time taken to access the pattern. Space complexity of fine-tuned enhanced suffix array is 5n bytes per character for reduced enhanced Lcp table and to store bucket table it requires 32 bytes. Data mining technique is used to cross validate the result. It is proved that the developed BioDBMPHF Tool effectively compresses the database and obtains same results compared to that traditional algorithm in approximately half the time taken by them thereby reducing the time complexity.
Keywords: Sequence alignment, enhanced suffix array, compression, minimum perfect hash function, data mining
Recevied October 25, 2012; accepted January 1, 2013