Approximate string matching in DNA sequences
"Approximate String Matching in DNA Sequences"
Cheng Lok Lam
for the degree of Master of Philosophy at The University of Hong Kong
in August 2003
DNA (Deoxyribonucleic acid) sequences hold the code of life of living organisms. Approximate string matching on DNA sequences is very important in research in the fields of medicine and health. Although the approximate string matching problem has been studied extensively in the field of computer science, many solutions cannot be applied directly on DNA data. This is because DNA data is a very large data set (GenBank had recorded more than 20Gbp of DNA sequences at June 2002).
A method based on suffix tree (or suffix array) and a partitioning of the query string into sub-queries was proposed recently. The method has been shown to be efficient for a long query with a large error bound for the main memory model.
Due to the large data volume of DNA, in many instances, the suffix tree (or suffix array) is larger than the main memory and must be stored in external memory. In our study, the technique is extended to external memory model. It is shown that the method using suffix array performs better than suffix tree for the external memory based model, and an algorithm is proposed for building suffix array efficiently in external memory.
A novel auxiliary data structure is also proposed. The data structure greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model.
The parallel approximate matching problem in DNA sequences is also investigated. Two novel parallel algorithms for PC cluster are proposed. Experimental results show that when the error rate is small a partitioning of the suffix array over the machines in the cluster is a more efficient approach. Conversely, partitioning the data over the machines is a better approach, if the error rate is large.
School:The University of Hong Kong
School Location:China - Hong Kong SAR
Source Type:Master's Thesis
Keywords:dna data processing bioinformatics database searching computer algorithms
Date of Publication:01/01/2004