Approximate string matching in DNA sequences

by Cheng, Lok-lam

Abstract (Summary)
(Uncorrected OCR) Abstract of thesis entitled

"Approximate String Matching in DNA Sequences"

Submitted by

Cheng Lok Lam

for the degree of Master of Philosophy at The University of Hong Kong

in August 2003

DNA (Deoxyribonucleic acid) sequences hold the code of life of living organisms. Approximate string matching on DNA sequences is very important in research in the fields of medicine and health. Although the approximate string matching problem has been studied extensively in the field of computer science, many solutions cannot be applied directly on DNA data. This is because DNA data is a very large data set (GenBank had recorded more than 20Gbp of DNA sequences at June 2002).

A method based on suffix tree (or suffix array) and a partitioning of the query string into sub-queries was proposed recently. The method has been shown to be efficient for a long query with a large error bound for the main memory model.

Due to the large data volume of DNA, in many instances, the suffix tree (or suffix array) is larger than the main memory and must be stored in external memory. In our study, the technique is extended to external memory model. It is shown that the method using suffix array performs better than suffix tree for the external memory based model, and an algorithm is proposed for building suffix array efficiently in external memory.

A novel auxiliary data structure is also proposed. The data structure greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model.

The parallel approximate matching problem in DNA sequences is also investigated. Two novel parallel algorithms for PC cluster are proposed. Experimental results show that when the error rate is small a partitioning of the suffix array over the machines in the cluster is a more efficient approach. Conversely, partitioning the data over the machines is a better approach, if the error rate is large.

Bibliographical Information:


School:The University of Hong Kong

School Location:China - Hong Kong SAR

Source Type:Master's Thesis

Keywords:dna data processing bioinformatics database searching computer algorithms


Date of Publication:01/01/2004

© 2009 All Rights Reserved.