Algorithms for string matching with applications in molecular biology

by Holloway, James Lee

Abstract (Summary)
As the volume of genetic sequence data increases due to improved sequencing

techniques and increased interest, the computational tools available to analyze

the data are becoming inadequate. This thesis seeks to improve a few of the computational

methods available to access and analyze data in the genetic sequence

databases. The first two results are parallel algorithms based on previously known

sequential algorithms. The third result is a new approach, based on assumptions

that we believe make sense in the biological context of the problem, to approximating

an NP complete problem. The final result is a fundamentally new approach

to approximate string matching using the divide and conquer paradigm instead

of the dynamic programming approach that has been used almost exclusively in

the past.

Dynamic programming algorithms to measure the distance between sequences

have been known since at least 1972. Recently there has been interest

in developing parallel algorithms to measure the distance between two sequences.

We have developed an optimal parallel algorithm to find the edit distance, a metric

frequently used to measure distance, between two sequences.

It is often interesting to find the substrings of length k that appear most

frequently in a given string. We give a simple sequential algorithm to solve this problem and an efficient parallel version of the algorithm. The parallel algorithm

uses an efficient novel parallel bucket sort.

When sequencing a large segment of DNA, the original DNA sequence is

reconstructed using the results of sequencing fragments, that may or may not

contain errors, of many copies of the original DNA. New algorithms are given to

solve the problem of reconstructing the original DNA sequence with and without

errors introduced into the fragments. A program based on this algorithm is used

to reconstruct the human beta globin region (HUMHBB) when given a set of 300

to 500 mers drawn randomly from the HUMHBB region.

Approximate string matching is used in a biological context to model the

steps of evolution. While such evolution may proceed base by base using the

change, insert, or delete operators, there is also evidence that whole genes may

be moved or inverted. We introduce a new problem, the string to string rearrangement

problem, that allows movement and inversion of substrings. We give

a divide and conquer algorithm for finding a rearrangement of one string within


Bibliographical Information:

Advisor:Cull, Paul; Rudd, Walter; D'Ambrosio, Bruce; Bell, Christopher

School:Oregon State University

School Location:USA - Oregon

Source Type:Master's Thesis

Keywords:nucleotide sequence data processing amino acid


Date of Publication:05/07/2009

© 2009 All Rights Reserved.