Evaluation of protein sequence classification patterns
Classification of protein sequence is one of the foundations of bioinformatics, as new proteins are sequenced every day. Each protein sequence represents a protein of a certain family and its function can sometimes be predicted through sequence classification. Today several approaches exist for sequence classification, and in this work pattern approaches are considered. A pattern is an expression, representing a certain protein family, which corresponding protein sequences hopefully match. PROSITE is a pattern collection that well known in the area of bioinformatics and therefore plays an important part in this project together with the MAMA pattern collection. Evaluation of patterns today focus on accuracy, i.e. sensitivity and specificity, but in this thesis information content is also considered. The intended experiment which was about discovering any relationship between accuracy and information content showed that no clear connection was found. This fact led to the conclusion that information content might not be suitable as an evaluation measure when evaluating patterns. The second experiment concerned the fact that sometimes the same sequences are used both during training and testing, which probably gives misleadingly high accuracy values. This fact gave birth to the idea that an independent test set other than the training set reduces accuracy values, which was revealed after a number of tests. Finally the last experiment, which was about creating a new system for evaluating whole pattern collections, is presented with results showing that MAMA performs better than PROSITE according to this system.
School:Högskolan i Skövde
Source Type:Master's Thesis
Keywords:performance protein sequence classification patterns
Date of Publication:02/06/2008