Increasing the speed and efficiency of search in FBI/CODIS DNA database : throught multivariate statistical clustering approach and development of a similarity ranking scheme /

by Yadav, Puneet.

Abstract (Summary)
A new method has been developed to create and maintain a search tree-structured index to multidimensional data using naturally occurring patterns and clusters within the data, and thereby allows the implementation of efficient search and retrieval strategies in a database. This method was applied to a DNA database, which was developed by the FBI for forensic uses. A set of 10,000 DNA/STR profiles based on the STR allele probability distribution density for the Caucasians has been generated for the sixteen loci. The resulting allele distribution has been analyzed using Multivariate Statistical analysis, in specific; the Principal Component Analysis (PCA) approach was employed to detect clustering patterns among the profiles. The analysis revealed that with the choice of some loci-pairs (such as d13s17 and d16s539) good and distinct clusters were obtainable. Members within each distinct cluster were further studied to determine the attributes that made them distinct from all members of other clusters. The PCA analysis results with a real DNA/STR dataset also showed similar clustering patterns. In order to rank order the profiles from a search process as to their similarity to that of the target profile, a new Similarity Index (SI) parameter has been developed. The Similarity Index was successfully tested on a small (126) and a large (1026) dataset. Further, a Shuffling Index was developed to study the sensitivity of the Similarity Index to the selection of weights used in the similarity index sub-parameters. Results show that the similarity ranking of profiles remain stable over a wide range of weights. iii
Bibliographical Information:


School:The University of Tennessee at Chattanooga

School Location:USA - Tennessee

Source Type:Master's Thesis



Date of Publication:

© 2009 All Rights Reserved.