Employ machine learning to unveil encrypted molecular patterns within proteomic and genomic profiles to assist in personalized medical diagnosis.

by Carvalho, Paulo Costa

Abstract (Summary)
Motivation: Employ machine learning to unveil encrypted molecular patterns within proteomic andgenomic profiles to assist in personalized medical diagnosis.Results and conclusions:1. Proteomic profile studies: Patients with Hodgkin?s disease (HD), a rare type of lymphoma, had their serumproteomic profile compared to control subjects (CS) in order to search for differentially expressed proteinpatterns. Initially, a serum protein 1D gel analysis revealed two over-expressed proteins (~26 and 18 kDa) in HDpatients (p lt; 0.01). To further hunt for discriminatory patterns, serum mass spectra from 30 CS and 30 HDpatients were obtained by electrospray mass spectrometry. A support vector machine (SVM) approach correctlyclassified all spectra as either controls or Hodgkin?s disease patients by the leave-one-out cross-validationmethod. Subsequently, a new algorithm named ?maximum divergence analysis? (MDA) was employed to trackbiomarkers in the multi-charged spectra data. Two differentially expressed peaks were able to correctly classify97% of all subjects. To our knowledge, this was the first time SVM was applied to ESI multi-charged spectra formedical diagnosis. A new approach for resolving multi-class problems called ellipsoid clustering machine (ECM)was then used to define a CS domain in a feature space. This method is advantageous when dealing withheterogeneous sets because it efficiently defines a pattern, is able to generalize and is applicable to multi-classproblems. All CS and HD patients were correctly classified by the leave-one-out cross validation using the ECMmodel. The elliptical boundaries could be a geometrical definition of a Hodgkin?s disease-free / control serumstandard. It is hoped that, by adding new biomarkers to the model, it could be used for multi-diagnose for varioustypes of cancers.2. Genetic profile analysis: In this study, an improved hypertension risk evaluation method combining onesrenin-angiotensin-aldosterone system (RAAS) genomic profile with pertinent clinical data is demonstrated. Themost relevant clinical features are chosen by querying a ?pre-computed for a given genetic profile? feature subsetdatabase. The disease? risk is evaluated by classifying patient?s data with a support vector machine model, thenmeasuring the Euclidian distance to the hyperplane decision function. To create this database, a new hybridfeature selection / ranking method was used to generate feature subsets from information that we acquired fromBrazilian hypertension patients. The application of feature selection in RAAS haplotypes ascertained itsassociation with hypertension and elucidated distinct polymorphism patterns for different ethnic groups.3. Distributed computing for future studies: To carry out faster feature selection and classificationstudies, grid computing should be employed. Most distributed computing / grid solutions have complexinstallation procedures requiring specialist support, or have limitations regarding operating systems. In thiswork, we demonstrate Squid, a new multi-platform, open-source program designed to ?keep things simple?while offering high-end computing power for large scale applications. Squid also has an efficient faulttolerance and crash recovery system against data loss, being able to re-route jobs upon node failure andrecover even if the master node fails.
This document abstract is also available in Portuguese.
Bibliographical Information:

Advisor:Gilberto Barbosa Domont; Wim Maurits Sylvain Degrave

School:Faculdades Oswaldo Cruz

School Location:Brazil

Source Type:Master's Thesis

Keywords:Proteomic profile studies Genetic analysis


Date of Publication:11/24/2005

© 2009 All Rights Reserved.