Information-Theoretic Variable Selection and Network Inference from Microarray Data
Abstract (Summary)
Statisticians are used to model interactions between variables on the basis of observed
data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets
having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of
samples. The detection of functional relationships, when such uncertainty is contained in
data, constitutes a major challenge.
Our work focuses on variable selection and network inference from datasets having
many variables and few samples (high variable-to-sample ratio), such as microarray data.
Variable selection is the topic of machine learning whose objective is to select, among a
set of input variables, those that lead to the best predictive model. The application of
variable selection methods to gene expression data allows, for example, to improve cancer
diagnosis and prognosis by identifying a new molecular signature of the disease. Network
inference consists in representing the dependencies between the variables of a dataset by
a graph. Hence, when applied to microarray data, network inference can reverse-engineer
the transcriptional regulatory network of cell in view of discovering new drug targets to
cure diseases.
In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset
Information for Variable Elimination) a new method of feature selection and MRNET (Minimum
Redundancy NETwork), a new algorithm of network inference. Both tools rely on
the computation of mutual information, an information-theoretic measure of dependency.
More precisely, MASSIVE and MRNET use approximations of the mutual information
between a subset of variables and a target variable based on combinations of mutual informations
between sub-subsets of variables and the target. The used approximations allow
to estimate a series of low variate densities instead of one large multivariate density. Low
variate densities are well-suited for dealing with high variable-to-sample ratio datasets,
since they are rather cheap in terms of computational cost and they do not require a large
amount of samples in order to be estimated accurately. Numerous experimental results
show the competitiveness of these new approaches. Finally, our thesis has led to a freely
available source code of MASSIVE and an open-source R and Bioconductor package of
network inference.
Bibliographical Information:
Advisor:Rossi Fabrice; Verleysen Michel; Gardner Timothy; Lenaerts Tom; Cardinal Jean; Bontempi Gianluca
School:Université libre de Bruxelles
School Location:Belgium
Source Type:Master's Thesis
Keywords:microarray analysis information theory variable selection network inference
ISBN:
Date of Publication:12/16/2008