Information-Theoretic Variable Selection and Network Inference from Microarray Data

by Meyer, Patrick E

Abstract (Summary)
Statisticians are used to model interactions between variables on the basis of observed data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of samples. The detection of functional relationships, when such uncertainty is contained in data, constitutes a major challenge. Our work focuses on variable selection and network inference from datasets having many variables and few samples (high variable-to-sample ratio), such as microarray data. Variable selection is the topic of machine learning whose objective is to select, among a set of input variables, those that lead to the best predictive model. The application of variable selection methods to gene expression data allows, for example, to improve cancer diagnosis and prognosis by identifying a new molecular signature of the disease. Network inference consists in representing the dependencies between the variables of a dataset by a graph. Hence, when applied to microarray data, network inference can reverse-engineer the transcriptional regulatory network of cell in view of discovering new drug targets to cure diseases. In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset Information for Variable Elimination) a new method of feature selection and MRNET (Minimum Redundancy NETwork), a new algorithm of network inference. Both tools rely on the computation of mutual information, an information-theoretic measure of dependency. More precisely, MASSIVE and MRNET use approximations of the mutual information between a subset of variables and a target variable based on combinations of mutual informations between sub-subsets of variables and the target. The used approximations allow to estimate a series of low variate densities instead of one large multivariate density. Low variate densities are well-suited for dealing with high variable-to-sample ratio datasets, since they are rather cheap in terms of computational cost and they do not require a large amount of samples in order to be estimated accurately. Numerous experimental results show the competitiveness of these new approaches. Finally, our thesis has led to a freely available source code of MASSIVE and an open-source R and Bioconductor package of network inference.
Bibliographical Information:

Advisor:Rossi Fabrice; Verleysen Michel; Gardner Timothy; Lenaerts Tom; Cardinal Jean; Bontempi Gianluca

School:Université libre de Bruxelles

School Location:Belgium

Source Type:Master's Thesis

Keywords:microarray analysis information theory variable selection network inference


Date of Publication:12/16/2008

© 2009 All Rights Reserved.