An analytical framework for theoretical analyses in binary classifier ensembles and a study of issues in cluster validation for genomic data
Abstract (Summary)iii Classification and clustering subsume a large number of pattern recognition tasks. The contribution of this work is two-fold. The first part relates to classification, more specifically, to classifier ensembles (multiple classifier systems) for binary classification (two-class) problems. In the second part of this work, we explore some of the issues in cluster validation as relates to genomic data. Classifier ensembles have proved to be promising and useful in various applications. The basic idea is to build a team of classifiers and combine their outputs in order to obtain a more ”robust” classification, as opposed to relying on the output of a single classifier. The outputs of the classifiers could be combined in a number of ways. Majority voting is a simple yet useful combination scheme. Our contribution in the area of multiple classifier systems includes the formulation of the problem of computing the upper and lower bounds of majority voting accuracy for an ensemble of binary classifiers as a linear program (LP). The resulting analytical framework can be used for performing a variety of analyses related to voting. Diversity and complementarity are considered as desirable properties in an ensemble of classifiers, however there is no widely accepted characterization of these concepts, thus making an objective evaluation difficult. Many of the measures defined in the literature are formulated in terms of correct/incorrect classifications, these are referred to as error-diversity measures. We show that the analytical framework mentioned above can be used effectively to evaluate error-diversity measures and explore whether there is a useful relationship between the selected diversity measures and the ensemble accuracy. Next, we explore some of the issues in cluster validation in the context of microarray data. Clustering is often an important first step in the analysis of genomic data, cluster validation is an important step in cluster analysis. We assess the suitability of standard cluster validation techniques for microarray data. Often an important goal in clustering genomic data is to group genes based on underlying biologically relevant criteria such as functions. It is thus of interest to compare a clustering result with an external clustering, for instance comparing the grouping of genes obtained by applying a clustering algorithm on microarray data against a reference grouping such as a grouping based on biological functions derived from existing biological literature. We examine some of the recent cluster validity measures proposed in the literature which may be suitable for this purpose. We propose a measure for the distance between two membership matrices and suggest when this could be a suitable choice for the above purpose. We use standard network flow algorithms for finding the measure. We also formulate related theoretical problems as network flow problems.
School Location:USA - Pennsylvania
Source Type:Master's Thesis
Date of Publication: