Class discovery via feature selection in unsupervised settings
MetadataShow full item record
Identifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome. We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes. In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics.