Show simple item record

dc.contributor.authorCurtis, Jessicaen_US
dc.date.accessioned2016-02-22T19:25:49Z
dc.date.available2016-02-22T19:25:49Z
dc.date.issued2016
dc.identifier.urihttps://hdl.handle.net/2144/14548
dc.description.abstractIdentifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome. We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes. In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics.en_US
dc.language.isoen_US
dc.subjectStatisticsen_US
dc.subjectClass discoveryen_US
dc.subjectClusteringen_US
dc.subjectFeature selectionen_US
dc.subjectGene expressionen_US
dc.subjectHigher criticismen_US
dc.subjectUnsuperviseden_US
dc.titleClass discovery via feature selection in unsupervised settingsen_US
dc.typeThesis/Dissertationen_US
dc.date.updated2016-02-13T02:22:16Z
etd.degree.nameDoctor of Philosophyen_US
etd.degree.leveldoctoralen_US
etd.degree.disciplineMathematics & Statisticsen_US
etd.degree.grantorBoston Universityen_US


This item appears in the following Collection(s)

Show simple item record