Discovering biological connections between experimental conditions based on common patterns of differential gene expression
Gower, Adam C.
MetadataShow full item record
Similarities between patterns of differential gene expression can be used to establish connections between the experimental and biological conditions that give rise to them. The growing volume of gene expression data in repositories such as NCBI's Gene Expression Omnibus (GEO) presents an opportunity to identify such similarities on a large scale across a diverse collection of datasets. In this work, I have developed a pattern-based approach, named openSESAME, to identify datasets enriched in samples displaying coordinate differential expression of a query signature. Importantly, openSESAME performs this search without knowledge of the experimental groups in the datasets being searched, which allows it to identify perturbations of gene expression due to attributes that may not have been recorded. First, I demonstrated the utility of openSESAME using two gene expression signatures to query a set of more than 75,000 human expression profiles obtained from GEO. A query using a signature of estradiol treatment identified experiments in which estrogen signaling was perturbed and also discriminated between estrogen receptor-positive and -negative breast cancers. A second query using a signature of silencing of the transcription factor p63 (a key regulator of epidermal differentiation) identified datasets related to stratified squamous epithelia or epidermal diseases such as melanoma. Next, to improve the utility of openSESAME, I expanded the collection of profiles to include samples from mouse and rat, and automatically translated expression signatures for cross-species queries. Furthermore, I processed the sample annotation associated with these samples in GEO, extracting informative words and phrases and continuous (e.g., age) and categorical (e.g., disease state) variables. I have also recorded sample-specific dates and quality metrics to assess whether batch effects or outliers are affecting individual query results. Finally, I used openSESAME to query this repository with over 800 gene expression signatures from the Broad Institute's Molecular Signatures Database (MSigDB). I then used the scores of the association of each signature with each sample in the repository to build a network of the relatedness of these signatures to each other. This "constellation" of signatures can be used to determine the relationship of a query signature to other biological and experimental perturbations.
Thesis (Ph.D.)--Boston UniversityPLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at email@example.com. Thank you.