Show simple item record

dc.contributor.advisorJohnson, W. Evanen_US
dc.contributor.authorZhang, Yuqingen_US
dc.date.accessioned2020-07-16T18:29:13Z
dc.date.available2020-07-16T18:29:13Z
dc.date.issued2020
dc.identifier.urihttps://hdl.handle.net/2144/41301
dc.description.abstractHeterogeneity describes any variability across different datasets. In genomic studies which profile gene expression levels, the presence of heterogeneity is ubiquitous, and may bring challenges to the integrative analysis of multiple datasets. Thus, many efforts are needed to understand and address the impact of heterogeneity. In this dissertation, I have developed novel statistical models and computational software for this purpose. I derived reference-batch ComBat and ComBat-Seq, two improved models based on the state-of-the-art method, ComBat, for addressing one particular type of heterogeneity known as the “batch effects”. I showed their benefits compared to the existing methods in several data types and situations, and implemented these models in publicly available software. Then, I created systematic simulations to explore the impact of common study heterogeneity on the independent validation of genomic prediction models, showing that the most identifiable sources of heterogeneity are not the primary ones affecting the validation of genomic predictors. Finally, I adapted a solution using cross-study ensemble learning to train predictors with generalizable independent performance, to address the unwanted impact of batch effects on prediction. I compared this new framework with the traditional approach for batch correction, showing that cross-study learning may provide a more robust-performing model in independent validation. Results in this dissertation provide insights and guidelines for working with heterogeneous gene expression profiling datasets in practice, and encourage further investigation on understanding and addressing heterogeneity in genomic studiesen_US
dc.language.isoen_US
dc.rightsAttribution 4.0 Internationalen_US
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectBioinformaticsen_US
dc.titleStatistical and computational methods for addressing heterogeneity in genomic dataen_US
dc.typeThesis/Dissertationen_US
dc.date.updated2020-07-16T01:03:42Z
etd.degree.nameDoctor of Philosophyen_US
etd.degree.leveldoctoralen_US
etd.degree.disciplineBioinformatics GRSen_US
etd.degree.grantorBoston Universityen_US


This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International
Except where otherwise noted, this item's license is described as Attribution 4.0 International