Show simple item record

dc.contributor.authorChoi, Seung Hoanen_US
dc.date.accessioned2016-08-17T16:36:01Z
dc.date.available2016-08-17T16:36:01Z
dc.date.issued2016
dc.identifier.urihttps://hdl.handle.net/2144/17727
dc.description.abstractRecent Next Generation Sequencing methods provide a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Due to this feature of RNA sequencing (RNA-seq) data, appropriate statistical inference methods are required. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA-seq data, its appropriateness in the application to genetic studies has not been exhaustively evaluated. Additionally, adjusting for covariates that have an unknown relationship with expression of a gene has not been extensively evaluated in RNA-seq studies using the NB framework. Finally, the dependent structures in RNA-Seq data may violate the assumptions of some multiple testing correction methods. In this dissertation, we suggest an alternative regression method, evaluate the effect of covariates, and compare various multiple testing correction methods. We conduct simulation studies and apply these methods to a real data set. First, we suggest Firth’s logistic regression for detecting differentially expressed genes in RNA-seq data. We also recommend the data adaptive method that estimates a recalibrated distribution of test statistics. Firth’ logistic regression exhibits an appropriately controlled Type-I error rate using the data adaptive method and shows comparable power to NB regression in simulation studies. Next, we evaluate the effect of disease-associated covariates where the relationship between the covariate and gene expression is unknown. Although the power of NB and Firth’s logistic regression is decreased as disease-associated covariates are added in a model, Type-I error rates are well controlled in Firth’ logistic regression if the relationship between a covariate and disease is not strong. Finally, we compare multiple testing correction methods that control family-wise error rates and impose false discovery rates. The evaluation reveals that an understanding of study designs, RNA-seq data, and the consequences of applying specific regression and multiple testing correction methods are very important factors to control family-wise error rates or false discovery rates. We believe our statistical investigations will enrich gene expression studies and influence related statistical methods.en_US
dc.language.isoen_US
dc.rightsAttribution 4.0 Internationalen_US
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectBiostatisticsen_US
dc.subjectCovariatesen_US
dc.subjectData adaptiveen_US
dc.subjectDifferential expressionen_US
dc.subjectFirth's logistic regressionen_US
dc.subjectMultiple testing correctionen_US
dc.subjectRNA-Seqen_US
dc.titleEvaluation of statistical methods, modeling, and multiple testing in RNA-seq studiesen_US
dc.typeThesis/Dissertationen_US
dc.date.updated2016-08-12T01:28:54Z
etd.degree.nameDoctor of Philosophyen_US
etd.degree.leveldoctoralen_US
etd.degree.disciplineBiostatisticsen_US
etd.degree.grantorBoston Universityen_US


This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International
Except where otherwise noted, this item's license is described as Attribution 4.0 International