Methods for correlated observations with applications to genetic association studies
OA Version
Citation
Abstract
Correlation is commonly present in genetic association studies and may yield incorrect inference when ignored. Hence, developing methods for properly analyzing correlated data is crucial. However, there is a lack of analytical tools to answer certain questions because existing methods are not applicable when some model assumptions are violated. In this thesis, we propose three methods for correlated phenotypes, particularly correlation arising from family data.
We first develop an iterated weighted linear mixed effects (IWLME) method to account for heteroscedasticity. We compare the model performance of IWLME with five other methods by simulation studies. When applying methods that ignore heteroscedasticity, the occurrence of heteroscedasticity results in lower power, but not excessive type I error. When heteroscedasticity is present, meta-analysis, linear mixed effects (LME) models in GENetic EStimation and Inference in Structured samples (GENESIS), weighed LME and IWLME provide a more precise estimate of the effect size with smaller bias and mean square error, compared with LME and generalized estimating equations (GEE). In an Epi-genome wide association study, by applying IWLME, more CpGs reach the significance threshold compared with LME.
We then explore R2 statistics in LME, defining R2 as the proportion of the variance in the response that is predictable from the fixed effect variables. We review six existing R2 estimators and extend these estimators to estimate partial R2. We propose three R2/partial R2 estimators based on our R2 definition and variance decomposition. We compare the performance among the methods by simulation studies. Our proposed R2 estimators have the smallest mean square error, low bias, and no or only a small percentage of negative estimation when the true R2/partial R2 is modest or higher (>2%).
Finally, a Firth bias corrected generalized estimating equations (FBC-GEE) approach is proposed to address separation for correlated binary data, a common occurrence in association analyses of rare genetic variants. We compare GEE, FBC-GEE, Firth logistic regression and Scalable and Accurate Implementation of GEneralized mixed model (SAIGE) by conducting simulation studies. FBC-GEE helps reduce type I error inflation compared with GEE.
With these projects, we develop new methodologies and improve the understanding of the performance of available methods for genetics studies with family data.