Methods for drawing causal inference from electronic health record data
Embargo Date
2026-09-10
OA Version
Citation
Abstract
Data collected in routine clinical practice, such as electronic health records (EHRs), enable clinical researchers to answer questions that cannot be addressed by randomized controlled trials (RCTs) and complement safety and efficacy data from RCTs in the post-approval phase. The use of EHRs in this way has grown tremendously over the past two decades. Although EHRs contain valuable information, there are several complexities inherent to these data because they are not collected specifically for research purposes.
Robins’ generalized methods (g-methods), such as the parametric g-formula and inverse probability weighting (IPW), can estimate causal effects from EHR data while addressing time-varying confounding and treatment-confounder feedback. However, further investigation into the additional impact of informed presence, which occurs when the timing and frequency of clinic visits depends on an individual’s prognosis, on causal inference is warranted. Therefore, the overarching goal of this dissertation is to address methodological issues when applying g-methods to draw causal inference from EHR data.
We first utilized real EHR data to answer a novel clinical question: What is the causal effect of hepatitis C virus (HCV) treatment on renal function? We used the target trial approach to design the study and the parametric g-formula to estimate the causal effect. We combined multiple analytical approaches to address time-varying confounding, treatment-confounder feedback and informed presence.
To expand on these challenges, we next investigated how discretization of a continuous time scale, which is required to implement g-methods, impacts causal effect estimation with the parametric g-formula. We designed a rigorous simulation study to quantify properties of discretized estimators. In addition, we proposed two data adaptive methods to reduce bias due to discretization, and we applied these methods to a real EHR database.
Lastly, we focused on confidence interval (CI) estimation for causal parameters in the context of EHR data. Typically, CIs around causal parameters are generated with the percentile bootstrap. This can be computationally expensive and vary in accuracy. We compared candidate bootstrap CI methods in the context of an IPW causal estimator and proposed practical, evidence-based guidance for selecting an appropriate bootstrap CI method.
Considering the expanded application of causal inference methods to EHR data sources, this dissertation provides an essential perspective and functional solutions to problems that arise from the complexities innate to this source of data.