Measures of discrimination, reclassification, and calibration for risk prediction models: an exploration in their interrelationships and practical utility and improvement in their estimation
MetadataShow full item record
Public health practice and quality of medical care rely heavily on the accuracy, precision, and robustness of risk prediction models. Health care providers use risk prediction models to assess a patient’s risk of developing an event during a specified time frame given the patient’s specific characteristics, and subsequently recommend a course of treatment or preventative action. In public health research, risk prediction models are often constructed with common statistical modeling techniques, such as logistic regression for binary outcomes or Cox proportional hazard regression for time-to-event outcomes, and the performance of the model is assessed through internal or external validation, or some combination. Model validation requires statistical and clinical significance and satisfactory baseline or improvement in model calibration and discrimination: calibration quantifies how close predictions are to observed outcomes while discrimination quantifies the model’s ability to distinguish correctly between events and nonevents. Measures for evaluating these qualities include (but are not limited to) Brier score, calibration-in-the-large, proportion of variation (R2), sensitivity and specificity, area under the receiver operating characteristic curve (AUC), discrimination slope, net reclassification index (NRI), integrated discrimination improvement (IDI), and decision theory analytic measures such as net benefit and relative utility. Among these measures exist several interrelationships under certain assumptions, and their estimation and interpretation is an active area of research. The first two parts of this thesis focus on studying the empirical distributions and improving confidence interval (CI) estimation of ∆AUC, NRI, and IDI for both binary event data and time-to-event data. Through data simulation and the comparison of several CI types derived with bootstrapping techniques, we make recommendations for proper estimation in future work and apply our recommendations to real-life Framingham Heart Study data. The third part of this thesis summarizes the many interrelationships and possible redundancies among the measures listed, extends theoretical formulas assuming normal variables for ∆AUC, NRI, and IDI from nested models to non-nested models and to Brier score, and explores the impact of varying discrimination and calibration assumptions on Yates’ and Sanders’ decomposed versions of Brier score through simulation. Lastly, overall conclusions and future directions are presented at the end.