Impact of new variables on discrimination of risk prediction models

Date
2012
DOI
Authors
Demler, Olga V.
Version
Embargo Date
Indefinite
OA Version
Citation
Abstract
Risk prediction models for binary outcomes (such as the Framingham Risk Score for cardiovascular disease or the Gail Model for 5 year risk of breast cancer) have become the standard tools for health practitioners and policy makers. Rapid scientific progress in genetics and biochemistry has led to numerous new variables being proposed as candidates to improve existing models. Quality of risk prediction models is usually measured by the area under the receiver operating characteristic curve (AUC). Increase of AUC is used to evaluate how much added new variable contributes to model performance. However, the following paradox has been often reported in the literature: the new predictor is statistically significant in the multivariable model, but does not lead to a statistically significant change in the AUC. In the first part of this thesis we prove that the paradox outlined above is not true when data is normally distributed. We demonstrate that in this setting statistical significance of the new predictor(s) is always equivalent to the statistical significance of the increase in the AUC. In the second part, we show rigorously that the DeLong test, which is typically used to compare two AUCs, is invalid for nested models for any distribution of the data and for general type of risk prediction models, including logistic regression. Invalidity is the likely explanation for the paradox outlined above and results in DeLong test being overly conservative. In the third part of the thesis we focus on understanding what kind of statistical properties of the new predictor are beneficial for model performance. Using multivariate normal data we prove that contrary to common wisdom new variables uncorrelated with the old risk score are not always the strongest contributors to discrimination while negatively correlated ones are always beneficial. We also show that new predictor that has very high multiple R-square when linearly regressed on the old predictors can also be beneficial for risk prediction model. All results are illustrated using real-life Framingham data and conclusions and future direction are presented at the end.
Description
Thesis (Ph.D.)--Boston University
PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
License