Using clinical models at the bedside: Overcoming the conundrum of missing data

Fletcher, Sarah

INTRODUCTION: Clinical prediction models require complete data for estimating individual risk, yet complete data is often not available in clinical settings. We investigated four imputation methods applied to a lung cancer risk prediction model intended for use at the bedside.

METHODS: The TREAT risk prediction model was developed using a study population of 492 individuals with known or suspected lung cancer being evaluated as candidates for lung surgery at Vanderbilt Medical Center. Using this cohort, we artificially induced missing diagnostic information in 264 patients with complete data. Simulations investigated model behavior when a low impact predictor, (OR’s near 1) Forced Expiratory Volume in one second (FEV₁) was missing; a high impact predictor (OR>3) ¹⁸F-fluoro-deoxyglucose positron emission tomography (FDG-PET) was missing, and when both predictors were missing simultaneously. The four imputation methods imputation were 1)median imputation (MedI), 2)subgroup mean imputation, 3)multiple imputation (MI), and 4)condensed predictor imputation (CPI). Model behavior was measured by mean risk difference between estimated risk and the actual outcome (MRD) and mean square error for risk difference (MSE).

RESULTS: For FEV₁ the four techniques yielded similar results (MRD: -0.005, -0.002, -0.003, and -0.004 for MedI, subgroup mean, MI and CPI respectively). For the high impact variable FDG-PET, MRDs favoring the computationally intensive MI were observed (MedI: -0.048, MI: -0.005). CPI methods (MRD: -0.015) may be an alternative to MI because they are less computationally intensive. MI is the best option when both FEV₁ and FDG-PET were missing (MRD=-0.008; MSE=0.125²) but is similar to CPI (MRD=-0.021; MSE=0.127²).

CONCLUSIONS: Prediction models are important tools used frequently to inform clinical decisions. However, prediction models cannot be applied when a patient’s clinical data is missing. Using the population median for missing data is computationally and statistically efficient for low impact predictors, but high impact predictors require more sophisticated imputation methods for use in clinical practice.