Beware the Little Foxes that Spoil the Vines: Small Inconsistencies in Clinical Data Can Distort Machine Learning Findings
Author(s): Abdolvahab Khademi, Mark S. Tuttle, Qing Zeng-Treitler, Stuart J. Nelson
It is well known that Electronic Health Records (EHR) data contain inconsistent and inaccurate data, the effect of which on predictive model performance and risk/benefit factor identification are often neglected. This study investigates how varying levels of random and non-random binary differences, often referred to as "noise", affect modeling tools, such as logistic regression, support vector machines, and gradient boosting models. Using curated data from the All of Us database, we simulated different noise levels to mimic real-world variability. Across all models and noise types, increased noise consistently reduced classification accuracy. More importantly, noise diminished the variance of variable impact scores while leaving their means unchanged, suggesting a muted ability to identify key predictors. These findings imply that even modest noise levels can obscure meaningful signals. Measures like accuracy and hazard ratios may thus be misleading in noisy data contexts. The consistency of effects across models and noise mechanisms suggests this issue stems from inherent data variability rather than model brittleness, with broad implications for EHR data analyses.