Machine Learning for Missing Data Imputation in Healthcare Research: A Systematic Review of Methods and Applications
Author(s)
Tiphaine Porte, Msc, Morgane Swital, PhD, Nathanael SEDMAK, MSc, Clara Bouvard, MSc, Arthur Gougeon, MSc, Flavien Roux, MSc, Frédéric Mistretta, MSc, Audrey Lajoinie, PharmD, PhD.
RCTs, Lyon, France.
RCTs, Lyon, France.
OBJECTIVES: Missing data is a critical issue and a potential source of bias in clinical research, particularly in real-world data (RWD) studies where loss to follow-up and incomplete data are common. Imputing missing data is a significant challenge as it directly affects the validity and reliability of clinical analyses. This literature review aimed to provide an overview of machine learning (ML) imputation methods applied in healthcare and report their performance.
METHODS: A literature review was conducted on MEDLINE to identify studies published since 2020 on ML-based imputation methods in studies conducted on RWD. Titles and abstracts [Ti/Abs] were screened, followed by full-text review for inclusion.
RESULTS: Out of 166 articles initially retrieved, 7 were included. The main therapeutic areas were oncology (n = 2) and cardiovascular disease (n=2). Data sources included clinical registries (n = 3), healthcare administrative databases (n=1) and connected medical devices (n = 2). Imputation was primarily used to maximize data use by avoiding case deletion in analyses. The applied methods included multiple imputation using MICE (n=2), random forest-based methods (n=3), k-nearest neighbor imputation (n=2) and advanced techniques such as Bayesian networks or deep learning (n=2). Among the studies that compared ML-based imputation methods to simple techniques (e.g., mean imputation) or to no imputation (n = 3), improved predictive accuracy was observed. For example, one study found that simple imputation yielded a root mean square error (RMSE) of 2.9266, whereas kNN imputation yielded an RMSE of 0.769. However, ML-based imputation methods also presented limitations, particularly in the context of high levels of missing data.
CONCLUSIONS: Using machine learning (ML) methods for missing data imputation is a promising approach to improving the performance and robustness of predictive models in healthcare. However, the reviewed studies highlight remaining challenges, particularly in cases of high missingness. This warrants cautious interpretation and further methodological refinement.
METHODS: A literature review was conducted on MEDLINE to identify studies published since 2020 on ML-based imputation methods in studies conducted on RWD. Titles and abstracts [Ti/Abs] were screened, followed by full-text review for inclusion.
RESULTS: Out of 166 articles initially retrieved, 7 were included. The main therapeutic areas were oncology (n = 2) and cardiovascular disease (n=2). Data sources included clinical registries (n = 3), healthcare administrative databases (n=1) and connected medical devices (n = 2). Imputation was primarily used to maximize data use by avoiding case deletion in analyses. The applied methods included multiple imputation using MICE (n=2), random forest-based methods (n=3), k-nearest neighbor imputation (n=2) and advanced techniques such as Bayesian networks or deep learning (n=2). Among the studies that compared ML-based imputation methods to simple techniques (e.g., mean imputation) or to no imputation (n = 3), improved predictive accuracy was observed. For example, one study found that simple imputation yielded a root mean square error (RMSE) of 2.9266, whereas kNN imputation yielded an RMSE of 0.769. However, ML-based imputation methods also presented limitations, particularly in the context of high levels of missing data.
CONCLUSIONS: Using machine learning (ML) methods for missing data imputation is a promising approach to improving the performance and robustness of predictive models in healthcare. However, the reviewed studies highlight remaining challenges, particularly in cases of high missingness. This warrants cautious interpretation and further methodological refinement.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR141
Topic
Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas