A Comparative Analysis of Missing Value Imputation Techniques: Spline vs Markov Chain Monte Carlo & Machine Learning Algorithms
Speaker(s)
Paul Choudhury S1, Dutta Majumdar A2, Sil A1, Dutta S1, Mahon R3
1PharmaQuant Insights Pvt. Ltd., Kolkata, West Bengal, India, 2PharmaQuant Insights Pvt. Ltd., Kolkata, WB, India, 3University of Galway, Galway, Ireland
Presentation Documents
OBJECTIVES: Addressing missing data is crucial in healthcare research. While machine learning (ML) algorithms offer solutions, they often require large, high-quality datasets, which may not always be available. Markov-chain Monte Carlo (MCMC) imputation is effective but struggles with high-dimensional data. We proposed a novel imputation approach using spline models, known for their flexibility in capturing complex, non-linear relationships, and compare their performance with established methods.
METHODS: Data from the NCCTG database (NLCD) and Veterans' Administration Lung Cancer Study (VACD) were selected for this analysis. The variables 'age' from NLCD and 'diagtime' from VACD were chosen for imputation. We randomly removed 30% of the data from both datasets to create a test dataset. The remaining 70% was used to train several imputation models, including random forest (RF), decision tree (DT), support vector machine (SVM), gradient boosting model (GBM), and linear regression (LR). Additionally, multiple imputation using MCMC was performed on the missing data. Natural spline models (NSM) and regression spline models (RSM) were fitted to the training data. Since RSM often exhibits high variance at predictor extremes, causing wide confidence intervals, especially with small samples, NSM was fitted with hyperparameter tuning to refine the shape of the spline. The root-mean-square error (RMSE) was calculated for the imputed values to compare the accuracy of the different imputation techniques.
RESULTS: The RMSEs were lower for NSM compared to other models for both datasets. For the NLCD dataset, the RMSEs were as follows- NSM: 8.1, RSM: 8.6, MCMC (five-seeds): 11.1 to 12.6, RF: 8.7, DT: 8.9, LR: 8.5, SVM: 8.6, GBM: 8.7. For the VACD dataset, the RMSEs were- NSM: 4.5, RSM: 5.3, MCMC (five-seeds): 4.8 to 5.6, RF: 5.1, DT: 4.6, LR: 4.8, SVM: 5.1, GBM: 4.8.
CONCLUSIONS: The results suggest spline models outperform other methods in terms of RMSE, indicating its potential as a viable alternative.
Code
MSR172
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Missing Data
Disease
No Additional Disease & Conditions/Specialized Treatment Areas