RECOMMENDATIONS AND LIMITATIONS WHEN USING MACHINE LEARNING IN RARE DISEASE IDENTIFICATION
Author(s)
ABSTRACT WITHDRAWN
OBJECTIVES Imbalances between the relative sizes of patient subgroups is a major challenge when identifying patients with rare diseases or novel cancer subtypes. This research aims to determine whether standard machine learning (ML) techniques and modifications of these standard techniques can aid diagnosis of rare diseases. METHODS A pragmatic literature review was conducted to find ML methods for addressing patient subgroup imbalances in the medical literature. Identified methods were applied to synthetic gene expression data to determine their relative efficacy across a range of effect sizes. The analysis classified moderately imbalanced subgroups and compared ML algorithms both with and without the application of Synthetic Minority Over-sampling Technique (SMOTE), a data pre-processing technique. RESULTS The pragmatic literature review identified 1,154 publications; 19 of which were ultimately found to be relevant. Eighteen publications used a pre-processing method to address patient subgroup imbalances or adapted the ML algorithm itself to account for different subgroup sizes. When applied to the synthetic gene expression data, four classifiers (random forest [RF], gradient boosting [GB], support vector machine [SVM] and K-nearest neighbour [KNN]) all had predictive capabilities for moderate patient class imbalances, with RF and GB generally performing better. When using GB, Area under the ROC Curve (AUC) values quickly converge to 1 across effect sizes regardless of whether patient subgroups were balanced, imbalanced or SMOTE was used. SMOTE improved accuracy for RF and SVM. For KNN, however, the use of SMOTE had a negative impact on the mean AUC: the SMOTE analysis had a consistent mean AUC value of 0.50 across effect sizes meaning that it had no predictive capabilities. CONCLUSIONS When applied to minority patient subgroup identification, the success of ML methods varies greatly. SMOTE can be used before classification and may improve the predictive accuracy of some classifiers but reduce the effectiveness of others.
Conference/Value in Health Info
2020-05, ISPOR 2020, Orlando, FL, USA
Value in Health, Volume 23, Issue 5, S1 (May 2020)
Code
PRO67
Topic
Medical Technologies, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Diagnostics & Imaging
Disease
No Specific Disease, Rare and Orphan Diseases