RECOMMENDATIONS AND LIMITATIONS WHEN USING MACHINE LEARNING IN RARE DISEASE IDENTIFICATION

Author(s)

ABSTRACT WITHDRAWN

OBJECTIVES

Imbalances between the relative sizes of patient subgroups is a major challenge when identifying patients with rare diseases or novel cancer subtypes. This research aims to determine whether standard machine learning (ML) techniques and modifications of these standard techniques can aid diagnosis of rare diseases.

METHODS

A pragmatic literature review was conducted to find ML methods for addressing patient subgroup imbalances in the medical literature. Identified methods were applied to synthetic gene expression data to determine their relative efficacy across a range of effect sizes. The analysis classified moderately imbalanced subgroups and compared ML algorithms both with and without the application of Synthetic Minority Over-sampling Technique (SMOTE), a data pre-processing technique.

RESULTS

The pragmatic literature review identified 1,154 publications; 19 of which were ultimately found to be relevant. Eighteen publications used a pre-processing method to address patient subgroup imbalances or adapted the ML algorithm itself to account for different subgroup sizes. When applied to the synthetic gene expression data, four classifiers (random forest [RF], gradient boosting [GB], support vector machine [SVM] and K-nearest neighbour [KNN]) all had predictive capabilities for moderate patient class imbalances, with RF and GB generally performing better. When using GB, Area under the ROC Curve (AUC) values quickly converge to 1 across effect sizes regardless of whether patient subgroups were balanced, imbalanced or SMOTE was used. SMOTE improved accuracy for RF and SVM. For KNN, however, the use of SMOTE had a negative impact on the mean AUC: the SMOTE analysis had a consistent mean AUC value of 0.50 across effect sizes meaning that it had no predictive capabilities.

CONCLUSIONS

When applied to minority patient subgroup identification, the success of ML methods varies greatly. SMOTE can be used before classification and may improve the predictive accuracy of some classifiers but reduce the effectiveness of others.

Conference/Value in Health Info

2020-05, ISPOR 2020, Orlando, FL, USA

Value in Health, Volume 23, Issue 5, S1 (May 2020)

Code

PRO67

Topic

Medical Technologies, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Diagnostics & Imaging

Disease

No Specific Disease, Rare and Orphan Diseases

Explore Related HEOR by Topic

Presentation