Improving Efficiency in Evidence Synthesis: Performance Assessment of Machine Learning Classifiers for Automated Screening in Systematic Literature Reviews
Author(s)
Shreya Baswankar, MSc Biostats1, Boopathi Kangusamy, MSc. Statistics2, Mohd Kashif Siddiqui, MBA, MPH, PharmD3, Raja Rajeeswari C, MSc Biostats3, Jatin Gupta, MPharm3.
1Biostats, ICMR-National Institute of Epidemiology, Chennai, India, 2School of Public Health Division, ICMR-National Institute of Epidemiology, Chennai, India, 3EBM Health Consultants, New Delhi, India.
1Biostats, ICMR-National Institute of Epidemiology, Chennai, India, 2School of Public Health Division, ICMR-National Institute of Epidemiology, Chennai, India, 3EBM Health Consultants, New Delhi, India.
OBJECTIVES: Systematic literature reviews (SLRs) are important for informing clinical decisions, however, they are time-consuming and labour-intensive, particularly during the title and abstract (Ti/ab) screening step. This study aims to evaluate the performance of machine learning (ML)-based classification algorithms in automating the Ti/ab screening and assess their agreement with expert human reviewers.
METHODS: Human reviewer decisions from a pre-existing SLR served as the gold standard. After text pre-processing, three classifiers were trained: Naïve Bayes (NB) with Term Frequency-Inverse Document Frequency (TF-IDF features), Support Vector Machine (SVM) and Logistic Regression (LR) with Word2Vec. Class imbalance was addressed using Synthetic Minority Over-sampling Technique (SMOTE). Classifier performance was optimized using a prediction threshold of 0.9 and evaluated using standard metrics, precision-recall (PR) curves, and 10-fold cross-validation. The agreement between ML classifier and the human reviewer was assessed using Gwet’s AC1.
RESULTS: Out of 1,610 identified records the human reviewer included 91 studies and excluded 1,519. NB demonstrated the lowest misclassification rate for the included group at 12.1%. NB obtained the highest evaluation metrics with sensitivity 88%, specificity 98%, precision 67%, balanced accuracy 93%, F1 score 0.749, outperforming SVM and LR. All three classifiers showed near-perfect agreement with the human, with NB achieving the highest agreement of 0.91.
CONCLUSIONS: This study validates that ML classifiers can achieve near-perfect agreement with human experts. These models can form the engine of a human-in-the-loop system, drastically reducing manual workload by safely eliminating the majority of irrelevant citations. The future of evidence synthesis in HEOR will involve integrating these validated models into dynamic, active learning workflows and benchmarking them against emergent Large Language Models. This will not only accelerate SLR timelines for HTA submissions but also unlock the potential to efficiently screen and synthesize evidence from vast real-world data sources.
METHODS: Human reviewer decisions from a pre-existing SLR served as the gold standard. After text pre-processing, three classifiers were trained: Naïve Bayes (NB) with Term Frequency-Inverse Document Frequency (TF-IDF features), Support Vector Machine (SVM) and Logistic Regression (LR) with Word2Vec. Class imbalance was addressed using Synthetic Minority Over-sampling Technique (SMOTE). Classifier performance was optimized using a prediction threshold of 0.9 and evaluated using standard metrics, precision-recall (PR) curves, and 10-fold cross-validation. The agreement between ML classifier and the human reviewer was assessed using Gwet’s AC1.
RESULTS: Out of 1,610 identified records the human reviewer included 91 studies and excluded 1,519. NB demonstrated the lowest misclassification rate for the included group at 12.1%. NB obtained the highest evaluation metrics with sensitivity 88%, specificity 98%, precision 67%, balanced accuracy 93%, F1 score 0.749, outperforming SVM and LR. All three classifiers showed near-perfect agreement with the human, with NB achieving the highest agreement of 0.91.
CONCLUSIONS: This study validates that ML classifiers can achieve near-perfect agreement with human experts. These models can form the engine of a human-in-the-loop system, drastically reducing manual workload by safely eliminating the majority of irrelevant citations. The future of evidence synthesis in HEOR will involve integrating these validated models into dynamic, active learning workflows and benchmarking them against emergent Large Language Models. This will not only accelerate SLR timelines for HTA submissions but also unlock the potential to efficiently screen and synthesize evidence from vast real-world data sources.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR129
Topic
Health Technology Assessment, Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas