Evaluation of Machine Learning Assisted Title and Abstract Screening in 5 Clinical Systematic Literature Reviews

Author(s)

Jose Marcano Belisario, BSc, MPH, PhD1, Michaela Lunan, BA, MA, PhD1, Emma Hawe, BSc, MSc1, Gene Farrelly, MBA2;
1RTI Health Solutions, Value and Access - Evidence Synthesis and Statistics, Manchester, United Kingdom, 2RTI Health Solutions, Research Technology, Research Triangle Park, NC, USA
OBJECTIVES: Many systematic literature review (SLR) programs offer built in machine learning assisted screening, which calculates the probability of each citation being advanced at the abstract and title stage. These advancement probabilities could reduce manual screening workload if a threshold for making bulk exclusions could be identified; the project objective was to assess whether an appropriate threshold can be determined.
METHODS: Screening decisions from 5 SLRs were compared with the advancement probabilities generated by Nested Knowledge across 4 training scenarios: 20% (T1), 30% (T2), 40% (T3), and 50% (T4) of randomly selected citations. These SLRs covered systemic lupus erythematosus (SLE), non small cell lung cancer, breast cancer, amyotrophic lateral sclerosis, and allergic rhino conjunctivitis (ARC). For each scenario, the cross validation metrics obtained after training the machine learning algorithm were recorded: recall, area under the curve (AUC), precision, F1 score, and accuracy.
RESULTS: Across scenarios, recall probabilities ranged from 0.67 (ARC T1) to 0.92 (SLE T3 and SLE T4) and generally increased with larger training sets (correlation = 0.62). AUC, precision, F1 score, and accuracy (0.43, 0.23, 0.32, and 0.29, respectively) were not strongly correlated with training set size. Across all SLRs and scenarios, low advancement probabilities (from 0.00 to 0.15) were assigned to at least 1 article that had been included by human reviewers. Reasons included records with no abstract; trial registry records that differed in formatting from the standard journal entries in the training set; and standard journal entries that covered aspects of the SLR question not captured in the training set.
CONCLUSIONS: This project provided practical examples of the impact that cross-validation measures have on the reliability of SLR findings. It also identified the complexity of SLR questions and the representativeness of training sets as factors that can influence advancement probabilities, and thus decisions about thresholds for bulk exclusions.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR46

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×