Large Language Model and Contextual Prioritization of Titles and Abstracts for Systematic Review Screening
Author(s)
Riaz Qureshi1, Eitan Agai, BA, MSc2;
1Denver, CO, USA, 2PICO Portal, New York, NY, USA
1Denver, CO, USA, 2PICO Portal, New York, NY, USA
Presentation Documents
OBJECTIVES: Major advancements have been made in the utilization of machine learning (ML) approaches to reorder records for screening in systematic reviews (SRs). The evidence is still unclear as to the potential utility of large language models (LLMs) for contextual prioritization without a human-decision training period. Our objective is to assess the sensitivity of two different approaches to contextual prioritization of records for screening using two SR case studies.
METHODS: We used the full set of titles/abstracts retrieved for two SRs. SR1 had strict eligibility criteria, 48 includes, and 12103 initial records. SR2 had broad eligibility criteria, 84 includes, and 9054 initial records. We used ChatGPT4.0 as the LLM for Approach #1 with a prompt to identify the most likely includes for a question, given the eligibility criteria in full and all titles/abstracts. Approach #2 used embeddings to compare and score the contextual similarity of titles/abstracts to the inclusion/exclusion criteria separately, combined and normalized the scores, then ranked the records by relevance to the question.
RESULTS: Approach #1 returned 82 and 613 likely includes for SR1 and SR2, respectively. Taking the top 1/3 (normalized score >0.66) from Approach #2 yielded 1641 and 489 likely includes. These predictions represented between 1% and 14% of the original records. From these predictions, for SR1 and SR2, Approach #1 had sensitivities of 0.23 and 0.32, whereas Approach #2 had sensitivities of 0.75 and 0.60.
CONCLUSIONS: These two approaches show great potential utility for SR efficiency, particularly Approach #2. After screening between 1% and 14% of all records, a reviewer may already identify between 23%-75% of final includes. This set of initial predicted includes may also be a better training set for ML than a purely random initial selection of the same number which may not have as many positives to train the ML algorithm.
METHODS: We used the full set of titles/abstracts retrieved for two SRs. SR1 had strict eligibility criteria, 48 includes, and 12103 initial records. SR2 had broad eligibility criteria, 84 includes, and 9054 initial records. We used ChatGPT4.0 as the LLM for Approach #1 with a prompt to identify the most likely includes for a question, given the eligibility criteria in full and all titles/abstracts. Approach #2 used embeddings to compare and score the contextual similarity of titles/abstracts to the inclusion/exclusion criteria separately, combined and normalized the scores, then ranked the records by relevance to the question.
RESULTS: Approach #1 returned 82 and 613 likely includes for SR1 and SR2, respectively. Taking the top 1/3 (normalized score >0.66) from Approach #2 yielded 1641 and 489 likely includes. These predictions represented between 1% and 14% of the original records. From these predictions, for SR1 and SR2, Approach #1 had sensitivities of 0.23 and 0.32, whereas Approach #2 had sensitivities of 0.75 and 0.60.
CONCLUSIONS: These two approaches show great potential utility for SR efficiency, particularly Approach #2. After screening between 1% and 14% of all records, a reviewer may already identify between 23%-75% of final includes. This set of initial predicted includes may also be a better training set for ML than a purely random initial selection of the same number which may not have as many positives to train the ML algorithm.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR39
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas