Automating PICOs Criteria Assessment in Systematic Reviews With LLMs: Insights From Two Case Studies
Author(s)
Theodora Oikonomidi, PhD1, Lenon Mendes Pereira, PhD2, Ketevan Rtveladze, MSc3, Ines Guerra, MSc3.
1IQVIA, Athens, Greece, 2IQVIA, Chicago, IL, USA, 3IQVIA, London, United Kingdom.
1IQVIA, Athens, Greece, 2IQVIA, Chicago, IL, USA, 3IQVIA, London, United Kingdom.
OBJECTIVES: Systematic study selection for inclusion in systematic literature reviews (SLRs), based on prespecified population, intervention, comparison, outcome, and study design (PICOS) criteria, is fundamental to evidence synthesis. However, this process is labour-intensive and time-consuming. This study evaluates the performance of a large language model (LLM) in assessing whether scientific article abstracts meet PICOS criteria.
METHODS: We developed a prompt to identify PICOS elements in abstracts and tested its accuracy in classifying abstracts for inclusion or exclusion, comparing results against decisions made by two human reviewers (considered the ground truth). Two previously completed economic SLRs in haematology (n = 833 abstracts screened) and in oncology (n = 1,712) served as test cases. Discrepancies between LLM and human decisions were analysed by a subject matter expert (SME).
RESULTS: In the haematology SLR, the LLM correctly included 80% (n=37) of the 46 abstracts included by human reviewers. Notably, none of the nine abstracts missed by the LLM were ultimately included at full-text review. In the oncology SLR, the LLM correctly included 67% (n=173) of the total 258 abstracts included by humans; of the missed abstracts, only 19 were later included at full-text review. Analysis of the LLM’s exclusion rationale revealed that, in all but three cases, incorrect exclusions were due to misclassification of study design. This suggests that refining the PICOS criteria in the prompt to better describe economic/resource use study designs could enhance performance. The LLM demonstrated good specificity, by including 14 and 83 abstracts excluded by human reviewers in the haematology and oncology SLR, respectively.
CONCLUSIONS: LLM performance in identifying PICOS elements and making inclusion decisions at the abstract level is promising. Importantly, missed inclusions might be minimised by further elaborating PICOS criteria. Future work should focus on developing and prospectively validating integrated workflows that incorporate LLMs alongside human reviewers in the SLR process.
METHODS: We developed a prompt to identify PICOS elements in abstracts and tested its accuracy in classifying abstracts for inclusion or exclusion, comparing results against decisions made by two human reviewers (considered the ground truth). Two previously completed economic SLRs in haematology (n = 833 abstracts screened) and in oncology (n = 1,712) served as test cases. Discrepancies between LLM and human decisions were analysed by a subject matter expert (SME).
RESULTS: In the haematology SLR, the LLM correctly included 80% (n=37) of the 46 abstracts included by human reviewers. Notably, none of the nine abstracts missed by the LLM were ultimately included at full-text review. In the oncology SLR, the LLM correctly included 67% (n=173) of the total 258 abstracts included by humans; of the missed abstracts, only 19 were later included at full-text review. Analysis of the LLM’s exclusion rationale revealed that, in all but three cases, incorrect exclusions were due to misclassification of study design. This suggests that refining the PICOS criteria in the prompt to better describe economic/resource use study designs could enhance performance. The LLM demonstrated good specificity, by including 14 and 83 abstracts excluded by human reviewers in the haematology and oncology SLR, respectively.
CONCLUSIONS: LLM performance in identifying PICOS elements and making inclusion decisions at the abstract level is promising. Importantly, missed inclusions might be minimised by further elaborating PICOS criteria. Future work should focus on developing and prospectively validating integrated workflows that incorporate LLMs alongside human reviewers in the SLR process.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR42
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas