Automating PICOs Criteria Assessment in Systematic Reviews With LLMs: Insights From Two Case Studies

Author(s)

Theodora Oikonomidi, PhD1, Lenon Mendes Pereira, PhD2, Ketevan Rtveladze, MSc3, Ines Guerra, MSc3.
1IQVIA, Athens, Greece, 2IQVIA, Chicago, IL, USA, 3IQVIA, London, United Kingdom.
OBJECTIVES: Systematic study selection for inclusion in systematic literature reviews (SLRs), based on prespecified population, intervention, comparison, outcome, and study design (PICOS) criteria, is fundamental to evidence synthesis. However, this process is labour-intensive and time-consuming. This study evaluates the performance of a large language model (LLM) in assessing whether scientific article abstracts meet PICOS criteria.
METHODS: We developed a prompt to identify PICOS elements in abstracts and tested its accuracy in classifying abstracts for inclusion or exclusion, comparing results against decisions made by two human reviewers (considered the ground truth). Two previously completed economic SLRs in haematology (n = 833 abstracts screened) and in oncology (n = 1,712) served as test cases. Discrepancies between LLM and human decisions were analysed by a subject matter expert (SME).
RESULTS: In the haematology SLR, the LLM correctly included 80% (n=37) of the 46 abstracts included by human reviewers. Notably, none of the nine abstracts missed by the LLM were ultimately included at full-text review. In the oncology SLR, the LLM correctly included 67% (n=173) of the total 258 abstracts included by humans; of the missed abstracts, only 19 were later included at full-text review. Analysis of the LLM’s exclusion rationale revealed that, in all but three cases, incorrect exclusions were due to misclassification of study design. This suggests that refining the PICOS criteria in the prompt to better describe economic/resource use study designs could enhance performance. The LLM demonstrated good specificity, by including 14 and 83 abstracts excluded by human reviewers in the haematology and oncology SLR, respectively.
CONCLUSIONS: LLM performance in identifying PICOS elements and making inclusion decisions at the abstract level is promising. Importantly, missed inclusions might be minimised by further elaborating PICOS criteria. Future work should focus on developing and prospectively validating integrated workflows that incorporate LLMs alongside human reviewers in the SLR process.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR42

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×