CAN ARTIFICIAL INTELLIGENCE ASSISTED SYSTEMATIC LITERATURE REVIEW SUPPORT RIGOROUS EVIDENCE GENERATION? A CASE STUDY ON PREDICTORS OF COPD EXACERBATIONS

Author(s)

Ákos Bernard Józwiak, PhD¹, Judit Józwiak-Hagymásy, MSc¹, Agnes Nagy, MSc², Judit Tittmann, MD², Sándor Kovács, BA, MBA, MSc¹, Przemyslaw Kardas, PhD, MD³, Job FM van Boven, PhD⁴, Irene Mommers, PhD⁴, Attila Imre, PharmD⁵, Tamas Agh, MSc, PhD, MD⁶;
¹SYREON Research Institute, Budapest, Hungary, ²University of Pecs, Pecs, Hungary, ³Medical University of Lodz, Lodz, Poland, ⁴University Medical Center Groningen, University of Groningen, Groningen, Netherlands, ⁵Center for Health Technology Assessment, Semmelweis University & Syreon Research Institute, Budapest, Hungary, ⁶Center for HTA and Pharmacoeconomic Research, University of Pecs & Syreon Research Institute, Research Associate Professor; Director of Strategy, Budapest, Hungary

OBJECTIVES: To evaluate whether artificial intelligence (AI) assisted systematic literature review (SLR) workflows can support rigorous and reliable evidence generation through AI-assisted screening and data extraction using a case study on predictors of chronic obstructive pulmonary disease (COPD) exacerbations.
METHODS: A literature search was conducted in MEDLINE and Embase. Search hits were de-duplicated in the Systematic Review Accelerator, followed by title and abstract screening assisted by a generative AI pipeline developed in KNIME. Screening was performed using a zero-shot large language model (LLM) (meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) accessed via DeepInfra. Records deemed potentially relevant were subjected to AI-assisted full-text screening and data extraction in a single step using two parallel pipelines. In the first pipeline, full texts were tokenized with Mistral OCR and processed using the Qwen/Qwen3-235B-A22B-Instruct-2507 LLM via DeepInfra in KNIME. The second pipeline used OpenAI ChatGPT (version 5). AI outputs were validated by a human reviewer.
RESULTS: Of 8,520 identified records, 1,492 original studies and 80 reviews were flagged as potentially relevant after title and abstract screening. In a 30-record validation sample, the accuracy of AI-assisted title and abstract screening was 96.7%. Full-text screening and data extraction were limited to the reviews. Although both AI models extracted data from all 80 reviews, human validation excluded 38 records. While these excluded reviews focused on COPD and referenced exacerbations, their predictors related to outcomes other than COPD exacerbations. The Qwen pipeline extracted 802 predictors and the ChatGPT pipeline 546. After human validation, 110 predictors were identified as relevant; following refinement and harmonisation of terminology, 88 predictors were retained for the narrative synthesis.
CONCLUSIONS: Our findings demonstrate that AI can meaningfully support SLRs; however, human validation remains essential. AI-assisted SLR workflows showed high accuracy in title and abstract screening and efficiently processed large volumes of evidence, while AI-assisted data extraction required substantial human validation.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR71

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Respiratory-Related Disorders (Allergy, Asthma, Smoking, Other Respiratory)

Presentation (CTI)