Streamlining Systematic Review Feasibility Assessments With Large Language Models: A Novel AI-Driven Workflow

Author(s)

Disher T
EVERSANA, West Porters Lake, NS, Canada

Presentation Documents

ISPOR poster - AI Feasibility144966.pdf

OBJECTIVES: The integration of artificial intelligence (AI) into systematic literature review workflows has predominantly focused on automated screening and data extraction. Following these stages, a detailed feasibility assessment of evidence synthesis should be performed to identify sources of heterogeneity in outcome definitions, inclusion criteria, and other trial design characteristics. We propose a structured workflow employing large language models (LLMs) for this task.

METHODS: This study is based on two systematic literature reviews. For each review, we utilized a combination of iterative conversations and API calls with GPT-3.5 to assess outcome and inclusion/exclusion criteria feasibility. The primary tasks included: (1) de-duplication of identical or near-identical outcome definitions; (2) identification of key areas of clinically important variability in outcome definitions and inclusion/exclusion criteria; and (3) extraction of criteria into structured forms that capture variability in these components. Steps 1 and 2 were refined iteratively through conversation in a chat interface, while step 3 was conducted via API calls. All assessments and inputs were corroborated against gold-standard documents and verified by human experts.

RESULTS: The LLM workflow substantially reduced the manual effort involved in review, data sheet development, and extraction for feasibility assessments. Although human intervention and review were necessary, agreement across tasks was generally high (~90%). Errors in outcome definitions and inclusion criteria were typically straightforward to identify and correct, resulting in significant time savings. However, evaluating exclusion criteria posed greater challenges and required increased input from clinical experts to develop an initial structured data extraction form. Subsequent extraction showed unacceptable error rates, necessitating a re-focus on a more straightforward sub-task of exclusion feasibility.

CONCLUSIONS: Large language models can likely streamline the evidence synthesis feasibility assessment process with minimal risk, provided experts are involved at all stages. Evaluating exclusion criteria may be more complex due to greater variability in language and difficulty in understanding criteria implications.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

HTA239

Topic

Study Approaches

Topic Subcategory

Meta-Analysis & Indirect Comparisons

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation