AI-Assisted Systematic Literature Screening According to NICE, UK, and CDA (Canada) Position Papers: An HTA Submission Use Case
Author(s)
Dilip Makhija, MS1, Marvin Rock, DrPH, MPH1, Chong H Kim, MPH, MS, PhD1, Mirko von Hein, M.Sc.2, Rajdeep Kaur, PhD3, Sumeet Attri, M.Pharm3, Barinder Singh, RPh3.
1Gilead Sciences, Inc., Foster City, CA, USA, 2Gilead Sciences, London, United Kingdom, 3Pharmacoevidence, Mohali, India.
1Gilead Sciences, Inc., Foster City, CA, USA, 2Gilead Sciences, London, United Kingdom, 3Pharmacoevidence, Mohali, India.
OBJECTIVES: This study aimed to implement a hybrid review framework integrating AI-assisted screening with expert human oversight to conduct multiple systematic literature reviews (SLRs) for Primary Biliary Cholangitis (PBC). This approach aligns with guidance from NICE UK and CDA Canada position papers. The reviews focused on seven key domains: epidemiology, treatment patterns, treatment guidelines, economic evaluation, healthcare resource utilization and related costs, health utility values, and humanistic outcomes.
METHODS: A comprehensive literature search was conducted across EMBASE®, MEDLINE®, NHS EED, CENTRAL, and CDSR from inception through 2024. Inclusion/exclusion criteria were guided by the PICOS framework. A parallel dual-review and quality control process was used for data collection, where citations were independently screened by a human reviewer and a GPT-4-based AI tool applying predefined inclusion and exclusion criteria. A subject matter expert (human) resolved the conflict between human and AI-screened citations.
RESULTS: A total of 6,309 citations were screened across the burden domains of PBC using the AI-human hybrid framework. In this review process, the overall average agreement between AI and human reviewers was 94.97% (median: 95.24%), ranging from 89.01% in humanistic burden review to 99.59% in economic evaluation review. High concordance was also observed in other domains: 97.1% in health utility values, 97.9% in healthcare resource utilization and related costs, 95.2% in treatment patterns, 94.1% in treatment guidelines, and 91.8% in epidemiology. NICE’s Evidence Assessment Group evaluated the SLR methodology and results and considered them appropriate.
CONCLUSIONS: This study represents an AI-assisted SLR submitted to the HTA body (NICE UK). The average agreement between human- and AI-screened citations approached the ideal level of 95%, significantly exceeding the approximately 80% inter-reviewer agreement typically observed in conventional dual-human SLRs. This high agreement rate demonstrates the feasibility and rigor of integrating AI into HTA workflows, offering a scalable, timely, and cost-effective model for future submissions.
METHODS: A comprehensive literature search was conducted across EMBASE®, MEDLINE®, NHS EED, CENTRAL, and CDSR from inception through 2024. Inclusion/exclusion criteria were guided by the PICOS framework. A parallel dual-review and quality control process was used for data collection, where citations were independently screened by a human reviewer and a GPT-4-based AI tool applying predefined inclusion and exclusion criteria. A subject matter expert (human) resolved the conflict between human and AI-screened citations.
RESULTS: A total of 6,309 citations were screened across the burden domains of PBC using the AI-human hybrid framework. In this review process, the overall average agreement between AI and human reviewers was 94.97% (median: 95.24%), ranging from 89.01% in humanistic burden review to 99.59% in economic evaluation review. High concordance was also observed in other domains: 97.1% in health utility values, 97.9% in healthcare resource utilization and related costs, 95.2% in treatment patterns, 94.1% in treatment guidelines, and 91.8% in epidemiology. NICE’s Evidence Assessment Group evaluated the SLR methodology and results and considered them appropriate.
CONCLUSIONS: This study represents an AI-assisted SLR submitted to the HTA body (NICE UK). The average agreement between human- and AI-screened citations approached the ideal level of 95%, significantly exceeding the approximately 80% inter-reviewer agreement typically observed in conventional dual-human SLRs. This high agreement rate demonstrates the feasibility and rigor of integrating AI into HTA workflows, offering a scalable, timely, and cost-effective model for future submissions.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR21
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Rare & Orphan Diseases, Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)