Performance Evaluation of AI-Assisted Systematic Literature Review for Studies on Burden of HPV-Associated Head and Neck Cancers

Author(s)

Dong Wang, Ph.D.¹, Surabhi Datta, Ph.D.², Yi-Ling Huang, Ph.D.¹, Kyeryoung Lee, Ph.D.², Chris Liston, PharmD, MBA², Yi Zheng, Ph.D.¹, Jun Zhang, MSPH, MD³, Majid Rastergar-mojarad, Ph.D.², Nicole Cossrow, MPH, PhD¹.
¹Merck & Co., Inc., Rahway, NJ, USA, ²IMO Health, Rosemont, IL, USA, ³MSD R&D (China) Co., Ltd., Beijing, China.

OBJECTIVES: The recent developments in artificial intelligence (AI) can efficiently address a growing need for systematic literature reviews in health research. We have developed a large language model (LLM)-based platform to assist SLR development, named Intelligent Systematic LiterAture Review (ISLAR). The objective is to evaluate ISLAR’s performance in three tasks: 1) abstract screening 2) full-text screening, and 3) data element extraction of studies on burden of HPV-associated head and neck squamous cell cancers.
METHODS: Domain experts created a protocol for burden of HPV-associated head and neck cancers literature review. After identifying related literature, a gold standard corpus was developed by randomly selecting and annotating 103 articles. Each abstract was annotated as relevant or irrelevant based on the designed protocol. In the next step, the experts screened the relevant abstracts based on information in full text. In the third step, a predefined set of data elements such as sample size, cancer type, and HPV genotype were annotated in the relevant articles. The ISLAR platform was then used to replicate the tasks for these articles. Accuracy, sensitivity, and F1 scores were calculated for abstract screening, full-text screening, and data extraction using human review as the gold standard.
RESULTS: In the abstract screening step, ISLAR was 73.8% accurate, 90.1% sensitive, and had an F1 score of 77.3%. In full-text screening, ISLAR achieved an accuracy of 85.7%, a sensitivity of 91.7%, and an F1 score of 91.7%. ISLAR had accuracy, sensitivity, and an F1 score of 80.8%, 94.6%, and 86.6%, respectively in extracting data elements.
CONCLUSIONS: ISLAR had very good performance in full-text screening and data element extraction. In the abstract screening step, we prioritized high sensitivity to aim that no relevant articles were excluded due to limited details in abstracts. The results demonstrated substantial potential for ISLAR role in AI-assisted SLR development.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR165

Topic

Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Infectious Disease (non-vaccine)

Presentation (CTI)