Evaluating the Performance of a Supervised Artificial Intelligence Model in Screening Abstracts for Oncology-Focused Targeted Literature Reviews

Author(s)

Ayushman Ghosh, PhD1, Tushar Pyne, PhD1, Aniket Das, PhD1, Dipen Patel, PhD2, Jishna Das, MPharm3, Ashmita Chatterjee, MS1, Mansi Pawar, PharmD4, Manikanta Dasari, MPharm5, Varun Ektare, MPH4, Murtuza Bharmal, MS, PhD6.
1Indence Health, Kolkata, India, 2AstraZeneca, Gaithersburg, MD, USA, 3Indence Health, Bengaluru, India, 4Indence Health, Mumbai, India, 5Indence Health, Chandigarh, India, 6AstraZeneca, Boston, MA, USA.
OBJECTIVES: To evaluate the performance of a supervised artificial intelligence (AI) model in screening abstracts for targeted literature reviews (TLRs) across 28 oncology indications and assess its accuracy using decision match rate, recall, precision, and F-score.
METHODS: The study utilized the Nested Knowledge platform with a supervised machine learning model to screen abstracts from PubMed and Embase. Each TLR was based on unique PICOS (Population, Intervention, Comparator, Outcome, Study design) criteria, and searches yielded 587-3,653 hits per review. AI was trained on 10% of abstracts with human decisions, with additional inclusions added to the training set when fewer than 10 inclusions were present. AI screened the remaining 90% of abstracts. Performance was evaluated using: Decision Match Rate: Proportion of AI decisions matching human decisions Recall: Proportion of human inclusions correctly identified by AI Precision: Proportion of AI inclusions also included by humans F-score: Harmonic mean of recall and precision
RESULTS: Decision match rates ranged from 52% to 87%, recall ranged from 17% to 98%, while precision ranged from 8% to 59%. F-scores ranged from 11% to 73%, reflecting reasonable overall performance. Simpler PICOS achieved higher accuracy, while more complex PICOS required increased human intervention, reducing time savings. A limitation of the supervised learning approach was that AI did not provide reasons for exclusions, making discrepancies harder to resolve for reviewers.
CONCLUSIONS: The diversity of 28 TLRs, with varied PICOS, abstract count, and number of inclusions provided unique insights into factors impacting AI performance. The AI model demonstrated average accuracy in screening abstracts, with performance influenced by PICOS complexity. While useful for scaling TLRs, limitations in transparency and reliance on human training highlight areas for improvement. Future research should explore unsupervised machine learning models and compare multiple large language models to enhance accuracy and scalability.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR95

Topic

Health Technology Assessment, Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×