Training Artificial Intelligence for Literature Reviews: Can a Classifier Match a Human Reviewer?

Speaker(s)

Metcalf T¹, Dodd O¹, Peatman J¹, Kiessling J¹, O'Donovan P², Heron L³
¹Adelphi Values PROVE™, Bollington, CHE, Great Britain, ²Adelphi Values PROVE™, Limerick, LK, Ireland, ³Adelphi Values PROVE™, Bollington, UK

Presentation Documents

ISPOR EU 2024 Poster_Training Artificial Intelligence for Literature Reviews_v2_0141257.pdf

INTRODUCTION: Artificial intelligence (AI) is becoming increasingly recognised as a tool to improve screening efficiency during literature reviews. AI classifiers offer an alternative to AI screening by performing binary classification of publications in response to a set question, independent of review type; they are not restricted to a single use setting and can be applied across multiple reviews with a greater potential for accuracy (based on a larger, more specific training set). While an abundance of literature exists for AI screening, evidence evaluating AI‑trained classifiers and their comparability with a human reviewer is lacking.

OBJECTIVES: To demonstrate the comparability of four independent AI classifiers with human reviewer decisions in a real-world data set.

METHODS: Four classifiers were independently trained using DistillerSR to categorise abstracts based on elderly populations, pediatric populations, case reports, and randomised controlled trials (RCTs). Each classifier was trained using ≥1,000 abstracts until either a ≥0.8 F1 score was achieved, or 2,000 abstracts were screened. Each classifier was applied in a ‘test project’ using 2,245 abstracts from a systematic literature review. Classifier responses were compared with those of a human with matched responses reported as a decision match percentage.

RESULTS: Across the four classifiers, a decision match with the human reviewer of >94% was achieved; elderly (99.6%), RCT (98.2%), case report (95.1%), and paediatric (94.5%). The classifiers displayed a greater specificity over sensitivity, i.e. a tendency to be ‘over inclusive’ ensuring no relevant references were excluded.

CONCLUSIONS: With a collective decision match with human reviewers of >94%, all AI classifiers surpassed the decision match rate of 91.7% reported in previous literature. This improved consistency with human reviewers suggests that our approach taken to train the AI classifiers was effective and that these classifiers are appropriate to support abstract screening in literature reviews.

Code

MSR222

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR Europe 2024

17 - 20 November