Training Artificial Intelligence for Literature Reviews: Can a Classifier Match a Human Reviewer?
Speaker(s)
Metcalf T1, Dodd O1, Peatman J1, Kiessling J1, O'Donovan P2, Heron L3
1Adelphi Values PROVE™, Bollington, CHE, Great Britain, 2Adelphi Values PROVE™, Limerick, LK, Ireland, 3Adelphi Values PROVE™, Bollington, UK
Presentation Documents
INTRODUCTION: Artificial intelligence (AI) is becoming increasingly recognised as a tool to improve screening efficiency during literature reviews. AI classifiers offer an alternative to AI screening by performing binary classification of publications in response to a set question, independent of review type; they are not restricted to a single use setting and can be applied across multiple reviews with a greater potential for accuracy (based on a larger, more specific training set). While an abundance of literature exists for AI screening, evidence evaluating AI‑trained classifiers and their comparability with a human reviewer is lacking.
OBJECTIVES:
To demonstrate the comparability of four independent AI classifiers with human reviewer decisions in a real-world data set.METHODS:
Four classifiers were independently trained using DistillerSR to categorise abstracts based on elderly populations, pediatric populations, case reports, and randomised controlled trials (RCTs). Each classifier was trained using ≥1,000 abstracts until either a ≥0.8 F1 score was achieved, or 2,000 abstracts were screened. Each classifier was applied in a ‘test project’ using 2,245 abstracts from a systematic literature review. Classifier responses were compared with those of a human with matched responses reported as a decision match percentage.RESULTS:
Across the four classifiers, a decision match with the human reviewer of >94% was achieved; elderly (99.6%), RCT (98.2%), case report (95.1%), and paediatric (94.5%). The classifiers displayed a greater specificity over sensitivity, i.e. a tendency to be ‘over inclusive’ ensuring no relevant references were excluded.CONCLUSIONS:
With a collective decision match with human reviewers of >94%, all AI classifiers surpassed the decision match rate of 91.7% reported in previous literature. This improved consistency with human reviewers suggests that our approach taken to train the AI classifiers was effective and that these classifiers are appropriate to support abstract screening in literature reviews.Code
MSR222
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas