Inter-Reviewer Reliability of Literature Screening and Data Extraction for Human and Machine-Assisted Systematic Reviews
Author(s)
Hanegraaf P1, Dagne AW2, Mosselman JJ3, de Jong R3, Abogunrin S4, Queiros L5, Lane M6, Postma M7, Boersma C8, van der Schans J9
1Pitts, Zeist, Netherlands, 2Health-Ecore, Zeist, Utrecht, Netherlands, 3Pitts, Zeist, UT, Netherlands, 4Evidera, Basel, BS, Switzerland, 5F. Hoffmann-La Roche, Basel, Switzerland, 6F. Hoffmann-La Roche, Basel, BS, Switzerland, 7Health-Ecore, Zeist, UT, Netherlands, 8University of Groningen, Department of Health Sciences, UMCG; Open University, Heerlen, Department of Management Sciences and Health-Ecore Ltd, Zeist, The Netherlands, Groningen, Netherlands, 9Health-Ecore, Groningen, GR, Netherlands
Presentation Documents
OBJECTIVES: Machine learning can be used for both fully automated or assisted screening and eligibility assessment, as well as to support data extraction efforts, and has shown promising potential over the recent years.
METHODS: We performed a review of SLRs of randomized controlled trials. Data was extracted on IRR by means of Cohen’s kappa score of abstract/title screening, full text screening, and data extraction. For the second part of this study, we performed a survey of authors of SLRs on their expectations of machine learning automation and human performed IRR in SLRs.
RESULTS: Most studies in our SLR did not report on the IRR. In total, 45 eligible articles were included. The average Cohen’s kappa score reported was 0.82 (SD= 0.11, n=12) for abstract screening, 0.77 (SD= 0.18, n=14) for full text screening, 0.86 (SD=0.07, n=15) for the whole screening process, and 0.88 (SD= 0.08, n=16) for data extraction. The survey (n=37) showed overlapping expected Cohen’s kappa values ranging between approximately 0.6-0.9 for either human or machine learning assisted SLRs. In general, authors expect a higher-than-average IRR for machine learning assisted SLR compared to human based SLR in both the screening and the data extraction.
CONCLUSIONS: Human performed SLRs likely show a moderate agreement between reviewers, while authors expect machine learning assisted SLRs to perform better. A minimal strong agreement between reviewers of machine learning assisted SLRs is recommended to ensure overall acceptance of machine learning in SLRs.
Conference/Value in Health Info
Value in Health, Volume 26, Issue 11, S2 (December 2023)
Code
MSR27
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas