Artificial Intelligence (AI)-Based Screening: Exploration of Differences in Two Health Technology Assessment (HTA)-Compliant Systematic Literature Reviews (SLRS)
Author(s)
Cichewicz A1, Slim M2, Deshpande S3
1Evidera, Waltham, MA, USA, 2Evidera, Montreal, QC, Canada, 3Evidera, London, LON, UK
Presentation Documents
OBJECTIVES: Recent AI advances have led to the development of several web-based tools to expedite abstract screening in SLRs. We previously established a relationship between training set volume, performance, and time savings; however, variations in performance across topics and models may have subsequent implications for the minimum standards to be considered by health technology assessment (HTA) bodies. We assessed the performance of AI models to predict title/abstract screening decisions using two SLRs previously submitted to HTA bodies.
METHODS: DistillerAI and Robot Screener were employed to update two clinical SLRs in psoriasis (PsO) and endometrial cancer (EC). Initial decisions (PsO, n=4000; EC n=3319 records) by human reviewers were used to train each model. Records identified following the SLR update (PsO, n=1123; EC, n=568) were screened by both models and compared to human decisions. We calculated recall and the inter-rater reliability (IRR) between AI and human reviewers when applying prediction thresholds to ≤0.2 for excludes and >0.8 for includes.
RESULTS: The models differed in their eligibility predictions between SLRs, with more records falling outside the preset threshold with DistillerAI resulting in a smaller proportion of records screened vs. Robot Screener (PsO: 63% and 84%; EC: 80% and 92%, respectively). The IRR and recall were relatively higher for PsO compared to EC. For the PsO SLR, the IRR/recall for DistillerAI and Robot Screener were 96.9%/0.98 and 93.1%/0.85, respectively. For the EC SLR, the IRR/recall were 96.2%/0.5 and 95.8%/0.27, respectively.
CONCLUSIONS: Despite the high agreement rates between AI and human screeners, this study demonstrates that AI models’ performance may vary when employed to update different clinical efficacy/safety SLRs. Specifically, lower recall rates decrease confidence in AI to identify all relevant studies for inclusion. Future uptake by HTA bodies should consider the complexity of study selection criteria and set minimally acceptable model performance metrics.
Conference/Value in Health Info
Value in Health, Volume 26, Issue 11, S2 (December 2023)
Code
MSR163
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas