Artificial Intelligence (AI)-Based Screening: Exploration of Differences in Two Health Technology Assessment (HTA)-Compliant Systematic Literature Reviews (SLRS)

Speaker(s)

Cichewicz A¹, Slim M², Deshpande S³
¹Evidera, Waltham, MA, USA, ²Evidera, Montreal, QC, Canada, ³Evidera, London, LON, UK

Presentation Documents

Evidera ISPOR EU 2023_MSR 163133666.pdf

OBJECTIVES: Recent AI advances have led to the development of several web-based tools to expedite abstract screening in SLRs. We previously established a relationship between training set volume, performance, and time savings; however, variations in performance across topics and models may have subsequent implications for the minimum standards to be considered by health technology assessment (HTA) bodies. We assessed the performance of AI models to predict title/abstract screening decisions using two SLRs previously submitted to HTA bodies.

METHODS: DistillerAI and Robot Screener were employed to update two clinical SLRs in psoriasis (PsO) and endometrial cancer (EC). Initial decisions (PsO, n=4000; EC n=3319 records) by human reviewers were used to train each model. Records identified following the SLR update (PsO, n=1123; EC, n=568) were screened by both models and compared to human decisions. We calculated recall and the inter-rater reliability (IRR) between AI and human reviewers when applying prediction thresholds to ≤0.2 for excludes and >0.8 for includes.

RESULTS: The models differed in their eligibility predictions between SLRs, with more records falling outside the preset threshold with DistillerAI resulting in a smaller proportion of records screened vs. Robot Screener (PsO: 63% and 84%; EC: 80% and 92%, respectively). The IRR and recall were relatively higher for PsO compared to EC. For the PsO SLR, the IRR/recall for DistillerAI and Robot Screener were 96.9%/0.98 and 93.1%/0.85, respectively. For the EC SLR, the IRR/recall were 96.2%/0.5 and 95.8%/0.27, respectively.

CONCLUSIONS: Despite the high agreement rates between AI and human screeners, this study demonstrates that AI models’ performance may vary when employed to update different clinical efficacy/safety SLRs. Specifically, lower recall rates decrease confidence in AI to identify all relevant studies for inclusion. Future uptake by HTA bodies should consider the complexity of study selection criteria and set minimally acceptable model performance metrics.

Code

MSR163

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR Europe 2023

12 - 15 November