Can Artificial Intelligence (AI) Replace a Human Reviewer in Systematic Literature Review (SLR)? Validation of the LIVESTARTTM Tool
Author(s)
Liu J1, Jafar R2, Girard LA3, Thorlund K4, Forsythe A5
1Cytel Inc., Toronto, ON, Canada, 2Cytel Inc., Vancouver, BC, Canada, 3Cytel Inc., Montreal, QC, Canada, 4McMaster University, Hamilton, ON, Canada, 5Cytel Inc., Waltham, MA, USA
Presentation Documents
OBJECTIVES: SLRs are labor-intensive and time-consuming, however, they are required for regulatory and health technology assessments (HTA). The new PRISMA guidelines (Page et al. 2020) allows the inclusion of automated tools in screening. We developed the LiveSTARTTM AI tool utilizing transfer learning to perform the title and abstract (TiAb) review stage of SLR processes.
METHODS: LiveSTARTTM utilizes deep learning (12-layer neural network) to identify texts relevant to population, intervention/comparator, outcome, and study design (PICOS), and then hierarchically predicts publication acceptance based on given inclusion/exclusion criteria. LiveSTARTTM comprises 4 functions: 1) de-duplicate by grouping abstracts with the same or similar content; 2) provide probability of inclusion for each PICOS criteria; 3) predict the inclusion of each publication by comparing its abstract to the inclusion/exclusion criteria; and 4) predict the reason of rejection based on PICOS with the pre-specified hierarchy. LiveSTARTTM was trained on 59 SLR datasets with 65,328 publications, all of which were manually annotated by two independent reviewers and the discrepancies were verified by a third senior reviewer.
RESULTS: Fifty-nine datasets covered 17 oncology and 6 non-oncology indications with 47 clinical, 6 economic and 6 health-related quality-of-life SLRs. LiveSTARTTM validation showed an accuracy = 0.92, precision = 0.91, recall = 0.86, F1-score = 0.89, and AUC = 0.91 when compared to the results generated by two independent reviewers and a third verifier. LiveSTARTTM reviews 1000 publications in ≈12.5 minutes with no additional preparation of the datasets as compared to manual review. Hierarchical rejection by PICOS criteria allows traceability and flexibility of changes in SLR scope.
CONCLUSIONS: With the combination of the unique algorithm, rigorous training on broad datasets, and highly reliable and transparent output, LiveSTARTTM AI combined with a single reviewer could potentially yield comparable accuracy with significant time savings. However, adoption by regulatory and HTA authorities will be required.
Conference/Value in Health Info
Value in Health, Volume 25, Issue 12S (December 2022)
Code
MSR74
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas