Evaluating the Performance of Claude 3.5 Sonnet in Data Extraction Automation for Systematic Literature Reviews (SLRs)
Author(s)
Cuthbert Chow, MDS, Ellen Kasireddy, MHSc, Mir-Masoud Pourrahmat, MSc, Jean-Paul Collet, MD, PhD, Mir Sohail Fazeli, MD, PhD;
Evidinno Outcomes Research Inc, Vancouver, BC, Canada
Evidinno Outcomes Research Inc, Vancouver, BC, Canada
OBJECTIVES: To evaluate the performance of Claude 3.5 Sonnet in automating data extraction for SLRs.
METHODS: A custom model was developed using the large language model, Claude 3.5 Sonnet, for data extraction, employing a multi-stage processing approach. The model’s performance was tested for its ability to extract data from 14 studies across two SLRs in dermatology and oncology; we intend to test the model on 50 studies across various disease areas. Performance was benchmarked against extractions conducted and reconciled between two senior independent human reviewers. False positives were defined as incorrect data points; false negatives represented missed data points. Performance metrics included accuracy (total correct predictions across all classes), precision (proportion of true positive predictions), sensitivity, and F1 score (harmonic mean between precision and sensitivity).
RESULTS: The model demonstrated average accuracy of 76.2%, average precision of 89.2%, and average sensitivity of 85.1%, with an F1 score of 86.5%, reflecting strong alignment with human extraction. Compared with human reviewers, 76.2% of extracted data points were true positives, 9.9% were false positives, and 14.0% were false negatives. The model performed best in extracting study design characteristics (accuracy: 92.2%; precision: 98.1%; sensitivity: 93.9%; F1: 96.0%) and baseline participant characteristics (accuracy: 90.7%; precision: 96.9%; sensitivity: 93.3%; F1: 95.0%). Performance for intervention characteristics was strong but was impacted by a higher proportion of missed data points (accuracy: 83.7%; precision: 94.3%; sensitivity: 88.2%; F1: 91.2%). Outcomes exhibited the lowest performance, driven by a higher rate of false negatives (accuracy: 71.6%; precision: 85.0%; sensitivity: 83.7%; F1: 85.6%).
CONCLUSIONS: The proposed workflow for automated data extraction shows promising performance, particularly in extracting study design and baseline participant characteristics, indicating its potential to complement human reviewers. Lower performance in extracting outcomes, driven by false negatives, underscores the need for targeted improvements through modified prompts and output schema.
METHODS: A custom model was developed using the large language model, Claude 3.5 Sonnet, for data extraction, employing a multi-stage processing approach. The model’s performance was tested for its ability to extract data from 14 studies across two SLRs in dermatology and oncology; we intend to test the model on 50 studies across various disease areas. Performance was benchmarked against extractions conducted and reconciled between two senior independent human reviewers. False positives were defined as incorrect data points; false negatives represented missed data points. Performance metrics included accuracy (total correct predictions across all classes), precision (proportion of true positive predictions), sensitivity, and F1 score (harmonic mean between precision and sensitivity).
RESULTS: The model demonstrated average accuracy of 76.2%, average precision of 89.2%, and average sensitivity of 85.1%, with an F1 score of 86.5%, reflecting strong alignment with human extraction. Compared with human reviewers, 76.2% of extracted data points were true positives, 9.9% were false positives, and 14.0% were false negatives. The model performed best in extracting study design characteristics (accuracy: 92.2%; precision: 98.1%; sensitivity: 93.9%; F1: 96.0%) and baseline participant characteristics (accuracy: 90.7%; precision: 96.9%; sensitivity: 93.3%; F1: 95.0%). Performance for intervention characteristics was strong but was impacted by a higher proportion of missed data points (accuracy: 83.7%; precision: 94.3%; sensitivity: 88.2%; F1: 91.2%). Outcomes exhibited the lowest performance, driven by a higher rate of false negatives (accuracy: 71.6%; precision: 85.0%; sensitivity: 83.7%; F1: 85.6%).
CONCLUSIONS: The proposed workflow for automated data extraction shows promising performance, particularly in extracting study design and baseline participant characteristics, indicating its potential to complement human reviewers. Lower performance in extracting outcomes, driven by false negatives, underscores the need for targeted improvements through modified prompts and output schema.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
P21
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas