Calibrating Large Language Model Probabilities to Achieve Target Recall in Systematic Review Screening

Author(s)

Seye Abogunrin, MPH, MSc, MD1, Roberto Rey Sieiro, BSc, MSc2, Marie Lane, BSc3.
1Global Access Evidence Leader, F. Hoffmann-La Roche Ltd, Basel, Switzerland, 2Roche Farma, S.A.,, Madrid, Spain, 3F. Hoffmann-La Roche Ltd, Basel, Switzerland.
OBJECTIVES: Recent literature shows that generative artificial intelligence can accelerate evidence synthesis for evidence-based medicine. Deploying robust large language models (LLMs) in systematic literature reviews (SLRs) requires predictable performance and control over metrics, notably recall. This study evaluates a workflow to transform GPT-4o, a general-purpose LLM, into a fine-tunable classifier for screening titles and abstracts (TIAB), to meet user-specific recall targets. Raw LLM outputs are often not reliable indicators of correctness, exhibiting significant overconfidence. Our workflow transforms underlying probability distributions into reliable, tunable scores, which are essential for recall-sensitive tasks in SLRs.
METHODS: A multi-stage methodology was employed. Firstly, class probabilities for "Include" and "Exclude" decisions were derived from the LLM token-level log probabilities. Retrospectively obtained results of a previously completed title and abstract screening dataset were partitioned into training (20%), validation (20%), and test (60%) sets. Raw training set probabilities were used to fit an isotonic regression calibration model. The validation set was used to select an optimal decision threshold for the desired recall target. Finally, the model and the threshold were evaluated on the withheld test set.
RESULTS: The calibration workflow produced a reliable probability score. A specific decision threshold was selected from validation set performance to target high recall. Applying this workflow to the test set achieved 83% recall while maintaining 56% precision. This demonstrates the ability to tune the model for recall-sensitive tasks, a trade-off often acceptable during title and abstract screening.
CONCLUSIONS: Raw LLM outputs, insufficient for reliable classification, can be transformed via a robust calibration workflow. This method provides a "control knob," enabling practitioners to deploy LLM classifiers that meet specific performance requirements. This approach could be highly effective for automating SLRs in evidence-based medicine, where maximizing recall is often critical to avoid missing relevant studies, enhancing the integrity and speed of the review process.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

SA18

Topic

Health Technology Assessment, Methodological & Statistical Research, Study Approaches

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×