ROBERTA on the Job: A Proof-of-Concept Study of a Customized GPT Tool for Risk-of-Bias Assessment of Randomized Clinical Trials

Author(s)

Eric Manalastas, MSc, PhD¹, Juliette C Thompson, BSc², Aditi Hombali, MPT², David A Scott, MSc².
¹Systematic Reviewer, Visible Analytics Ltd, Sheffield, United Kingdom, ²Visible Analytics Ltd, Oxford, United Kingdom.

OBJECTIVES: Risk-of-bias assessment is an essential element of systematic reviews. However, procedures such as the Cochrane revised tool for randomised trials (RoB2) are challenging to use, highly time-consuming (mean duration per assessment: 28 minutes), and marked by low interrater reliability. Customised GPT-based tools have the potential to assist in ROB assessments and provide greater efficiency and consistency. Our goal was to assess the feasibility of using a simple GPT-based tool for ROB assessment.
METHODS: Using a proof-of-concept design, we developed and tested a customised GPT-based tool designed to perform ROB assessments aligned with Cochrane standards that we named ROBERTA. The tool was built on OpenAI’s GPT architecture and customised using official guidance documents. We evaluated the tool’s performance on three criteria: speed (time to produce a ROB assessment), accuracy (consistency against a gold standard set of ROB ratings made by trained human assessors), and test-retest reliability (consistency of the tool’s own ROB assessments across time points).
RESULTS: Evaluation for speed indicated extremely rapid performance. In tests using 20 clinical trial publications, average time per ROB assessment was less than half a minute (mean: 25.2 seconds, SD: 4.7). Evaluation for accuracy showed variation by level of risk: for domains rated by human assessors as low-risk and ‘some concerns’, full 100% agreement was observed. However, considerable disagreement was observed in domains rated as ‘high-risk’, with ROBERTA ratings being more lenient than human assessors. Evaluation for test-retest reliability indicated acceptable performance across one week (Cohen’s kappa = 1.00; perfect consistency) and two weeks (Cohen’s kappa = 0.54; moderate consistency).
CONCLUSIONS: Preliminary evaluations suggest a tool such as ROBERTA can support rapid ROB assessments with a fair degree of accuracy and acceptable test-retest reliability, complementing human review. Further testing will enhance the utility of GPT-based tools towards realising the potential of AI to facilitate systematic reviews.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR183

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)