AI-ASSISTED RISK OF BIAS ASSESSMENT IN OBSERVATIONAL STUDIES: ISOLATING JUDGMENT USING A PASSAGE-BASED VALIDATION APPROACH

Author(s)

Eitan Agai, BA, MSc¹, Karen A. Robinson, MSc, PhD², Alon Agai, .³.
¹Founder & CEO, PICO Portal, Inc., St Petersburg, FL, USA, ²Johns Hopkins University, Baltimore, MD, USA, ³Head of Partnerships, PICO Portal, Inc., Brooklyn, NY, USA.

Presentation Documents

ISPOR26_Agai_MSR29_POSTER_FINAL.pdf

OBJECTIVES: To evaluate whether a large language model (LLM) can reproduce expert risk of bias (RoB) judgments when the evidence is held constant, by isolating and validating the assessment step rather than evidence identification.
METHODS: We conducted a passage-based validation study using 128 observational studies from a systematic review of per- and polyfluoroalkyl substances (PFAS) and health outcomes. Two experienced reviewers assessed RoB across nine Navigation Guide domains and identified the text passages (evidence) supporting each judgment. These human-selected passages, together with the corresponding Navigation Guide domain questions and guidance, were provided to an LLM. The model was not tasked with locating evidence or reviewing full-text articles. For each passage -domain pair, the LLM generated a structured RoB rating on a five-level scale. Model outputs were compared with human-adjudicated ratings using exact agreement, acceptable agreement, percent agreement, and weighted Cohen’s kappa, overall and by domain.
RESULTS: Agreement between LLM-generated and human-adjudicated RoB ratings varied by domain. Concordance was highest in domains relying on explicit reporting, with agreement ranging from 95.1% to 100%, and lower in domains requiring contextual judgement, with agreement ranging from 76.3 to 89%. Most discrepancies reflected partial or acceptable agreement rather than opposing assessments, indicating partial alignment in judgment when evidence was shared.
CONCLUSIONS: Separating evidence identification from assessment enables targeted evaluation of AI judgment quality and provides a clearer foundation for responsible integration of AI into evidence synthesis. When restricted to human-curated evidence passages within observational studies, LLMs demonstrated reasoning and assessment similar to that of a human RoB assessor.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR29

Topic

Methodological & Statistical Research

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)