AI-ASSISTED RISK OF BIAS ASSESSMENT IN OBSERVATIONAL STUDIES: ISOLATING JUDGMENT USING A PASSAGE-BASED VALIDATION APPROACH
Author(s)
Eitan Agai, BA, MSc1, Karen A. Robinson, MSc, PhD2;
1PICO Portal, Inc., Founder & CEO, St Petersburg, FL, USA, 2Johns Hopkins University, Baltimore, MD, USA
1PICO Portal, Inc., Founder & CEO, St Petersburg, FL, USA, 2Johns Hopkins University, Baltimore, MD, USA
OBJECTIVES: To evaluate whether a large language model (LLM) can reproduce expert risk of bias (RoB) judgments when the evidence is held constant, by isolating and validating the assessment step rather than evidence identification.
METHODS: We conducted a passage-based validation study using 128 observational studies from a systematic review of per- and polyfluoroalkyl substances (PFAS) and health outcomes. Two experienced reviewers assessed RoB across nine Navigation Guide domains and identified the text passages (evidence) supporting each judgment. These human-selected passages, together with the corresponding Navigation Guide domain questions and guidance, were provided to an LLM. The model was not tasked with locating evidence or reviewing full-text articles. For each passage -domain pair, the LLM generated a structured RoB rating on a five-level scale. Model outputs were compared with human-adjudicated ratings using exact agreement, acceptable agreement, percent agreement, and weighted Cohen’s kappa, overall and by domain.
RESULTS: Agreement between LLM-generated and human-adjudicated RoB ratings varied by domain. Concordance was highest in domains relying on explicit reporting, with agreement ranging from 95.1% to 100%, and lower in domains requiring contextual judgement, with agreement ranging from 76.3 to 89%. Most discrepancies reflected partial or acceptable agreement rather than opposing assessments, indicating partial alignment in judgment when evidence was shared.
CONCLUSIONS: Separating evidence identification from assessment enables targeted evaluation of AI judgment quality and provides a clearer foundation for responsible integration of AI into evidence synthesis. When restricted to human-curated evidence passages within observational studies, LLMs demonstrated reasoning and assessment similar to that of a human RoB assessor.
METHODS: We conducted a passage-based validation study using 128 observational studies from a systematic review of per- and polyfluoroalkyl substances (PFAS) and health outcomes. Two experienced reviewers assessed RoB across nine Navigation Guide domains and identified the text passages (evidence) supporting each judgment. These human-selected passages, together with the corresponding Navigation Guide domain questions and guidance, were provided to an LLM. The model was not tasked with locating evidence or reviewing full-text articles. For each passage -domain pair, the LLM generated a structured RoB rating on a five-level scale. Model outputs were compared with human-adjudicated ratings using exact agreement, acceptable agreement, percent agreement, and weighted Cohen’s kappa, overall and by domain.
RESULTS: Agreement between LLM-generated and human-adjudicated RoB ratings varied by domain. Concordance was highest in domains relying on explicit reporting, with agreement ranging from 95.1% to 100%, and lower in domains requiring contextual judgement, with agreement ranging from 76.3 to 89%. Most discrepancies reflected partial or acceptable agreement rather than opposing assessments, indicating partial alignment in judgment when evidence was shared.
CONCLUSIONS: Separating evidence identification from assessment enables targeted evaluation of AI judgment quality and provides a clearer foundation for responsible integration of AI into evidence synthesis. When restricted to human-curated evidence passages within observational studies, LLMs demonstrated reasoning and assessment similar to that of a human RoB assessor.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR29
Topic
Methodological & Statistical Research
Disease
No Additional Disease & Conditions/Specialized Treatment Areas