NEW FAILURE MODES, OLD STANDARDS: WHY ARTIFICIAL INTELLIGENCE (AI) DEMANDS DIFFERENT VALIDATION APPROACHES IN EVIDENCE SYNTHESIS

Author(s)

Priccila Zuchinali, PhD¹, Kassandra Schaible, MPH², Allie Cichewicz, MSc³.
¹Thermo Fisher Scientific, Ottawa, ON, Canada, ²Thermo Fisher Scientific, Pittsburgh, PA, USA, ³Independent Consultant, Boston, MA, USA.

Presentation Documents

ISPOR26_Zuchinali_MSR84_Poster.pdf

OBJECTIVES: AI has rapidly integrated with literature reviews across screening, extraction, risk of bias, and narrative synthesis tasks. Current research focuses on validation of AI capabilities against a human benchmark, without considering differences in errors for humans versus AI. With increasing AI adoption and proficiency, integrating AI into human workflows is frequently discussed, but validations processes still rely on those designed for human reviewers.
METHODS: This work highlights why such standards are inadequate and outlines AI-specific validation considerations for evidence synthesis workflows.
RESULTS: AI errors (failures) often come from biased training data or algorithmic constraints and tend to be systematically widespread and less noticeable, whereas human errors more commonly reflect contextual misinterpretation and flawed judgment, occurring more inconsistently across tasks. These differences affect data validation strategies and AI reliability perceptions. Depending on the model used, AI may incorrectly include or exclude studies during screening based on borderline eligibility, while extractions may incorporate fabricated study details or bias assessments be overly generalizable. When synthesizing evidence, AI prioritizes smooth and confident language, which can hide uncertainties or differences between studies. Understanding these possible failures and how they differ from traditional human error is key to successful AI integration. Therefore, effective validation of AI-assisted reviews requires unique safeguards for each task. This may involve using frequent checkpoints within AI systems for ongoing model validation, implementing feedback loops specifically designed for each process or task, monitoring patterns in model decision-making for consistency, and ensuring that AI-generated output pass credibility checks.
CONCLUSIONS: Effective use of AI in evidence synthesis requires more than stronger AI systems, it demands a clear understanding of its limitations, treating it as a high-throughput instrument with unique failure modes. The future of human-AI collaboration depends on developing specific checks, controls, and careful prompt design to catch errors and ensure results are reliable.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR84

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)