NEW FAILURE MODES, OLD STANDARDS: WHY ARTIFICIAL INTELLIGENCE (AI) DEMANDS DIFFERENT VALIDATION APPROACHES IN EVIDENCE SYNTHESIS
Author(s)
Priccila Zuchinali, PhD1, Kassandra Schaible, MPH2, Allie Cichewicz, MSc3;
1Thermo Fisher Scientific, Ottawa, ON, Canada, 2Thermo Fisher Scientific, Pittsburgh, PA, USA, 3Independent Consultant, Boston, MA, USA
1Thermo Fisher Scientific, Ottawa, ON, Canada, 2Thermo Fisher Scientific, Pittsburgh, PA, USA, 3Independent Consultant, Boston, MA, USA
OBJECTIVES: AI has rapidly integrated with literature reviews across screening, extraction, risk of bias, and narrative synthesis tasks. Current research focuses on validation of AI capabilities against a human benchmark, without considering differences in errors for humans versus AI. With increasing AI adoption and proficiency, integrating AI into human workflows is frequently discussed, but validations processes still rely on those designed for human reviewers.
METHODS: This work highlights why such standards are inadequate and outlines AI-specific validation considerations for evidence synthesis workflows.
RESULTS: AI errors (failures) often come from biased training data or algorithmic constraints and tend to be systematically widespread and less noticeable, whereas human errors more commonly reflect contextual misinterpretation and flawed judgment, occurring more inconsistently across tasks. These differences affect data validation strategies and AI reliability perceptions. Depending on the model used, AI may incorrectly include or exclude studies during screening based on borderline eligibility, while extractions may incorporate fabricated study details or bias assessments be overly generalizable. When synthesizing evidence, AI prioritizes smooth and confident language, which can hide uncertainties or differences between studies. Understanding these possible failures and how they differ from traditional human error is key to successful AI integration. Therefore, effective validation of AI-assisted reviews requires unique safeguards for each task. This may involve using frequent checkpoints within AI systems for ongoing model validation, implementing feedback loops specifically designed for each process or task, monitoring patterns in model decision-making for consistency, and ensuring that AI-generated output pass credibility checks.
CONCLUSIONS: Effective use of AI in evidence synthesis requires more than stronger AI systems, it demands a clear understanding of its limitations, treating it as a high-throughput instrument with unique failure modes. The future of human-AI collaboration depends on developing specific checks, controls, and careful prompt design to catch errors and ensure results are reliable.
METHODS: This work highlights why such standards are inadequate and outlines AI-specific validation considerations for evidence synthesis workflows.
RESULTS: AI errors (failures) often come from biased training data or algorithmic constraints and tend to be systematically widespread and less noticeable, whereas human errors more commonly reflect contextual misinterpretation and flawed judgment, occurring more inconsistently across tasks. These differences affect data validation strategies and AI reliability perceptions. Depending on the model used, AI may incorrectly include or exclude studies during screening based on borderline eligibility, while extractions may incorporate fabricated study details or bias assessments be overly generalizable. When synthesizing evidence, AI prioritizes smooth and confident language, which can hide uncertainties or differences between studies. Understanding these possible failures and how they differ from traditional human error is key to successful AI integration. Therefore, effective validation of AI-assisted reviews requires unique safeguards for each task. This may involve using frequent checkpoints within AI systems for ongoing model validation, implementing feedback loops specifically designed for each process or task, monitoring patterns in model decision-making for consistency, and ensuring that AI-generated output pass credibility checks.
CONCLUSIONS: Effective use of AI in evidence synthesis requires more than stronger AI systems, it demands a clear understanding of its limitations, treating it as a high-throughput instrument with unique failure modes. The future of human-AI collaboration depends on developing specific checks, controls, and careful prompt design to catch errors and ensure results are reliable.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR84
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas