The Quantification of Efficacy and Accuracy for Screening and Data Extraction in Perinatal Mood and Anxiety Disorders Studies: A Comparative Analysis of Traditional vs. GenAI-Powered Approaches

Moderator

kyeryoung lee, PhD, IMO health, Ardsley, NY, United States

Speakers

Hunki Paek; Emma McNeill; Brian Christman; Sabrina Alam; Lizheng Shi, PhD, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, United States; Xiaoyan Wang

OBJECTIVES: Article screening and data extraction are the most time-consuming and error-prone steps in manual systematic literature reviews (SLRs). Artificial intelligence (AI)-enabled automation can significantly reduce time, improve accuracy, and increase the comprehensiveness limited by resource constraints. We quantified the efficacy and accuracy of automation versus manual curation in the use case of perinatal mood and anxiety disorders (PMAD).
METHODS: We developed gold standards for 26 manually curated articles and compared them with outcomes from an automated SLR system. Screening criteria included studies on screening, prevention, or treatment for PMAD involving any peri-or-postpartum (<=1y) females, with the economic evaluation metrics outcomes (e.g., cost, ICER, QALY). Data extraction focused on study details, model parameters, baseline characteristics, and evaluation outcomes. We measured time reduction, and error correction through automation and system performance.
RESULTS: AI-screening took <1 min/article compared to ~300 min/article in manual screening, achieving 96.2% abstract-level and 100% full-text accuracy. AI-system predicted 25 of 26 gold standard articles as “Relevant”. One article excluded by the AI-system used the term “Mother” exclusively in the abstract, which was interpreted as “postpartum “by human reviewers, but not by AI-system. When 26 full-text articles were re-evaluated, the system predicted this article as “Relevant” since the term “postpartum” appeared in the full text. Data abstraction with AI took 10-30 min/article capturing 100-200 (up to over 400) versus 300 min/article manually, capturing 20-30 elements. The AI achieved an overall F1-score of 0.993, specifically in study details (F1=0.993), model parameters (F1=0.998), evaluation outcomes (F1=0.998), and baseline characteristics (F1=0.984). Notably, AI corrected 14 human errors in 6 articles and added 36 missed data elements in 8 articles.
CONCLUSIONS: Our AI-system significantly reduced time and error while effectively enhancing accuracy and comprehensiveness in screening and data extraction from full-texts/tables. This approach holds promise to advance SLRs and health economics and outcomes research

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR49

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Mental Health (including addition)

Presentation (CTI)