The Quantification of Efficacy and Accuracy for Screening and Data Extraction in Perinatal Mood and Anxiety Disorders Studies: A Comparative Analysis of Traditional vs. GenAI-Powered Approaches
Author(s)
kyeryoung lee, PhD1, Hunki Paek, PhD1, Emma McNeill, MS2, Brian Christman, MS2, Sabrina Alam, MPH2, Lizheng Shi, PhD3, Xiaoyan Wang, PhD1;
1IMO Health, Rosemont, IL, USA, 2University of Mississippi Medical Center, Jackson, MS, USA, 3Tulane Universit, New Orleans, LA, USA
1IMO Health, Rosemont, IL, USA, 2University of Mississippi Medical Center, Jackson, MS, USA, 3Tulane Universit, New Orleans, LA, USA
Presentation Documents
OBJECTIVES: Article screening and data extraction are the most time-consuming and error-prone steps in manual systematic literature reviews (SLRs). Artificial intelligence (AI)-enabled automation can significantly reduce time, improve accuracy, and increase the comprehensiveness limited by resource constraints. We quantified the efficacy and accuracy of automation versus manual curation in the use case of perinatal mood and anxiety disorders (PMAD).
METHODS: We developed gold standards for 26 manually curated articles and compared them with outcomes from an automated SLR system. Screening criteria included studies on screening, prevention, or treatment for PMAD involving any peri-or-postpartum (<=1y) females, with the economic evaluation metrics outcomes (e.g., cost, ICER, QALY). Data extraction focused on study details, model parameters, baseline characteristics, and evaluation outcomes. We measured time reduction, and error correction through automation and system performance.
RESULTS: AI-screening took <1 min/article compared to ~300 min/article in manual screening, achieving 96.2% abstract-level and 100% full-text accuracy. AI-system predicted 25 of 26 gold standard articles as “Relevant”. One article excluded by the AI-system used the term “Mother” exclusively in the abstract, which was interpreted as “postpartum “by human reviewers, but not by AI-system. When 26 full-text articles were re-evaluated, the system predicted this article as “Relevant” since the term “postpartum” appeared in the full text. Data abstraction with AI took 10-30 min/article capturing 100-200 (up to over 400) versus 300 min/article manually, capturing 20-30 elements. The AI achieved an overall F1-score of 0.993, specifically in study details (F1=0.993), model parameters (F1=0.998), evaluation outcomes (F1=0.998), and baseline characteristics (F1=0.984). Notably, AI corrected 14 human errors in 6 articles and added 36 missed data elements in 8 articles.
CONCLUSIONS: Our AI-system significantly reduced time and error while effectively enhancing accuracy and comprehensiveness in screening and data extraction from full-texts/tables. This approach holds promise to advance SLRs and health economics and outcomes research
METHODS: We developed gold standards for 26 manually curated articles and compared them with outcomes from an automated SLR system. Screening criteria included studies on screening, prevention, or treatment for PMAD involving any peri-or-postpartum (<=1y) females, with the economic evaluation metrics outcomes (e.g., cost, ICER, QALY). Data extraction focused on study details, model parameters, baseline characteristics, and evaluation outcomes. We measured time reduction, and error correction through automation and system performance.
RESULTS: AI-screening took <1 min/article compared to ~300 min/article in manual screening, achieving 96.2% abstract-level and 100% full-text accuracy. AI-system predicted 25 of 26 gold standard articles as “Relevant”. One article excluded by the AI-system used the term “Mother” exclusively in the abstract, which was interpreted as “postpartum “by human reviewers, but not by AI-system. When 26 full-text articles were re-evaluated, the system predicted this article as “Relevant” since the term “postpartum” appeared in the full text. Data abstraction with AI took 10-30 min/article capturing 100-200 (up to over 400) versus 300 min/article manually, capturing 20-30 elements. The AI achieved an overall F1-score of 0.993, specifically in study details (F1=0.993), model parameters (F1=0.998), evaluation outcomes (F1=0.998), and baseline characteristics (F1=0.984). Notably, AI corrected 14 human errors in 6 articles and added 36 missed data elements in 8 articles.
CONCLUSIONS: Our AI-system significantly reduced time and error while effectively enhancing accuracy and comprehensiveness in screening and data extraction from full-texts/tables. This approach holds promise to advance SLRs and health economics and outcomes research
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR49
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Mental Health (including addition)