COMPARISON OF AI-ASSISTED AND TRADITIONAL ANALYTIC WORKFLOWS IN HEALTH ECONOMICS AND OUTCOMES RESEARCH USING HEALTH SURVEY DATA

Author(s)

Alysha M. McGovern, MBA¹, Joseph Yeb, BS¹, Harshini Mashruwala, MS¹, Praveen Kumar Potukuchi, PhD¹, Amy Bolton, MPH, MS¹, Hamid Zarei, PhD².
¹Boston Scientific, Marlborough, MA, USA, ²University of Louisville, Louisville, KY, USA.

OBJECTIVES: Although artificial intelligence (AI) is increasingly used in analytic workflows, direct comparisons with traditional human-only analyses are limited. This study compared the efficiency and accuracy of AI-assisted and traditional analytic workflows using publicly-available health survey data.
METHODS: Four analysts independently completed identical descriptive analyses on a 5% sample of the 2024 Behavioral Risk Factor Surveillance System dataset using traditional (Stata 19; StataCorp) and AI-assisted (ChatGPT GPT-5; OpenAI) workflows. A two-sequence crossover design and ≥48-hour washout period were used to minimize learning and order effects. For both workflows, analysts used standard templates and were permitted only limited, predefined modifications to simulate routine applied analyses. AI interactions followed a standardized TRACI (Task-Role-Audience-Create-Intent) prompting framework and a stateless control prompt to prevent personalization. Analysts recorded total task completion time (minutes) for each workflow. Accuracy was assessed by two blinded validators through concordance with pre-generated reference outputs.
RESULTS: Stata-based and AI-assisted workflows had similar mean completion times (47.8 and 50.3 minutes, respectively), with Stata faster for two analysts and ChatGPT faster for two. In descriptive table construction (N, %), both workflows correctly implemented all required variables, including collapsed variable creation. Stata outputs showed near-complete concordance with validator results (99.3%). ChatGPT outputs consistently reproduced correct absolute counts, but showed lower overall concordance (75.0%) due to small percentage discrepancies (median absolute difference=0.2%) arising from non-exclusion of refused/unknown/missing responses from denominators. Excluding denominator-related discrepancies, ChatGPT concordance was 99.2%. Computational or omission errors were rare (<1.0%) in both workflows. Chi-square tests yielded consistent statistical results across workflows, with most results matching validator outputs (Stata 87.5%, ChatGPT 93.8%); remaining discrepancies were attributable to denominator handling.
CONCLUSIONS: AI-assisted analyses demonstrated comparable completion times and accuracy to traditional Stata-based analyses. Most observed discrepancies reflected denominator-handling rules versus computational errors, underscoring the importance of explicit analytical specifications and independent validation for AI-assisted workflows.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR25

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)