Comparison of Generative AI and Manual Data Programming in a Lupus Health Productivity Loss Study
Author(s)
Tiange Tang, MPH1, Catherine Mak, MSc1, Feng Zeng2;
1Biogen, Cambridge, MA, USA, 2Biogen, Value Evidence Strategy Lead, Cambridge, MA, USA
1Biogen, Cambridge, MA, USA, 2Biogen, Value Evidence Strategy Lead, Cambridge, MA, USA
Presentation Documents
OBJECTIVES: Generative artificial intelligence (AI) is an emerging tool in data programming for real world evidence research. This study aimed to replicate a human-led analysis of health productivity losses evaluation due to systemic lupus erythematosus in a U.S. commercially insured population using AI-generated code.
METHODS: Data from January 1, 2016, to December 31, 2022, were extracted from the IBM® MarketScan® Commercial & Medicare Claims and Health Productivity and Management (HPM) database. The AI replication process evaluation included four steps: (1) researchers completed all tasks using SQL and R, including coding and visualizations of results; (2) human-written code was divided into tasks, with corresponding prompts created for ChatGPT-4; (3) following the input of promptstoChatGPT-4, the ChatGPT-generated codes were tested against the original human results. (4) If ChatGPT-4 could not generate the right codes to complete the task after 10 prompt attempts, human intervention would be introduced to complete the task. The outcomes measured were code generation success, replication accuracy, efficiency (number of commands used), and number of revisions.
RESULTS: Seventy-five tasks were generated, and ChatGPT-4 created code for each. Among these tasks, 77.3% were completed without a need for revisions, while 18.7% required less than 10 prompt revisions to achieve the accurate results. The remaining 4% of tasks, such as calculating Charlson Comorbidity Index scores using International Classification of Diseases (ICD)-9/10 coding, needed human intervention.
CONCLUSIONS: ChatGPT-4 can replicate simple data tasks, such as patient selection, with an acceptable number of prompt iterations. However, at this time human intervention remains necessary for coding more complex tasks.
METHODS: Data from January 1, 2016, to December 31, 2022, were extracted from the IBM® MarketScan® Commercial & Medicare Claims and Health Productivity and Management (HPM) database. The AI replication process evaluation included four steps: (1) researchers completed all tasks using SQL and R, including coding and visualizations of results; (2) human-written code was divided into tasks, with corresponding prompts created for ChatGPT-4; (3) following the input of promptstoChatGPT-4, the ChatGPT-generated codes were tested against the original human results. (4) If ChatGPT-4 could not generate the right codes to complete the task after 10 prompt attempts, human intervention would be introduced to complete the task. The outcomes measured were code generation success, replication accuracy, efficiency (number of commands used), and number of revisions.
RESULTS: Seventy-five tasks were generated, and ChatGPT-4 created code for each. Among these tasks, 77.3% were completed without a need for revisions, while 18.7% required less than 10 prompt revisions to achieve the accurate results. The remaining 4% of tasks, such as calculating Charlson Comorbidity Index scores using International Classification of Diseases (ICD)-9/10 coding, needed human intervention.
CONCLUSIONS: ChatGPT-4 can replicate simple data tasks, such as patient selection, with an acceptable number of prompt iterations. However, at this time human intervention remains necessary for coding more complex tasks.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR75
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)