REAL-WORLD DATA LARGE LANGUAGE MODEL ASSISTIVE SQL CODING SYSTEM
Author(s)
Vladimir Turzhitsky, MS, PhD1, Varun Kumar Nomula, MS1, Yezhou Sun, MS1, Tesfagabir Meharizghi, MS2, Henry Wang, MS2, Aude Genevay, PhD2, Shinan Zhang, MS2, Tim Shear, MS2, Andy Mitchell, AA2;
1Merck & Co. Inc, Rahway, NJ, USA, 2Amazon Web Services, Seattle, WA, USA
1Merck & Co. Inc, Rahway, NJ, USA, 2Amazon Web Services, Seattle, WA, USA
OBJECTIVES: To develop and evaluate a Large Language Model (LLM)-enabled text-to-SQL assistive programming system that accelerates and standardizes SQL generation for real-world data (RWD) analysis, and to characterize its methodological components and early performance on representative RWD tasks.
METHODS: We implemented a web-based assistant that integrates foundation LLMs with retrieval-augmented generation (RAG). The system embeds database-specific metadata (table structures, variable descriptions, DDL statements, example rows) and retrieves verified few-shot “Golden Examples” based on semantic similarity to the user prompt. Prompts combine user intent, metadata, and examples to produce SQL plus an explanation, which users can review, edit, and execute in the interface. Sessions preserve context for iterative refinement. The Golden Examples are stored in an Amazon OpenSearch Serverless vector database and surfaced via AWS-based retrieval. Access is governed through Merck’s Real-World Data Exchange (RWDEx) with single sign-on and role-based permissions. Preliminary performance was assessed on a 40-question benchmark derived from the DE-SynPUF Medicare claims dataset using Anthropic Claude Sonnet 3.5. Early production deployment includes multiple frequently-used commercial claims and EHR datasets, with ongoing collection of initial user case studies.
RESULTS: On the DE-SynPUF benchmark, first-attempt SQL generation accuracy was 82.5% and increased to 97.5% within two attempts. Accuracy by difficulty was: easy (N=5) 80% first attempt, 100% within two; medium (N=14) 86% first attempt, 100% within two; hard (N=21) 81% first attempt, 95% within two. Initial user case studies demonstrate feasible integration into RWD workflows for claims and EHR use cases; a crossover study to quantify efficiency gains (e.g., time-to-correct query) is planned.
CONCLUSIONS: An LLM-driven, RAG-enhanced text-to-SQL assistant can reliably generate executable SQL for RWD tasks and support iterative query refinement. Early results indicate high accuracy across diverse question types. Future work will expand benchmarking, characterize error modes, compare models, and quantify efficiency and usability in controlled studies.
METHODS: We implemented a web-based assistant that integrates foundation LLMs with retrieval-augmented generation (RAG). The system embeds database-specific metadata (table structures, variable descriptions, DDL statements, example rows) and retrieves verified few-shot “Golden Examples” based on semantic similarity to the user prompt. Prompts combine user intent, metadata, and examples to produce SQL plus an explanation, which users can review, edit, and execute in the interface. Sessions preserve context for iterative refinement. The Golden Examples are stored in an Amazon OpenSearch Serverless vector database and surfaced via AWS-based retrieval. Access is governed through Merck’s Real-World Data Exchange (RWDEx) with single sign-on and role-based permissions. Preliminary performance was assessed on a 40-question benchmark derived from the DE-SynPUF Medicare claims dataset using Anthropic Claude Sonnet 3.5. Early production deployment includes multiple frequently-used commercial claims and EHR datasets, with ongoing collection of initial user case studies.
RESULTS: On the DE-SynPUF benchmark, first-attempt SQL generation accuracy was 82.5% and increased to 97.5% within two attempts. Accuracy by difficulty was: easy (N=5) 80% first attempt, 100% within two; medium (N=14) 86% first attempt, 100% within two; hard (N=21) 81% first attempt, 95% within two. Initial user case studies demonstrate feasible integration into RWD workflows for claims and EHR use cases; a crossover study to quantify efficiency gains (e.g., time-to-correct query) is planned.
CONCLUSIONS: An LLM-driven, RAG-enhanced text-to-SQL assistant can reliably generate executable SQL for RWD tasks and support iterative query refinement. Early results indicate high accuracy across diverse question types. Future work will expand benchmarking, characterize error modes, compare models, and quantify efficiency and usability in controlled studies.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR148
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas