Validating the Utility of Synthetic Data Generation for Clinical Research

Speaker(s)

Wilson A1, Krikov S2, Crockett D3
1Parexel International, Waltham, MA, USA, 2Parexel International, Lexington, MA, USA, 3Intermountain Health, Salt Lake City, UT, USA

OBJECTIVES: Clinical research often requires collaboration and data sharing. Collaborative research can speed up research and improve findings, but using sensitive data like patient info raises privacy concerns. These challenges can negate any potential time savings and, in fact, be entirely prohibitive. One emerging solution to data sharing comes from the emerging field of synthetic data generation (SDG).

The current study explores two promising SDG methods – an open-source method and a proprietary method - and evaluates them on a specific causal effect estimation task.

METHODS: In this study, we established an evaluation framework to assess synthetic data quality by comparing target causal effect estimates across different estimation methods. Successful synthetic data was defined as preserving both effect relationships and confounding structures necessary for accurate causal inference.

We estimated the target effect of medication exposure on death within 90 days using crude odds ratio, propensity score, and tmle-adjusted methods. We compared these estimates among the original and synthetic datasets.

{Figure 1 will depict study flow}

RESULTS: The results {illustrated in Figure 2} indicate that advanced SDG methods are successful in obtaining accurate causal estimates and maintaining confounding structures in a kidney disease progression case study.

CONCLUSIONS: Synthetic data offers a pragmatic balance between data utility and privacy protection. It also enables broader data accessibility and collaboration while allowing for the inclusion of rare or underrepresented conditions in research, enhancing the scope and depth of studies.

The effectiveness of synthetic data relies heavily on the selected generation method. Each method presents a trade-off between complexity, realism, and computational efficiency, influencing how closely the synthetic data mirrors the original dataset's information and relationships. As such, selecting the appropriate synthetic data generation technique is crucial for achieving accurate and meaningful research outcomes in clinical studies.

Code

RWD39

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Urinary/Kidney Disorders