WITHDRAWN Systematic Evaluation of Synthetic Panel Data Quality With an Application to Kidney Disease

Author(s)

ABSTRACT WITHDRAWN

OBJECTIVES:

The dynamic and evolving clinical trial landscape, with its nascent incorporation of real-world data, has the potential to transform healthcare. With these exciting developments come challenges in protecting patient privacy while retaining data information. An emerging technology to ameliorate this challenge is synthetic data generation (SDG). The promises of SDG are to provide realistic, representative, and sharable data that retains all the potential learning of the original (parent) data. We generate a synthetic cohort using real-world data on Focal Segmental Glomerulosclerosis (FSGS) patients.

METHODS:

We used Generative Adversarial Networks (GANs), a type of unsupervised deep learning algorithm, to generate synthetic FSGS patients and their disease trajectories over time. 1,289 patients were identified within a large tertiary healthcare system by ICD9/10 codes that could provide longitudinal values for the synthetic cohort. We simulated synthetic patient data using EMR (Electronic Medical Record) data, including laboratory test values, patient-reported health state utility values (HSUVs), and other baseline characteristics. Finally, we tested the quality of the generated data with statistical tests assessing trends similarity.

RESULTS:

Clinical attributes showed a strong relation between analyte trajectories and outcomes. Glomerular filtration rate (GFR) decreased over time in the cohort who died during the observation period. Synthetic data was indistinguishable from original data in both statistical tests and in machine learning algorithms to predict disease progression. GFR and albumin time-series predictions were statistically undistinguishable from real patients’ data as well as equally useful in predicting outcomes.

CONCLUSIONS:

We demonstrated that synthetic cohorts retain the same statistical distributions of the parent dataset, while reducing probability of patient identification to zero. This application of statistical tests to evaluate deep learning algorithms provides a novel perspective on synthetic data generation and poses the bases for the establishment of best practices for synthetic data quality assessment.

Conference/Value in Health Info

2022-11, ISPOR Europe 2022, Vienna, Austria

Value in Health, Volume 25, Issue 12S (December 2022)

Code

MSR48

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance, Reproducibility & Replicability

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Presentation