WITHDRAWN Systematic Evaluation of Synthetic Panel Data Quality With an Application to Kidney Disease
Author(s)
ABSTRACT WITHDRAWN
OBJECTIVES:
The dynamic and evolving clinical trial landscape, with its nascent incorporation of real-world data, has the potential to transform healthcare. With these exciting developments come challenges in protecting patient privacy while retaining data information. An emerging technology to ameliorate this challenge is synthetic data generation (SDG). The promises of SDG are to provide realistic, representative, and sharable data that retains all the potential learning of the original (parent) data. We generate a synthetic cohort using real-world data on Focal Segmental Glomerulosclerosis (FSGS) patients.METHODS:
We used Generative Adversarial Networks (GANs), a type of unsupervised deep learning algorithm, to generate synthetic FSGS patients and their disease trajectories over time. 1,289 patients were identified within a large tertiary healthcare system by ICD9/10 codes that could provide longitudinal values for the synthetic cohort. We simulated synthetic patient data using EMR (Electronic Medical Record) data, including laboratory test values, patient-reported health state utility values (HSUVs), and other baseline characteristics. Finally, we tested the quality of the generated data with statistical tests assessing trends similarity.RESULTS:
Clinical attributes showed a strong relation between analyte trajectories and outcomes. Glomerular filtration rate (GFR) decreased over time in the cohort who died during the observation period. Synthetic data was indistinguishable from original data in both statistical tests and in machine learning algorithms to predict disease progression. GFR and albumin time-series predictions were statistically undistinguishable from real patients’ data as well as equally useful in predicting outcomes.CONCLUSIONS:
We demonstrated that synthetic cohorts retain the same statistical distributions of the parent dataset, while reducing probability of patient identification to zero. This application of statistical tests to evaluate deep learning algorithms provides a novel perspective on synthetic data generation and poses the bases for the establishment of best practices for synthetic data quality assessment.Conference/Value in Health Info
2022-11, ISPOR Europe 2022, Vienna, Austria
Value in Health, Volume 25, Issue 12S (December 2022)
Code
MSR48
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance, Reproducibility & Replicability
Disease
No Additional Disease & Conditions/Specialized Treatment Areas