Enabling Data Privacy in Health Economic Microsimulation Models by Using Subject-Level Synthetic Data
Author(s)
Ivkovic M, Grand T
Novo Nordisk A/S, Copenhagen, Denmark
Presentation Documents
Objectives: Health economic microsimulation models, that directly utilize subject-level data to inform reimbursement decisions, are increasingly replacing the more traditional cohort models that utilize summary data only. Sharing such microsimulation with, for example, HTA bodies requires careful attention to data privacy and security in order to protect personal information. This study aimed to explore the potential of an alternative solution which minimizes privacy risk: synthetic subject-level data, which mimics actual data, but holds no sensitive subject information. Methods: We synthesized key variables from a large phase III trial in people with obesity by fitting a classification and regression tree (CART) model to the observed data and using it to predict a synthetic dataset, sequentially generating the variables. Utilizing the publicly available software package synthpop in R, we attempted to identify the benefits and limitations of synthesizing data with the CART machine learning method. A special focus was on measures of similarity between observed and synthetic data. Results: We compared univariate and multivariate properties of data pre- and post-synthesis. Histograms and summary statistics demonstrated little to no difference in univariate distributions of synthetic vs actual variables, crucially preserving tail behaviour. The multivariate correlation structure of synthetic data greatly resembled that of the original data, to the point visual representations of them were nearly indistinguishable. Furthermore, CART’s properties ensure key subgroups are captured, which resulted in preservation of main subgroup interactions, as well. Conclusions: In this case study, a synthetic dataset with a very high degree of similarity to the actual subject-level data was constructed using machine learning techniques. This suggests that synthetic data may be a promising alternative in health economic microsimulation models to address data privacy and protection concerns. Future work will include validating the use of subject-level synthetic vs observed data in a health economic microsimulation modeling.
Conference/Value in Health Info
2021-11, ISPOR Europe 2021, Copenhagen, Denmark
Value in Health, Volume 24, Issue 12, S2 (December 2021)
Acceptance Code
P69
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance
Disease
No Specific Disease