A Record-Level Risk Metric for Partially Synthetic Clinical Data Using Inverse-Square Similarity Decay
Author(s)
Ade Adeoye, MSc1, Lucy Mosquera, BSc, MSc2.
1Aetion Inc, Edmonton, AB, Canada, 2Sr Director of Data Science, Aetion, Ottawa, ON, Canada.
1Aetion Inc, Edmonton, AB, Canada, 2Sr Director of Data Science, Aetion, Ottawa, ON, Canada.
OBJECTIVES: Partial synthesis of clinical trial data is an innovative privacy-enhancing technology where quasi-identifiers (QIs) are synthesized to mitigate re-identification risk while preserving data utility. Responsible sharing of these datasets requires quantitative assessment of re-identification risks. Traditional approaches like k-anonymity provide group-level guarantee, but fail to quantify record-level risk. We propose an interpretable record-level risk metric which leverages pairwise record comparison, incorporates structured QI similarity, and models decline in attacker confidence and increasing ambiguity.
METHODS: Given a real dataset R and its synthetic equivalent S, we compute an NxN matrix of pairwise distances over a set of QIs. For each real record ri, we identify the true synthetic record in S. Ties occur when multiple synthetic records share the same distance. Record-level similarity score is defined as: Similarity = (1-di,scaled) x (1/T2i), where di,scaled is scaled distance and Ti≤1 is the number of tied synthetic records at equal distance. 1/Ti2 reflects the attacker’s decreasing confidence in identifying the correct record as ambiguity increases, modeled as a function of uncertainty interpreted statistically through variance or entropy in the set of possible synthetic matches. Risk is obtained as: Riski= (1/kreal,i) x Similarityi x P(A), where P(A) represents the probability of attack.
RESULTS: We apply this method to CDSIC SDTM-formatted clinical trial datasets. It yields a distribution of record-level risk scores that aligns favourably with intuitive notions of disclosure risk. Records with unique QI combinations, small tie sizes and smaller distances are assigned higher risks. The inverse-square component sharply penalizes cases with higher tie sizes, consistent with conservative disclosure standards.
CONCLUSIONS: This work highlights a useful framework for quantifying record-level disclosure risk in partially synthetic clinical datasets. By combining distance, rank-based uncertainty, and inverse-square attenuation of tie confidence, this method offers more granular, privacy conscious assessment for synthetic outputs than k-anonymity-based measures.
METHODS: Given a real dataset R and its synthetic equivalent S, we compute an NxN matrix of pairwise distances over a set of QIs. For each real record ri, we identify the true synthetic record in S. Ties occur when multiple synthetic records share the same distance. Record-level similarity score is defined as: Similarity = (1-di,scaled) x (1/T2i), where di,scaled is scaled distance and Ti≤1 is the number of tied synthetic records at equal distance. 1/Ti2 reflects the attacker’s decreasing confidence in identifying the correct record as ambiguity increases, modeled as a function of uncertainty interpreted statistically through variance or entropy in the set of possible synthetic matches. Risk is obtained as: Riski= (1/kreal,i) x Similarityi x P(A), where P(A) represents the probability of attack.
RESULTS: We apply this method to CDSIC SDTM-formatted clinical trial datasets. It yields a distribution of record-level risk scores that aligns favourably with intuitive notions of disclosure risk. Records with unique QI combinations, small tie sizes and smaller distances are assigned higher risks. The inverse-square component sharply penalizes cases with higher tie sizes, consistent with conservative disclosure standards.
CONCLUSIONS: This work highlights a useful framework for quantifying record-level disclosure risk in partially synthetic clinical datasets. By combining distance, rank-based uncertainty, and inverse-square attenuation of tie confidence, this method offers more granular, privacy conscious assessment for synthetic outputs than k-anonymity-based measures.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD6
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
No Additional Disease & Conditions/Specialized Treatment Areas