A Record-Level Risk Metric for Partially Synthetic Clinical Data Using Inverse-Square Similarity Decay

Author(s)

Ade Adeoye, MSc1, Lucy Mosquera, BSc, MSc2.
1Aetion Inc, Edmonton, AB, Canada, 2Sr Director of Data Science, Aetion, Ottawa, ON, Canada.
OBJECTIVES: Partial synthesis of clinical trial data is an innovative privacy-enhancing technology where quasi-identifiers (QIs) are synthesized to mitigate re-identification risk while preserving data utility. Responsible sharing of these datasets requires quantitative assessment of re-identification risks. Traditional approaches like k-anonymity provide group-level guarantee, but fail to quantify record-level risk. We propose an interpretable record-level risk metric which leverages pairwise record comparison, incorporates structured QI similarity, and models decline in attacker confidence and increasing ambiguity.
METHODS: Given a real dataset R and its synthetic equivalent S, we compute an NxN matrix of pairwise distances over a set of QIs. For each real record ri, we identify the true synthetic record in S. Ties occur when multiple synthetic records share the same distance. Record-level similarity score is defined as: Similarity = (1-di,scaled) x (1/T2i), where di,scaled is scaled distance and Ti≤1 is the number of tied synthetic records at equal distance. 1/Ti2 reflects the attacker’s decreasing confidence in identifying the correct record as ambiguity increases, modeled as a function of uncertainty interpreted statistically through variance or entropy in the set of possible synthetic matches. Risk is obtained as: Riski= (1/kreal,i) x Similarityi x P(A), where P(A) represents the probability of attack.
RESULTS: We apply this method to CDSIC SDTM-formatted clinical trial datasets. It yields a distribution of record-level risk scores that aligns favourably with intuitive notions of disclosure risk. Records with unique QI combinations, small tie sizes and smaller distances are assigned higher risks. The inverse-square component sharply penalizes cases with higher tie sizes, consistent with conservative disclosure standards.
CONCLUSIONS: This work highlights a useful framework for quantifying record-level disclosure risk in partially synthetic clinical datasets. By combining distance, rank-based uncertainty, and inverse-square attenuation of tie confidence, this method offers more granular, privacy conscious assessment for synthetic outputs than k-anonymity-based measures.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

RWD6

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×