A Record-Level Risk Metric for Partially Synthetic Clinical Data Using Inverse-Square Similarity Decay

Author(s)

Ade Adeoye, MSc¹, Lucy Mosquera, BSc, MSc².
¹Aetion Inc, Edmonton, AB, Canada, ²Sr Director of Data Science, Aetion, Ottawa, ON, Canada.

Presentation Documents

ISPOREurope25_ADEOYE_RWD6_POSTER.pdf

OBJECTIVES: Partial synthesis of clinical trial data is an innovative privacy-enhancing technology where quasi-identifiers (QIs) are synthesized to mitigate re-identification risk while preserving data utility. Responsible sharing of these datasets requires quantitative assessment of re-identification risks. Traditional approaches like k-anonymity provide group-level guarantee, but fail to quantify record-level risk. We propose an interpretable record-level risk metric which leverages pairwise record comparison, incorporates structured QI similarity, and models decline in attacker confidence and increasing ambiguity.
METHODS: Given a real dataset R and its synthetic equivalent S, we compute an NxN matrix of pairwise distances over a set of QIs. For each real record r_i, we identify the true synthetic record in S. Ties occur when multiple synthetic records share the same distance. Record-level similarity score is defined as: Similarity = (1-d_i,scaled) x (1/T²_i), where d_i,scaled is scaled distance and T_i≤1 is the number of tied synthetic records at equal distance. 1/T_i² reflects the attacker’s decreasing confidence in identifying the correct record as ambiguity increases, modeled as a function of uncertainty interpreted statistically through variance or entropy in the set of possible synthetic matches. Risk is obtained as: Risk_i= (1/k_real,i) x Similarity_ix P(A), where P(A) represents the probability of attack.
RESULTS: We apply this method to CDSIC SDTM-formatted clinical trial datasets. It yields a distribution of record-level risk scores that aligns favourably with intuitive notions of disclosure risk. Records with unique QI combinations, small tie sizes and smaller distances are assigned higher risks. The inverse-square component sharply penalizes cases with higher tie sizes, consistent with conservative disclosure standards.
CONCLUSIONS: This work highlights a useful framework for quantifying record-level disclosure risk in partially synthetic clinical datasets. By combining distance, rank-based uncertainty, and inverse-square attenuation of tie confidence, this method offers more granular, privacy conscious assessment for synthetic outputs than k-anonymity-based measures.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

RWD6

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)