Evaluating a Privacy Preserving Record Linkage (PPRL) Solution to Link De-Identified Patient Records in Rwd Using Default Matching Methods and Machine Learning Methods
Nguyen N1, Connolly T2, Overcash J3, Hubbard A3, Sudaria T4
1Veradigm, Raleigh, NC, USA, 2Veradigm, Marshall, VA, USA, 3Veradigm Health, Raleigh, NC, USA, 4Veradigm Health, Union City, CA, USA
OBJECTIVES: PPRL allows for linking de-identified patient records in RWE studies without sharing PHI/PII. A commercially available PPRL solution is evaluated using it’s out of the box rules to match patients. Output of the PPRL solution were then used to train custom ML models.
METHODS: 24 matching algorithms from the solution were evaluated alone and in paired combinations (both algorithm A and B must match) with each other for 10,000 patients from Veradigm’s Ambulatory EHR dataset. The algorithms used differing rules involving first name, last name, address, zip, date of birth, gender, email, phone number, and social security number (SSN). The PPRL solution creates encrypted tokens for each patient and each algorithm, which were then used to match patients. The matches were evaluated by human annotator for correctness. The encrypted tokens were then used as features to train logistic regression and random forest models (Scikit-learn). Precision, recall and F1 scores were evaluated.
RESULTS: The PPRL solutions’ matching algorithms used in pairs did better than algorithms on their own. Paired algorithms based on SSN and ZIP did well (1.0 precision, 0.73 recall, 0.85 F1 score) however SSN had very low fill rates (6.8%). The algorithms paired algorithms that did best by F1 score used first name, last name, date of birth, and gender with (precision 0.93, recall 0.82, F1 score 0.88). Using logistic regression precision was 0.95, recall 0.92, and F1 0.93. Using random forest precision was 0.94, recall 0.93, and F1 0.93.
CONCLUSIONS: Algorithms used in pairs did better than algorithms used on their own. The addition of data, such as SSN, address, zip, improve linking precision, however, are not always available in the data. Encrypted tokens from commercially available PPRL solutions can be used to engineer features to train better models than the default solution out of the box.
Conference/Value in Health Info
Real World Data & Information Systems, Study Approaches
Data Protection, Integrity, & Quality Assurance, Electronic Medical & Health Records
No Additional Disease & Conditions/Specialized Treatment Areas