Evaluating a Privacy Preserving Record Linkage (PPRL) Solution to Link De-Identified Patient Records in Rwd Using Default Matching Methods and Machine Learning Methods

Author(s)

Nguyen N¹, Connolly T², Overcash J³, Hubbard A³, Sudaria T⁴
¹Veradigm, Raleigh, NC, USA, ²Veradigm, Marshall, VA, USA, ³Veradigm Health, Raleigh, NC, USA, ⁴Veradigm Health, Union City, CA, USA

OBJECTIVES: PPRL allows for linking de-identified patient records in RWE studies without sharing PHI/PII. A commercially available PPRL solution is evaluated using it’s out of the box rules to match patients. Output of the PPRL solution were then used to train custom ML models.

METHODS: 24 matching algorithms from the solution were evaluated alone and in paired combinations (both algorithm A and B must match) with each other for 10,000 patients from Veradigm’s Ambulatory EHR dataset. The algorithms used differing rules involving first name, last name, address, zip, date of birth, gender, email, phone number, and social security number (SSN). The PPRL solution creates encrypted tokens for each patient and each algorithm, which were then used to match patients. The matches were evaluated by human annotator for correctness. The encrypted tokens were then used as features to train logistic regression and random forest models (Scikit-learn). Precision, recall and F1 scores were evaluated.

RESULTS: The PPRL solutions’ matching algorithms used in pairs did better than algorithms on their own. Paired algorithms based on SSN and ZIP did well (1.0 precision, 0.73 recall, 0.85 F1 score) however SSN had very low fill rates (6.8%). The algorithms paired algorithms that did best by F1 score used first name, last name, date of birth, and gender with (precision 0.93, recall 0.82, F1 score 0.88). Using logistic regression precision was 0.95, recall 0.92, and F1 0.93. Using random forest precision was 0.94, recall 0.93, and F1 0.93.

CONCLUSIONS: Algorithms used in pairs did better than algorithms used on their own. The addition of data, such as SSN, address, zip, improve linking precision, however, are not always available in the data. Encrypted tokens from commercially available PPRL solutions can be used to engineer features to train better models than the default solution out of the box.

Conference/Value in Health Info

2022-05, ISPOR 2022, Washington, DC, USA

Value in Health, Volume 25, Issue 6, S1 (June 2022)

Code

RWD103

Topic

Real World Data & Information Systems, Study Approaches

Topic Subcategory

Data Protection, Integrity, & Quality Assurance, Electronic Medical & Health Records

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Real-World Data

Presentation