PROBABILISTIC LINKING OF CLAIMS AND ELECTRONIC HEALTH RECORD (EHR) ENCOUNTERS USING THE FELLEGI-SUNTER STATISTICAL FRAMEWORK
Author(s)
Stacey Long1, Benjamin Nikolai, BS2;
1OMNY Health, Chief Strategy Officer, Atlanta, GA, USA, 2OMNY Health, Atlanta, GA, USA
1OMNY Health, Chief Strategy Officer, Atlanta, GA, USA, 2OMNY Health, Atlanta, GA, USA
OBJECTIVES: To develop and validate a robust, automated probabilistic method for linking individual encounter records between claims and EHR systems to establish a comprehensive view across clinical care delivery and billing records.
METHODS: This retrospective study used OMNY Health electronic health records (EHR; 2017-2025) combined with a linked open claims data source. We utilized the Fellegi-Sunter model to calculate the match probability between encounter record pairs across the two data sets. The model evaluated multiple overlapping features, including dates comparisons: Exact matches and similarity within 1, 3, 5, or 7-day windows; code set overlaps: Degree of overlap (Exact, >50%, or Any) for procedures, diagnoses, and provider taxonomies; and setting of care comparisons (outpatient, inpatient, emergency department). Match weights and non-match weights were aggregated against a prior probability to generate a final match probability. The model was tested on approximately 6 million records from 100,000 randomly selected patients within large health systems (LHS) and specialty practice networks (SPN).
RESULTS: A target match probability score of >90% was established to consider the EHR encounter and billing record a match, recording a match rate of claims to clinical encounter. Performance varied by clinical care setting: 89.25% of ambulatory records matched at the 90% probability threshold. Hospital-based inpatient/outpatient care each demonstrated similar performance, with 45.94% and 43.55% of records matching at the 90% threshold, respectively. Only 29.90% of records from emergency department settings matched at the 90% threshold. Higher match rates were observed in EHR data originating from LHS compared to data originating from SPN.
CONCLUSIONS: The Fellegi-Sunter framework provides a statistically rigorous and transparent method for joining disparate healthcare datasets. By moving beyond simple "black box" or single-variable matches, this holistic approach reduces false positives and provides researchers with a scalable, defensible tool for combining datasets to support real-world evidence generation.
METHODS: This retrospective study used OMNY Health electronic health records (EHR; 2017-2025) combined with a linked open claims data source. We utilized the Fellegi-Sunter model to calculate the match probability between encounter record pairs across the two data sets. The model evaluated multiple overlapping features, including dates comparisons: Exact matches and similarity within 1, 3, 5, or 7-day windows; code set overlaps: Degree of overlap (Exact, >50%, or Any) for procedures, diagnoses, and provider taxonomies; and setting of care comparisons (outpatient, inpatient, emergency department). Match weights and non-match weights were aggregated against a prior probability to generate a final match probability. The model was tested on approximately 6 million records from 100,000 randomly selected patients within large health systems (LHS) and specialty practice networks (SPN).
RESULTS: A target match probability score of >90% was established to consider the EHR encounter and billing record a match, recording a match rate of claims to clinical encounter. Performance varied by clinical care setting: 89.25% of ambulatory records matched at the 90% probability threshold. Hospital-based inpatient/outpatient care each demonstrated similar performance, with 45.94% and 43.55% of records matching at the 90% threshold, respectively. Only 29.90% of records from emergency department settings matched at the 90% threshold. Higher match rates were observed in EHR data originating from LHS compared to data originating from SPN.
CONCLUSIONS: The Fellegi-Sunter framework provides a statistically rigorous and transparent method for joining disparate healthcare datasets. By moving beyond simple "black box" or single-variable matches, this holistic approach reduces false positives and provides researchers with a scalable, defensible tool for combining datasets to support real-world evidence generation.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR144
Topic
Methodological & Statistical Research
Topic Subcategory
Missing Data
Disease
No Additional Disease & Conditions/Specialized Treatment Areas