PROBABILISTIC RECORD MATCHING USING MACHINE LEARNING TECHNIQUES

Author(s)

Ross JM¹, Fermin Cota RN¹, Zaric GS²
¹Next Level Analytics Inc., London, ON, Canada, ²Western University, London, ON, Canada

OBJECTIVES: Record matching is a common problem in database studies and in many practical applications. Many records are incomplete or contain errors which makes the matching problem difficult. In addition, the scale of the problem, where batches of several million records must be reconciled against databases containing tens of million records, requires efficient matching algorithms. METHODS: We consider a problem of matching records containing first name, last name, SSN, Medicaid ID, and address. We developed a matching algorithm that works as follows: First, “fingerprints,” which convert records into bitsets of length 128 were generated. Second, fingerprints were compared across tables using approximate nearest neighbour methods. Third, the records for which fingerprints were nearest neighbours were compared using Levenshtein and Jaro-Winkler distance measures. Decision thresholds for matching were set using the EM algorithm. To evaluate the algorithm, we created test sets in which a batch file of 200,000 records must be reconciled against a master database containing 1,000,000 records. We systematically varied the rates of matches, errors, and missing records. We evaluated our algorithm on the basis of a confusion matrix, adjusted for indeterminate decisions. This is a single metric, we called a “quality score”. RESULTS: The fingerprint blocking method was able to retain 99.5% of possible matches, while maintaining at most 50 comparisons per batch record, representing a reduction of 99.995% in the number of comparisons compared to a naïve search strategy. As more errors were introduced into the dataset, the rate of retention decreased, while the 50 comparisons per batch record remained constant. CONCLUSIONS: Our algorithm significantly reduced the number of comparisons while retaining almost all matches. The string distance measures were significantly more effective at identifying matching records than deterministic decision rules. To efficiently match error prone records, a record blocking stage is required.

Conference/Value in Health Info

2016-05, ISPOR 2016, Washington DC, USA

Value in Health, Vol. 19, No. 3 (May 2016)

Code

PRM65

Topic

Real World Data & Information Systems

Topic Subcategory

Reproducibility & Replicability

Disease

Multiple Diseases

Explore Related HEOR by Topic

Real-World Data

Presentation