PROBABILISTIC RECORD MATCHING USING MACHINE LEARNING TECHNIQUES
Author(s)
Ross JM1, Fermin Cota RN1, Zaric GS2
1Next Level Analytics Inc., London, ON, Canada, 2Western University, London, ON, Canada
OBJECTIVES: Record matching is a common problem in database studies and in many practical applications. Many records are incomplete or contain errors which makes the matching problem difficult. In addition, the scale of the problem, where batches of several million records must be reconciled against databases containing tens of million records, requires efficient matching algorithms. METHODS: We consider a problem of matching records containing first name, last name, SSN, Medicaid ID, and address. We developed a matching algorithm that works as follows: First, “fingerprints,” which convert records into bitsets of length 128 were generated. Second, fingerprints were compared across tables using approximate nearest neighbour methods. Third, the records for which fingerprints were nearest neighbours were compared using Levenshtein and Jaro-Winkler distance measures. Decision thresholds for matching were set using the EM algorithm. To evaluate the algorithm, we created test sets in which a batch file of 200,000 records must be reconciled against a master database containing 1,000,000 records. We systematically varied the rates of matches, errors, and missing records. We evaluated our algorithm on the basis of a confusion matrix, adjusted for indeterminate decisions. This is a single metric, we called a “quality score”. RESULTS: The fingerprint blocking method was able to retain 99.5% of possible matches, while maintaining at most 50 comparisons per batch record, representing a reduction of 99.995% in the number of comparisons compared to a naïve search strategy. As more errors were introduced into the dataset, the rate of retention decreased, while the 50 comparisons per batch record remained constant. CONCLUSIONS: Our algorithm significantly reduced the number of comparisons while retaining almost all matches. The string distance measures were significantly more effective at identifying matching records than deterministic decision rules. To efficiently match error prone records, a record blocking stage is required.
Conference/Value in Health Info
2016-05, ISPOR 2016, Washington DC, USA
Value in Health, Vol. 19, No. 3 (May 2016)
Code
PRM65
Topic
Real World Data & Information Systems
Topic Subcategory
Reproducibility & Replicability
Disease
Multiple Diseases