Comparing Techniques for Mining HBA1C Data in Ambulatory EHRS for Use in RWE Research


Morgan P1, Overcash J2, Sudaria T3, Nguyen N2
1Veradigm Health, Raleigh, NC, USA, 2Veradigm, Raleigh, NC, USA, 3Veradigm Health, Union City, CA, USA

OBJECTIVES: Hemoglobin A1c (HbA1c) values are used regularly in RWE studies, however, are not consistently present in a useable structured form. This study compares two methods to mine HbA1c lab data from free text in EHR fields for RWE research use.

METHODS: Possible terms with HbA1c data were identified from ambulatory EHR data (N=69,832 possible HbA1c results) first by using simple regex logic and second by using keyword neighbor’s regex model (python pandas and string) and clinical knowledge to obtain more terms involving HbA1c. Grammatical errors (i.e., misspelling, punctation, etc.) and relative names (hgb, eag, mean plasma glucose, etc.) were accounted for. The possible term list was reviewed by clinicians to determine validity as HbA1c data. Using clinical expertise and regular expressions, inclusion (i.e aic in the presence of %) and exclusion criteria (i.e., waived, ‘not taken’) were added to strengthen model accuracy. The model was adapted to conform the alias terms corresponding result values and units. The model then uses units and lab result values to calculate and standardize HbA1c with a unit of percentage. If the result/unit combination could not be standardized, then the lab name was not included.

RESULTS: The keyword neighbors’ model (53408 HbA1c results) identified 192% more HbA1c results than the simple regex model (18309 HbA1c results). The keyword neighbors’ model had a recall of 98%. 19 distinct false negatives terms were identified, but further regex logic would lead to degradation of the keyword neighbors’ model performance with minimum lab result yield increase.

CONCLUSIONS: Simple regex logic and result ranges are conservative, missing HbA1c aliases, non-standard values, and outlier values, while still allowing the chance of false positives. The use of keyword neighbors and clinical knowledge greatly increased the percentage of identified HbA1c labs and thus increasing the number of data available for RWE research.

Conference/Value in Health Info

2022-05, ISPOR 2022, Washington, DC, USA




Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Data Protection, Integrity, & Quality Assurance, Electronic Medical & Health Records


Diabetes/Endocrine/Metabolic Disorders

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on Update my browser now