PATIENT JOURNEY FOUNDATIONAL MODEL FOR SCALABLE IMPUTATION OF MISSING UNITS OF MEASUREMENT (UOM) IN ELECTRONIC HEALTH RECORDS DATA
Author(s)
Ehsan Alipour, MD, MPH, PhD1, Wilson Lau, PhD1, Youngwon Kim, PhD1, Sihang Zeng, MS1, Anand Oka, PhD1, Jay Nanduri, MS MBA2.
1Truveta Inc, Bellevue, WA, USA, 2Truveta, issaquah, WA, USA.
1Truveta Inc, Bellevue, WA, USA, 2Truveta, issaquah, WA, USA.
OBJECTIVES: Missing units of measurements (UoM) for laboratory results, observations and measurements create significant data quality challenges in electronic health record (EHR) data that can impact downstream utilization of the data for clinical work, quality control, and research. We developed a transformer-based patient journey foundation model to automatically impute UoM and evaluated its accuracy and potential impact on EHR data completeness.
METHODS: The patient journey is constructed by three parallel sequences containing event names, event value (when applicable) and age of patient at the time of event. These sequences were tokenized using a custom tokenizer and passed through a generative pre-trained transformer (GPT) model with 150 million parameters, augmented with a prediction head for UoM for each input token.
RESULTS: The dataset consisted of 142,390 patients with cancer sampled from Truveta Data who had at least 1 year of patient journey data. Dataset was split into train and validation sets with 17,799 patients in the validation set. All concepts belonging to the conditions and laboratory tests categories were included. 404 unique UoMs were identified in the data. 135 of these occurred more than 100 times in validation set. 19% of laboratory tests did not have a ground-truth UoM. The model achieved an accuracy of 97% over the validation set in identifying the correct UoM. It achieved an F1 score of 0.7 or higher for 88% of UoMs that appeared at least 100 times in the validation set.
CONCLUSIONS: Our foundational model can utilize information about the laboratory test, its value, timing, and the overall context of the patient journey in terms of other events to impute UoM of lab values with high accuracy. The model can significantly improve EHR data quality, reduce need for expensive and time-consuming manual data cleaning, and enhance the reliability of clinical research, real-world evidence generation, and downstream AI applications.
METHODS: The patient journey is constructed by three parallel sequences containing event names, event value (when applicable) and age of patient at the time of event. These sequences were tokenized using a custom tokenizer and passed through a generative pre-trained transformer (GPT) model with 150 million parameters, augmented with a prediction head for UoM for each input token.
RESULTS: The dataset consisted of 142,390 patients with cancer sampled from Truveta Data who had at least 1 year of patient journey data. Dataset was split into train and validation sets with 17,799 patients in the validation set. All concepts belonging to the conditions and laboratory tests categories were included. 404 unique UoMs were identified in the data. 135 of these occurred more than 100 times in validation set. 19% of laboratory tests did not have a ground-truth UoM. The model achieved an accuracy of 97% over the validation set in identifying the correct UoM. It achieved an F1 score of 0.7 or higher for 88% of UoMs that appeared at least 100 times in the validation set.
CONCLUSIONS: Our foundational model can utilize information about the laboratory test, its value, timing, and the overall context of the patient journey in terms of other events to impute UoM of lab values with high accuracy. The model can significantly improve EHR data quality, reduce need for expensive and time-consuming manual data cleaning, and enhance the reliability of clinical research, real-world evidence generation, and downstream AI applications.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
RWD140
Topic
Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, STA: Personalized & Precision Medicine