A Hierarchical Algorithm to Identify Pregnancy Start and Duration Using Structured EHR Data

Author(s)

Sunny Guin, PhD, Katie Brown, PhD, MSN, RN, Amy Sullivan, MS, Esther Kim, PhD, Nadia Tabatabaeepour, MPH, Katherine Gilbert, MPH, Jordan Swartz, MD, Sarah Platt, MS, Emily Webber, PhD.
Truveta, Seattle, WA, USA.
OBJECTIVES: Accurate identification of pregnancy start dates and durations is critical for real-world studies evaluating treatment exposures, healthcare utilization, and maternal-infant outcomes. However, estimating pregnancy start is challenging from structured electronic health record (EHR) data.  We have developed a hierarchical algorithm to estimate last menstrual period (LMP) and pregnancy duration using structured EHR data from Truveta, a multi-health system platform. The approach supports comprehensive cohort development for health economics and outcomes research (HEOR), including linked infant outcomes. 
METHODS: Truveta data provides complete, timely, representative, de-identified EHR data comprising over 120 million patients from US health systems. We used structured diagnosis and procedure records with deterministic mother-infant linkages. Women aged 12-55 with pregnancy-related codes were included. Two primary methods were used: gestational age codes—specifically ICD-10-CM Z3A.xx and SNOMED CT equivalents—were used to estimate weeks of gestation by counting backward from code dates. Outcome-based methods used codes for live births, stillbirths, abortions, and ectopic/molar pregnancies. The algorithm prioritized four LMP estimation approaches from the two methods 
RESULTS: Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancy start dates were estimated using biologically plausible durations. Approximately 3.5 million women had outcome codes (2.9M live births; 0.6M non-live births), with 1.26 million linked to infant records. SNOMED codes contributed an additional 0.5 million episodes. The resulting dataset includes pregnancy timing, outcome type, and infant linkage, enabling longitudinal analyses of care, treatment effects, and child outcomes. 
CONCLUSIONS: This multi-step algorithm supports robust pregnancy episode construction using structured EHR data. Its hierarchical design improves completeness and precision, particularly when combined with mother-infant linkage. Future work includes clinician-reviewed validation of estimated pregnancy start dates and durations using note-based gold standards. This methodology enables scalable, reproducible cohort creation for regulatory-grade research, including pregnancy PASS and drug safety evaluations. 

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

RWD2

Topic

Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Reproducibility & Replicability

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Reproductive & Sexual Health

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×