CONTRASTIVE TRAJECTORY DISTILLATION: A NOVEL METHOD FOR OPTIMIZING LARGE LANGUAGE MODEL EMBEDDINGS IN MATERNAL HEALTH PREDICTION

Author(s)

Samuel Weiss, BS, Robert Martorano, BS, Ian J. Hooley, BS.
Pomelo Care, New York, NY, USA.

OBJECTIVES: Electronic Health Records (EHR) contain rich longitudinal signals, yet traditional tabular machine learning often fails to capture temporal dependencies within sparse diagnosis histories. We propose Contrastive Trajectory Distillation (CTD), a framework leveraging Large Language Models (LLMs) to transform longitudinal patient histories into high-fidelity predictive embeddings for maternal health outcomes.
METHODS: We developed a "mutual prediction" framework between two LLM agents (Gemini 2.5 Pro): a Predictor forecasting clinical trajectories from past ICD-10 history, and an Inferrer summarizing actual future outcomes. Using a reflection mechanism, prompts were iteratively optimized to maximize a contrastive objective—ensuring the Predictor’s output was more mathematically closer to the patient's actual future than to a random contrastive patient. We validated this framework on maternal episodes (N=3,000), training CatBoost models to compare 3,072-dimensional optimized embeddings against traditional dummy-coded ICD-10 features across 9 binary maternity complications and total cost.
RESULTS: Optimized embeddings consistently outperformed traditional features across all tasks. The method showed the greatest lift in hard-to-predict conditions, improving AUC-ROC for Preeclampsia (0.67 vs 0.58; +15.8%), Hypertension Spectrum (0.73 vs 0.66; +10.7%), and Gestational Hypertension (0.71 vs 0.67; +6.4%). For healthcare utilization, the embeddings improved Future Healthcare Cost prediction R² by 4.2% (0.40 vs 0.38) and reduced mean absolute error (MAE) by 3.5%. Furthermore, the optimization process improved the embeddings' ability to distinguish correct patient trajectories (contrastive pass rate) from 63% using raw codes to 80% using optimized prompts.
CONCLUSIONS: CTD effectively bridges the gap between unstructured clinical reasoning and structured risk prediction. By distilling LLM knowledge into fixed-dimensional embeddings, this framework captures semantic and temporal patterns that traditional coding misses. This offers a scalable, interpretable approach for the early identification of high-risk pregnancies without the latency and cost of direct LLM inference.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR70

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Reproductive & Sexual Health

Presentation (CTI)