CONTRASTIVE TRAJECTORY DISTILLATION: A NOVEL METHOD FOR OPTIMIZING LARGE LANGUAGE MODEL EMBEDDINGS IN MATERNAL HEALTH PREDICTION
Author(s)
Samuel Weiss, BS, Robert Martorano, BS, Ian J. Hooley, BS.
Pomelo Care, New York, NY, USA.
Pomelo Care, New York, NY, USA.
OBJECTIVES: Electronic Health Records (EHR) contain rich longitudinal signals, yet traditional tabular machine learning often fails to capture temporal dependencies within sparse diagnosis histories. We propose Contrastive Trajectory Distillation (CTD), a framework leveraging Large Language Models (LLMs) to transform longitudinal patient histories into high-fidelity predictive embeddings for maternal health outcomes.
METHODS: We developed a "mutual prediction" framework between two LLM agents (Gemini 2.5 Pro): a Predictor forecasting clinical trajectories from past ICD-10 history, and an Inferrer summarizing actual future outcomes. Using a reflection mechanism, prompts were iteratively optimized to maximize a contrastive objective—ensuring the Predictor’s output was more mathematically closer to the patient's actual future than to a random contrastive patient. We validated this framework on maternal episodes (N=3,000), training CatBoost models to compare 3,072-dimensional optimized embeddings against traditional dummy-coded ICD-10 features across 9 binary maternity complications and total cost.
RESULTS: Optimized embeddings consistently outperformed traditional features across all tasks. The method showed the greatest lift in hard-to-predict conditions, improving AUC-ROC for Preeclampsia (0.67 vs 0.58; +15.8%), Hypertension Spectrum (0.73 vs 0.66; +10.7%), and Gestational Hypertension (0.71 vs 0.67; +6.4%). For healthcare utilization, the embeddings improved Future Healthcare Cost prediction R² by 4.2% (0.40 vs 0.38) and reduced mean absolute error (MAE) by 3.5%. Furthermore, the optimization process improved the embeddings' ability to distinguish correct patient trajectories (contrastive pass rate) from 63% using raw codes to 80% using optimized prompts.
CONCLUSIONS: CTD effectively bridges the gap between unstructured clinical reasoning and structured risk prediction. By distilling LLM knowledge into fixed-dimensional embeddings, this framework captures semantic and temporal patterns that traditional coding misses. This offers a scalable, interpretable approach for the early identification of high-risk pregnancies without the latency and cost of direct LLM inference.
METHODS: We developed a "mutual prediction" framework between two LLM agents (Gemini 2.5 Pro): a Predictor forecasting clinical trajectories from past ICD-10 history, and an Inferrer summarizing actual future outcomes. Using a reflection mechanism, prompts were iteratively optimized to maximize a contrastive objective—ensuring the Predictor’s output was more mathematically closer to the patient's actual future than to a random contrastive patient. We validated this framework on maternal episodes (N=3,000), training CatBoost models to compare 3,072-dimensional optimized embeddings against traditional dummy-coded ICD-10 features across 9 binary maternity complications and total cost.
RESULTS: Optimized embeddings consistently outperformed traditional features across all tasks. The method showed the greatest lift in hard-to-predict conditions, improving AUC-ROC for Preeclampsia (0.67 vs 0.58; +15.8%), Hypertension Spectrum (0.73 vs 0.66; +10.7%), and Gestational Hypertension (0.71 vs 0.67; +6.4%). For healthcare utilization, the embeddings improved Future Healthcare Cost prediction R² by 4.2% (0.40 vs 0.38) and reduced mean absolute error (MAE) by 3.5%. Furthermore, the optimization process improved the embeddings' ability to distinguish correct patient trajectories (contrastive pass rate) from 63% using raw codes to 80% using optimized prompts.
CONCLUSIONS: CTD effectively bridges the gap between unstructured clinical reasoning and structured risk prediction. By distilling LLM knowledge into fixed-dimensional embeddings, this framework captures semantic and temporal patterns that traditional coding misses. This offers a scalable, interpretable approach for the early identification of high-risk pregnancies without the latency and cost of direct LLM inference.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR70
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Reproductive & Sexual Health