MACHINE LEARNING-BASED PREDICTION OF PULMONARY ARTERIAL HYPERTENSION USING REAL-WORLD EHR AND CLAIMS DATA
Author(s)
Vikash Kumar Verma, MBA, PharmD1, Louis Brooks Jr, MS2, Marissa Seligman, PharmD3, Abhimanyu Roy, MBA4, Abhinav Nayyar, MBA, MBBS5, Ankitkumar Arora, MPharm6, Anuj Gupta, MSc7, Kavita Karayat, Other8, Vishan Khatavkar, MBA9, Aakash Singh, Other10, Srishti Motila, Other7, Pankaj Bhardwaj, MBA, RPh9, Riddhi Markan, BA, MSc11, Gargi Mahashay, BTech5.
1Optum Lifesciences, Boston, MA, USA, 2Optum, Bloomsbury, NJ, USA, 3Optum, Winchester, MA, USA, 4Optum, Gurgaon, India, 5Optum Life Sciences, Gurugram, India, 6Optum Global Solutions, Gurgaon, India, 7Optum Lifesciences, Noida, India, 8Optum Lifesciences, NOIDA, India, 9Optum Lifesciences, Gurugram, India, 10Optum Lifesciences, GURUGRAM, India, 11OPTUM Global Solutions, Gurugram, India.
1Optum Lifesciences, Boston, MA, USA, 2Optum, Bloomsbury, NJ, USA, 3Optum, Winchester, MA, USA, 4Optum, Gurgaon, India, 5Optum Life Sciences, Gurugram, India, 6Optum Global Solutions, Gurgaon, India, 7Optum Lifesciences, Noida, India, 8Optum Lifesciences, NOIDA, India, 9Optum Lifesciences, Gurugram, India, 10Optum Lifesciences, GURUGRAM, India, 11OPTUM Global Solutions, Gurugram, India.
OBJECTIVES: Pulmonary arterial hypertension (PAH) is a rare, progressive condition frequently diagnosed late due to nonspecific symptoms and reliance on invasive testing, resulting in delayed treatment and poor outcomes. This study applied machine learning (ML) techniques to real‑world clinical and claims data to predict PAH onset earlier, identify key risk factors, and support proactive clinical intervention aimed at improving survival and reducing healthcare burden.
METHODS: A retrospective analysis was conducted using Optum® Market Clarity data (January 2020-June 2025). Patients with confirmed PAH (n=1,026) were identified, with the index date defined as the initial PAH diagnosis. Continuous enrollment for four years pre‑index was required to capture baseline comorbidities, diagnostic activity, and healthcare utilization. A matched control group without PAH was constructed using propensity score matching on age, gender, race, and Charlson Comorbidity Index. Data were split 80:20 into training and testing sets. Logistic Regression, Random Forest, and XGBoost models were developed to predict PAH risk. Performance was evaluated using F1 scores and area under the curve (AUC). Logistic regression was further used to estimate odds ratios (ORs) for significant predictors.
RESULTS: ML models demonstrated strong predictive performance, with F1 scores of 77% (Logistic Regression), 79% (XGBoost), and 79% (Random Forest). Corresponding AUC values were 0.83, 0.81, and 0.82. Further analysis revealed several statistically significant predictors of PAH onset, including ECG abnormalities (OR=3.91), non-ST elevation myocardial infarction (OR=3.56), and dyspnea (OR=3.51), underscoring the contribution of cardiovascular and respiratory symptoms to earlier risk stratification.
CONCLUSIONS: ML‑powered models applied to real‑world data show strong potential for early PAH identification, highlighting key predictors that may aid in risk stratification before clinical deterioration occurs. Early detection tools embedded into clinical workflows could reduce diagnostic delays and improve outcomes. Future work will focus on external validation, clinical usability, and integration into decision‑support systems.
METHODS: A retrospective analysis was conducted using Optum® Market Clarity data (January 2020-June 2025). Patients with confirmed PAH (n=1,026) were identified, with the index date defined as the initial PAH diagnosis. Continuous enrollment for four years pre‑index was required to capture baseline comorbidities, diagnostic activity, and healthcare utilization. A matched control group without PAH was constructed using propensity score matching on age, gender, race, and Charlson Comorbidity Index. Data were split 80:20 into training and testing sets. Logistic Regression, Random Forest, and XGBoost models were developed to predict PAH risk. Performance was evaluated using F1 scores and area under the curve (AUC). Logistic regression was further used to estimate odds ratios (ORs) for significant predictors.
RESULTS: ML models demonstrated strong predictive performance, with F1 scores of 77% (Logistic Regression), 79% (XGBoost), and 79% (Random Forest). Corresponding AUC values were 0.83, 0.81, and 0.82. Further analysis revealed several statistically significant predictors of PAH onset, including ECG abnormalities (OR=3.91), non-ST elevation myocardial infarction (OR=3.56), and dyspnea (OR=3.51), underscoring the contribution of cardiovascular and respiratory symptoms to earlier risk stratification.
CONCLUSIONS: ML‑powered models applied to real‑world data show strong potential for early PAH identification, highlighting key predictors that may aid in risk stratification before clinical deterioration occurs. Early detection tools embedded into clinical workflows could reduce diagnostic delays and improve outcomes. Future work will focus on external validation, clinical usability, and integration into decision‑support systems.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR48
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Cardiovascular Disorders (including MI, Stroke, Circulatory), SDC: Rare & Orphan Diseases