MACHINE LEARNING-BASED PREDICTION OF PULMONARY ARTERIAL HYPERTENSION USING REAL-WORLD EHR AND CLAIMS DATA

Author(s)

Vikash Kumar Verma, MBA, PharmD1, Louis Brooks Jr, MS2, Marissa Seligman, PharmD3, Abhimanyu Roy, MBA4, Abhinav Nayyar, MBA, MBBS5, Ankitkumar Arora, MPharm6, Anuj Gupta, MSc7, Kavita Karayat, Other8, Vishan Khatavkar, MBA9, Aakash Singh, Other10, Srishti Motila, Other7, Pankaj Bhardwaj, MBA, RPh9, Riddhi Markan, BA, MSc11, Gargi Mahashay, BTech5.
1Optum Lifesciences, Boston, MA, USA, 2Optum, Bloomsbury, NJ, USA, 3Optum, Winchester, MA, USA, 4Optum, Gurgaon, India, 5Optum Life Sciences, Gurugram, India, 6Optum Global Solutions, Gurgaon, India, 7Optum Lifesciences, Noida, India, 8Optum Lifesciences, NOIDA, India, 9Optum Lifesciences, Gurugram, India, 10Optum Lifesciences, GURUGRAM, India, 11OPTUM Global Solutions, Gurugram, India.
OBJECTIVES: Pulmonary arterial hypertension (PAH) is a rare, progressive condition frequently diagnosed late due to nonspecific symptoms and reliance on invasive testing, resulting in delayed treatment and poor outcomes. This study applied machine learning (ML) techniques to real‑world clinical and claims data to predict PAH onset earlier, identify key risk factors, and support proactive clinical intervention aimed at improving survival and reducing healthcare burden.
METHODS: A retrospective analysis was conducted using Optum® Market Clarity data (January 2020-June 2025). Patients with confirmed PAH (n=1,026) were identified, with the index date defined as the initial PAH diagnosis. Continuous enrollment for four years pre‑index was required to capture baseline comorbidities, diagnostic activity, and healthcare utilization. A matched control group without PAH was constructed using propensity score matching on age, gender, race, and Charlson Comorbidity Index. Data were split 80:20 into training and testing sets. Logistic Regression, Random Forest, and XGBoost models were developed to predict PAH risk. Performance was evaluated using F1 scores and area under the curve (AUC). Logistic regression was further used to estimate odds ratios (ORs) for significant predictors.
RESULTS: ML models demonstrated strong predictive performance, with F1 scores of 77% (Logistic Regression), 79% (XGBoost), and 79% (Random Forest). Corresponding AUC values were 0.83, 0.81, and 0.82. Further analysis revealed several statistically significant predictors of PAH onset, including ECG abnormalities (OR=3.91), non-ST elevation myocardial infarction (OR=3.56), and dyspnea (OR=3.51), underscoring the contribution of cardiovascular and respiratory symptoms to earlier risk stratification.
CONCLUSIONS: ML‑powered models applied to real‑world data show strong potential for early PAH identification, highlighting key predictors that may aid in risk stratification before clinical deterioration occurs. Early detection tools embedded into clinical workflows could reduce diagnostic delays and improve outcomes. Future work will focus on external validation, clinical usability, and integration into decision‑support systems.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR48

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Cardiovascular Disorders (including MI, Stroke, Circulatory), SDC: Rare & Orphan Diseases

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×