Developing a Feature Selection Workflow for Variable-Rich Data: A Case Study Utilizing Claims Data to Build Classifiers for the Prediction of Opioid Use Disorder Among Persons Authorized to Purchase Medical Cannabis

Author(s)

Allen M. Smith, PharmD, Horacio Gomez-Acevedo, PhD, Corey J. Hayes, MPH, PharmD, PhD, Melody Greer, PhD, Bradley C. Martin, RPh, PharmD, PhD;
University of Arkansas for Medical Sciences, Little Rock, AR, USA
OBJECTIVES: High-dimensional (variable-rich) data in predictive analytics is prone to overfitting and makes prediction tasks more difficult due to data sparsity and increased computational complexity. A feature selection workflow was developed and applied to high-dimensional data to select features for opioid use disorder (OUD) risk prediction.
METHODS: Arkansas claims between November 2018 - December 2023 were linked to medical cannabis authorization data. Features were derived from demographics, healthcare expenditure and utilization characteristics, acute and chronic comorbidities grouped by Clinical Classification Software Refined (CCSR), and prescription characteristics grouped by First Databank (FDB) therapeutic class. Features were labeled “prognostic” of OUD if evidenced by prior literature. First, CCSR and FDB-derived features with <30 observations were combined into larger, clinically-relevant groupings. Acute and chronic features were then evaluated separately using Spearman correlation matrices and similarity score-derived dendrograms to determine clinically-relevant feature groupings that were either combined with other features or reduced to a single dimension by Principal Component Analysis (PCA). Random forest-derived feature importance scores (FIS) and Cox proportional hazards-derived p-values were then calculated and visually inspected to determine a cut-off point retaining ≥60% of prognostic features. Features retained in either strategy were included in the final feature space.
RESULTS: A total of 569 features were initially derived. Collapsing features with <30 observations into larger groupings reduced features to 458. Spearman correlation values >0.2 and similarity scores <0.95 identified 24 groupings for feature combination and 24 groupings for PCA, reducing features to 344. The random forest approach preserved 26 (60.47%) prognostic features (FIS value range: 0.0026-0.0432) while the Cox approach preserved 32 (74.42%) prognostic features (p-value range: 0.6786-0.0452), reducing the final feature space to 174.
CONCLUSIONS: A feature selection workflow leveraging clinical expertise with a comprehensive sequential dimensionality reduction approach is an effective way to reduce high-dimensionality while maintaining a clinically meaningful feature space.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR2

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×