Developing a Feature Selection Workflow for Variable-Rich Data: A Case Study Utilizing Claims Data to Build Classifiers for the Prediction of Opioid Use Disorder Among Persons Authorized to Purchase Medical Cannabis
Author(s)
Allen M. Smith, PharmD, Horacio Gomez-Acevedo, PhD, Corey J. Hayes, MPH, PharmD, PhD, Melody Greer, PhD, Bradley C. Martin, RPh, PharmD, PhD;
University of Arkansas for Medical Sciences, Little Rock, AR, USA
University of Arkansas for Medical Sciences, Little Rock, AR, USA
Presentation Documents
OBJECTIVES: High-dimensional (variable-rich) data in predictive analytics is prone to overfitting and makes prediction tasks more difficult due to data sparsity and increased computational complexity. A feature selection workflow was developed and applied to high-dimensional data to select features for opioid use disorder (OUD) risk prediction.
METHODS: Arkansas claims between November 2018 - December 2023 were linked to medical cannabis authorization data. Features were derived from demographics, healthcare expenditure and utilization characteristics, acute and chronic comorbidities grouped by Clinical Classification Software Refined (CCSR), and prescription characteristics grouped by First Databank (FDB) therapeutic class. Features were labeled “prognostic” of OUD if evidenced by prior literature. First, CCSR and FDB-derived features with <30 observations were combined into larger, clinically-relevant groupings. Acute and chronic features were then evaluated separately using Spearman correlation matrices and similarity score-derived dendrograms to determine clinically-relevant feature groupings that were either combined with other features or reduced to a single dimension by Principal Component Analysis (PCA). Random forest-derived feature importance scores (FIS) and Cox proportional hazards-derived p-values were then calculated and visually inspected to determine a cut-off point retaining ≥60% of prognostic features. Features retained in either strategy were included in the final feature space.
RESULTS: A total of 569 features were initially derived. Collapsing features with <30 observations into larger groupings reduced features to 458. Spearman correlation values >0.2 and similarity scores <0.95 identified 24 groupings for feature combination and 24 groupings for PCA, reducing features to 344. The random forest approach preserved 26 (60.47%) prognostic features (FIS value range: 0.0026-0.0432) while the Cox approach preserved 32 (74.42%) prognostic features (p-value range: 0.6786-0.0452), reducing the final feature space to 174.
CONCLUSIONS: A feature selection workflow leveraging clinical expertise with a comprehensive sequential dimensionality reduction approach is an effective way to reduce high-dimensionality while maintaining a clinically meaningful feature space.
METHODS: Arkansas claims between November 2018 - December 2023 were linked to medical cannabis authorization data. Features were derived from demographics, healthcare expenditure and utilization characteristics, acute and chronic comorbidities grouped by Clinical Classification Software Refined (CCSR), and prescription characteristics grouped by First Databank (FDB) therapeutic class. Features were labeled “prognostic” of OUD if evidenced by prior literature. First, CCSR and FDB-derived features with <30 observations were combined into larger, clinically-relevant groupings. Acute and chronic features were then evaluated separately using Spearman correlation matrices and similarity score-derived dendrograms to determine clinically-relevant feature groupings that were either combined with other features or reduced to a single dimension by Principal Component Analysis (PCA). Random forest-derived feature importance scores (FIS) and Cox proportional hazards-derived p-values were then calculated and visually inspected to determine a cut-off point retaining ≥60% of prognostic features. Features retained in either strategy were included in the final feature space.
RESULTS: A total of 569 features were initially derived. Collapsing features with <30 observations into larger groupings reduced features to 458. Spearman correlation values >0.2 and similarity scores <0.95 identified 24 groupings for feature combination and 24 groupings for PCA, reducing features to 344. The random forest approach preserved 26 (60.47%) prognostic features (FIS value range: 0.0026-0.0432) while the Cox approach preserved 32 (74.42%) prognostic features (p-value range: 0.6786-0.0452), reducing the final feature space to 174.
CONCLUSIONS: A feature selection workflow leveraging clinical expertise with a comprehensive sequential dimensionality reduction approach is an effective way to reduce high-dimensionality while maintaining a clinically meaningful feature space.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR2
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)