Comparisons of Encoding Techniques for Categorical Features in Linear Regression Models
Author(s)
Sun W, Cai Y, Liu Y
IQVIA, Philadelphia, PA, USA
OBJECTIVES: In healthcare research, data contains many types of categorical variables, such as race, country, zip code, from low to high number of levels in each category. However, aforementioned categorical features need to encode into numeric forms before applying machine learning algorithms. Therefore, it is critical to find a suitable encoding method for coefficient estimation and prediction. In this study, we investigate three commonly used encoding methods and compare them in coefficient estimation, feature selection, and prediction in linear regression analysis.
METHODS: We compare label encoding, one-hot encoding, and target leave-one out encoding for low (n_level=5) and high number (n_level=50) of levels in categorical variables under balanced and unbalanced synthetic data designs. We apply three different machine learning algorithms (ordinary least squares (OLS), Bayesian ridge and logistic regression) on datasets from the regression and binary classification settings.
RESULTS: In the low-level (n_level=5) settings with continuous outcomes, all three methods can identify the true important features, and target encoding achieves the smallest mean absolute error (MAE) for both the coefficients estimation and prediction. For binary classification, label encoding fails in detecting true features and the prediction accuracy is around 50%. In the OLS regression scenario, the coefficients derived from one-hot encoding shift far away from the true value, especially for the imbalanced settings. In the high-level settings (n_level=50), target encoding outperforms the other two methods with the smallest prediction MAE, stable coefficients estimation and feature selection. One-hot encoding has relatively low prediction MAE, however, could not identify the true important features under the imbalanced settings and the coefficient estimations are not stable. Label-encoding is not able to identify the true important features and has the largest prediction MAE.
CONCLUSIONS: Target leave one out encoding outperforms other traditional methods in terms of both coefficient estimation and prediction performance, under both low and high number of categories.
Conference/Value in Health Info
Value in Health, Volume 25, Issue 6, S1 (June 2022)
Code
MSR14
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas