Development of an Algorithm to Identify the Type of Diabetes in the French Administrative Health Care Database “Système National Des Données De Santé” (SNDS)
Author(s)
Bretin O1, Casarotto E1, Bessou A1, Maurel F1, Serusclat P2, Joubert M3, Fagherazzi G4, Berteau C5, Pouyet A6, Maillard C7
1IQVIA France, Courbevoie, France, 2Groupe Hospitalier Mutualiste Les Portes du Sud, Venissieux, France, 3CHU de Caen, Caen, France, 4Paris south Paris Saclay University, Villejuif, France, 5Roche Diabetes Care France, Meylan, 38, France, 6TIMKL, Montbonnot Saint Martin, France, 7IQVIA Opérations France, La défense, France
Presentation Documents
OBJECTIVES: The French administrative health care database (SNDS), covering 99% of the French population, is a powerful tool for epidemiological and pharmacoeconomic studies on diabetes. However, its lack of clinical information makes it difficult to accurately identify the type of diabetes. The objective was to develop an accurate machine learning algorithm to determine the type of diabetes in the SNDS, validated thanks to a linkage with primary care clinical data.
METHODS: Electronic medical records (EMR) of a network of French general practitioners (GP) were probabilistically linked with the SNDS. This linkage allowed to constitute a population of diabetic patients whose type of diabetes was retrieved from GP consultations. About 200 predictors were derived from SNDS data to help discriminate between type-1 diabetes (T1D) and type-2 diabetes (T2D). Various machine learning algorithms (penalized logistic regressions, RandomForest, XGBoost) were trained and optimized by a 10-fold cross-validation procedure on the training set. The best model was selected for its ability to predict T1D on the test set, via the F1-score metric. Its performance was benchmarked against already-published algorithms applied to the test set.
RESULTS: A cohort of 40,774 people with diabetes was constituted, including 39,122 (95.9%) T2D and 1,652 (4.1%) T1D. A LASSO penalized regression obtained the best performance (F1: 0.79 (T1D); precisions: 84.6% (T1D), 98.9% (T2D); sensitivities: 73.8% (T1D), 99.4% (T2D)), outperforming the Charbonnel’s decision tree (F1: 0.66) and Fuentes’s best retrained logistic regression (F1: 0.59).
CONCLUSIONS: Thanks to an innovative linkage between SNDS and EMR, we have developed a high-performance classification model that outperforms existing published algorithms to identify the type of diabetes in a large medico-administrative database. It can be reused by the scientific community to conduct epidemiological and pharmacoeconomic studies on each type of diabetes in the French population.
Conference/Value in Health Info
Value in Health, Volume 26, Issue 11, S2 (December 2023)
Code
MSR2
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records
Disease
Diabetes/Endocrine/Metabolic Disorders (including obesity), No Additional Disease & Conditions/Specialized Treatment Areas