Scaling Responsible Healthcare Data Access With AI-Driven Sensitive Patient Variable Detection
Author(s)
Shane O'Meachair, PhD1, Lucy Mosquera, BSc, MSc2.
1Aetion, Barcelona, Spain, 2Aetion, Ottawa, ON, Canada.
1Aetion, Barcelona, Spain, 2Aetion, Ottawa, ON, Canada.
OBJECTIVES: Secondary data use of health record data requires rigorous de-identification or anonymisation. A key step in this process is to identify which variables in the dataset are direct identifiers or indirect (aka quasi-) identifiers. This is often a laborious process which requires significant expertise. The aim of this work is to leverage AI to identify variables in a dataset which could lead to patient re-identification. We also evaluate a method to assess confidence in these AI classifications.
METHODS: A pre-trained large language model (LLM) is used to classify a data dictionary containing variable names and brief descriptions. In-context learning is used to provide explanations and examples of variable types leveraging established privacy guidelines (e.g ISO/IEC 27559:2022). There are three possible categories: [Direct Identifier, Quasi-Identifier, Other]. To assess confidence in the outputs we combine Sample Consistency with Monte Carlo Temperature (SC-MCT) to measure agreement across predictions at different temperature levels. Four datasets are assessed including clinical trial and insurance claims data.
RESULTS: The LLM-based classifier achieves greater than 90% accuracy across all four datasets [range: 92%-100%]. The majority of predictions have 100% agreement across temperature values. SC-MCT gives improved accuracy or equal accuracy compared to a fixed temperature value. In ¾ datasets, all misclassified variables have a sample consistency score below 100%. Prediction errors with high consistency were associated with very brief or ambiguous variable descriptions.
CONCLUSIONS: An LLM-based classifier can give highly accurate results when classifying personally identifiable variables in health datasets. This can greatly increase the accessibility of privacy enhancing technologies and facilitate safe sharing of healthcare data for important research purposes. Due to the risk of hallucination and error in LLMs, we aimed to assess confidence in the model outputs. SC-MCT does not provide well-calibrated uncertainty estimates but improves overall classification accuracy.
METHODS: A pre-trained large language model (LLM) is used to classify a data dictionary containing variable names and brief descriptions. In-context learning is used to provide explanations and examples of variable types leveraging established privacy guidelines (e.g ISO/IEC 27559:2022). There are three possible categories: [Direct Identifier, Quasi-Identifier, Other]. To assess confidence in the outputs we combine Sample Consistency with Monte Carlo Temperature (SC-MCT) to measure agreement across predictions at different temperature levels. Four datasets are assessed including clinical trial and insurance claims data.
RESULTS: The LLM-based classifier achieves greater than 90% accuracy across all four datasets [range: 92%-100%]. The majority of predictions have 100% agreement across temperature values. SC-MCT gives improved accuracy or equal accuracy compared to a fixed temperature value. In ¾ datasets, all misclassified variables have a sample consistency score below 100%. Prediction errors with high consistency were associated with very brief or ambiguous variable descriptions.
CONCLUSIONS: An LLM-based classifier can give highly accurate results when classifying personally identifiable variables in health datasets. This can greatly increase the accessibility of privacy enhancing technologies and facilitate safe sharing of healthcare data for important research purposes. Due to the risk of hallucination and error in LLMs, we aimed to assess confidence in the model outputs. SC-MCT does not provide well-calibrated uncertainty estimates but improves overall classification accuracy.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD171
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance, Health & Insurance Records Systems
Disease
No Additional Disease & Conditions/Specialized Treatment Areas