Natural Language Processing (NLP)-Based Detection of Transgender and Gender Non-Conforming Patients in Electronic Health Record (EHR)-Derived Data
Author(s)
Hooley I, Maignan K, Ngai D, Ackerman B
Flatiron Health, New York, NY, USA
OBJECTIVES Transgender and gender non-conforming (GNC) individuals are known to have inferior healthcare outcomes compared to their cisgender peers. However, studying this population in EHR-derived data is challenging as gender identity indicators (e.g., ICD codes, identity status drop-downs) are not reliably populated. We sought to remedy this by developing an NLP-based approach to detect transgender and GNC patients in a real-world dataset, benchmarking model performance against the use of ICD codes. METHODS We leveraged an NLP framework in order to select transgender and GNC patients: locate more potential cases (nNLP) than ICD codes alone (ncodes), iteratively test combinations of phrases to detect transgender and GNC patients, validate with manual chart abstraction, and assess the model’s positive predictive value (PPV). This study applied the framework to predict transgender and GNC status among 2.6 million patients in the nationwide Flatiron Health de-identified EHR-derived database. RESULTS Three iteration cycles optimized an NLP algorithm using 15 phrases. NLP classified more patients than ICD codes alone (nNLP = 7,624, ncodes = 159, nNLP ∩ ncodes= 84). Internal validation chart audits on a random subset estimated a 39% PPV for NLP [95% CI 32-46%; n = 208], 61% for ICD codes [95% CI 43-76%; n = 38], and 100% for both combined [95% CI 82-100%; n = 19]. CONCLUSIONS Using ICD codes to detect transgender and GNC patients may have higher PPV, but can result in a non-representative sample due to code under-utilization. Our NLP-based approach could detect a larger, more representative sample of patients in EHR-derived datasets, albeit at the expense of lower PPV. Further research should explore the sensitivity of this approach. Overall, NLP approaches can aid in improving the detection and real-world research of underserved populations such as transgender and GNC patients.
Conference/Value in Health Info
2021-05, ISPOR 2021, Montreal, Canada
Value in Health, Volume 24, Issue 5, S1 (May 2021)
Code
PNS98
Topic
Health Policy & Regulatory, Methodological & Statistical Research
Topic Subcategory
Confounding, Selection Bias Correction, Causal Inference, Health Disparities & Equity
Disease
No Specific Disease