Natural Language Processing (NLP)-Based Detection of Transgender and Gender Non-Conforming Patients in Electronic Health Record (EHR)-Derived Data


Hooley I, Maignan K, Ngai D, Ackerman B
Flatiron Health, New York, NY, USA


Transgender and gender non-conforming (GNC) individuals are known to have inferior healthcare outcomes compared to their cisgender peers. However, studying this population in EHR-derived data is challenging as gender identity indicators (e.g., ICD codes, identity status drop-downs) are not reliably populated. We sought to remedy this by developing an NLP-based approach to detect transgender and GNC patients in a real-world dataset, benchmarking model performance against the use of ICD codes.


We leveraged an NLP framework in order to select transgender and GNC patients: locate more potential cases (nNLP) than ICD codes alone (ncodes), iteratively test combinations of phrases to detect transgender and GNC patients, validate with manual chart abstraction, and assess the model’s positive predictive value (PPV). This study applied the framework to predict transgender and GNC status among 2.6 million patients in the nationwide Flatiron Health de-identified EHR-derived database.


Three iteration cycles optimized an NLP algorithm using 15 phrases. NLP classified more patients than ICD codes alone (nNLP = 7,624, ncodes = 159, nNLP ∩ ncodes= 84). Internal validation chart audits on a random subset estimated a 39% PPV for NLP [95% CI 32-46%; n = 208], 61% for ICD codes [95% CI 43-76%; n = 38], and 100% for both combined [95% CI 82-100%; n = 19].


Using ICD codes to detect transgender and GNC patients may have higher PPV, but can result in a non-representative sample due to code under-utilization. Our NLP-based approach could detect a larger, more representative sample of patients in EHR-derived datasets, albeit at the expense of lower PPV. Further research should explore the sensitivity of this approach. Overall, NLP approaches can aid in improving the detection and real-world research of underserved populations such as transgender and GNC patients.

Conference/Value in Health Info

2021-05, ISPOR 2021, Montreal, Canada

Value in Health, Volume 24, Issue 5, S1 (May 2021)




Health Policy & Regulatory, Methodological & Statistical Research

Topic Subcategory

Confounding, Selection Bias Correction, Causal Inference, Health Disparities & Equity


No Specific Disease

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on Update my browser now