IDENTIFYING SEXUAL BEHAVIOR FACTORS FOR INDIVIDUALS WITH HEPATITIS C VIRUS USING A LARGE LANGUAGE MODEL-BASED NATURAL LANGUAGE PROCESSING

Author(s)

Pilar Hernández Con, MD, MSCE¹, Daniel Paredes Pardo, MS², Chanakan Jenjai, PharmD, MS³, Ashley Stultz, PharmD¹, Shunhua YAN, MEd¹, Danielle Nelson, MD, MPH⁴, Jungjun Bae, MS¹, Khoa Nguyen, PharmD⁵, Yonghui Wu, PhD⁶, Haesuk Park, PhD¹;
¹University of Florida, Department of Pharmaceutical Outcomes and Policy, College of Pharmacy, Gainesville, FL, USA, ²University of Florida, Department of Health Outcomes and Biomedical Informatics, College of Medicine, Health Outcomes and Biomedical Informatics, College of Medicine, Gainesville, FL, USA, ³University of Florida College of Pharmacy, Department of Pharmaceutical Outcomes and Policy, Gainesville, FL, USA, ⁴University of Florida, Gainesville, FL, USA, ⁵University of Florida, Department of Pharmacotherapy & Translational Research, Gainesville, FL, USA, ⁶University of Florida, Department of Health Outcomes & Biomedical Informatics, College of Medicine, Gainesville, FL, USA

OBJECTIVES: Hepatitis C virus (HCV) infection remains a public health concern in the U.S., with sexual behaviors reported as potential transmission routes in about one in four cases. Structured medical records often provide incomplete information on these behaviors. We aimed to evaluate the use of natural language processing to extract sexual behavior factors from unstructured clinical narratives.
METHODS: We analyzed unstructured clinical notes from the University of Florida Health electronic health records including individuals ≥18 years tested at least once for HCV between January 2016 and July 2023. We developed a list of keywords to identify sexual behavior factors including sexual orientation/gender identity (e.g., men who have sex with men [MSM], same-sex relationships not MSM, transgender people) and high-risk sexual behaviors (e.g., anal sex, sex for compensation). Sentences containing these keywords were extracted for annotation. Annotation guidelines were developed and iteratively refined during training sessions, resulting in inter-annotator agreement improvement from 66.1% to 90.4%. A GatorTron-based Large Language Model (LLM) was trained on 70% of the annotated sentences, validated on 10% and tested on 20% of the sentences. Performance of concept extraction was evaluated using precision (accuracy), recall (sensitivity), and F1-scores (the harmonic mean of precision and recall; a high F1-score indicates a well-balanced model between precision and recall).
RESULTS: There were 6,092,972 clinical notes from 15,048 individuals tested for HCV. After annotation, we identified 76 sentences containing at least one concept for MSM, 231 sentences for transgender, 50 sentences for same-sex and 314 sentences for high-risk sexual behaviors. Our model achieved robust performance for MSM (Precision=0.722, Recall=0.867, F1 score=0.788), transgender (Precision=0.915, Recall=0.915, F1 score=0.915), same-sex (Precision=0.800, Recall=1.00, F1 score=0.889) and high-risk sexual behaviors (Precision=0.844, Recall=0.794, F1 score=0.818).
CONCLUSIONS: Our findings suggest that the LLM demonstrated high accuracy in extracting concepts related to sexual behavior factors from clinical narratives of individuals tested for HCV.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

P20

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Infectious Disease (non-vaccine), STA: Personalized & Precision Medicine

Presentation (CTI)