Real-Time Identification of Infectious Disease Symptoms Through AI-Powered Text Mining From Anonymized Electronic Health Records

Speaker(s)

Yoshihara H1, Maeda H2, Hagiwara Y3, Sato D4, Kitajima K5, Iwata A5, Van de Velde N3, Igarashi A5
1University of Tokyo, Tokyo, 13, Japan, 2University of Tokyo, Tokyo, Japan, 3Moderna, Inc., Cambridge, MA, USA, 4M3 eES Company, Tokyo, Japan, 5M3, Inc., Japan, Tokyo, Japan

OBJECTIVES: While the COVID-19 pandemic has settled down, it remains an infectious disease that requires caution. Meanwhile, there was an influenza outbreak during the winter season of 2023 in Japan, raising concerns about the simultaneous spread of multiple infectious diseases. In this study, we aim to develop an AI-powered natural language processing algorithm to extract epidemiological information such as predominant symptoms for each virus strain and vaccination data in real-time from unstructured text in electronic health record impression fields, which were traditionally difficult to analyze. This will be beneficial for public health measures and for formulating diagnostic and treatment plans in clinical settings.

METHODS: We obtained 500 anonymized impression field texts collected from 354 individuals who visited four fever outpatient clinics in the JAMDAS database. The extraction items included infection-related symptoms (temperature, general, head, respiratory, throat, gastrointestinal, sensory), and vaccination information. An algorithm combining in-context learning of open-source large language model (LLM) and rule-based processing was developed. The F1 score, sensitivity, and specificity were evaluated using 482 data entries, excluding the 18 used as examples for in-context learning.

RESULTS: The LLM (Llama 3 70B with few-shots) demonstrated high accuracy, particularly for items with a high degree of freedom, such as onset dates and vaccination information. By selectively using LLM and rule-based processing depending on the extraction item, we achieved exceptionally high F1 scores of >0.8 for all items, and >0.9 for most items.

CONCLUSIONS: We demonstrated that by combining LLM and rule-based processing, it is possible to extract epidemiological information related to infectious diseases with high accuracy from unstructured text in the impression field.

Code

EPH179

Topic

Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Disease Classification & Coding, Public Health

Disease

Infectious Disease (non-vaccine), No Additional Disease & Conditions/Specialized Treatment Areas