Extraction of Pulmonary Function Test Results from Unstructured Clinical Notes using a Retrieval-Augmented Generation Approach on a Major Cloud Platform
Author(s)
Vikas Kumar, MSc, MD, Yan Wang, PhD, Lawrence Rasouliyan, MPH.
OMNY Health, Atlanta, GA, USA.
OMNY Health, Atlanta, GA, USA.
Presentation Documents
OBJECTIVES: To measure the accuracy, hallucination rate, and latency with which large language models (LLMs) can extract the forced expiratory volume in one second to functional vital capacity ratio (FEV1/FVC) from relevant electronic health record free-text clinical notes.
METHODS: 50 note excerpts containing “FEV1/FVC” from the OMNY Health real-world data platform were randomly sampled and classified as “simple” (S; one FEV1/FVC score present), “no value” (NV; no scores present), or “complex” (C; multiple scores present). Two LLMs [Gemini-1.5-Flash (Flash) and Gemini-1.5-Pro (Pro)] available in Google BigQuery ML Studio were applied to query the notes using the following prompt: “Extract the actual FEV1/FVC ratio from the following text. Text: <NOTE>.” Maximum output tokens, temperature, and top-P parameters were 4, 0.0, and 0, respectively. Results for the S/NV categories were annotated for accuracy and hallucinations by a team of domain experts following a standardized protocol. C category results were qualitatively evaluated, and latency for each model was measured.
RESULTS: The S, NV, and C categories comprised 30, 9, and 11 excerpts, respectively. For the S category, Flash and Pro models were 90.0% and 73.3% accurate, respectively. For the NV category, no hallucinations were observed. For the C category, the Flash model reported one of the FEV1/FVC values, while the Pro model recognized the presence of multiple values. Flash and Pro models used 6,210 and 23,576 slot milliseconds, respectively.
CONCLUSIONS: LLMs can extract severity scores with 90% accuracy from simple note excerpts automatedly at scale. The Flash model was more accurate with simple excerpts, while the Pro model was more verbose (limiting its utility for when the max output tokens is low) yet more cognizant of multiple FEV1/FVC values. Potential future directions include exploring additional metrics beyond accuracy and latency, such as robustness across note complexity levels and model interpretability while accounting for cost as a factor.
METHODS: 50 note excerpts containing “FEV1/FVC” from the OMNY Health real-world data platform were randomly sampled and classified as “simple” (S; one FEV1/FVC score present), “no value” (NV; no scores present), or “complex” (C; multiple scores present). Two LLMs [Gemini-1.5-Flash (Flash) and Gemini-1.5-Pro (Pro)] available in Google BigQuery ML Studio were applied to query the notes using the following prompt: “Extract the actual FEV1/FVC ratio from the following text. Text: <NOTE>.” Maximum output tokens, temperature, and top-P parameters were 4, 0.0, and 0, respectively. Results for the S/NV categories were annotated for accuracy and hallucinations by a team of domain experts following a standardized protocol. C category results were qualitatively evaluated, and latency for each model was measured.
RESULTS: The S, NV, and C categories comprised 30, 9, and 11 excerpts, respectively. For the S category, Flash and Pro models were 90.0% and 73.3% accurate, respectively. For the NV category, no hallucinations were observed. For the C category, the Flash model reported one of the FEV1/FVC values, while the Pro model recognized the presence of multiple values. Flash and Pro models used 6,210 and 23,576 slot milliseconds, respectively.
CONCLUSIONS: LLMs can extract severity scores with 90% accuracy from simple note excerpts automatedly at scale. The Flash model was more accurate with simple excerpts, while the Pro model was more verbose (limiting its utility for when the max output tokens is low) yet more cognizant of multiple FEV1/FVC values. Potential future directions include exploring additional metrics beyond accuracy and latency, such as robustness across note complexity levels and model interpretability while accounting for cost as a factor.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR63
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Respiratory-Related Disorders (Allergy, Asthma, Smoking, Other Respiratory)