From Unstructured Text to Enriched Data: Evaluating Large Language Models for Clinical Information Extraction From Echocardiogram Reports
Author(s)
Yu-Tung Huang, BPharm1, Sandy Hsu, MS1, Chia-Chang Wen, PhD2, Chih-Fan Yeh, MD, PhD3, Wen Li Kuan, RPh, MS1, Fang-Ju (Irene) Lin, RPh, PhD4.
1Graduate Institute of Clinical Pharmacy & School of Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan, 2Information Technology Office, National Taiwan University Hospital, Taipei, Taiwan, 3Division of Cardiology, Department of Internal Medicine and Cardiovascular Center, National Taiwan University Hospital, Taipei, Taiwan, 4Department of Pharmacy, National Taiwan University Hospital, Taipei, Taiwan.
1Graduate Institute of Clinical Pharmacy & School of Pharmacy, College of Medicine, National Taiwan University, Taipei, Taiwan, 2Information Technology Office, National Taiwan University Hospital, Taipei, Taiwan, 3Division of Cardiology, Department of Internal Medicine and Cardiovascular Center, National Taiwan University Hospital, Taipei, Taiwan, 4Department of Pharmacy, National Taiwan University Hospital, Taipei, Taiwan.
OBJECTIVES: Unstructured echocardiogram reports provide valuable diagnostic information but limit scalability for real-world evidence generation. While large language models (LLMs) show promise in clinical text processing, their performance in extracting cardiac features across models and prompts remains underexplored. Therefore, this study aims to evaluate the performance of LLMs in extracting cardiac features from echocardiogram reports using different models and prompt formats.
METHODS: Three open-source LLMs (from small to large: LLaMA3.1:8B, Mistral-Small:24B, and LLaMA3.3:70B) were tested in a zero-shot setting to identify ten predefined cardiac features. A reference dataset with 30 positive and 20 negative cases per feature was curated from echocardiogram reports at National Taiwan University Hospital. Two prompt formats were compared: a shorter prompt with one classification question per feature and a longer prompt including follow-up questions on severity, segment, and mechanism. Performance was evaluated using accuracy, precision, recall, and F1-score.
RESULTS: The small model (LLaMA3.1:8B) exhibited greater variability in accuracy (range: 0.70-1.00), whereas the large model (LLaMA3.3:70B) consistently scored above 0.90 for all cardiac features. Among the ten features, chamber dilatation (accuracy: 0.70; F1-score: 0.79) and leaflet or cusp abnormalities (accuracy: 0.76; F1-score: 0.75) showed the lowest performance. In LLaMA3.1:8B, the longer prompt led to more features with accuracy and F1-score falling below 90%, compared to the short prompt . The largest decline was observed in chamber dilatation, with accuracy dropping from 0.92 to 0.70 and F1-score from 0.93 to 0.79. In contrast, performance remained stable across prompt formats in the mid-scale (Mistral-Small:24B) and large-scale (LLaMA3.3:70B) models.
CONCLUSIONS: Open-source LLMs demonstrated strong utility in extracting information from unstructured echocardiogram reports. However, the performance of smaller models was more affected by prompt complexity and description variability in clinical text. These results underscore the importance of tailoring LLM and prompt strategies to facilitate data enrichment for real-world evidence.
METHODS: Three open-source LLMs (from small to large: LLaMA3.1:8B, Mistral-Small:24B, and LLaMA3.3:70B) were tested in a zero-shot setting to identify ten predefined cardiac features. A reference dataset with 30 positive and 20 negative cases per feature was curated from echocardiogram reports at National Taiwan University Hospital. Two prompt formats were compared: a shorter prompt with one classification question per feature and a longer prompt including follow-up questions on severity, segment, and mechanism. Performance was evaluated using accuracy, precision, recall, and F1-score.
RESULTS: The small model (LLaMA3.1:8B) exhibited greater variability in accuracy (range: 0.70-1.00), whereas the large model (LLaMA3.3:70B) consistently scored above 0.90 for all cardiac features. Among the ten features, chamber dilatation (accuracy: 0.70; F1-score: 0.79) and leaflet or cusp abnormalities (accuracy: 0.76; F1-score: 0.75) showed the lowest performance. In LLaMA3.1:8B, the longer prompt led to more features with accuracy and F1-score falling below 90%, compared to the short prompt . The largest decline was observed in chamber dilatation, with accuracy dropping from 0.92 to 0.70 and F1-score from 0.93 to 0.79. In contrast, performance remained stable across prompt formats in the mid-scale (Mistral-Small:24B) and large-scale (LLaMA3.3:70B) models.
CONCLUSIONS: Open-source LLMs demonstrated strong utility in extracting information from unstructured echocardiogram reports. However, the performance of smaller models was more affected by prompt complexity and description variability in clinical text. These results underscore the importance of tailoring LLM and prompt strategies to facilitate data enrichment for real-world evidence.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR113
Topic
Medical Technologies, Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Cardiovascular Disorders (including MI, Stroke, Circulatory), No Additional Disease & Conditions/Specialized Treatment Areas