Leveraging Large Language Models and EMR Data to Identify Undiagnosed Rare Diseases: A Hybrid AI Approach
Author(s)
Sandy Balkin, PhD.
SVP Strategy & Analytics, Royalty Pharma, New York, NY, USA.
SVP Strategy & Analytics, Royalty Pharma, New York, NY, USA.
OBJECTIVES: To evaluate how large language models and our electronic medical record data provided by NextGen EMR can be leveraged to identify patients with undiagnosed rare diseases, and to assess the effectiveness of hybrid AI approaches in extracting clinical phenotypes and prioritizing high-risk cases for follow-up.
METHODS: Four foundational LLMs were used to analyze structured EMR data from patients with genetically confirmed rare diseases to identify characteristic phenotype patterns. These patterns were then applied to the broader patient population, enabling LLMs and machine learning algorithms to screen for individuals with similar profiles and flag potential undiagnosed rare disease cases for further review.
RESULTS: LLM-driven analysis of structured EMR data identified characteristic phenotypes of genetically confirmed rare diseases. Applying these patterns to the full patient population flagged additional high-risk individuals, improving sensitivity and specificity over rule-based methods and enabling earlier identification for genetic evaluation.
CONCLUSIONS: LLM-based analysis of structured EMR data enables more accurate and scalable screening for rare diseases. Incorporating unstructured clinical data in the future could further enhance identification and support earlier diagnosis.
METHODS: Four foundational LLMs were used to analyze structured EMR data from patients with genetically confirmed rare diseases to identify characteristic phenotype patterns. These patterns were then applied to the broader patient population, enabling LLMs and machine learning algorithms to screen for individuals with similar profiles and flag potential undiagnosed rare disease cases for further review.
RESULTS: LLM-driven analysis of structured EMR data identified characteristic phenotypes of genetically confirmed rare diseases. Applying these patterns to the full patient population flagged additional high-risk individuals, improving sensitivity and specificity over rule-based methods and enabling earlier identification for genetic evaluation.
CONCLUSIONS: LLM-based analysis of structured EMR data enables more accurate and scalable screening for rare diseases. Incorporating unstructured clinical data in the future could further enhance identification and support earlier diagnosis.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD117
Topic
Methodological & Statistical Research, Patient-Centered Research, Real World Data & Information Systems
Disease
Rare & Orphan Diseases