Large Language Models and Case Reports: An Innovative Approach to Real-World Data for Rare Disease Natural History Analysis

Speaker(s)

Paek H1, Lee K2, Huang LC2, Annan A2, Rastergar-mojarad M1, Wang X2
1IMO Health, Rosemont, IL, USA, 2IMO health, Rosemont, IL, USA

OBJECTIVES: Clinical trials for rare diseases face unique challenges, including small patient populations and a limited understanding of the natural history of diseases, which complicates the setting of the clinical trial endpoints. Case reports often include rich narratives of detailed clinical observations of individual patients. Despite their value as real-world data (RWD) sources, these case reports are often underutilized. We aimed to develop a system for extracting comprehensive clinical features of rare diseases from “case report” studies by leveraging the large language models (LLMs) and structuring them into a computable format

METHODS: We selected two use cases, Fabry disease and Immunoglobulin A nephropathy (IGAN), and collected full-text “case reports” from PubMed. Using 20 abstracts from each disease group, we developed an LLM-based case report processing system, which extracted all clinical features described in case reports and conducted both quantitative and qualitative evaluations on 50 case reports for each disease.

RESULTS: Our system extracted an average of 286 clinical features and corresponding values per report for Fabry disease, ranging from 129-452 features. For IGAN, we extracted an average of 94 features and corresponding values per report, ranging from 67-127. These clinical features include patient demographics, disease characteristics such as diagnosis and genetic information, laboratory test results, comorbidities, treatment history, and outcomes. Our model achieved precision, recall, and F1 scores of 0.9956, 0.9966, and 0.9961 for Fabry disease, 0.9835, 0.9736, and 0.9785 for IGAN, respectively. We also visualized the geographical locations of each rare disease case using the first author’s affiliation.

CONCLUSIONS: Our study validates the potential of using case reports as sources of RWD and demonstrates the effectiveness of LLM in extracting clinical data from case reports. This approach enhances the generation of robust real-world evidence and improves our understanding of the natural history of rare diseases.

Code

RWD51

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Rare & Orphan Diseases