Using Large Language Models To Extract PD-L1 Testing Details From Electronic Health Records

Author(s)

Cohen AB, Waskom M, Adamson B, Kelly J, Amster G
Flatiron Health, New York, NY, USA

Presentation Documents

ISPOR24_Cohen_ MSR64_POSTER136019.pdf

OBJECTIVES: The suitability of artificial intelligence (AI) and large language models (LLMs) to assist in curating real world data (RWD) from electronic health records (EHR) for research is uncertain. PD-L1 biomarker testing guides cancer treatment decisions, but results are hard to access because lab reports are unstructured and require clinical expertise to interpret. Additionally, results vary by cancer type and documentation patterns have changed over time. This study explored the ability of LLMs to rapidly identify PD-L1 biomarker details in the EHR.

METHODS: We applied open-source LLMs (Llama-2-7B and Mistral-v0.1-7B) to extract seven biomarker details relating to PD-L1 testing from the Flatiron Health US nationwide, EHR-derived database: collection/receipt/report date, cell type, percent staining, combined positive score, and staining intensity. Two approaches were used: “zero-shot” experiments (no fine-tuning) exploring a range of prompts and fine-tuning on manually-curated answers from 500/1000/1500 documents. In both cases, we validated performance using 250 human abstracted answers across >15 cancer types. We additionally compared performance on percent staining to a deep learning model (LSTM) baseline trained on >10,000 examples.

RESULTS: We successfully used LLMs to extract biomarker testing details from EHR documents. Fine-tuned outputs consistently conformed to the desired RWD structure. In contrast, zero-shot outputs were frequently invalid and exhibited hallucination. Fine-tuning performance improved with additional training examples. F1 scores ranged from 0.8–0.95, and date accuracy (within 15 days) ranged from 0.85–0.9. Fine-tuned LLMs exceeded performance of the deep learning model baseline (∆F1 = 0.05) despite the significant difference in training data.

CONCLUSIONS: LLMs, fine-tuned with high-quality labeled data, accurately extracted complex PD-L1 test details from the EHR despite considerable variability in cancer type, documentation, and time. In contrast, zero-shot prompt extraction was not effective at the model scale examined here. Validation required access to high-quality data labeled by experts with access to the source EHR.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR64

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Distributed Data & Research Networks, Health & Insurance Records Systems

Disease

Drugs, Oncology, Personalized & Precision Medicine

Explore Related HEOR by Topic

Presentation