Leveraging Large-Language Models for Medical Code List Creation: An Example Using the Charlson Comorbidity Index

Author(s)

Andreas Ochs, PhD1, Meda R. Sandu, PhD1, Michael Duxbury, BSc1, Bagmeet Behera, PhD2, Christian Henderson, BSc1, Mark Danese, MSc, PhD3, George Kafatos, PhD1.
1Amgen Ltd, Uxbridge, United Kingdom, 2Amgen GmbH, Berlin, Germany, 3Outcomes Insights, Inc, Calabasas, CA, USA.
OBJECTIVES: Accurate medical code lists are vital for studies using real-world data. Different approaches to creating code lists exist: comprehensive code lists can be optimally created by reviewing each code description of the full medical vocabulary. However, this is extremely resource intense. Large-language models (LLMs) have been shown to successfully diagnose patients based on physician notes and can potentially be used to efficiently generate comprehensive code lists. We evaluated the accuracy of LLM-generated code lists by comparing against those published for conditions of the Charlson Comorbidity Index (CCI), a widely used comorbidity scoring system across 17 conditions.
METHODS: An LLM model (ChatGPT o3 mini) was used to generate relevance scores of each of the 16,287 ICD-10-WHO codes to a given condition of the CCI. This score reflects the LLM’s assessment of how closely each code description aligns with the clinical concept of the condition. Code lists were derived by applying a range of relevance score thresholds. The resulting LLM-generated code lists were compared to the validated code lists of the CCI published by Quan et al (2005). Classification metrics, including sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated at different thresholds to assess the LLM’s performance.
RESULTS: When setting the strictest threshold (relevance score=0) which excludes all codes the LLM has determined as completely irrelevant only 1 code was identified as false negative across all conditions, a larger number of false positives were found, resulting in >99.9% sensitivity, 93.5% specificity, >99.9% NPV and 6.1% PPV.
CONCLUSIONS: These early results indicate that LLMs could potentially be used to identify the large number of codes irrelevant to specific conditions. This approach can be used in combination with manual code review to improve accuracy and efficiency in code list generation. Ongoing work of model comparisons and prompt engineering could further increase accuracy.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

RWD118

Topic

Real World Data & Information Systems, Study Approaches

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×