Leveraging Large-Language Models for Medical Code List Creation: An Example Using the Charlson Comorbidity Index
Author(s)
Andreas Ochs, PhD1, Meda R. Sandu, PhD1, Michael Duxbury, BSc1, Bagmeet Behera, PhD2, Christian Henderson, BSc1, Mark Danese, MSc, PhD3, George Kafatos, PhD1.
1Amgen Ltd, Uxbridge, United Kingdom, 2Amgen GmbH, Berlin, Germany, 3Outcomes Insights, Inc, Calabasas, CA, USA.
1Amgen Ltd, Uxbridge, United Kingdom, 2Amgen GmbH, Berlin, Germany, 3Outcomes Insights, Inc, Calabasas, CA, USA.
OBJECTIVES: Accurate medical code lists are vital for studies using real-world data. Different approaches to creating code lists exist: comprehensive code lists can be optimally created by reviewing each code description of the full medical vocabulary. However, this is extremely resource intense. Large-language models (LLMs) have been shown to successfully diagnose patients based on physician notes and can potentially be used to efficiently generate comprehensive code lists. We evaluated the accuracy of LLM-generated code lists by comparing against those published for conditions of the Charlson Comorbidity Index (CCI), a widely used comorbidity scoring system across 17 conditions.
METHODS: An LLM model (ChatGPT o3 mini) was used to generate relevance scores of each of the 16,287 ICD-10-WHO codes to a given condition of the CCI. This score reflects the LLM’s assessment of how closely each code description aligns with the clinical concept of the condition. Code lists were derived by applying a range of relevance score thresholds. The resulting LLM-generated code lists were compared to the validated code lists of the CCI published by Quan et al (2005). Classification metrics, including sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated at different thresholds to assess the LLM’s performance.
RESULTS: When setting the strictest threshold (relevance score=0) which excludes all codes the LLM has determined as completely irrelevant only 1 code was identified as false negative across all conditions, a larger number of false positives were found, resulting in >99.9% sensitivity, 93.5% specificity, >99.9% NPV and 6.1% PPV.
CONCLUSIONS: These early results indicate that LLMs could potentially be used to identify the large number of codes irrelevant to specific conditions. This approach can be used in combination with manual code review to improve accuracy and efficiency in code list generation. Ongoing work of model comparisons and prompt engineering could further increase accuracy.
METHODS: An LLM model (ChatGPT o3 mini) was used to generate relevance scores of each of the 16,287 ICD-10-WHO codes to a given condition of the CCI. This score reflects the LLM’s assessment of how closely each code description aligns with the clinical concept of the condition. Code lists were derived by applying a range of relevance score thresholds. The resulting LLM-generated code lists were compared to the validated code lists of the CCI published by Quan et al (2005). Classification metrics, including sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated at different thresholds to assess the LLM’s performance.
RESULTS: When setting the strictest threshold (relevance score=0) which excludes all codes the LLM has determined as completely irrelevant only 1 code was identified as false negative across all conditions, a larger number of false positives were found, resulting in >99.9% sensitivity, 93.5% specificity, >99.9% NPV and 6.1% PPV.
CONCLUSIONS: These early results indicate that LLMs could potentially be used to identify the large number of codes irrelevant to specific conditions. This approach can be used in combination with manual code review to improve accuracy and efficiency in code list generation. Ongoing work of model comparisons and prompt engineering could further increase accuracy.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD118
Topic
Real World Data & Information Systems, Study Approaches
Disease
No Additional Disease & Conditions/Specialized Treatment Areas