Evaluating Large Language Models’ (LLM) Performance in Content Generation for Global Value Dossiers (GVD)

Speaker(s)

Walters J1, Rtveladze K2, Xu W3, Green N4, Joseph J1, Matev K5, Gallinaro J1, Guerra I1
1IQVIA, London, LON, UK, 2IQVIA, London , LON, UK, 3IQVIA, Amsterdam, Netherlands, 4IQVIA, London, UK, 5IQVIA, Sofia, Bulgaria

OBJECTIVES: LLMs have the capability to revolutionize content generation, potentially approaching human proficiency. This study investigated the practicality and effectiveness of LLMs in generating content for GVDs.

METHODS: GVD Subject Matter Experts (SMEs) used an LLM and Retrieval Augmented Generation (RAG) to draft Product Profile (PP), Clinical and Economic chapters for four drugs covering diverse therapeutic areas. The European Medicines Agency labels and journal publications were used as sources to generate PP and Clinical and Economic content, respectively. Performance was evaluated across several dimensions, including relevance, completeness, accuracy, language quality, and overall quality, using a five-point Likert Scale.

RESULTS: Overall, SMEs agreed that the generated content was relevant and accurate (92% strongly or somewhat agreed). The responses were largely complete (75% somewhat agreed) but could be further improved by LLM prompt editing and RAG improvements. In all chapters there were instances of language being overly simple, repetition throughout subsections and some hallucinations, however SMEs concurred that the responses were mostly well-written (92% strongly or somewhat agreed). SMEs agreed that they would incorporate the generated responses into projects (92% somewhat agreed), albeit with manual checking and editing required. Other notable findings included: 1) restricted information within publications impacted the quality of responses but was improved by providing the LLM with supplementary material, when available; 2) varying performance was observed for Economic responses due to variations of model design, and hence differing content, reported in each publication; 3) multiple indications within the drug label caused issues with incomplete or incorrect information within the PP section.

CONCLUSIONS: This study demonstrated that LLMs can significantly aid in GVD content generation. However, performance is contingent on the level of detail provided in the instructions to the LLM and within the source material. SME review is essential to ensure accuracy and completeness of the generated output.

Code

MSR110

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas