Optimizing GPT-4 Prompts for Population Eligibility Screening in COA Landscape Reviews of Qualitative Studies
Speaker(s)
Thorlund K1, Jafar R2, Nourizade M3, Hudgens S4
1McMaster University, Hamilton, ON, Canada, 2COA-AI, Tucson, AZ, USA, 3BioSpark AI Technologies Inc., Vancouver, BC, Canada, 4Clinical Outcomes Solutions, Tucson, AZ, USA
OBJECTIVES: To examine whether extensive prompt engineering with GPT4 (OpenAI) could produce satisfactory accuracy in automated abstract screening for Clinical Outcome Assessment (COA) landscape reviews. To examine which prompt engineering strategies work well and which do not.
METHODS: We iterated through several prompt engineering approaches and evaluated the accuracy of the responses on a title and abstract dataset collected from 17 landscape reviews (n=200 citations) across oncology, rheumatology, dermatology, and rare diseases. We limited our study to evaluate eligibility by population (i.e. disease).
We started with a "Foundational Prompt" emphasizing on task-specific details, and step-by-step analysis, by adding “Act as a ...”, “Your task is ...”, and “Think step by step”. We then modified the prompt to incorporate the most advanced prompt engineering techniques such as Few-shot, Emotional, Multi-step, and Chain of Thought prompts.RESULTS: The foundational prompt established a simple framework that yielded an accuracy of 0.89. The "Two-Shot prompt" which introduced positive and negative examples caused a notable 20% drop in accuracy. Emotional instructions in the "Emotional prompts" intended to evoke empathy, yielded no improvement. The "Multi-step prompt" produced a lower accuracy of 0.85. While the "One-Shot Chain of Thought (CoT)" yielded a best-performing accuracy of 0.92, the "Few-Shot CoT" demonstrated a slight setback in accuracy to 0.91.
CONCLUSIONS: The experiments revealed that effective prompts include clear, concise, and simple instructions. The decline in accuracy with the addition of naive examples and emotional instructions underscored the need for conciseness and caution against overfitting. The one-shot chain of thought (CoT) approach emerged as the most effective, highlighting the model's inherent ability to generate better reasoning, emphasizing the significance of asking the model to output its reasoning process rather than a simple final answer.
Code
CO66
Topic
Clinical Outcomes
Topic Subcategory
Clinical Outcomes Assessment
Disease
Biologics & Biosimilars, Musculoskeletal Disorders (Arthritis, Bone Disorders, Osteoporosis, Other Musculoskeletal), No Additional Disease & Conditions/Specialized Treatment Areas, Oncology, Rare & Orphan Diseases