Optimizing GPT-4 Prompts for Population Eligibility Screening in COA Landscape Reviews of Qualitative Studies

Speaker(s)

Thorlund K¹, Jafar R², Nourizade M³, Hudgens S⁴
¹McMaster University, Hamilton, ON, Canada, ²COA-AI, Tucson, AZ, USA, ³BioSpark AI Technologies Inc., Vancouver, BC, Canada, ⁴Clinical Outcomes Solutions, Tucson, AZ, USA

OBJECTIVES: To examine whether extensive prompt engineering with GPT4 (OpenAI) could produce satisfactory accuracy in automated abstract screening for Clinical Outcome Assessment (COA) landscape reviews. To examine which prompt engineering strategies work well and which do not.

METHODS: We iterated through several prompt engineering approaches and evaluated the accuracy of the responses on a title and abstract dataset collected from 17 landscape reviews (n=200 citations) across oncology, rheumatology, dermatology, and rare diseases. We limited our study to evaluate eligibility by population (i.e. disease).

We started with a "Foundational Prompt" emphasizing on task-specific details, and step-by-step analysis, by adding “Act as a ...”, “Your task is ...”, and “Think step by step”. We then modified the prompt to incorporate the most advanced prompt engineering techniques such as Few-shot, Emotional, Multi-step, and Chain of Thought prompts.

RESULTS: The foundational prompt established a simple framework that yielded an accuracy of 0.89. The "Two-Shot prompt" which introduced positive and negative examples caused a notable 20% drop in accuracy. Emotional instructions in the "Emotional prompts" intended to evoke empathy, yielded no improvement. The "Multi-step prompt" produced a lower accuracy of 0.85. While the "One-Shot Chain of Thought (CoT)" yielded a best-performing accuracy of 0.92, the "Few-Shot CoT" demonstrated a slight setback in accuracy to 0.91.

CONCLUSIONS: The experiments revealed that effective prompts include clear, concise, and simple instructions. The decline in accuracy with the addition of naive examples and emotional instructions underscored the need for conciseness and caution against overfitting. The one-shot chain of thought (CoT) approach emerged as the most effective, highlighting the model's inherent ability to generate better reasoning, emphasizing the significance of asking the model to output its reasoning process rather than a simple final answer.

Code

CO66

Topic

Clinical Outcomes

Topic Subcategory

Clinical Outcomes Assessment

Disease

Biologics & Biosimilars, Musculoskeletal Disorders (Arthritis, Bone Disorders, Osteoporosis, Other Musculoskeletal), No Additional Disease & Conditions/Specialized Treatment Areas, Oncology, Rare & Orphan Diseases

ISPOR 2024

May 5-8, 2024