Screening Articles in a Qualitative Literature Review Using Large Language Models: A Comparison of GPT Versus Open Source, Trained Models Against Expert Researcher Screening

Author(s)

Hudgens S¹, Lloyd-Price L², Jafar R³, Nourizade M⁴, Burbridge C⁵, Thorlund K⁶
¹Clinical Outcomes Solutions, Tucson, AZ, USA, ²Clinical Outcomes Solutions Ltd, Folkestone, Kent, UK, ³COA-AI, Tucson, AZ, USA, ⁴BioSpark AI Technologies Inc., Vancouver, BC, Canada, ⁵Clinical Outcomes Solutions, Ltd., Folkestone, Kent, UK, ⁶McMaster University, Hamilton, ON, Canada

Presentation Documents

COAAI ISPOR US Poster Qual Screening 2024139177.pdf

OBJECTIVES: We aimed to assess two AI models’ performance for literature screening to identify relevant qualitative research that can be used to develop Clinical Outcome Assessment (COA) conceptual models. We also compared run-time for both models.

METHODS: We manually curated a dataset of title/abstract screening (n=1,300 study references) spanning over 17 landscape reviews across oncology, rheumatology, dermatology and rare diseases. Each citation was annotated for eligibility (Y/N) by population, study design (qualitative), and reporting of concepts (how patients feel or function). We then compared the accuracy of two AI models at predicting the screening decisions of expert researchers. The two AI models were Generative Pre-trained Transformers 4 (GPT4, OpenAI) prompts and a fine-tuned SciFive biomedical large language model (LLM). We used 70% of the data for training and 30% for test.

RESULTS: Both models performed well for assessing relevance by population, with F1-scores for the GPT4 and SciFive models being 0.92 and 0.83 respectively (precision was 0.92 and 0.93 respectively). For concept reporting the fine-tuned SciFive model outperformed GPT4 with an F1-score and precision 0.88 and 0.92 versus 0.81 and 0.79. The same was true but less pronounced for eligibility by study design, with an F1-score and precision 0.81 and 0.90 versus 0.86 and 0.76. For overall eligibility, the customized SciFive model outperformed the GPT4 model with an F1-score and precision of 0.84 and 0.92 versus 0.85 and 0.82. Lastly, it took the GPT4 prompts between 10-30 minutes to screen 100 abstracts. By contrast, the customized SciFive model took 1-2 minutes on a computer with a Quadro RTX 8000 GPU.

CONCLUSIONS: In conclusion, both AI models are promising. The customized T5 model appears slightly more accurate and performs substantially faster than the GPT4 model.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

CO83

Topic

Clinical Outcomes, Study Approaches

Topic Subcategory

Clinical Outcomes Assessment, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Oncology, Rare & Orphan Diseases, Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Explore Related HEOR by Topic

Clinical Outcomes

Presentation