Screening Articles in a Qualitative Literature Review Using Large Language Models: A Comparison of GPT Versus Open Source, Trained Models Against Expert Researcher Screening

Author(s)

Hudgens S1, Lloyd-Price L2, Jafar R3, Nourizade M4, Burbridge C5, Thorlund K6
1Clinical Outcomes Solutions, Tucson, AZ, USA, 2Clinical Outcomes Solutions Ltd, Folkestone, Kent, UK, 3COA-AI, Tucson, AZ, USA, 4BioSpark AI Technologies Inc., Vancouver, BC, Canada, 5Clinical Outcomes Solutions, Ltd., Folkestone, Kent, UK, 6McMaster University, Hamilton, ON, Canada

OBJECTIVES: We aimed to assess two AI models’ performance for literature screening to identify relevant qualitative research that can be used to develop Clinical Outcome Assessment (COA) conceptual models. We also compared run-time for both models.

METHODS: We manually curated a dataset of title/abstract screening (n=1,300 study references) spanning over 17 landscape reviews across oncology, rheumatology, dermatology and rare diseases. Each citation was annotated for eligibility (Y/N) by population, study design (qualitative), and reporting of concepts (how patients feel or function). We then compared the accuracy of two AI models at predicting the screening decisions of expert researchers. The two AI models were Generative Pre-trained Transformers 4 (GPT4, OpenAI) prompts and a fine-tuned SciFive biomedical large language model (LLM). We used 70% of the data for training and 30% for test.

RESULTS: Both models performed well for assessing relevance by population, with F1-scores for the GPT4 and SciFive models being 0.92 and 0.83 respectively (precision was 0.92 and 0.93 respectively). For concept reporting the fine-tuned SciFive model outperformed GPT4 with an F1-score and precision 0.88 and 0.92 versus 0.81 and 0.79. The same was true but less pronounced for eligibility by study design, with an F1-score and precision 0.81 and 0.90 versus 0.86 and 0.76. For overall eligibility, the customized SciFive model outperformed the GPT4 model with an F1-score and precision of 0.84 and 0.92 versus 0.85 and 0.82. Lastly, it took the GPT4 prompts between 10-30 minutes to screen 100 abstracts. By contrast, the customized SciFive model took 1-2 minutes on a computer with a Quadro RTX 8000 GPU.

CONCLUSIONS: In conclusion, both AI models are promising. The customized T5 model appears slightly more accurate and performs substantially faster than the GPT4 model.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

CO83

Topic

Clinical Outcomes, Study Approaches

Topic Subcategory

Clinical Outcomes Assessment, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Oncology, Rare & Orphan Diseases, Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×