Large Language Model (LLM) as Catalyst for Rapid and Efficient Critical Appraisal of Epidemiological Research
Author(s)
Gautamjeet Singh Mangat, MSc1, Astha Jain, MSc1, Sugandh Sharma, MSc2, Sangeeta Budhia, PhD3.
1Parexel, Mohali, India, 2Parexel, Chandigarh, India, 3Parexel, London, United Kingdom.
1Parexel, Mohali, India, 2Parexel, Chandigarh, India, 3Parexel, London, United Kingdom.
OBJECTIVES: The STROBE checklist, a comprehensive tool for evaluating epidemiological studies, typically requires a substantial amount of review time. This study compared the performance of human review and LLM-assisted human review using the checklist to evaluate whether an LLM can efficiently reduce the workload of human reviewers.
METHODS: A random selection of epidemiology studies (n=20) was identified. Through an iterative process, standardized prompts were created specifically for the STROBE checklist, which included detailed instructions for the 22-items. These prompts were designed to elicit comprehensive analyses from LLM, rather than simple yes/no answers. The prompt package included appraisal guidance, a response template, and study documents for quality assessment. Outcomes measured were accuracy (LLM-human alignment), completeness (thoroughness of item appraisal), and time efficiency.
RESULTS: The average time to complete the STROBE checklist was substantially reduced when using LLM-assistance compared to human reviewers alone. LLM-assisted human review took an average of 14 minutes per study, while human reviewers alone required 27 minutes per study, a 48% reduction in time. Analysis of the results showed that LLM-assisted human review and the expert reviewer alone achieved an overall agreement rate of 88% across all items of the checklist. The LLM consistently struggled with accurately appraising the generalizability item, which required human judgement. LLM also demonstrated inconsistent performance in quickly assessing questions 11 (quantitative variables handling) and 12 (statistical methodology for subgroups, confounding, etc.), as it provided partial responses and required additional prompting. However, for most items where the agreement was observed, LLM provided comprehensive and detailed responses.
CONCLUSIONS: LLMs demonstrate an apparent ability to accelerate the critical appraisal process while maintaining high quality. We strongly recommend incorporating LLMs as initial reviewers, followed by targeted human involvement for specialized input. This strategy saves time while optimizing the use of human expertise on nuanced aspects of the literature review process.
METHODS: A random selection of epidemiology studies (n=20) was identified. Through an iterative process, standardized prompts were created specifically for the STROBE checklist, which included detailed instructions for the 22-items. These prompts were designed to elicit comprehensive analyses from LLM, rather than simple yes/no answers. The prompt package included appraisal guidance, a response template, and study documents for quality assessment. Outcomes measured were accuracy (LLM-human alignment), completeness (thoroughness of item appraisal), and time efficiency.
RESULTS: The average time to complete the STROBE checklist was substantially reduced when using LLM-assistance compared to human reviewers alone. LLM-assisted human review took an average of 14 minutes per study, while human reviewers alone required 27 minutes per study, a 48% reduction in time. Analysis of the results showed that LLM-assisted human review and the expert reviewer alone achieved an overall agreement rate of 88% across all items of the checklist. The LLM consistently struggled with accurately appraising the generalizability item, which required human judgement. LLM also demonstrated inconsistent performance in quickly assessing questions 11 (quantitative variables handling) and 12 (statistical methodology for subgroups, confounding, etc.), as it provided partial responses and required additional prompting. However, for most items where the agreement was observed, LLM provided comprehensive and detailed responses.
CONCLUSIONS: LLMs demonstrate an apparent ability to accelerate the critical appraisal process while maintaining high quality. We strongly recommend incorporating LLMs as initial reviewers, followed by targeted human involvement for specialized input. This strategy saves time while optimizing the use of human expertise on nuanced aspects of the literature review process.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
EPH156
Topic
Epidemiology & Public Health, Health Technology Assessment, Methodological & Statistical Research
Disease
Oncology