DEVELOPMENT AND EVALUATION OF CELLS, AN ENSEMBLE OF OPEN-WEIGHT LARGE LANGUAGE MODELS FOR SYSTEMATIC LITERATURE REVIEW

Author(s)

Richard F. Pollock, MA, MSc;
Covalence Research Ltd, Director, Harpenden, United Kingdom

OBJECTIVES: The Covalence Ensemble Large Language Model (LLM) Systematic Literature Review (SLR) Platform (CELLS) is a web application that orchestrates an ensemble of locally-hosted open-weight LLMs to conduct automated screening of study titles and abstracts in the context of an SLR, focussed on reproducibility, inter-model agreement, chain-of-thought documentation, and human-in-the-loop decision making. The present analysis details key methodological aspects of CELLS and presents an evaluation of performance versus a human reviewer.
METHODS: CELLS was used to run six open-weight LLMs with between 14 and 120 billion parameters on a corpus of 801 study titles and abstracts, screened against five inclusion criteria, with a human reviewer presented with the same task. Each LLM was prompted with identical context for each study and required to emit schema-validated responses for each criterion, with the schema compiled automatically based on human-defined screening criteria specified in the web interface. For every study, LLM, and criterion combination, CELLS recorded the raw LLM output including the chain-of-thought and schema-validated screening decision, enabling per-model performance auditing, generation of inter-LLM agreement statistics (e.g. Fleiss’/Krippendorff’s alpha), and use of pre-specified or post hoc consensus rules (e.g., majority vote or unanimity) to generate final screening recommendations.
RESULTS: Of the 801 study records retrieved, agreement on the need for full-text screening between CELLS and the human reviewer was 97.2% (779/801); CELLS recommended full-text screening (or reported low inter-LLM agreement) for 45 studies, versus 23 from the human reviewer, with all 23 human-recommended studies included in the 45 recommended by CELLS. Had CELLS made screening decisions without human audit, the human screening workload would have reduced by 94.4% (from 801 to 45 studies), with no studies wrongly excluded.
CONCLUSIONS: Agreement between CELLS and human screening decisions was high, with no studies incorrectly excluded by CELLS, paving the way for substantial human workload efficiencies in literature screening.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR89

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)