ELEVATE GEN-AI FRAMEWORK-BASED INTERIM EVALUATION OF A GENAI-ENABLED PLATFORM FOR LITERATURE SCREENING OF A CLASS II MEDICAL DEVICE

Author(s)

Viji Queen, PharmD¹, Revanth M, B.E.², George Alisha, B.E.², Swathirajan C R, Ph.D², Angeline Babitha Dhas, B.E.³, Vince Salerno, PharmD, RPh³.
¹MadeAI, Nagercoil, India, ²MadeAi, Nagercoil, India, ³MadeAi, Cambridge, MA, USA.

Presentation Documents

Elevate Gen-AI Framework-Based Interim Evaluation (42 x 56) (1).pdf

OBJECTIVES: This research presents an interim evaluation of a GenAI-enabled literature review platform using the ELEVATE-GenAI framework, based on title and abstract screening of a Class II medical device golden dataset. The use of a regulated medical device dataset represents a novel application of the ELEVATE-GenAI framework to assess screening-stage performance within a regulated evidence synthesis context.
METHODS: A golden dataset comprising 2,302 records, curated and adjudicated by subject matter experts, was screened at the title and abstract level within the MadeAi platform. Screening outputs were evaluated against selected ELEVATE-GenAI domains, including performance, comprehensiveness, and factuality. Task-specific accuracy metrics (recall, precision, and F1 score) were calculated separately for inclusion and exclusion decisions, using the human-adjudicated dataset as the reference standard. Comprehensiveness was assessed by comparing the overall article relevance with the golden dataset. Factuality was evaluated through concordance between AI-generated exclusion decisions and human primary exclusion rationales, with discrepancies reviewed through expert oversight.
RESULTS: For inclusion decisions, MadeAi achieved a recall of 83%, precision of 80.7%, and an F1 score of 81.9%, indicating strong sensitivity for identifying relevant evidence. Exclusion performance demonstrated conservative screening behavior, prioritizing the retention of relevant studies, with a recall and precision of 96% and an F1 score of 95.9%. Overall article relevance was 93%, supporting screening-stage comprehensiveness. Concordance between AI exclusion decisions and human primary exclusion reasons was 84%, supporting the factuality of exclusion logic.
CONCLUSIONS: This interim application of the ELEVATE-GenAI framework demonstrates that domain-specific reporting across performance, comprehensiveness, and factuality provides a standardized and interpretable pathway for evaluating GenAI-enabled literature screening. By moving beyond accuracy metrics to include concordance of exclusion reasoning, the framework supports the transparency and reliability expected within regulated evidence synthesis workflows.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

SA52

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

STA: Multiple/Other Specialized Treatments

Presentation (CTI)