ELEVATE GEN-AI FRAMEWORK-BASED INTERIM EVALUATION OF A GENAI-ENABLED PLATFORM FOR LITERATURE SCREENING OF A CLASS II MEDICAL DEVICE
Author(s)
Viji Queen, PharmD1, Revanth M, B.E.2, George Alisha, B.E.2, Swathirajan C R, Ph.D2, Angeline Babitha Dhas, B.E.3, Vince Salerno, PharmD, RPh3.
1MadeAI, Nagercoil, India, 2MadeAi, Nagercoil, India, 3MadeAi, Cambridge, MA, USA.
1MadeAI, Nagercoil, India, 2MadeAi, Nagercoil, India, 3MadeAi, Cambridge, MA, USA.
OBJECTIVES: This research presents an interim evaluation of a GenAI-enabled literature review platform using the ELEVATE-GenAI framework, based on title and abstract screening of a Class II medical device golden dataset. The use of a regulated medical device dataset represents a novel application of the ELEVATE-GenAI framework to assess screening-stage performance within a regulated evidence synthesis context.
METHODS: A golden dataset comprising 2,302 records, curated and adjudicated by subject matter experts, was screened at the title and abstract level within the MadeAi platform. Screening outputs were evaluated against selected ELEVATE-GenAI domains, including performance, comprehensiveness, and factuality. Task-specific accuracy metrics (recall, precision, and F1 score) were calculated separately for inclusion and exclusion decisions, using the human-adjudicated dataset as the reference standard. Comprehensiveness was assessed by comparing the overall article relevance with the golden dataset. Factuality was evaluated through concordance between AI-generated exclusion decisions and human primary exclusion rationales, with discrepancies reviewed through expert oversight.
RESULTS: For inclusion decisions, MadeAi achieved a recall of 83%, precision of 80.7%, and an F1 score of 81.9%, indicating strong sensitivity for identifying relevant evidence. Exclusion performance demonstrated conservative screening behavior, prioritizing the retention of relevant studies, with a recall and precision of 96% and an F1 score of 95.9%. Overall article relevance was 93%, supporting screening-stage comprehensiveness. Concordance between AI exclusion decisions and human primary exclusion reasons was 84%, supporting the factuality of exclusion logic.
CONCLUSIONS: This interim application of the ELEVATE-GenAI framework demonstrates that domain-specific reporting across performance, comprehensiveness, and factuality provides a standardized and interpretable pathway for evaluating GenAI-enabled literature screening. By moving beyond accuracy metrics to include concordance of exclusion reasoning, the framework supports the transparency and reliability expected within regulated evidence synthesis workflows.
METHODS: A golden dataset comprising 2,302 records, curated and adjudicated by subject matter experts, was screened at the title and abstract level within the MadeAi platform. Screening outputs were evaluated against selected ELEVATE-GenAI domains, including performance, comprehensiveness, and factuality. Task-specific accuracy metrics (recall, precision, and F1 score) were calculated separately for inclusion and exclusion decisions, using the human-adjudicated dataset as the reference standard. Comprehensiveness was assessed by comparing the overall article relevance with the golden dataset. Factuality was evaluated through concordance between AI-generated exclusion decisions and human primary exclusion rationales, with discrepancies reviewed through expert oversight.
RESULTS: For inclusion decisions, MadeAi achieved a recall of 83%, precision of 80.7%, and an F1 score of 81.9%, indicating strong sensitivity for identifying relevant evidence. Exclusion performance demonstrated conservative screening behavior, prioritizing the retention of relevant studies, with a recall and precision of 96% and an F1 score of 95.9%. Overall article relevance was 93%, supporting screening-stage comprehensiveness. Concordance between AI exclusion decisions and human primary exclusion reasons was 84%, supporting the factuality of exclusion logic.
CONCLUSIONS: This interim application of the ELEVATE-GenAI framework demonstrates that domain-specific reporting across performance, comprehensiveness, and factuality provides a standardized and interpretable pathway for evaluating GenAI-enabled literature screening. By moving beyond accuracy metrics to include concordance of exclusion reasoning, the framework supports the transparency and reliability expected within regulated evidence synthesis workflows.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
SA52
Topic
Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
STA: Multiple/Other Specialized Treatments