GenAI Goes to ISPOR: Exploratory, Descriptive Analysis of Generative AI Performance for Summarizing and Synthesizing ISPOR Research Abstracts as Sources
Author(s)
Cynthia D. Morrow, MA, PhD1, Ellen R. Thiel, MPH1, Jennifer P. Wisdom, PhD, MPH1, Blaine P. Reeder, PhD2;
1Knowledge Resolution, Traverse City, MI, USA, 2University of Missouri, Columbia, MO, USA
1Knowledge Resolution, Traverse City, MI, USA, 2University of Missouri, Columbia, MO, USA
OBJECTIVES: Generative AI (genAI) has potential to streamline the summary and synthesis of research findings for evidence generation to support safety and efficacy requirements in the life sciences. However, there is a need to assess genAI’s accuracy and traceability.
METHODS: A random sample of abstracts (N=30) from the June 2023 issue of Value in Health were first individually uploaded to two publicly available genAI tools (ChatGPT v2, Claude 3.5 Sonnet). The tools were prompted to provide summaries of methods and results and to isolate the specific abstract text the tool used to generate summaries (traceability). Then, the whole sample was uploaded and the tools were prompted to provide a count of how many abstracts used electronic health record (EHR) or claims data. Time to output was recorded. Summaries were independently scored using a rubric, with 1 point assigned for accurate reporting of data type, study population, methodology, and primary research findings.
RESULTS: Overall rubric scores for ChatGPT and Claude were 96% and 95%, respectively. The tools were highly accurate in describing the study population (97% for ChatGPT and 100% for Claude), methodology (97% and 100%), and primary research findings (100% for both). The rubric criterion with the lowest accuracy was data type: 89% and 79% for ChatGPT and Claude, respectively. Mean (SD) time for output generation was 7.4 (2.2) and 4.5 (0.7) seconds for ChatGPT and Claude, respectively. Claude was especially limited in its ability to isolate source text. Via manual review of the sample, abstracts utilized EHR data (N=7), claims (N=12), and other/multiple sources (N=11). ChatGPT reported that no abstracts utilized EHR or claims; Claude provided inaccurate numbers (2 for EHR, 13 for claims).
CONCLUSIONS: GenAI is promisingly fast for summarizing research findings. Improvements in traceability and accuracy when synthesizing across multiple sources will benefit life sciences use cases.
METHODS: A random sample of abstracts (N=30) from the June 2023 issue of Value in Health were first individually uploaded to two publicly available genAI tools (ChatGPT v2, Claude 3.5 Sonnet). The tools were prompted to provide summaries of methods and results and to isolate the specific abstract text the tool used to generate summaries (traceability). Then, the whole sample was uploaded and the tools were prompted to provide a count of how many abstracts used electronic health record (EHR) or claims data. Time to output was recorded. Summaries were independently scored using a rubric, with 1 point assigned for accurate reporting of data type, study population, methodology, and primary research findings.
RESULTS: Overall rubric scores for ChatGPT and Claude were 96% and 95%, respectively. The tools were highly accurate in describing the study population (97% for ChatGPT and 100% for Claude), methodology (97% and 100%), and primary research findings (100% for both). The rubric criterion with the lowest accuracy was data type: 89% and 79% for ChatGPT and Claude, respectively. Mean (SD) time for output generation was 7.4 (2.2) and 4.5 (0.7) seconds for ChatGPT and Claude, respectively. Claude was especially limited in its ability to isolate source text. Via manual review of the sample, abstracts utilized EHR data (N=7), claims (N=12), and other/multiple sources (N=11). ChatGPT reported that no abstracts utilized EHR or claims; Claude provided inaccurate numbers (2 for EHR, 13 for claims).
CONCLUSIONS: GenAI is promisingly fast for summarizing research findings. Improvements in traceability and accuracy when synthesizing across multiple sources will benefit life sciences use cases.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR38
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas