Application of Generative Artificial Intelligence for Extracting Structured Data from Unstructured Bladder Cancer Pathology Reports

Author(s)

Jennifer Ken-Opurum, PhD¹, Sidharth Singh, MBA², P. Pranav, MBA³, Rahul Bhonsle, MTech², Shekhar Thumake, MBA², Heather Marino, MLA¹, Luke Dunlap, MS¹;
¹Axtria, Berkeley Heights, NJ, USA, ²Axtria, Pune, India, ³Axtria, Bengaluru, India

OBJECTIVES: Free-text clinical notes contain rich patient medical histories but require manual, time-intensive abstraction to capture data in usable formats. Tools like Natural Language Processing can accelerate this work but are limited to specific data types. Our objective is to assess accuracy and performance of generative artificial intelligence (GenAI) in extracting unstructured clinical concepts from pathology reports and converting them to usable formatted data.
METHODS: Seventy-nine bladder cancer pathology reports were obtained from the Clinical Data Sharing Alliance repository. The analysis framework followed the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) Implementation Guide version 3.4. We used multimodal generative pretrained transformer OpenAI GPT-4o as the GenAI tool for automated data processing. An iterative prompt engineering process was refined to extract relevant pathological data from unstructured reports, including five variables: tumor location, tumor laterality, tumor direction, tumor results original response, and units. Extracted data were transformed and mapped to CDISC SDTM 3.4 standards. Accuracy evaluation was conducted at the pathology report-level.
RESULTS: Seventy-two pathology reports were extracted with 100% accuracy. I.e., for each of these pathology reports, all available variables were accurately extracted and mapped per SDTM specifications. Of the remaining seven pathology reports, accuracy ranged from 40-80%. In these cases, extraction challenges were attributed to complex multi-tumor descriptions, non-standard anatomical terminology, and/or inclusion of multiple specimen collections. By design, our prompts were engineered to extract only one value per variable, thus failing to capture more complicated pathologies that exist in real-world practice settings. Future iterations of this work will consider such complexities and allow for enriched reporting informed by clinical expertise.
CONCLUSIONS: This study demonstrates feasibility and effectiveness of GenAI for automated extraction and standardization of bladder cancer pathology data. The high accuracy rates suggest this approach could significantly streamline the process of converting unstructured pathology reports into SDTM-compliant datasets.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR139

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology

Presentation (CTI)