Application of Generative Artificial Intelligence for Extracting Structured Data from Unstructured Bladder Cancer Pathology Reports
Author(s)
Jennifer Ken-Opurum, PhD1, Sidharth Singh, MBA2, P. Pranav, MBA3, Rahul Bhonsle, MTech2, Shekhar Thumake, MBA2, Heather Marino, MLA1, Luke Dunlap, MS1;
1Axtria, Berkeley Heights, NJ, USA, 2Axtria, Pune, India, 3Axtria, Bengaluru, India
1Axtria, Berkeley Heights, NJ, USA, 2Axtria, Pune, India, 3Axtria, Bengaluru, India
OBJECTIVES: Free-text clinical notes contain rich patient medical histories but require manual, time-intensive abstraction to capture data in usable formats. Tools like Natural Language Processing can accelerate this work but are limited to specific data types. Our objective is to assess accuracy and performance of generative artificial intelligence (GenAI) in extracting unstructured clinical concepts from pathology reports and converting them to usable formatted data.
METHODS: Seventy-nine bladder cancer pathology reports were obtained from the Clinical Data Sharing Alliance repository. The analysis framework followed the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) Implementation Guide version 3.4. We used multimodal generative pretrained transformer OpenAI GPT-4o as the GenAI tool for automated data processing. An iterative prompt engineering process was refined to extract relevant pathological data from unstructured reports, including five variables: tumor location, tumor laterality, tumor direction, tumor results original response, and units. Extracted data were transformed and mapped to CDISC SDTM 3.4 standards. Accuracy evaluation was conducted at the pathology report-level.
RESULTS: Seventy-two pathology reports were extracted with 100% accuracy. I.e., for each of these pathology reports, all available variables were accurately extracted and mapped per SDTM specifications. Of the remaining seven pathology reports, accuracy ranged from 40-80%. In these cases, extraction challenges were attributed to complex multi-tumor descriptions, non-standard anatomical terminology, and/or inclusion of multiple specimen collections. By design, our prompts were engineered to extract only one value per variable, thus failing to capture more complicated pathologies that exist in real-world practice settings. Future iterations of this work will consider such complexities and allow for enriched reporting informed by clinical expertise.
CONCLUSIONS: This study demonstrates feasibility and effectiveness of GenAI for automated extraction and standardization of bladder cancer pathology data. The high accuracy rates suggest this approach could significantly streamline the process of converting unstructured pathology reports into SDTM-compliant datasets.
METHODS: Seventy-nine bladder cancer pathology reports were obtained from the Clinical Data Sharing Alliance repository. The analysis framework followed the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) Implementation Guide version 3.4. We used multimodal generative pretrained transformer OpenAI GPT-4o as the GenAI tool for automated data processing. An iterative prompt engineering process was refined to extract relevant pathological data from unstructured reports, including five variables: tumor location, tumor laterality, tumor direction, tumor results original response, and units. Extracted data were transformed and mapped to CDISC SDTM 3.4 standards. Accuracy evaluation was conducted at the pathology report-level.
RESULTS: Seventy-two pathology reports were extracted with 100% accuracy. I.e., for each of these pathology reports, all available variables were accurately extracted and mapped per SDTM specifications. Of the remaining seven pathology reports, accuracy ranged from 40-80%. In these cases, extraction challenges were attributed to complex multi-tumor descriptions, non-standard anatomical terminology, and/or inclusion of multiple specimen collections. By design, our prompts were engineered to extract only one value per variable, thus failing to capture more complicated pathologies that exist in real-world practice settings. Future iterations of this work will consider such complexities and allow for enriched reporting informed by clinical expertise.
CONCLUSIONS: This study demonstrates feasibility and effectiveness of GenAI for automated extraction and standardization of bladder cancer pathology data. The high accuracy rates suggest this approach could significantly streamline the process of converting unstructured pathology reports into SDTM-compliant datasets.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR139
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology