AI-Powered Document Classifier for Identifying Documents Containing Radiation Therapy Information in Medical Records
Author(s)
Sandeep Giri, BS1, Shubh Tripathi, Other2.
1CloudxLab, Beaverton, OR, USA, 2Cloudxlab, Beaverton, OR, USA.
1CloudxLab, Beaverton, OR, USA, 2Cloudxlab, Beaverton, OR, USA.
OBJECTIVES: To develop an AI-powered document classification system that accurately identifies medical records containing critical radiation therapy information—particularly in oncology contexts—to streamline research workflows and reduce manual effort in large-scale document processing.
METHODS: To process large volumes of oncology-related medical records,first, scanned images and PDFs are converted to text using OCR tool. The textual content is then embedded using PubMedBERT, which is specialized for biomedical language. These embeddings serve as input to a custom-built neural network classifier. A key design choice was to optimize the model for 99% recall, ensuring that nearly every relevant document is captured—even if it means including some non-relevant ones. This high-recall setting is critical in medical research, where missing a document containing essential information could lead to incomplete analysis or missed insights. We used a balanced dataset of 7512 instances in total, out of which we used 4807 for training, 1202 for validation (testing set provided between training) and 1503 for final testing.
RESULTS: The system achieved a recall of 99% and precision of 80% on both validation and test sets. This high recall ensures that nearly all documents with critical cancer-related content are captured, while the 80% precision indicates a reasonable balance in minimizing irrelevant document inclusion. These results demonstrate the model’s effectiveness in reliably filtering useful documents from large medical corpora.
CONCLUSIONS: This system eliminates the need for researchers to manually sift through millions of documents to find relevant information. By automating the identification of critical content with high recall, the classifier ensures that no important documents are missed while significantly reducing time and effort. The filtered documents can then be forwarded for deeper AI-driven analysis or research, enabling focused exploration of treatment patterns, outcomes, and other clinically significant variables. This approach demonstrates how AI can be effectively applied to address real-world inefficiencies in clinical data management.
METHODS: To process large volumes of oncology-related medical records,first, scanned images and PDFs are converted to text using OCR tool. The textual content is then embedded using PubMedBERT, which is specialized for biomedical language. These embeddings serve as input to a custom-built neural network classifier. A key design choice was to optimize the model for 99% recall, ensuring that nearly every relevant document is captured—even if it means including some non-relevant ones. This high-recall setting is critical in medical research, where missing a document containing essential information could lead to incomplete analysis or missed insights. We used a balanced dataset of 7512 instances in total, out of which we used 4807 for training, 1202 for validation (testing set provided between training) and 1503 for final testing.
RESULTS: The system achieved a recall of 99% and precision of 80% on both validation and test sets. This high recall ensures that nearly all documents with critical cancer-related content are captured, while the 80% precision indicates a reasonable balance in minimizing irrelevant document inclusion. These results demonstrate the model’s effectiveness in reliably filtering useful documents from large medical corpora.
CONCLUSIONS: This system eliminates the need for researchers to manually sift through millions of documents to find relevant information. By automating the identification of critical content with high recall, the classifier ensures that no important documents are missed while significantly reducing time and effort. The filtered documents can then be forwarded for deeper AI-driven analysis or research, enabling focused exploration of treatment patterns, outcomes, and other clinically significant variables. This approach demonstrates how AI can be effectively applied to address real-world inefficiencies in clinical data management.
Conference/Value in Health Info
2025-09, ISPOR Real-World Evidence Summit 2025, Tokyo, Japan
Value in Health Regional, Volume 49S (September 2025)
Code
RWD17
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
SDC: Oncology