Time-Dependent Profiling of Distinct Stages Prior to Breast Cancer Onset Using Free-Text Diagnosis Names


Lorenzo R1, Holmes B1, Green F2, Loving J1
1Syapse, San Francisco, CA, USA, 2Syapse, San Diego, CA, USA

OBJECTIVES: Early detection of breast cancer (BC) is crucial in determining patient outcomes. Modeling the patient journey prior to BC diagnosis is therefore an important task. Patient diagnoses are often available as free text, and difficult to represent for predictive analytics. We introduce the use of sentence transformers, paired alongside a novel association through unsupervised clustering to yield highly relevant patient journey representations.

METHODS: We generated a vocabulary of 9,915 diagnoses from patient visits at most one year before a BC diagnosis, inclusive of the BC diagnosis visit. We used the Biomed-Roberta sentence transformer to vectorize these diagnoses. We clustered using silhouette scoring for optimal cluster number, and found centroids. These were again clustered to group similar concepts to clinically-relevant categories.

Patients were selected, either 6 months or 3 weeks before BC diagnosis by randomized, equally-weighted patient assignment. Diagnoses up to a year prior were vectorized. We created an XGBoost model trained using these vectors to classify the two groups (75/25 train/test split).

RESULTS: Expert review established cluster quality and confirmed all breast cancer diagnoses in a single cluster. In the BC diagnosis cluster, all units were breast-related, and 228/237 were breast cancers. Non-BC members were breast deformities or genetic susceptibility to BC. Max silhouette score was 0.87. XGBoost classified 23,521 patients as 6-month or 3-week with an accuracy of 75%, F-score of 0.73. Relevant clusters to BC diagnosis included limb pain and nausea.

CONCLUSIONS: We showed signal separating patients at critical time points prior to BC diagnosis. This signal was found using the relative position of patient diagnosis in vector space; we have demonstrated that valuable insights into patient status and progress can be found using unsupervised clustering. This work, while early, establishes a technique that we are developing towards early-prediction capabilities.

Conference/Value in Health Info

2023-05, ISPOR 2023, Boston, MA, USA

Value in Health, Volume 26, Issue 6, S2 (June 2023)




Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records


No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now