Landscape of Natural Language Processing (NLP) Capabilities at Clinical Sites: Insights From a Real-World (RW) Gastric Cancer (GC) Study
Author(s)
Spencer Jones, PhD1, Lucy Turner, BSc2, Aisha Rashid, BSc2, Julia Gallinaro, PhD3, Marina Borges, MSc4, Karina Vitanova, PhD3, Elizabeth Eldridge, MPH5, Merce Conill, MSc6, Valeria Saglimbene, PhD7, Ines Guerra, MSc3.
1AstraZeneca, Zürich, Switzerland, 2AstraZeneca, Baar, Switzerland, 3IQVIA, London, United Kingdom, 4IQVIA, Oeiras, Portugal, 5IQVIA, Durham, NC, USA, 6IQVIA, Barcelona, Spain, 7IQVIA, Milan, Italy.
1AstraZeneca, Zürich, Switzerland, 2AstraZeneca, Baar, Switzerland, 3IQVIA, London, United Kingdom, 4IQVIA, Oeiras, Portugal, 5IQVIA, Durham, NC, USA, 6IQVIA, Barcelona, Spain, 7IQVIA, Milan, Italy.
OBJECTIVES: Medical notes contain valuable clinical information, yet they are often underutilized in RW evidence generation due to cost and complexity of manual curation. NLP offers solutions for information extraction; however its adoption across clinical sites is unclear. The study aimed to assess the current landscape of NLP capabilities for research purposes across sites participating in a RW GC study.
METHODS: A feasibility questionnaire (FQ) was developed to capture information on sites’ NLP capabilities, including technical details of NLP (e.g. type of model, validation procedure and metrics), regulatory compliance and quality assurance processes in place. The FQ was sent to 27 sites, across six countries. Follow-up interviews were conducted to clarify responses. Participating sites were selected based on their expertise in GC treatment, with many sites belonging to the Oncology Evidence Network.
RESULTS: Of the 17 responding sites, nine reported having NLP capabilities (two in France, one in Italy, one in Germany, two in the United Kingdom, one in Switzerland and two in Canada). Among these, four sites had already extracted study-relevant variables using NLP (sites in France, Italy and Germany). Three of the four also indicated capacity to extract additional variables. Two sites from the United Kingdom had prior NLP experience but lacked reusable algorithms. One Canadian site had piloted NLP internally; the other Canadian site and the Swiss site provided limited details. The NLP approaches used varied, including rule-, machine learning- and deep learning-based algorithms developed or fine-tuned in-house or commercially available software. Of the four sites with NLP-derived variables, three had data quality assurance processes, and two confirmed regulatory compliance (e.g. General Data Protection Regulation).
CONCLUSIONS: NLP adoption for variable extraction in clinical settings remains limited. While half of responding sites have explored NLP internally or in past studies, only a subset have validated, reusable algorithms readily available for research purposes.
METHODS: A feasibility questionnaire (FQ) was developed to capture information on sites’ NLP capabilities, including technical details of NLP (e.g. type of model, validation procedure and metrics), regulatory compliance and quality assurance processes in place. The FQ was sent to 27 sites, across six countries. Follow-up interviews were conducted to clarify responses. Participating sites were selected based on their expertise in GC treatment, with many sites belonging to the Oncology Evidence Network.
RESULTS: Of the 17 responding sites, nine reported having NLP capabilities (two in France, one in Italy, one in Germany, two in the United Kingdom, one in Switzerland and two in Canada). Among these, four sites had already extracted study-relevant variables using NLP (sites in France, Italy and Germany). Three of the four also indicated capacity to extract additional variables. Two sites from the United Kingdom had prior NLP experience but lacked reusable algorithms. One Canadian site had piloted NLP internally; the other Canadian site and the Swiss site provided limited details. The NLP approaches used varied, including rule-, machine learning- and deep learning-based algorithms developed or fine-tuned in-house or commercially available software. Of the four sites with NLP-derived variables, three had data quality assurance processes, and two confirmed regulatory compliance (e.g. General Data Protection Regulation).
CONCLUSIONS: NLP adoption for variable extraction in clinical settings remains limited. While half of responding sites have explored NLP internally or in past studies, only a subset have validated, reusable algorithms readily available for research purposes.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD113
Topic
Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Health & Insurance Records Systems
Disease
No Additional Disease & Conditions/Specialized Treatment Areas