An AI-Powered Tool to Identify and Assess Fit-for-Use Registries for Drug Development and Evaluation
Author(s)
Ghinwa Y. Hayek, MPH1, Boris Kopin, MSc1, Sonia Zebachi, PhD1, Gaëtan Pinon, MSc1, Basile Ferry, MSc1, Elisabeth Bakker, MSc2, Alexandre Macquin, MSc1, Sieta de Vries, PhD2, Peter GM Mol, PharmD, PhD2, Billy Amzal, MBA, MPH, MSc, PhD1.
1Quinten Health, Paris, France, 2Department of Clinical Pharmacy and Pharmacology, University of Groningen, University Medical Center Groningen, Groningen, Netherlands.
1Quinten Health, Paris, France, 2Department of Clinical Pharmacy and Pharmacology, University of Groningen, University Medical Center Groningen, Groningen, Netherlands.
OBJECTIVES: Selecting appropriate real-world data (RWD) sources, particularly registries, is a primary challenge for academia, industry, regulators, and health technology assessment (HTA) bodies, as successful submissions rely on data quality and relevance. The identification of a RWD source is often lengthy, and complex due to discoverability, and accessibility matters, notably the multiplicity of data catalogues, and unavailability of the metadata. Leveraging natural language processing techniques can address these challenges. In the context of the public-private More-EUROPA consortium, we propose an AI-powered tool to support identification, and selection of fit-for-use registries across the drug development lifecycle.
METHODS: The tool was developed around three key pillars: 1) Centralisation of data sources 2) Assessment of available metadata 3) Identification of relevant sources. For pillar 1, registries were extracted from HMA-EMA catalogues of real-world data sources, observational studies in ClinicalTrial.gov, and published literature (PubMed and Semantic Scholar). For pillar 2, metadata were normalised and converged into a "common metadata model" based on PICOTS (population, intervention, comparator, outcome, time, setting). A Large Language Model (LLM) was applied to extract key information from unstructured publications data. For pillar 3, a machine learning algorithm was developed to de-duplicate and identify registries across the four data sources of pillar 1.
RESULTS: The current AI-powered tool includes 245 registries extracted from the EMA catalogues, 8,300 observational studies, and 12,000 unique registries identified from 220,000 publications. During beta testing with consortium members, including regulators, HTA agencies, industry, and researchers, the trained LLM has demonstrated consistent and accurate extraction of PICOT-related metadata from publications.
CONCLUSIONS: The developed AI-powered tool is a comprehensive platform to identify adequate registries addressing diverse research questions across the drug development lifecycle. Future development phases will be co-designed with users, including regulators, HTA bodies, and industry, to ensure its practical adoption for evaluation purposes.
METHODS: The tool was developed around three key pillars: 1) Centralisation of data sources 2) Assessment of available metadata 3) Identification of relevant sources. For pillar 1, registries were extracted from HMA-EMA catalogues of real-world data sources, observational studies in ClinicalTrial.gov, and published literature (PubMed and Semantic Scholar). For pillar 2, metadata were normalised and converged into a "common metadata model" based on PICOTS (population, intervention, comparator, outcome, time, setting). A Large Language Model (LLM) was applied to extract key information from unstructured publications data. For pillar 3, a machine learning algorithm was developed to de-duplicate and identify registries across the four data sources of pillar 1.
RESULTS: The current AI-powered tool includes 245 registries extracted from the EMA catalogues, 8,300 observational studies, and 12,000 unique registries identified from 220,000 publications. During beta testing with consortium members, including regulators, HTA agencies, industry, and researchers, the trained LLM has demonstrated consistent and accurate extraction of PICOT-related metadata from publications.
CONCLUSIONS: The developed AI-powered tool is a comprehensive platform to identify adequate registries addressing diverse research questions across the drug development lifecycle. Future development phases will be co-designed with users, including regulators, HTA bodies, and industry, to ensure its practical adoption for evaluation purposes.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD16
Topic
Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Reproducibility & Replicability
Disease
No Additional Disease & Conditions/Specialized Treatment Areas