Prediction of Query Outcomes in the Setting of Observational Studies
Author(s)
Simone Schena, MSc1, Alessandra Mignani, MSc1, Giulio Mazzarelli, MSc1, Lucia Simoni, PhD1, Alessandra Ori, MSc, EMBA1, Fabio Ferri, MSc1, Duccio Urbinati, MSc, PharmD2.
1IQVIA Solutions Italy S.r.l, Modena, Italy, 2IQVIA Solutions Italy S.r.l, Milan, Italy.
1IQVIA Solutions Italy S.r.l, Modena, Italy, 2IQVIA Solutions Italy S.r.l, Milan, Italy.
OBJECTIVES: Managing queries in observational studies remains challenging due to the need to balance research staff effort with database cleanliness. Moreover, unclear or unacceptable justifications for data discrepancies can lead to query rejections, delaying study completion. This research investigates the extent to which the approval or rejection of a query can be predicted by analyzing its textual content and contextual information.
METHODS: Data from multiple observational studies (including both multi-country and local studies) were analyzed, to capture variability in query patterns. The analysis involved an integrated process of data harmonization across datasets, classification of queries using predefined descriptors, and feature extraction from query text through linguistic analysis. Extracted features included text metrics, action word indicators (presence/absence of terms like "check", "verify"), and content indicators (references to ranges, dates). Machine learning models (e.g., logistic regression, classification tree and random forest) are being developed using a training-test set approach to investigate how well the approval or rejection of a query can be predicted and to identify key factors that should be considered when formulating queries.
RESULTS: Out of 28175 queries, 7.5% (n=2119) were rejections, showing a substantial class imbalance in the dataset. Among queries raised manually by the data manager (n=10376, 36.8%), 1362 (13.1%) were rejected. The most common action words in queries related to the entered information included “check” (46.4%, n=13060), “update” (38.2%, n=10765), and “amend” (40.8%, n=11497), while “provide” appeared in only 0.8% (n=229) of queries. Models’ results are currently under analysis.
CONCLUSIONS: Predicting the outcome of queries and identifying the factors that data managers should take into consideration when formulating queries is crucial, to avoid delay in study completion and to minimize the workload for research staff. This approach combines data management, statistics and machine learning techniques, offering practical guidance for optimizing data management workflows and prioritizing review efforts in the real-world setting.
METHODS: Data from multiple observational studies (including both multi-country and local studies) were analyzed, to capture variability in query patterns. The analysis involved an integrated process of data harmonization across datasets, classification of queries using predefined descriptors, and feature extraction from query text through linguistic analysis. Extracted features included text metrics, action word indicators (presence/absence of terms like "check", "verify"), and content indicators (references to ranges, dates). Machine learning models (e.g., logistic regression, classification tree and random forest) are being developed using a training-test set approach to investigate how well the approval or rejection of a query can be predicted and to identify key factors that should be considered when formulating queries.
RESULTS: Out of 28175 queries, 7.5% (n=2119) were rejections, showing a substantial class imbalance in the dataset. Among queries raised manually by the data manager (n=10376, 36.8%), 1362 (13.1%) were rejected. The most common action words in queries related to the entered information included “check” (46.4%, n=13060), “update” (38.2%, n=10765), and “amend” (40.8%, n=11497), while “provide” appeared in only 0.8% (n=229) of queries. Models’ results are currently under analysis.
CONCLUSIONS: Predicting the outcome of queries and identifying the factors that data managers should take into consideration when formulating queries is crucial, to avoid delay in study completion and to minimize the workload for research staff. This approach combines data management, statistics and machine learning techniques, offering practical guidance for optimizing data management workflows and prioritizing review efforts in the real-world setting.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR171
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas