Training Models for Machine-Enabled Systematic Literature Reviews: Do Large Datasets Always Give Better Results?

Speaker(s)

Abogunrin S1, Batanova E2, Karthick S3, LeDrew N4, Mestre M5, Oliver G6, Queiros L7, Witzmann A8
1F. Hoffmann-La Roche, Basel, BS, Switzerland, 2F. Hoffmann La Roche, Basel, Switzerland, 3CapeStart, Littleton, MA, USA, 4Evidence Partners, Ottawa, ON, Canada, 5DataQA, London, UK, 6CapeStart, Cambridge, MA, USA, 7F. Hoffmann-La Roche, Basel, Switzerland, 8F. Hoffmann La Roche, Kaiseraugst, AG, Switzerland

OBJECTIVES: Artificial intelligence (AI) model fitting by supervised learning suggests that larger training datasets tend to produce more accurate predictions. This can be demonstrated by evaluating the final tuned model on a labeled test dataset. However, it is not clear what the minimum number of records is for training such models and whether larger training datasets will produce significantly better results. We investigated these issues using an example of randomized controlled trial data of oncology patients.

METHODS: Data from a retrospective human-led systematic literature review (SLR) were processed in three SLR tools with AI-capabilities. For each tool, three binary classification models were trained with 50 records, 816 (approximately 10% of dataset sample) records and 1631 (approximately 20% of dataset sample) records (9 models total). The same training dataset was used for the three tools. Each model classified records not used for the model training as relevant or irrelevant. Automatic classifications were compared to the human classifications using confusion matrices, precision, recall, and F1 score.

RESULTS: The dataset sample included 8816 records, of which 8766, 8000, and 7185 were used to test the 50-record, 816-record and 1631-record models, respectively. The recall, precision, and F1 scores for the 50-record models are between 0.58 and 0.87, 0.22 and 0.31, and 0.35 and 0.40, respectively; and 0.63 and 0.88, 0.18 and 0.43, and 0.3 and 0.51, for the 816-record models; and 0.62 and 0.83, 0.22 and 0.51, and 0.35 and 0.56 for the 1631-record models. One 50-record model did not generalize sufficiently and its results were excluded from the analysis.

CONCLUSIONS: Results were similar irrespective of the number of records used to train the AI models. Using a smaller number of training records would be advantageous in particular for SLRs based on a limited number of articles. Further research should assess how best to select training datasets.

Code

MSR84

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas