Ontology-Based Text Mining in Scientific Literature
Author(s)
Witzmann A1, Batanova E2, Queiros L2, Abogunrin S3
1F. Hoffmann La Roche, Kaiseraugst, AG, Switzerland, 2F. Hoffmann La Roche, Basel, Switzerland, 3F. Hoffmann-La Roche Ltd., Basel, BS, Switzerland
Background: Text mining tools like I2E (Linguamatics) match a user query, such as a set of keywords, to find relevant documents. The objective is to determine whether automated classification of title and abstracts (TIAB) can help reduce the time spent by experts reviewing document TIAB for inclusion in systematic literature reviews (SLRs). Methods: The data collection consisted of annotated articles across a therapeutic area. The PICOS (population, intervention, comparator, outcomes and study design) framework, which describes the study selection criteria, was the basis for the development of initial text mining queries used for this analysis. The subsequent refinement of the queries included keywords and linguistic expressions from a training sample of the data collection. The Linguamatics I2E System indexes the data collection by adding domain ontologies to allow fast querying. Using a holdback test sample of the data collection, the built queries classified each article. The Excel outputs from the executed I2E queries were loaded into R analytics to calculate recall, precision, and F-measure. Work-saved-over-sampling at 95%-recall (WSS@95) was used to compute the human effort averted by the tool. Results: The data collection consisted of five review topics: three clinical, one economic and one utility review all in oncology. The size of the review datasets varied from 288 to 9,123 articles. The recall, precision and F-measure ranged between 0.50 and 1.00, 0.05 and 0.24, 0.10 and 0.38, respectively. A reduction in the number of articles needing manual review was found for all review topics studied (WSS@95 ≥ 61%). Conclusions: Automated document classification could be a valuable approach in supporting SLRs. While there seems to be a clear advantage to using text mining for reviewing TIAB, project teams should consider the additional effort required for document indexing and query definition.
Conference/Value in Health Info
2021-11, ISPOR Europe 2021, Copenhagen, Denmark
Value in Health, Volume 24, Issue 12, S2 (December 2021)
Code
POSB300
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Specific Disease