Ontology-Based Text Mining in Scientific Literature

Author(s)

Witzmann A¹, Batanova E², Queiros L², Abogunrin S³
¹F. Hoffmann La Roche, Kaiseraugst, AG, Switzerland, ²F. Hoffmann La Roche, Basel, Switzerland, ³F. Hoffmann-La Roche Ltd., Basel, BS, Switzerland

Background: Text mining tools like I2E (Linguamatics) match a user query, such as a set of keywords, to find relevant documents. The objective is to determine whether automated classification of title and abstracts (TIAB) can help reduce the time spent by experts reviewing document TIAB for inclusion in systematic literature reviews (SLRs).

Methods: The data collection consisted of annotated articles across a therapeutic area. The PICOS (population, intervention, comparator, outcomes and study design) framework, which describes the study selection criteria, was the basis for the development of initial text mining queries used for this analysis. The subsequent refinement of the queries included keywords and linguistic expressions from a training sample of the data collection. The Linguamatics I2E System indexes the data collection by adding domain ontologies to allow fast querying. Using a holdback test sample of the data collection, the built queries classified each article. The Excel outputs from the executed I2E queries were loaded into R analytics to calculate recall, precision, and F-measure. Work-saved-over-sampling at 95%-recall (WSS@95) was used to compute the human effort averted by the tool.

Results: The data collection consisted of five review topics: three clinical, one economic and one utility review all in oncology. The size of the review datasets varied from 288 to 9,123 articles. The recall, precision and F-measure ranged between 0.50 and 1.00, 0.05 and 0.24, 0.10 and 0.38, respectively. A reduction in the number of articles needing manual review was found for all review topics studied (WSS@95 ≥ 61%).

Conclusions: Automated document classification could be a valuable approach in supporting SLRs. While there seems to be a clear advantage to using text mining for reviewing TIAB, project teams should consider the additional effort required for document indexing and query definition.

Conference/Value in Health Info

2021-11, ISPOR Europe 2021, Copenhagen, Denmark

Value in Health, Volume 24, Issue 12, S2 (December 2021)

Code

POSB300

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Specific Disease

Explore Related HEOR by Topic

Methodology

Presentation