ACCURACY OF TEXT PROCESSING TOKENISERS FOR AUTOMATED IDENTIFICATION OF DISEASES AND INTERVENTIONS IN ABSTRACTS OF STUDIES ON HUMANISTIC AND ECONOMIC BURDEN OF DISEASE

Author(s)

Challen R¹, Martin A², Martin C²
¹Terminological Ltd, Hove, UK, ²Crystallise Ltd., London, UK

Presentation Documents

PRM58--strong-u-challen-r-u-sup-1-sup-strong-martin-a-sup-2-sup-martin-c-sup-2-sup-br-sup-1-sup-terminological-ltd-hove-uk-sup-2-sup-crystallise-ltd-london-uk

OBJECTIVES: To determine the sensitivity and specificity of software based on text processing analysis to classify diseases and interventions in PubMed abstracts relevant to the humanistic or economic burden of disease. METHODS: We developed an online database of abstracts of over 100,000 studies identified by a systematic search of PubMed on the humanistic and economic burden of disease (www.heoro.com). We manually indexed 10,000 abstracts to detailed ontologies of diseases and interventions, as well as to study types, PRO instruments and geographical setting. The disease and intervention ontologies were developed from MeSH terms and lists of licensed drugs from the US and UK, with new items added when identified from the abstracts. We used this training set to develop tokenisers to facilitate matching the text, MeSH headings and metadata in the abstracts to relevant ontology items. We then assessed the initial accuracy of the tokeniser matching on a sample of 150 abstracts from the unmoderated set, using expert evaluation, prior to further software refinements. RESULTS: The tokeniser matching had a sensitivity of 95% for disease ontology items and 85% for intervention ontology items compared with expert assessment. The specificity, defined as matching to any ontology items that appeared in the text, MeSH headings or metadata of each abstract, was 89% for diseases and 91% for interventions. The accuracy of matching was higher for drug terms than for non-pharmaceutical interventions, which tend to be described less consistently. CONCLUSIONS: With overall accuracy of around 90%, the initial tokeniser matching compared reasonably to indexing of abstracts by less experienced scientists. Ongoing final expert checking and further software refinement will improve the specificity of the indexing to topics that were the focus of the research. As 90,000 abstracts could be indexed within hours, this method facilitates a streamlined approach to identifying relevant data for health economics and outcomes research.

Conference/Value in Health Info

2016-10, ISPOR Europe 2016, Vienna, Austria

Value in Health, Vol. 19, No. 7 (November 2016)

Code

PRM58

Topic

Real World Data & Information Systems

Topic Subcategory

Reproducibility & Replicability

Disease

Multiple Diseases

Explore Related HEOR by Topic

Real-World Data

Presentation