Application of Text Mining to the Development of a Geographic Search Filter to Facilitate Evidence Retrieval in Ovid Medline

Author(s)

Popoff E, Cheung A, Szabo S
Broadstreet HEOR, Vancouver, BC, Canada

OBJECTIVES : Text mining is a valuable technique for analyzing large unstructured datasets to identify meaningful patterns. Given the increasing volume of published research in bibliographic databases like MEDLINE, efficient retrieval of relevant evidence is crucial and represents an opportunity to integrate text mining tools. This study aimed to develop a geographic search filter for accurately identifying research from the United States (U.S.) in Ovid MEDLINE.

METHODS : U.S. and non-U.S. citations with a valid PUBMED ID were collected from bibliographies of evidence-based reviews by the U.S. Preventive Services Task Force. U.S. citations were defined as having U.S.-based author affiliations; and U.S.-based publishing location and/or grant funding. Citations were partitioned by U.S./non-U.S. status and randomly divided 3:1 to a training set to identify search terms for the filter, and testing set for its validation. Punctuation and commonly occurring words like conjunctions were removed. Using text mining, common one- and two-word terms in title/abstract fields were identified, and frequencies compared between U.S. and non-U.S. citations. Analyses used the tidytext package in R.

RESULTS : 21,915 citations were collected; 16,436 were assigned to the training set (N=5,902 U.S.; N=10,534 non-U.S.). Common U.S.-related terms included (expressed as ratio of frequency in U.S. to non-U.S. citations) U.S. populations (“Americans” (15.5), “Medicare beneficiaries” (12.0)), and U.S. geographic terms (“Baltimore” (20.0)). Terms common to non-U.S. citations were non-U.S. geographic terms (“Japan” (0.04), “French” (0.05)). A preliminary search filter was developed by combining terms related to U.S. citations in title/abstract fields.

CONCLUSIONS : This development of a MEDLINE-based search filter will streamline the systematic identification of evidence from U.S. studies. Periodic updates will be necessary to reflect changes in MEDLINE’s controlled vocabulary. Future work will include validation of the filter using the testing set, refinement to improve sensitivity/specificity, and application of these methods to develop search strategies specific for other jurisdictions.

Conference/Value in Health Info

2020-11, ISPOR Europe 2020, Milan, Italy

Value in Health, Volume 23, Issue S2 (December 2020)

Code

PNS46

Topic

Epidemiology & Public Health, Methodological & Statistical Research, Organizational Practices

Topic Subcategory

Academic & Educational, Artificial Intelligence, Machine Learning, Predictive Analytics, Public Health

Disease

No Specific Disease

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×