Application of Text Mining to the Development of a Geographic Search Filter to Facilitate Evidence Retrieval in Ovid Medline
Author(s)
Popoff E, Cheung A, Szabo S
Broadstreet HEOR, Vancouver, BC, Canada
OBJECTIVES : Text mining is a valuable technique for analyzing large unstructured datasets to identify meaningful patterns. Given the increasing volume of published research in bibliographic databases like MEDLINE, efficient retrieval of relevant evidence is crucial and represents an opportunity to integrate text mining tools. This study aimed to develop a geographic search filter for accurately identifying research from the United States (U.S.) in Ovid MEDLINE. METHODS : U.S. and non-U.S. citations with a valid PUBMED ID were collected from bibliographies of evidence-based reviews by the U.S. Preventive Services Task Force. U.S. citations were defined as having U.S.-based author affiliations; and U.S.-based publishing location and/or grant funding. Citations were partitioned by U.S./non-U.S. status and randomly divided 3:1 to a training set to identify search terms for the filter, and testing set for its validation. Punctuation and commonly occurring words like conjunctions were removed. Using text mining, common one- and two-word terms in title/abstract fields were identified, and frequencies compared between U.S. and non-U.S. citations. Analyses used the tidytext package in R. RESULTS : 21,915 citations were collected; 16,436 were assigned to the training set (N=5,902 U.S.; N=10,534 non-U.S.). Common U.S.-related terms included (expressed as ratio of frequency in U.S. to non-U.S. citations) U.S. populations (“Americans” (15.5), “Medicare beneficiaries” (12.0)), and U.S. geographic terms (“Baltimore” (20.0)). Terms common to non-U.S. citations were non-U.S. geographic terms (“Japan” (0.04), “French” (0.05)). A preliminary search filter was developed by combining terms related to U.S. citations in title/abstract fields. CONCLUSIONS : This development of a MEDLINE-based search filter will streamline the systematic identification of evidence from U.S. studies. Periodic updates will be necessary to reflect changes in MEDLINE’s controlled vocabulary. Future work will include validation of the filter using the testing set, refinement to improve sensitivity/specificity, and application of these methods to develop search strategies specific for other jurisdictions.
Conference/Value in Health Info
2020-11, ISPOR Europe 2020, Milan, Italy
Value in Health, Volume 23, Issue S2 (December 2020)
Code
PNS46
Topic
Epidemiology & Public Health, Methodological & Statistical Research, Organizational Practices
Topic Subcategory
Academic & Educational, Artificial Intelligence, Machine Learning, Predictive Analytics, Public Health
Disease
No Specific Disease