Measuring the Elephant in the Room: Quantifying Legal Barriers to AI Processing of Full?Text in Systematic Reviews

Author(s)

Artur Nowak, MSc¹, Marie Lane, BSc², Seye Abogunrin, MPH, MSc, MD².
¹Evidence Prime, Krakow, Poland, ²Roche, Basel, Switzerland.

Presentation Documents

MSR145.pdf

OBJECTIVES: Generative artificial-intelligence (AI) tools hold great potential for expediting full-text screening and data extraction in systematic literature reviews. Yet, their implementation remains faced with legal issues due to restrictive licences on most PDFs. We quantified the share of articles in recent Roche reviews with licences permitting commercial text-and-data mining. We assessed agreement between key open-access (OA) metadata sources and examined whether Copyright Clearance Center (CCC) RightFind XML services addresses remaining rights gaps.
METHODS: A corpus of 3,712 unique DOIs (deduplicated from 6,336 PDFs across 49 reviews) was matched against OpenAlex and PubMed Central (PMC). Licence strings were normalised and classified as commercially-friendly (e.g., CCO, CC-BY, public-domain, MIT) or restricted. Where OpenAlex and PMC disagreed, the more permissive term was selected. Non-commercial clauses were treated as prohibitive for AI inference. Rights flags for the same DOIs were subsequently checked in the RightFind XML service.
RESULTS: OpenAlex held records for every DOI, but only 1,584 (43 %) exposed a licence; 26 records marked as closed access paradoxically carried OA licences. Among 440 papers with dual metadata, 39 (9%) showed true licence disagreements; PMC was more permissive (15 cases). The reconciled set identified 618 PDFs (17 %) as commercially usable; 3,094 remained restricted or unknown. RightFind XML contained AI inference-friendly files for 1,332 papers, rescuing 794 otherwise-restricted studies and expanding the usable corpus to 38 %. These findings mirror broader reports that OA metadata remain incomplete and inconsistent.
CONCLUSIONS: Fewer than one-in-six screened medical articles can lawfully feed commercial AI pipelines under their native licences. However, switching to processing XMLs provided by RightFind more than doubles compliant coverage. Multi-source, licence-aware ingestion pipelines-aligned with emerging CCC guidance on AI copyright-are essential for trustworthy, large-scale evidence synthesis. Wider adoption could unlock faster, cheaper health-technology assessments and better patient outcomes.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR145

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)