Measuring the Elephant in the Room: Quantifying Legal Barriers to AI Processing of Full?Text in Systematic Reviews
Author(s)
Artur Nowak, MSc1, Marie Lane, BSc2, Seye Abogunrin, MPH, MSc, MD2.
1Evidence Prime, Krakow, Poland, 2Roche, Basel, Switzerland.
1Evidence Prime, Krakow, Poland, 2Roche, Basel, Switzerland.
OBJECTIVES: Generative artificial-intelligence (AI) tools hold great potential for expediting full-text screening and data extraction in systematic literature reviews. Yet, their implementation remains faced with legal issues due to restrictive licences on most PDFs. We quantified the share of articles in recent Roche reviews with licences permitting commercial text-and-data mining. We assessed agreement between key open-access (OA) metadata sources and examined whether Copyright Clearance Center (CCC) RightFind XML services addresses remaining rights gaps.
METHODS: A corpus of 3,712 unique DOIs (deduplicated from 6,336 PDFs across 49 reviews) was matched against OpenAlex and PubMed Central (PMC). Licence strings were normalised and classified as commercially-friendly (e.g., CCO, CC-BY, public-domain, MIT) or restricted. Where OpenAlex and PMC disagreed, the more permissive term was selected. Non-commercial clauses were treated as prohibitive for AI inference. Rights flags for the same DOIs were subsequently checked in the RightFind XML service.
RESULTS: OpenAlex held records for every DOI, but only 1,584 (43 %) exposed a licence; 26 records marked as closed access paradoxically carried OA licences. Among 440 papers with dual metadata, 39 (9%) showed true licence disagreements; PMC was more permissive (15 cases). The reconciled set identified 618 PDFs (17 %) as commercially usable; 3,094 remained restricted or unknown. RightFind XML contained AI inference-friendly files for 1,332 papers, rescuing 794 otherwise-restricted studies and expanding the usable corpus to 38 %. These findings mirror broader reports that OA metadata remain incomplete and inconsistent.
CONCLUSIONS: Fewer than one-in-six screened medical articles can lawfully feed commercial AI pipelines under their native licences. However, switching to processing XMLs provided by RightFind more than doubles compliant coverage. Multi-source, licence-aware ingestion pipelines-aligned with emerging CCC guidance on AI copyright-are essential for trustworthy, large-scale evidence synthesis. Wider adoption could unlock faster, cheaper health-technology assessments and better patient outcomes.
METHODS: A corpus of 3,712 unique DOIs (deduplicated from 6,336 PDFs across 49 reviews) was matched against OpenAlex and PubMed Central (PMC). Licence strings were normalised and classified as commercially-friendly (e.g., CCO, CC-BY, public-domain, MIT) or restricted. Where OpenAlex and PMC disagreed, the more permissive term was selected. Non-commercial clauses were treated as prohibitive for AI inference. Rights flags for the same DOIs were subsequently checked in the RightFind XML service.
RESULTS: OpenAlex held records for every DOI, but only 1,584 (43 %) exposed a licence; 26 records marked as closed access paradoxically carried OA licences. Among 440 papers with dual metadata, 39 (9%) showed true licence disagreements; PMC was more permissive (15 cases). The reconciled set identified 618 PDFs (17 %) as commercially usable; 3,094 remained restricted or unknown. RightFind XML contained AI inference-friendly files for 1,332 papers, rescuing 794 otherwise-restricted studies and expanding the usable corpus to 38 %. These findings mirror broader reports that OA metadata remain incomplete and inconsistent.
CONCLUSIONS: Fewer than one-in-six screened medical articles can lawfully feed commercial AI pipelines under their native licences. However, switching to processing XMLs provided by RightFind more than doubles compliant coverage. Multi-source, licence-aware ingestion pipelines-aligned with emerging CCC guidance on AI copyright-are essential for trustworthy, large-scale evidence synthesis. Wider adoption could unlock faster, cheaper health-technology assessments and better patient outcomes.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR145
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas