Leveraging Artificial Intelligence for Thematic Analysis of Qualitative Transcripts: A Feasibility Study in Insurance Payer Interviews
Author(s)
Quinn Levin, BS, MBS, Alexa Klimchak, MA, Kathy Gooch, PhD.
Sarepta Therapeutics., Inc., Cambridge, MA, USA.
Sarepta Therapeutics., Inc., Cambridge, MA, USA.
OBJECTIVES: Traditional qualitative research methods can be resource-intensive, costly, time-consuming, and prone to human error and biases. Recent advancements in artificial intelligence (AI) may overcome these limitations. Using semi-structured insurance payer interview transcripts as a case study, this analysis aimed to generate themes using Microsoft Copilot and compare them with human-generated themes.
METHODS: Transcripts from 10 payer interviews related to predictive budget impact modeling for direct-acting antivirals (DAAs) and other therapies were analyzed by two researchers and Copilot. Key outcomes included comparing the quality/accuracy of theme generation (primary), effectiveness of prompt engineering, and accuracy of direct quotes.
RESULTS: Among the sample themes analyzed, Copilot generally extracted broad topics but struggled to synthesize nuanced themes compared with human assessment (e.g., “accuracy of national drug spending predictions” versus “payers confirmed that plans spent less on DAA therapies vs. initial predictions”). Both broad and targeted prompts revealed some human-Copilot alignment; however, Copilot missed key insights, included less relevant ones, and inaccurately linked unrelated concepts (e.g., “[treatment] guidelines evolved...as prices decreased”). Copilot occasionally extracted quotes inaccurately (misattributed or irrelevant to mapped topic), generated incorrect information (11%-25% “hallucinations”), performed unbalanced analysis across interviewees, failed to account for respondent’s perspective (e.g., experience with private versus public insurance), and struggled to move past its initial answers (e.g., failed to provide new quotes).
CONCLUSIONS: Across varying prompts, Copilot identified superficial topics without deeper insights into underlying patterns/nuances, highlighting the continued importance of human oversight for in-depth, contextual understanding. Although AI holds potential to rapidly conduct qualitative analysis with greater objectivity, further research is necessary to fully harness its potential to accurately support these assessments. Future studies should explore using Application Programming Interfaces (enabling the use of Retrieval-Augmented Generation, facilitating workflow integration and scalability, and modifying source code for parameter adjustments to reduce hallucinations) and carefully consider prompt engineering.
METHODS: Transcripts from 10 payer interviews related to predictive budget impact modeling for direct-acting antivirals (DAAs) and other therapies were analyzed by two researchers and Copilot. Key outcomes included comparing the quality/accuracy of theme generation (primary), effectiveness of prompt engineering, and accuracy of direct quotes.
RESULTS: Among the sample themes analyzed, Copilot generally extracted broad topics but struggled to synthesize nuanced themes compared with human assessment (e.g., “accuracy of national drug spending predictions” versus “payers confirmed that plans spent less on DAA therapies vs. initial predictions”). Both broad and targeted prompts revealed some human-Copilot alignment; however, Copilot missed key insights, included less relevant ones, and inaccurately linked unrelated concepts (e.g., “[treatment] guidelines evolved...as prices decreased”). Copilot occasionally extracted quotes inaccurately (misattributed or irrelevant to mapped topic), generated incorrect information (11%-25% “hallucinations”), performed unbalanced analysis across interviewees, failed to account for respondent’s perspective (e.g., experience with private versus public insurance), and struggled to move past its initial answers (e.g., failed to provide new quotes).
CONCLUSIONS: Across varying prompts, Copilot identified superficial topics without deeper insights into underlying patterns/nuances, highlighting the continued importance of human oversight for in-depth, contextual understanding. Although AI holds potential to rapidly conduct qualitative analysis with greater objectivity, further research is necessary to fully harness its potential to accurately support these assessments. Future studies should explore using Application Programming Interfaces (enabling the use of Retrieval-Augmented Generation, facilitating workflow integration and scalability, and modifying source code for parameter adjustments to reduce hallucinations) and carefully consider prompt engineering.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR54
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
STA: Personalized & Precision Medicine