Enhancing Causal Discovery in Chronic Diseases: The MAGIC Framework Using Multiple LLMs
Author(s)
Jihee Kim, B.A1, Minseol Jang, PharmD2, Miryoung Kim, RPh, MCP, PhD3, Hyun Jin Han, MBA, MPH, PhD2, Kangjun Noh, B.S.1, Sumin Park, B.A1, Kyungwoo Song, PhD1, Hae Sun Suh, MA, MS, PhD4.
1Department of Statistics and Data Science, Yonsei University, Seoul, Korea, Republic of, 2Department of Regulatory Science, Graduate School, Kyung Hee University, Seoul, Korea, Republic of, 3Sunchon National University, Suncheon, Korea, Republic of, 4College of Pharmacy, Kyung Hee University, Seoul, Korea, Republic of.
1Department of Statistics and Data Science, Yonsei University, Seoul, Korea, Republic of, 2Department of Regulatory Science, Graduate School, Kyung Hee University, Seoul, Korea, Republic of, 3Sunchon National University, Suncheon, Korea, Republic of, 4College of Pharmacy, Kyung Hee University, Seoul, Korea, Republic of.
OBJECTIVES: Understanding causal relationships among chronic diseases is essential for identifying associations and minimizing bias. Traditionally, directed acyclic graphs (DAGs) have relied on expert knowledge and literature review, limiting scalability and introducing potential bias. With the recent advance of large language models (LLMs), it is now possible to explore knowledge-informed DAG construction. This study aimed to evaluate the feasibility of LLM-based approaches and to introduce MAGIC (Multi-LLM Assisted Graph Inference and Correction), a novel framework that integrates statistical, clinical, and language-based feedback to improve DAG generation.
METHODS: The study consisted of two parts: development of a reference DAG and comparative performance evaluation of causal discovery methods. The reference DAG was constructed through literature review and experts’ consensus. MAGIC combines (1) statistical metrics (phi coefficients, BDeu scores, disease duration) using individual data from the Korea National Health and Nutrition Examination Survey; (2) external clinical knowledge from publicly available sources to enrich disease-specific context; and (3) a consensus-based voting mechanism across multiple LLMs to reduce model-specific bias. The clinical plausibility and methodological validity of each method were reviewed by a panel of three clinical experts and three statisticians. Performance was assessed using standard metrics— skeleton and orientation precision, recall, F1-score, and Structural Hamming Distance (SHD)—against the reference DAG.
RESULTS: After five rounds of expert review and discussion, MAGIC was deemed clinically plausible and methodologically robust. Quantitatively, MAGIC achieved the best overall performance with skeleton precision 0.941, recall 0.640, F1-score 0.762; orientation precision 0.735, recall 0.500, F1-score 0.595; SHD 27 after the third iteration.
CONCLUSIONS: MAGIC demonstrates the potential of LLM-guided, feedback-enhanced causal discovery for scalable and reliable causal graph construction. By integrating real-world data, clinical context, and multi-model consensus, this approach offers a reproducible and interpretable framework for complex chronic disease research and supports broader applications in healthcare and epidemiology.
METHODS: The study consisted of two parts: development of a reference DAG and comparative performance evaluation of causal discovery methods. The reference DAG was constructed through literature review and experts’ consensus. MAGIC combines (1) statistical metrics (phi coefficients, BDeu scores, disease duration) using individual data from the Korea National Health and Nutrition Examination Survey; (2) external clinical knowledge from publicly available sources to enrich disease-specific context; and (3) a consensus-based voting mechanism across multiple LLMs to reduce model-specific bias. The clinical plausibility and methodological validity of each method were reviewed by a panel of three clinical experts and three statisticians. Performance was assessed using standard metrics— skeleton and orientation precision, recall, F1-score, and Structural Hamming Distance (SHD)—against the reference DAG.
RESULTS: After five rounds of expert review and discussion, MAGIC was deemed clinically plausible and methodologically robust. Quantitatively, MAGIC achieved the best overall performance with skeleton precision 0.941, recall 0.640, F1-score 0.762; orientation precision 0.735, recall 0.500, F1-score 0.595; SHD 27 after the third iteration.
CONCLUSIONS: MAGIC demonstrates the potential of LLM-guided, feedback-enhanced causal discovery for scalable and reliable causal graph construction. By integrating real-world data, clinical context, and multi-model consensus, this approach offers a reproducible and interpretable framework for complex chronic disease research and supports broader applications in healthcare and epidemiology.
Conference/Value in Health Info
2025-09, ISPOR Real-World Evidence Summit 2025, Tokyo, Japan
Value in Health Regional, Volume 49S (September 2025)
Code
RWD270
Topic Subcategory
Health & Insurance Records Systems
Disease
SDC: Diabetes/Endocrine/Metabolic Disorders (including obesity)