An Automated Framework for Assessing LLM Medical Reference Citations

Introduction to the Challenge of Medical LLM Referencing

The increasing integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their ability to provide accurate and verifiable medical information. A critical aspect of this is their capacity to cite relevant and supportive references. While current LLMs can generate responses accompanied by sources, the actual degree to which these cited references substantiate the claims made within the responses remains a significant, unresolved question. This gap in understanding poses a considerable risk, particularly in a field where information accuracy can have direct health implications. To address this challenge, an innovative automated framework has been developed to systematically assess the reliability of medical citations provided by LLMs.

Introducing SourceCheckup: An Automated Evaluation Framework

To bridge the gap between LLM-generated claims and their supporting evidence, we introduce SourceCheckup. This is an automated, agent-based pipeline meticulously designed to evaluate the relevance and supportiveness of sources cited in LLM responses. The framework is crucial for ensuring that the information provided by LLMs is not only accessible but also factually grounded in credible medical literature. Given the high cost and time investment associated with obtaining expert medical annotations, an automated solution like SourceCheckup is vital for enabling rapid and scalable evaluation of LLMs, especially in the rapidly evolving field of medicine.

Methodology: How SourceCheckup Operates

The SourceCheckup framework employs a multi-stage process to rigorously assess LLM citation practices. Initially, medical reference texts are used to generate a diverse set of medical questions. These questions are then posed to various LLMs, which are tasked with providing responses that include both the textual answer and corresponding URL sources. Following the generation of responses, the framework meticulously parses the LLM’s output to extract individual medical statements. Concurrently, the provided URL sources are downloaded and processed. The core of the evaluation lies in the Source Verification model, which analyzes each statement-source pair to determine whether the cited source genuinely supports the statement made by the LLM. This systematic approach allows for a granular assessment of citation accuracy.

Data Generation and Corpus Creation

A cornerstone of this research is the creation of a comprehensive dataset tailored for evaluating medical LLM citations. This dataset comprises approximately 58,000 statement-source pairs, meticulously curated from over 800 reference documents. These documents were sourced from reputable medical information providers, including the Mayo Clinic, and also incorporated data from community platforms like Reddit, reflecting a broad spectrum of medical queries. The questions were generated to cover common medical inquiries, ensuring the dataset’s relevance to real-world use cases. This extensive corpus serves as a robust benchmark for evaluating LLM performance in citing medical references.

Evaluating Leading LLMs: Performance and Findings

The SourceCheckup framework was employed to evaluate seven prominent, commercially available LLMs. The evaluation focused on their ability to provide accurate and supportive medical references. A significant finding across these evaluations is the considerable gap between the claims made by LLMs and the evidence provided by their cited sources. Between 50% and 90% of LLM responses were found to be not fully supported by their citations, with some instances even showing contradictions. This issue persists even with advanced models, including GPT-4o equipped with web search capabilities. For such models, approximately 30% of individual statements were found to be unsupported, and nearly half of the complete responses failed to be fully substantiated by their provided references.

Validation with Medical Experts

To ensure the reliability and accuracy of the automated SourceCheckup framework, its performance was rigorously validated against the assessments of human medical experts. A panel of three US-licensed medical doctors independently reviewed a subset of the statement-source pairs. The results demonstrated a high degree of agreement between the automated framework and the expert consensus, with an 89% concordance rate. In fact, SourceCheckup exhibited a higher agreement rate than any pair of medical experts alone. This expert validation provides strong evidence for the robustness and trustworthiness of the automated evaluation pipeline.

Analysis of Source URL Validity

Beyond the content of the citations, the validity of the URLs themselves was also a critical aspect of the evaluation. Models without direct web access capabilities exhibited lower rates of producing valid URLs, ranging from 40% to 70%. While Retrieval Augmented Generation (RAG)-enabled models, such as GPT-4o and Gemini Ultra 1.0, demonstrated superior performance in generating valid URLs, they did not entirely escape the issue of citation inaccuracy. The analysis also revealed that the majority of cited URLs originated from US-based websites, and the rates of content behind paywalls or subscription models were generally low across the evaluated models.

Implications for Clinical Medicine and Future Research

The findings from this research underscore significant limitations in the current capabilities of LLMs to produce trustworthy medical references. This has profound implications for their potential adoption and use in clinical medicine, where accuracy and verifiability are paramount. The study highlights the urgent need for advancements in LLM training and fine-tuning to improve their source attribution and faithfulness to cited evidence. Furthermore, the open-sourcing of the SourceCheckup framework and its associated dataset is intended to foster continued research and development in this critical area. By providing a standardized tool and benchmark, the research community can collectively work towards enhancing the reliability of LLMs in healthcare settings, ultimately contributing to safer and more effective AI-driven medical support.

Addressing Ambiguity and Future Directions

The research acknowledges the inherent complexities in evaluating source verification, including the potential for ambiguity in statement-source relationships and the subjective nature of medical claims. While the framework focuses on grounding statements in verifiable sources rather than assessing the absolute truth of a claim, it provides a crucial layer of verifiable support. Future work may explore more nuanced methods for handling multi-source support and the aggregation of information. The limitations of the automated pipeline, such as potential errors in each stage, are acknowledged, motivating further refinement and validation. The ultimate goal is to ensure that LLMs can serve as reliable tools for clinicians and patients, backed by verifiable and accurate medical references.