AI in the Pediatric ER: LLMs Show Promising Diagnostic Capabilities in Pilot Study

Introduction to AI in Pediatric Emergency Care

The landscape of emergency medicine, particularly within the specialized realm of the Pediatric Emergency Department (PED), is defined by its inherent challenges: high patient volumes, the critical need for rapid, time-sensitive decision-making, and the complexity of diagnosing a wide spectrum of pediatric conditions. In this demanding environment, the integration of advanced technologies holds significant promise for enhancing patient care and supporting clinical workflows. Among these emerging technologies, Large Language Models (LLMs) have garnered considerable attention for their potential to process vast amounts of information and assist in complex cognitive tasks. However, their actual effectiveness in supporting the diagnostic process within a real-world clinical setting, such as a PED, has remained a subject of ongoing investigation, with existing studies presenting a mixed picture regarding their impact on clinical reasoning and diagnostic accuracy.

Pilot Study Design and Methodology

To address this critical gap in understanding, a pilot study was conducted and published in Frontiers in Digital Health, aiming to rigorously assess the diagnostic efficacy of several prominent LLM-based chatbots in realistic PED scenarios. The study also sought to explore their potential utility as diagnosis-making assistants for pediatric emergencies. The research team evaluated five distinct LLMs: ChatGPT-4o, Gemini 1.5 Pro, Gemini 1.5 Flash, Llama-3-8B, and ChatGPT-4o mini. These models were put to the test against a panel of 23 physicians, comprising a diverse group of 10 experienced Pediatric Emergency Department (PED) physicians, 6 PED residents in their final year of training, and 7 Emergency Medicine (EM) residents in their final year. The core of the evaluation involved presenting all participants—both LLMs and physicians—with 80 real-practice pediatric clinical cases. These cases were sourced directly from the PED of a tertiary care Children's Hospital and were categorized into three distinct levels of diagnostic complexity: lowly difficult, difficult, and highly difficult. For each case vignette, the participants were required to provide a primary diagnosis along with two differential diagnoses. The accuracy of these diagnostic outputs was subsequently compared against the definitive diagnoses established upon patient discharge. To ensure a standardized and objective assessment, two independent experts meticulously evaluated all responses using a comprehensive five-level accuracy scale. This scale was designed to accommodate varying degrees of diagnostic precision, thereby avoiding undue penalization for incomplete yet clinically relevant answers. Each LLM and physician participant received an overall score out of a possible 80, calculated as the sum of points awarded across all 80 cases, reflecting their cumulative diagnostic performance.

Key Findings: LLM Performance in Diagnostic Scenarios

The results of this pilot study yielded compelling insights into the diagnostic capabilities of the evaluated LLMs. Among the chatbots, ChatGPT-4o demonstrated the highest level of performance, achieving a total score of 72.5. It was closely followed by Gemini 1.5 Pro, which secured a score of 62.75. Significantly, ChatGPT-4o’s performance was found to be statistically superior (p < 0.05) to that of the experienced PED physicians, whose collective score averaged 61.88. In contrast, the EM residents exhibited a comparatively lower performance score of 43.75, significantly underperforming (p < 0.01) both the pediatric physician groups and the LLMs. A notable trend observed across the LLMs was an inverse relationship between their diagnostic performance and the increasing difficulty of the clinical cases. However, even when faced with highly difficult cases, ChatGPT-4o managed to provide correct answers for a majority of them, underscoring its robust diagnostic capacity even in complex scenarios.

Implications for Clinical Decision Support

The findings suggest that advanced LLMs, specifically ChatGPT-4o and Gemini 1.5 Pro, possess the potential to serve as valuable tools for emergency department physicians. These AI models could offer significant support in the complex process of clinical decision-making, particularly in challenging diagnostic situations. However, the study strongly emphasizes that these AI tools should function as assistants to, rather than replacements for, the physician’s own judgment and expertise. The integration of LLMs into clinical practice necessitates the development of clear, shared protocols that outline effective collaboration strategies between AI chatbots and human healthcare professionals. Such protocols are crucial for maximizing the benefits of AI while mitigating potential risks and ensuring patient safety. The study also highlighted that EM residents performed less effectively than both experienced PED physicians and the leading LLMs, suggesting that LLMs might offer particular value in augmenting the diagnostic capabilities of physicians with less specialized experience in pediatrics.

Limitations and Future Directions

While the results are promising, the study acknowledges several limitations inherent in evaluating rapidly evolving AI technologies. The swift advancement of LLMs means that performance benchmarks can quickly become outdated. Furthermore, LLMs are trained on data up to a certain point in time, and without periodic fine-tuning or updates, they may lack awareness of the most current medical information. The study also noted the potential lack of reproducibility in LLM responses, where the same case presented multiple times could yield different outputs, a factor not deeply explored in this research. To address this, future work should involve repeated sampling to better quantify the consistency and stability of LLM-generated outputs. Exploring sequential inputs or follow-up questions could also provide a more realistic simulation of clinical conversations and offer deeper insights into their impact on diagnostic reasoning. Additional limitations include the varying number of cases across difficulty levels, with fewer cases for the most difficult category, and the non-homogeneity in the number of physicians per group, which could affect the reliability of estimates for smaller subgroups. The difference in how physician performances were summarized (median score) compared to LLMs (absolute score) also presents a limitation for direct comparisons. Despite these constraints, the study provides a foundational understanding of LLM diagnostic performance in a critical healthcare setting, paving the way for future research and potential clinical integration.

Conclusion: The Future Role of AI in Pediatric Diagnostics

In conclusion, this pilot study from Frontiers in Digital Health underscores the critical importance of evaluating the diagnostic performance of various LLMs, particularly within the demanding context of the Pediatric Emergency Department. The findings suggest that leading LLMs, such as ChatGPT-4o and Gemini 1.5 Pro, exhibit a diagnostic efficacy that is comparable to, and in some instances superior to, that of experienced pediatricians, especially when dealing with complex cases. This high level of accuracy positions these LLMs as potentially invaluable tools for supporting PED physicians in navigating and resolving the most challenging pediatric emergency cases. Moreover, they could prove to be exceptionally useful for EM physicians encountering pediatric cases across all difficulty levels. Nevertheless, the study strongly reiterates that AI-driven tools should always function as adjuncts to, and never as replacements for, human clinical judgment. The future integration of LLMs in pediatric emergency care hinges on developing robust frameworks for collaboration and ensuring that these powerful technologies augment, rather than diminish, the critical role of the healthcare professional.