Abstract
Background
Large language models (LLMs) coupled with retrieval-augmented generation (RAG) can deliver point-of-care, guideline-anchored answers for complex oncologic toxicities. When iteratively refined with domain experts, these systems may rival—or surpass—individual specialist performance.
Objectives
To present and benchmark a purpose-built module of Code Red (https://chatbot.codigorojo.tech/)—an educational, medicine-wide generative-AI project—designed specifically to improve the clinical management of toxicities from immune effector cell therapies, particularly CAR-T (e.g., CRS, ICANS). This module aims to provide near–real-time, reference-backed recommendations that have been iteratively refined with domain experts, addressing the current deviations from guideline-based care and the resulting heterogeneity in real-world practice.
Methods
Three simulated CAR-T toxicity cases were constructed to span varying grades and scenarios. Each case was independently answered by two CAR-T–experienced hematologist/oncologists (n=3 per case) and by Code Red. An external LLM (ChatGPT o3) served as a blinded adjudicator, applying a seven-item rubric—clinical accuracy/guideline concordance (45%), safety & risk mitigation (15%), completeness/contextualization (10%), actionability & clarity (10%), reference quality (10%), transparency about uncertainty (5%), and form/communication efficiency (5%). Each item was scored 0–10, scores were standardized across cases/raters, and then combined into a single weighted composite to select the “winner” per case. Code Red uses a RAG pipeline over a curated corpus of CAR-T toxicity guidelines and primary literature, plus rule-based safeguards for dosing and citation integrity.
Results
Using a seven-item, 0–10 rubric (weights: 45/15/10/10/10/5/5%), the standardized composite scores were Code Red 8.8/10, Expert 1: 8.0, Expert 2: 7.6, Expert 3: 5.3. Code Red exceeded the top individual expert by +0.8 points (~8% absolute gain) and the aggregated expert mean (≈7.0/10; median 7.6; range 5.3–8.0) by +1.8 points (~26% relative gain).
Criterion-wise, Code Red scored 10/10 in clinical accuracy/guideline concordance and in safety/risk mitigation, and ≥9/10 in completeness, actionability, and structure/efficiency; transparency about uncertainty was moderate (6/10). Experts, when averaged, reached 8.0 in accuracy, 8.0 in safety, 7.7 in completeness, 8.3 in actionability, 3.3 in transparency, and 7.0 in structure/efficiency.
After weighting, Code Red ranked first in every case, leading the blinded adjudicator to select it as the top response throughout. Its margin was driven by a consistently protocol-level presentation—explicit monitoring schedules, predefined intervention thresholds (e.g., tocilizumab at 24 h of persistent fever), steroid regimens, ICU escalation criteria, and tertiary options (anakinra/siltuximab)—delivered in concise, highly actionable language.
Conclusions
A lean, expert-guided RAG system (Code Red) can outperform individual CAR-T specialists on simulated toxicity management scenarios while delivering rapid guidance. Ongoing improvement will rely on continuous user feedback and automated literature surveillance to preserve patient safety and guideline fidelity.