Human and AI collaboration failures and model performance gaps in cardiac surgery: a blinded two-phase evaluation of five large language models

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractBackground

Large language models (LLMs) have demonstrated strong general medical capabilities, but most existing studies rely on simplified scenarios or multiple-choice formats that emphasize factual recall and constrained response structures, rather than the open-ended, multi-step reasoning needed in real clinical care. Although newer models claim improved reasoning, few have been tested in settings requiring the integration of interacting variables, longitudinal information, and context-specific decision logic. Prior research has also focused primarily on output accuracy, with limited attention to how clinicians engage with model responses, including whether they can recognize subtle yet serious errors or fully utilize accurate and relevant insights.

Purpose

To evaluate the clinical performance of LLMs in complex cardiac surgery scenarios and to assess patterns of human–AI collaboration.

Methods

A panel of senior cardiac surgeons independently developed 15 high-fidelity cardiac surgery scenarios, each paired with a clinically relevant open-ended reasoning task, expert-curated reference answers, and a 10-dimensional weighted evaluation framework. Five LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B) were prompted using a multi-agent strategy to generate free-text, open-ended clinical responses. A separate group of senior surgeons conducted a blinded two-phase evaluation to assess model output and evaluator judgment shifts: in the first round, they rated LLMs independently; in the second, they were shown the reference answers and invited to revise their ratings, with changes being optional.

Results

LLM performance varied across scenarios, but relative rankings remained stable. Median normalized scores were highest for O1 (0.896), followed by O3-mini-high (0.854), DeepSeek-R1 (0.792), GPT-4 (0.667), and Llama3-OpenBioLLM-70B (0.521). Across evaluation dimensions, scenario comprehension scored highest (0.920), while patient safety (0.507), hallucination avoidance (0.549), and clinical efficiency (0.597) were lowest across models. Second-round normalized scores declined for four LLMs, with 7.57% of ratings revised from affirmative to negative and only 2.59% from negative to affirmative. Among the five highest-weighted evaluation dimensions, 10.16% of second-round ratings were revised from affirmative to negative.

Conclusion

The reasoning-optimized proprietary LLM achieved the best performance in cardiac surgery tasks, but all models showed consistent deficits in key clinical dimensions. The most frequent human–LLM failure mode was overacceptance, where flawed outputs were not identified on first review. These findings suggest that performance limitations combined with overacceptance of model outputs may pose greater risks than inaccuracy alone. Future studies should go beyond accuracy and consider how model responses affect clinical decision making, especially in complex, time-pressured settings.

Graphical abstract-Study Design

Graphical abstract-Main Results

Contributors

M Leon
M Leon

Author

Stanford University School of Medicine Palo Alto , United States of America

R B Feng
R B Feng

Author

H He
H He

Author