Improving large language models accuracy for aortic stenosis treatment via Heart Team simulation: a prompt design analysis
European Heart Journal - Digital Health

Abstract
Large language models (LLMs) have shown potential in clinical decision support, but the influence of prompt design on their performance, particularly in complex cardiology decision-making, is not well understood.
We retrospectively reviewed 231 patients evaluated by our Heart Team for severe aortic stenosis, with treatment options including surgical aortic valve replacement, transcatheter aortic valve implantation, or medical therapy. We tested multiple prompt-design strategies using zero-shot (0-shot), Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting, combined with few-shot prompting, free/guided-thinking, and self-consistency. Patient data were condensed into standardized vignettes and queried using GPT4-o (version 2024–05–13, OpenAI) 40 times per patient under each prompt (147 840 total queries). Primary endpoint was mean accuracy; secondary endpoints included sensitivity, specificity, area under the curve (AUC), and treatment invasiveness. Guided-thinking-ToT achieved the highest accuracy (94.04%, 95% CI 90.87–97.21), significantly outperforming few-shot-ToT (87.16%, 95% CI 82.68–91.63) and few-shot-CoT (85.32%, 95% CI 80.59–90.06;
Prompt design significantly impacts LLM performance in clinical decision-making for severe aortic stenosis. Tree-of-Thought prompting markedly improved accuracy and aligned recommendations with expert decisions, though LLMs tended toward conservative treatment approaches.
Contributors

Dorian Garin
Author

Stéphane Cook
Author

Charlie Ferry
Author

Wesley Bennar
Author

Mario Togni
Author

Pascal Meier
Author

Peter Wenaweser
Author

Serban Puricel
Author

Diego Arroyo
Author
