Clinical performance and readability evaluation of large language models for patient communication in heart failure and cardiomyopathies
European Heart Journal - Digital Health

Abstract
Large language models (LLMs) are increasingly used by patients seeking cardiovascular health information through digital platforms. However, their accuracy and suitability for providing guidance on heart failure and cardiomyopathy remains inadequately evaluated.
This study systematically benchmarked six state-of-the-art LLMs in generating responses to patient-oriented heart failure and cardiomyopathy queries, focusing on clinical appropriateness and comprehensibility.
We tested six prominent LLMs–OpenAI GPT-4o, DeepSeek Chat, Gemini 2.5 Pro, Anthropic Claude 3.7 Sonnet, Perplexity Sonar Pro, and xAI Grok-3–on 50 curated questions covering disease understanding, diagnosis, treatment, prognosis and lifestyle concerns via standardized API interfaces. A web-based evaluation platform randomized and blinded responses for assessment by twelve reviewers (3 cardiologists, 3 medical students, 6 AI auto-graders). Responses were rated across nine domains (1-5 Likert scale): appropriateness, comprehensibility, completeness, conciseness, confabulation avoidance, readability, educational value, actionability, and tone/empathy. Reviewers chose their preferred model per question.
Linguistic complexity and output length varied substantially. Gemini provided the most readable responses (Flesch-Kincaid Grade 11.9±1.8) but was most verbose (668.7±116.1 words), while Claude generated shortest responses (226.9±38.9 words) with higher complexity (Flesch-Kincaid Grade 35.2±20.5). Across 2,700 ratings, Gemini received the highest composite mean ratings (4.55±0.02), excelling in completeness and factual reliability, followed by xAI Grok (4.41±0.02), OpenAI GPT-4o (4.26±0.02), DeepSeek (4.20±0.02), Claude (4.15±0.02), and Perplexity (4.00±0.02). Confabulation avoidance scored consistently high across all models (4.49±0.02), while conciseness scored lowest (3.81±0.05). Consistently, evaluators selected Gemini as their preferred model in 43.7%, followed by xAI (30.3%) and OpenAI (11.7%). Rating tendencies varied by evaluator group: Auto-graders gave the highest average scores (mean 4.58 ± 0.60), followed by students (4.10 ± 0.88), while experts were more conservative (3.79 ± 0.93), reflecting stricter grading patterns closer to neutral. Key findings. Strengths and Weaknesses of LLMs.




