Clinical performance and readability evaluation of large language models for patient communication in heart failure and cardiomyopathies

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractBackground

Large language models (LLMs) are increasingly used by patients seeking cardiovascular health information through digital platforms. However, their accuracy and suitability for providing guidance on heart failure and cardiomyopathy remains inadequately evaluated.

Purpose

This study systematically benchmarked six state-of-the-art LLMs in generating responses to patient-oriented heart failure and cardiomyopathy queries, focusing on clinical appropriateness and comprehensibility.

Methods

We tested six prominent LLMs–OpenAI GPT-4o, DeepSeek Chat, Gemini 2.5 Pro, Anthropic Claude 3.7 Sonnet, Perplexity Sonar Pro, and xAI Grok-3–on 50 curated questions covering disease understanding, diagnosis, treatment, prognosis and lifestyle concerns via standardized API interfaces. A web-based evaluation platform randomized and blinded responses for assessment by twelve reviewers (3 cardiologists, 3 medical students, 6 AI auto-graders). Responses were rated across nine domains (1-5 Likert scale): appropriateness, comprehensibility, completeness, conciseness, confabulation avoidance, readability, educational value, actionability, and tone/empathy. Reviewers chose their preferred model per question.

Results

Linguistic complexity and output length varied substantially. Gemini provided the most readable responses (Flesch-Kincaid Grade 11.9±1.8) but was most verbose (668.7±116.1 words), while Claude generated shortest responses (226.9±38.9 words) with higher complexity (Flesch-Kincaid Grade 35.2±20.5). Across 2,700 ratings, Gemini received the highest composite mean ratings (4.55±0.02), excelling in completeness and factual reliability, followed by xAI Grok (4.41±0.02), OpenAI GPT-4o (4.26±0.02), DeepSeek (4.20±0.02), Claude (4.15±0.02), and Perplexity (4.00±0.02). Confabulation avoidance scored consistently high across all models (4.49±0.02), while conciseness scored lowest (3.81±0.05). Consistently, evaluators selected Gemini as their preferred model in 43.7%, followed by xAI (30.3%) and OpenAI (11.7%). Rating tendencies varied by evaluator group: Auto-graders gave the highest average scores (mean 4.58 ± 0.60), followed by students (4.10 ± 0.88), while experts were more conservative (3.79 ± 0.93), reflecting stricter grading patterns closer to neutral. Discussion: Gemini achieved the highest overall performance across appropriateness, completeness, and actionability, suggesting strong potential for patient-facing cardiovascular communication. All LLMs showed good accuracy avoiding medical misinformation, though variability exists in readability and comprehensiveness. While major factual errors or hallucinations were rare in our blinded evaluation, they were not entirely absent. Differences in grading strictness between experts and other raters further emphasize the need for careful validation of chatbot outputs in clinical settings. LLMs hold promise for enhancing patient education but should be deployed with oversight and model-specific awareness.

Key findings.

Strengths and Weaknesses of LLMs.