How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases
European Heart Journal - Digital Health

Abstract
Large language models (LLMs) such as ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google DeepMind) are increasingly explored for clinical decision support. However, their ability to accurately interpret electrocardiograms (ECGs)—a critical diagnostic tool in cardiology—remains underexamined. This study compared the diagnostic performance of three leading LLMs in interpreting core ECG parameters using validated teaching cases.
To benchmark and compare the diagnostic accuracy, consistency, and parameter-specific performance of ChatGPT, Claude, and Gemini across a standardised set of ECG interpretation prompts.
Seventy ECG images were sampled from six volumes of "Podrid's Real World ECGs", encompassing rhythm, rate, axis, intervals, and morphological abnormalities. Each ECG case was paired with a standardised prompt and submitted to each LLM in a new chat instance. LLM responses were scored against the corresponding textbook diagnosis using a standard 7-point rubric (1–7), with two independent reviewers blinded to model identity. Discrepancies were resolved by consensus. Repeated-measures ANOVA tested differences in overall performance, per textbook, and across ECG features. Graphical comparisons were generated in Excel.
Overall, there were no significant performance differences across models: Gemini (6.09 ± 0.33), ChatGPT (6.04 ± 0.33), and Claude (5.41 ± 0.39); p = 0.08. Subgroup analysis showed ChatGPT significantly outperformed Gemini (MD: 2.4; p=0.0355) and Claude (MD: 4.2; p=0.0012) in Book 6. Performance varied significantly by ECG feature (F(6,1242)=58.03, p<0.001), with no interaction between model and feature (F(12,1242)=0.52, p=0.639), indicating shared weaknesses in features such as axis determination and bundle branch block interpretation. Standard deviations often exceeded 0.5, suggesting response variability.
Current general-purpose LLMs demonstrate moderate but variable accuracy in ECG interpretation, with no model achieving consistently superior performance. Their shared struggles with specific ECG parameters highlight a need for cardiology-specific model fine-tuning. Until such refinement occurs, LLMs should be used cautiously for ECG interpretation in educational or clinical settings.


