How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractBackground

Large language models (LLMs) such as ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google DeepMind) are increasingly explored for clinical decision support. However, their ability to accurately interpret electrocardiograms (ECGs)—a critical diagnostic tool in cardiology—remains underexamined. This study compared the diagnostic performance of three leading LLMs in interpreting core ECG parameters using validated teaching cases.

Aim

To benchmark and compare the diagnostic accuracy, consistency, and parameter-specific performance of ChatGPT, Claude, and Gemini across a standardised set of ECG interpretation prompts.

Methods

Seventy ECG images were sampled from six volumes of "Podrid's Real World ECGs", encompassing rhythm, rate, axis, intervals, and morphological abnormalities. Each ECG case was paired with a standardised prompt and submitted to each LLM in a new chat instance. LLM responses were scored against the corresponding textbook diagnosis using a standard 7-point rubric (1–7), with two independent reviewers blinded to model identity. Discrepancies were resolved by consensus. Repeated-measures ANOVA tested differences in overall performance, per textbook, and across ECG features. Graphical comparisons were generated in Excel.

Results

Overall, there were no significant performance differences across models: Gemini (6.09 ± 0.33), ChatGPT (6.04 ± 0.33), and Claude (5.41 ± 0.39); p = 0.08. Subgroup analysis showed ChatGPT significantly outperformed Gemini (MD: 2.4; p=0.0355) and Claude (MD: 4.2; p=0.0012) in Book 6. Performance varied significantly by ECG feature (F(6,1242)=58.03, p<0.001), with no interaction between model and feature (F(12,1242)=0.52, p=0.639), indicating shared weaknesses in features such as axis determination and bundle branch block interpretation. Standard deviations often exceeded 0.5, suggesting response variability.

Conclusion

Current general-purpose LLMs demonstrate moderate but variable accuracy in ECG interpretation, with no model achieving consistently superior performance. Their shared struggles with specific ECG parameters highlight a need for cardiology-specific model fine-tuning. Until such refinement occurs, LLMs should be used cautiously for ECG interpretation in educational or clinical settings.

Contributors

ESC 365 is supported by

Explore sponsored resources

How well do large language models interpret ECGs? A comparative benchmark of chatGPT - 4o, claude, and gemini using standardised teaching cases

European Heart Journal - Digital Health

Abstract

Contributors

H Kamalanathan

A Wilson-Smith

D Downes

P Sahai

K Kaleeny

J Millhouse

ESC 365 is supported by