Are AI-generated electrocardiograms clinically accurate? Benchmarking accuracy of AI-generated ECGs: a multiplatform performance study of public LLMs

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractBackground

The use of generative AI to simulate electrocardiograms (ECGs) is expanding in medical education and digital cardiology. However, the diagnostic accuracy of ECGs produced by publicly accessible AI platforms has not been systematically evaluated. This study assessed whether synthetic ECGs generated by general-purpose AI services can accurately represent pre-specified arrhythmias.

Purpose

To evaluate the diagnostic accuracy and interpretability of ECGs generated by three widely available public AI services when prompted to simulate specific cardiac rhythms.

Methods

Bard, Bing Image Creator, and DALL-E were each prompted to generate ECG strips for ten common cardiac rhythms: sinus rhythm, sinus tachycardia, sinus bradycardia, atrial fibrillation, atrial flutter, ventricular tachycardia, ventricular fibrillation, complete heart block, supraventricular tachycardia, and asystole. Each platform produced four ECGs per rhythm (n=120). After excluding duplicates and non-ECG outputs (n=7), 113 ECGs remained. Three blinded physicians, including a cardiologist, independently reviewed each image and attempted to diagnose the rhythm. Discrepancies were resolved via adjudication. Accuracy was defined as agreement between the prompted rhythm and final expert consensus. Results were stratified by platform and rhythm.

Results

Only 37 of 113 ECGs (32.7%) accurately matched the intended rhythm. Additionally, 25.1% were uninterpretable due to graphical artefacts, physiologically implausible tracings, or distorted morphology. Bard produced the highest proportion of correct ECGs (84.5%) but primarily retrieved existing online images. Bing and DALL-E achieved rhythm-matched outputs in only 12.5% and 10% of cases, respectively. Atrial flutter (58.3%) and atrial fibrillation (50%) were the most accurately generated rhythms.

Conclusion

Synthetic ECGs generated by public AI tools demonstrate poor and inconsistent diagnostic accuracy. While Bard produced more rhythm-matched images, these were often retrieved rather than generated. These findings highlight the current limitations of publicly available generative AI for ECG simulation and support the need for domain-specific models before integration into clinical education.

Contributors

H Kamalanathan
H Kamalanathan

Author

Prince of Wales hospital Sydney , Australia

A Vuong
A Vuong

Author

S Bacchi
S Bacchi

Author

S Evans
S Evans

Author

J Kovoor
J Kovoor

Author

A Gupta
A Gupta

Author

D Downes
D Downes

Author

P Sahai
P Sahai

Author

ESC 365 is supported by