Are AI-generated electrocardiograms clinically accurate? Benchmarking accuracy of AI-generated ECGs: a multiplatform performance study of public LLMs
European Heart Journal - Digital Health

Abstract
The use of generative AI to simulate electrocardiograms (ECGs) is expanding in medical education and digital cardiology. However, the diagnostic accuracy of ECGs produced by publicly accessible AI platforms has not been systematically evaluated. This study assessed whether synthetic ECGs generated by general-purpose AI services can accurately represent pre-specified arrhythmias.
To evaluate the diagnostic accuracy and interpretability of ECGs generated by three widely available public AI services when prompted to simulate specific cardiac rhythms.
Bard, Bing Image Creator, and DALL-E were each prompted to generate ECG strips for ten common cardiac rhythms: sinus rhythm, sinus tachycardia, sinus bradycardia, atrial fibrillation, atrial flutter, ventricular tachycardia, ventricular fibrillation, complete heart block, supraventricular tachycardia, and asystole. Each platform produced four ECGs per rhythm (n=120). After excluding duplicates and non-ECG outputs (n=7), 113 ECGs remained. Three blinded physicians, including a cardiologist, independently reviewed each image and attempted to diagnose the rhythm. Discrepancies were resolved via adjudication. Accuracy was defined as agreement between the prompted rhythm and final expert consensus. Results were stratified by platform and rhythm.
Only 37 of 113 ECGs (32.7%) accurately matched the intended rhythm. Additionally, 25.1% were uninterpretable due to graphical artefacts, physiologically implausible tracings, or distorted morphology. Bard produced the highest proportion of correct ECGs (84.5%) but primarily retrieved existing online images. Bing and DALL-E achieved rhythm-matched outputs in only 12.5% and 10% of cases, respectively. Atrial flutter (58.3%) and atrial fibrillation (50%) were the most accurately generated rhythms.
Synthetic ECGs generated by public AI tools demonstrate poor and inconsistent diagnostic accuracy. While Bard produced more rhythm-matched images, these were often retrieved rather than generated. These findings highlight the current limitations of publicly available generative AI for ECG simulation and support the need for domain-specific models before integration into clinical education.
Contributors

A Vuong
Author

S Bacchi
Author

S Evans
Author

J Kovoor
Author

A Gupta
Author

V Premkumar
Author

A Wilson-Smith
Author

D Downes
Author

K Kaleeny
Author

P Sahai
Author

J Millhouse
Author

