Challenges in the application of ECG foundation models to out-of-domain populations: an evaluation in professional athletes
European Heart Journal - Digital Health

Abstract
Comprehensive cardiovascular screening is crucial to safeguarding athletes' health, and ECG screening is a key part of this process. AI support could be essential, with recent ECG foundation models providing scalable, low-input tools for rapid, cost-effective interpretation, enabling large-scale, long-term screening. These models are pretrained and fine-tuned primarily in clinical contexts; hence, it is crucial to evaluate their performance in out-of-domain populations such as athletes, particularly given the limited availability of athlete-specific data for fine-tuning. Ensuring robustness in these settings is vital to safely distinguish normal adaptations from true abnormalities.
This study aims to evaluate a recently developed ECG foundation model for out-of-domain generalization for ECG interpretation using a dataset of athletes' ECGs.
A cohort of 43 professional athletes underwent 12-lead ECG screening. A board-certified cardiologist annotated each athlete ECG for model performance evaluation. We applied a recently published foundation ECG model, ECG-FM [2]. The model was blind to the cardiologist’s annotations and was pre-trained on 873,632 ECGs sourced from PhysioNet2021 [3] (n = 85,955) and MIMIC-IV-ECG [4] (n = 787,677), then fine-tuned on the labelled PhysioNet2021 dataset to recognize various cardiac conditions, without additional finetuning or adaption. For evaluation, per-class accuracy was calculated. Sensitivity, specificity, and area under the precision-recall curve (AUCPR) were computed for classes with at least one positive (for sensitivity) and negative (for specificity) case. Additionally, per-patient accuracy and overall Hamming loss were calculated. A permutation test (1,000 label shuffles) assessed whether the observed per-patient accuracy exceeded chance.
The model showed a Hamming loss of 0.108 (a total of ~11% misclassification). Mean accuracy over patients was at 0.892 (SD: 0.059, p < 0.001) respectively (see Figure 1). Mean performance per class showed high specificity (mean: 0.97, SD: 0.09) and solid accuracy (mean: 0.89, SD: 0.18). Sensitivity (mean: 0.33, SD: 0.36) and AUCPR (mean: 0.68, SD: 0.29) varied across classes. Bundle branch block and sinus rhythm were challenging to detect, while sinus bradycardia was identified reliably (see Table 1).
Despite trained on clinical data, the ECG foundation model performed well on athlete ECGs without fine-tuning, achieving high specificity and accuracy and detecting many athlete heart adaptations. Lower sensitivity for some athlete-specific abnormalities highlights the need for refinement, but overall, this successful pilot paves the way for AI-supported cardio risk screening development.


