Beyond black-box electrocardiogram analysis: saliency-guided deep learning for differential diagnosis of tachyarrhythmias

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractBackground

Deep learning (DL) has revolutionised the way we interpret electrocardiogram (ECG) data. However, explainability remains a major barrier to trust and adoption in several domains, such as the differential diagnosis of complex arrhythmias which often relies on black-box DL models. Most open-source solutions lack the ability to illustrate the decision-making process, crucial for both clinical translation and research into feature relevance.

Purpose

We aim to evaluate the diagnostic performance and explainability of a saliency-map-enabled DL model for ECG tachyarrhythmia classification. Additionally, we explore how visual explainability informs model interpretation, reliability, and study its scope and limitations.

Methods

A pre-developed, open-source convolutional neural network, was adapted to classify atrial fibrillation (AFib), atrial flutter (AFlu), sinus tachycardia (TachSR), supraventricular tachycardia (SVT), and ventricular tachycardia (VT), using 12-lead ECGs from a large registry of patients from the MIMIC-IV database. The model was re-trained on a balanced subset (approximately 1000 samples per class, with equal negatives) and evaluated on an independent, non-overlapping subset, using sensitivity, specificity, F1-score, and AUC as metrics. Gradient-weighted class activation mapping (Grad-CAM) was integrated into the model at the last convolutional layer to visualise temporal attention (leads II, V1), with clinically-oriented, qualitative analysis of key false negatives and false positives.

Results

The model achieved high specificity across all classes (≥88%), with class-wise sensitivity ranging from 54% (AFlu) to 87% (SVT). F1-score was highest for SVT (0.91), followed by TachSR (0.83). AUC values exceeded 0.88 in all classes. Gradient maps reliably identified regions of diagnostic relevance, such as flutter waves in AFlu, but also highlighted failure cases. In many missed instances, attention revealed signs of underfitting atypical cases, such as pace-related wide QRS complexes misclassified as VT, and a notable susceptibility to noise, which led the model to overidentify non-existent features (e.g., extra P waves resulting in atrial flutter overdiagnosis). Detailed Grad-CAM overlays (Figures 1, 2) illustrate these examples and provide detailed interpretation.

Conclusions

Gradient based activation mapping enables DL models with visual tools to audit of ECG-based arrhythmia classification, offering transparency into "where" the model focuses, which is valuable in uncovering potential features of interest. However, these explainability techniques are not flawless: Grad-CAM reveals attention locations but not underlying logic ("why"), while its sensitivity to model architecture and the selection of target layer might limit robustness. Despite these caveats, such visual tools offer promising adjuncts for model explanation, improvement and even feature discovery in arrhythmias that challenge human interpretation.

Interpretation of false negative cases

Interpretation of false positive cases