PulseFormer: curriculum learning guided bimodal co-attention and graph-augmented hierarchical ECG strip classification
European Heart Journal - Digital Health

Abstract
Printed or scanned ECG strips remain crucial in low-resource settings where digital waveform data are unavailable. Most deep learning approaches depend on raw signal inputs and often struggle to model rhythm-specific features or account for the hierarchical structure of ECG abnormalities. We hypothesised that combining ECG strip images with rhythm-sensitive Continuous Wavelet Transform (CWT) maps, alongside curriculum learning and graph-based label refinement, could improve multi-label ECG classification from visual data alone.
To develop PulseFormer, a hierarchical ECG image classification model using bimodal co-attention, graph-refined label embeddings, and curriculum learning to predict 18 diagnostic classes from ECG and wavelet-derived inputs.
We used PTB-XL, a large public dataset of 12-lead ECG signals recorded at 100 Hz over 10 seconds. Signals were plotted to simulate printed strips, preprocessed with Hough transform for gridline removal, and binarised to enhance waveform visibility. CWT images were generated from these simulated ECG plots.
PulseFormer was trained on 17,144 ECG–CWT image pairs and tested on 2,241. ECG and CWT modalities were processed through separate CNN stems and Vision Transformer (ViT-Small, initialised with ImageNet weights) encoders, then fused via bimodal co-attention. Curriculum learning was applied by sequentially introducing increasingly fine-grained classification tasks—first normal vs abnormal, then superclass, and finally subclass labels—to encourage faster convergence and better feature space separation. Auxiliary subclass classification heads were attached at the ViT level to improve gradient flow and early feature optimisation for superior sub class discrimination.
Label embeddings were refined using a Graph Convolutional Network (GCN) based on predefined superclass–subclass relationships. Subclass outputs were softly gated by superclass probabilities, allowing controlled multi-label predictions. Grid-searched thresholds were used to calibrate the outputs, as seen in Figure 1. The model was trained using the Adam optimiser (learning rate 1e-4) with weight decay and cosine annealing. Weighted focal binary cross entropy loss and oversampling addressed severe class imbalance. Training ran for up to 36 epochs with early stopping, using a single NVIDIA RTX 4060 GPU.
PulseFormer achieved a macro F1 of 0.68, AUC of 0.92, and overall accuracy of 88 percent. Stage-wise: normal vs abnormal (0.84 accuracy, 0.82 F1, 0.90 AUC), superclass (0.85 accuracy, 0.71 F1, 0.84 AUC), and subclass (0.93 accuracy, 0.65 F1, 0.93 AUC), as seen in Figure 2.
PulseFormer enables accurate, structured, and multi-label ECG classification from image-based inputs, with the highest number of diagnostic classes covered in recent works. Its design supports scalable deployment in environments lacking raw waveform access, while promoting generalisable, hierarchy-aware diagnostic prediction. PulseFormer architecture Class wise results
