Imputation noise propagation on sub-groups in longitudinal synthetic data generation - a stratified analysis of fidelity for univariate and multivariate imputation

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractIntroduction

With cardiovascular disease (CVD) being the leading cause of death globally [1], prevention is crucial to reduce its prevalence. Research and development on prevention require access to sensitive data; a challenge which may be facilitated by access to synthetic data.

Cross-sectional synthetic data can be generated with, for instance, conditional tabular generative adversarial networks (CTGAN) [2]. Methods have been extended to longitudinal data, but mainly for short-term in-patient data [3,4]. The choice of imputation method is important to avoid skewing minority subgroups, especially for repeated measurements.

Purpose

To evaluate stratified fidelity for one method of univariate and multivariate imputation each on a data set used to train CTGAN and assess the effect on subgroups.

Methods

The research is performed within a study investigating effect of CVD intervention with pictorial risk communication for healthy individuals. Longitudinal data (cross sections from baseline, 3-, and 6-year follow-up) were included in this data set, resulting in 2612 samples. 102 variables were selected, consisting of 29 time-varying variables measured at each cross-section and 15 variables only measured at baseline. The variables were selected from multiple domains such as clinical risk markers, lifestyle and ultrasound.

Most variables were <5% missing, 11 were 6-11% missing and 1 was 33% missing.

The original data was copied. Copy one (uni) was imputed with median and mode for numerical and categorical variables respectively. Copy two (multi) was imputed with mean values of the two nearest neighbours in the data set (kNN). See fig. 1 for flow chart.

A CTGAN model was trained to generate longitudinal synthetic data using an 80/20% split for train/test on each data set. A stratified analysis for test and synthetic data was performed for age groups: 40- (7%), 50- (26%) and 60-year-olds (67%).

Results

The results of Kolmogorov-Smirnov statistic (KS), total variation distance (TVD), Pearson correlation (PS) and contingency similarity (CS) are displayed in table 1 for uni and multi on all samples and stratified on ages. For each row in the sub-tables, the best value of any difference >0.05 is highlighted.

For all samples, multi seems superior. The 40-year-olds seem to benefit most from univariate imputation, despite being only 7%. The opposite holds for 50- and 60-year-olds with overall better results from multi imputation. Most results for the smaller subgroups show lower fidelity in comparison to the largest, displaying the overall challenge to capture a range of characteristics for subgroups in a skewed cohort.

Conclusions

Multi imputation generally triumphs uni imputation, however, the smallest subgroup seems to benefit most from uni imputation. The reason for this may be the imbalance, which may result in the kNN-method drawing from other subgroups. A conditional approach may be interesting for future studies.

Contributors

ESC 365 is supported by