The effect of disease phenotyping strategies when reusing electronic health record data: a case study in ischemic heart disease

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractIntroduction

Creating large datasets for retrospective research and predictive modellng by reusing electronic health record (EHR) data is technically relatively straightforward. However, without a clear phenotyping strategy, this often yields flawed datasets due to missing context [1] or incomplete registration [2] as diagnoses, comorbidities or risk factors might not be consistently registered in structured codes. An alternative is to include more data sources, such as medications or lab values. For example, if a patient has a prescription for insulin, we may conclude they have diabetes. Disease phenotyping strategies could affect study populations and thus study results and reproducibility. Combining multiple strategies could lead to more robust phenotyping, make the data more fit for purpose, and improve the quality of downstream data analysis.

Purpose

We aim to exemplify the effect of different phenotyping strategies for ischemic heart disease (IHD) and two of its common comorbidities, diabetes and hypertension, on potential study populations and downstream data analysis.

Methods

We collected retrospective EHR data of patients referred with chest pain complaints to the cardiology department of a tertiary hospital in The Netherlands. We designed multiple phenotyping strategies to determine IHD, hypertension and diabetes from clinical diagnosis codes, billing codes, medication, lab tests, procedures, or clinical measurements. We extracted patient groups for each strategy and compared their characteristics. Lastly, we applied logistic regression (LR) to assess the effect of the strategies on predictive modeling of IHD based on age, gender, hypertension, and diabetes.

Results

A total of 5191 patients were included. The application of the different phenotyping strategies led to differing IHD-patient groups: 31% of IHD-patients were exclusively identifiable with billing codes (Figure 1). Moreover, if IHD had only been identified on diagnosis or billing codes, 13% would not have been included. For diabetes and hypertension, this would be 44% and 74%, respectively. LR analyses showed higher estimated coefficients for the population found by diagnosis codes than when combining all strategies, illustrating the potential impact on predictive modeling (Figure 2).

Conclusion

Our case study highlights the importance of carefully considering phenotyping strategies when reusing EHR data for IHD-related studies, as this may affect downstream analysis. It is important to fit the phenotyping to the research purpose, and to address trade-offs between specificity (including only relevant patients) and sensitivity (including all relevant patients). We recommend phenotyping strategies to be designed in multidisciplinary teams and clearly reported for EHR-based studies and prediction models. For future work, we aim to further refine our strategies for different patient populations, compare them between hospitals, and add additional EHR data such as free text.

Patients selected per IHD strategy

 

Comparison of LR analyses

Contributors

C G Allaart
C G Allaart

Author

University Medical Center Utrecht Utrecht , Netherlands (The)

A Uijl
A Uijl

Author

ESC 365 is supported by