The effect of disease phenotyping strategies when reusing electronic health record data: a case study in ischemic heart disease
European Heart Journal - Digital Health

Abstract
Creating large datasets for retrospective research and predictive modellng by reusing electronic health record (EHR) data is technically relatively straightforward. However, without a clear phenotyping strategy, this often yields flawed datasets due to missing context [1] or incomplete registration [2] as diagnoses, comorbidities or risk factors might not be consistently registered in structured codes. An alternative is to include more data sources, such as medications or lab values. For example, if a patient has a prescription for insulin, we may conclude they have diabetes. Disease phenotyping strategies could affect study populations and thus study results and reproducibility. Combining multiple strategies could lead to more robust phenotyping, make the data more fit for purpose, and improve the quality of downstream data analysis.
We aim to exemplify the effect of different phenotyping strategies for ischemic heart disease (IHD) and two of its common comorbidities, diabetes and hypertension, on potential study populations and downstream data analysis.
We collected retrospective EHR data of patients referred with chest pain complaints to the cardiology department of a tertiary hospital in The Netherlands. We designed multiple phenotyping strategies to determine IHD, hypertension and diabetes from clinical diagnosis codes, billing codes, medication, lab tests, procedures, or clinical measurements. We extracted patient groups for each strategy and compared their characteristics. Lastly, we applied logistic regression (LR) to assess the effect of the strategies on predictive modeling of IHD based on age, gender, hypertension, and diabetes.
A total of 5191 patients were included. The application of the different phenotyping strategies led to differing IHD-patient groups: 31% of IHD-patients were exclusively identifiable with billing codes (Figure 1). Moreover, if IHD had only been identified on diagnosis or billing codes, 13% would not have been included. For diabetes and hypertension, this would be 44% and 74%, respectively. LR analyses showed higher estimated coefficients for the population found by diagnosis codes than when combining all strategies, illustrating the potential impact on predictive modeling (Figure 2).
Our case study highlights the importance of carefully considering phenotyping strategies when reusing EHR data for IHD-related studies, as this may affect downstream analysis. It is important to fit the phenotyping to the research purpose, and to address trade-offs between specificity (including only relevant patients) and sensitivity (including all relevant patients). We recommend phenotyping strategies to be designed in multidisciplinary teams and clearly reported for EHR-based studies and prediction models. For future work, we aim to further refine our strategies for different patient populations, compare them between hospitals, and add additional EHR data such as free text. Patients selected per IHD strategy Comparison of LR analyses
Contributors

M J Boonstra
Author

M E Grunewald
Author

L M Overmars
Author

A Uijl
Author

R W M Vernooij
Author

M C H De Groot
Author

J H S M Venhuizen
Author

W W Van Solinge
Author

F W Asselbergs
Author

S Haijtema
Author
