Generative pre-trained transformer reinforces historical gender bias in diagnosing women’s cardiovascular symptoms

European Heart Journal - Digital Health

7 November 2025

Organised by:

Abstract

AbstractAims

Large language models (LLMs) such as GPT are increasingly used to generate clinical teaching cases and support diagnostic reasoning. However, biases in their training data may skew the portrayal and interpretation of cardiovascular symptoms in women, potentially leading to delayed or inaccurate diagnoses. We assessed GPT-4o’s and GPT-4’s gender representation in simulated cardiovascular cases and GPT-4o’s diagnostic performance across genders using real patient notes.

Methods and results

First, GPT-4o and GPT-4 were each prompted to generate 15 000 simulated cases spanning 15 cardiovascular conditions with known gender prevalence differences. The model’s gender distributions were compared to U.S. prevalence data from large national datasets (Centers for Disease Control and Prevention and National Inpatient Sample) using FDR-corrected χ² tests, finding a significant deviation (P < 0.0001). In 14 GPT-4-generated conditions (93%), male patients were overrepresented compared to females by a mean of 30% (SD 8.6%). Second, fifty de-identified cardiovascular patient notes were extracted from the MIMIC-IV-Note database. Patient gender was systematically swapped in each note, and GPT-4o was asked to produce differential diagnoses for each version (10 000 total prompts). Diagnostic accuracy across genders was determined by comparing model outputs to actual discharge diagnoses via FDR-corrected Mann–Whitney U tests, revealing significant diagnostic accuracy differences in 11 cases (22%). Female patients received lower accuracy scores than males for key conditions like coronary artery disease (P < 0.01), abdominal aortic aneurysm (P < 1.0 × 10⁻⁹), and atrial fibrillation (P < 0.01).

Conclusion

GPT-4o underrepresented women in simulated cardiovascular scenarios and less accurately diagnosed female patients with critical conditions. These biases risk reinforcing historical disparities in cardiovascular care. Future efforts should focus on bias detection and mitigation.