Generative pre-trained transformer reinforces historical gender bias in diagnosing women’s cardiovascular symptoms
European Heart Journal - Digital Health

Abstract
Large language models (LLMs) such as GPT are increasingly used to generate clinical teaching cases and support diagnostic reasoning. However, biases in their training data may skew the portrayal and interpretation of cardiovascular symptoms in women, potentially leading to delayed or inaccurate diagnoses. We assessed GPT-4o’s and GPT-4’s gender representation in simulated cardiovascular cases and GPT-4o’s diagnostic performance across genders using real patient notes.
First, GPT-4o and GPT-4 were each prompted to generate 15 000 simulated cases spanning 15 cardiovascular conditions with known gender prevalence differences. The model’s gender distributions were compared to U.S. prevalence data from large national datasets (Centers for Disease Control and Prevention and National Inpatient Sample) using FDR-corrected χ² tests, finding a significant deviation (
GPT-4o underrepresented women in simulated cardiovascular scenarios and less accurately diagnosed female patients with critical conditions. These biases risk reinforcing historical disparities in cardiovascular care. Future efforts should focus on bias detection and mitigation.
Contributors

Katherine Krieger
Author

Irbaz Hameed
Author

Giorgio Quer
Author

Charles Mack
Author

Marco Savic
Author

Polina Mantaj
Author

Aina Hirofuji
Author

Alexander Gregg
Author

Giovanni Soletti
Author

Camilla S Rossi
Author

Mohamed Rahouma
Author

Mario Gaudino
Author
NewYork-Presbyterian Hospital/Weill Cornell Medical Centre New York , United States of America
