Large language models approach clinician performance in ESC cardiovascular risk stratification: a vignette-based benchmark study

European Heart Journal - Digital Health

20 May 2026
Organised by: Logo
ESC Journals PREVENTIVE CARDIOLOGY Risk Factors and Prevention

Abstract

AbstractAims

Guideline-based cardiovascular risk stratification requires three distinct competencies: extracting risk factor data from clinical text, computing a validated risk score, and applying guideline-defined thresholds to assign a final risk category. We evaluated contemporary large language models (LLMs) on each of these tasks within the European Society of Cardiology (ESC) SCORE2 framework and compared LLM performance against a pooled individual clinician benchmark to contextualize findings against real-world human reproducibility.

Methods and results

Eleven LLMs were evaluated using 30 simulated outpatient clinical vignettes presented in both Portuguese and English. For each vignette, models extracted cardiovascular risk factors, determined SCORE2 applicability, generated 10-year risk estimates where appropriate, and assigned a final three-class ESC risk category. A committee of three cardiologists established the reference standard; eight independent clinicians provided an individual-level human benchmark. Traditional risk-factor extraction was near-perfect across all models (micro-F1 0.97–0.99). Agreement with expert-assigned final risk categories was moderate and variable (best: GPT-4o, quadratic-weighted κw 0.69, 95% CI 0.44–0.84), with 10 of 11 models more often underestimating than overestimating risk. To isolate the source of classification error, post hoc deterministic recalculation of SCORE2 was performed using model-extracted variables in eligible vignettes; this markedly improved agreement across all models (κw 0.85–0.90), demonstrating that extraction was largely intact and computational execution was the primary failure mode. The pooled individual clinician benchmark showed moderate agreement with the reference standard (κw 0.52, 95% CI 0.28–0.67), indicating that the best-performing LLMs matched or exceeded the average individual clinician on this guideline-based task. Performance was broadly consistent across Portuguese and English.

Conclusion

Contemporary LLMs reliably extract cardiovascular risk information from clinical text, and the best-performing systems achieved agreement within the range of average individual clinicians on this structured task. Their principal limitation lies in downstream computation and rule application.