Rapid improvement in ability of AI to reason using clinical guidelines

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractBackground

Large language models (LLMs) have seen rapid adoption, with over 100 million users engaging with ChatGPT within two months of its 2022 launch. Despite their potential, LLMs have struggled with clinical reasoning tasks. Recent advancements in 2024 introduced more powerful models like OpenAI’s o1-preview, optimized for reasoning through reinforcement learning. This study examines whether the latest LLMs have significantly improved their ability to apply clinical reasoning using up-to-date clinical guidelines.

Methods

We developed 201 clinical scenarios based on 43 guidelines from National Institute for Health and Care Excellence (NICE), covering conditions such as coronary artery disease, heart failure, chronic kidney disease, hypertension, diabetes, and more. Seventeen LLMs from OpenAI, Anthropic, and Google were evaluated. Prompt engineering techniques were employed, including chain-of-thought (CoT), multi-shot learning, and retrieval-augmented generation (RAG). LLMs were instructed to recommend medications for each clinical scenario strictly according to NICE guidelines, with outputs assessed using an automated system. LLMs generated a total of 118,992 medication recommendations.

Results

Models released in 2024 demonstrated significant improvements over their predecessors. OpenAI’s o1-preview achieved the highest accuracy with an F1 score of 73.0%, which was a step change vs. the highest performing model from 2023 (GPT-4 Turbo: F1 score 45.7%, P < 0.001) and 2022 (GPT-3.5 Turbo: F1 score 32.1%, P < 0.001). Providing the most recent guideline recommendations in real-time with RAG enhanced performance of all the most recent LLMs. However, even top-performing LLMs showed decreased accuracy in multi-morbidity and occasionally suggested dangerous treatments.

Conclusion

In conclusion, there has been a step change in the ability of AI to reason using clinical guidelines in 2024. Nonetheless, challenges persist, particularly in managing complex cases with multiple comorbidities and guard rails are still needed to prevent inappropriate recommendations. Continued advancements and implementation of safety measures are essential for reliable clinical application.

Contributors

S Khattak
S Khattak

Author

University Hospitals Birmingham NHS Foundation Trust Birmingham , United Kingdom of Great Britain & Northern Ireland

J T Townend
J T Townend

Author

Queen Elizabeth Hospital Birmingham Birmingham , United Kingdom of Great Britain & Northern Ireland

N K Khan
N K Khan

Author

Queen Elizabeth Hospital Birmingham Birmingham , United Kingdom of Great Britain & Northern Ireland

ESC 365 is supported by