Automated full-text screening and accelerated reviews using large language models with context-aware agents: an exploratory analysis in biomarker research

European Heart Journal - Digital Health

5 May 2026
Organised by: Logo
ESC Journals Research Methodology HEART FAILURE Chronic Heart Failure

Abstract

AbstractAims

Artificial intelligence (AI) tools utilizing large language models (LLMs) can accelerate scientific literature reviews by automating title, abstract, and full-text-based screenings of relevant patient populations and biomarkers. We developed an AI-based tool to automate and improve full-text screening performance using LLMs to accurately identify relevant publications that meet complex criteria.

Methods and results

We conducted a literature review utilizing the Population, Intervention—biomarkers, Comparison, Outcome framework to define our inclusion and exclusion criteria, focusing on biomarkers in heart failure with reduced ejection fraction (HFrEF). An AI-based full-text screening tool was created to process 5405 selected publications, combining multi-level and task-oriented retrieval-augmented generation (RAG) and agent-based methods, establishing ground truth standards to evaluate performance metrics both for the tool and human reviewers. Intra-LLM reliability was assessed by rerunning screenings on a batch of publications. Among the public and private domain models, LLaMA 3.3 70B was selected for its superior accuracy (82%), precision (71%), and recall (100%) in screening 49 manuscripts by LLMs. During the training phase, based on several hundred manuscripts, performance metrics significantly improved. Validation results showed a sensitivity of 91.4%, specificity of 53.2%, a false positive rate of 46.8%, and a false negative rate of 8.6%. The LLM outperformed human reviewers in F1 score and interrater reliability, achieving 100% consistency across multiple runs, with each run consisting of multiple LLMs on 1000 documents.

Conclusion

Our study demonstrated that AI tool can reduce labour-intensive efforts while maintaining accuracy in literature reviews, with greater inter-rater agreement compared to human reviewers.

ESC 365 is supported by