Automated full-text screening and accelerated reviews using large language models with context-aware agents: an exploratory analysis in biomarker research

European Heart Journal - Digital Health

5 May 2026

Organised by:

Abstract

AbstractAims

Artificial intelligence (AI) tools utilizing large language models (LLMs) can accelerate scientific literature reviews by automating title, abstract, and full-text-based screenings of relevant patient populations and biomarkers. We developed an AI-based tool to automate and improve full-text screening performance using LLMs to accurately identify relevant publications that meet complex criteria.

Methods and results

We conducted a literature review utilizing the Population, Intervention—biomarkers, Comparison, Outcome framework to define our inclusion and exclusion criteria, focusing on biomarkers in heart failure with reduced ejection fraction (HFrEF). An AI-based full-text screening tool was created to process 5405 selected publications, combining multi-level and task-oriented retrieval-augmented generation (RAG) and agent-based methods, establishing ground truth standards to evaluate performance metrics both for the tool and human reviewers. Intra-LLM reliability was assessed by rerunning screenings on a batch of publications. Among the public and private domain models, LLaMA 3.3 70B was selected for its superior accuracy (82%), precision (71%), and recall (100%) in screening 49 manuscripts by LLMs. During the training phase, based on several hundred manuscripts, performance metrics significantly improved. Validation results showed a sensitivity of 91.4%, specificity of 53.2%, a false positive rate of 46.8%, and a false negative rate of 8.6%. The LLM outperformed human reviewers in F1 score and interrater reliability, achieving 100% consistency across multiple runs, with each run consisting of multiple LLMs on 1000 documents.

Conclusion

Our study demonstrated that AI tool can reduce labour-intensive efforts while maintaining accuracy in literature reviews, with greater inter-rater agreement compared to human reviewers.