Large language model (LLM)-based agentic artificial intelligence tool streamlines research processes in biomarker studies: a proof of concept

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractBackground/Introduction

AI tools utilizing large language models (LLMs) can significantly accelerate literature reviews by automating repetitive tasks and analyses. However, initial evaluations have been limited to title and abstract screenings.

Purpose

This study evaluates the full-text screening performance of an agentic AI tool leveraging LLM technology to accurately identify relevant publications for a systematic review of circulating biomarkers in heart failure with reduced ejection fraction (HFrEF).

Methods

Within the iCARE4CVD public private partnership, we developed a knowledge model combined with an agentic AI tool that screened the full text of 5523 publications based on predefined selection criteria. The inclusion and exclusion criteria were decomposed into 136 specific tasks, each addressed by individual LLM agents using a Retrieval-Augmented Generation (RAG) approach. This process involved segmenting the full text into manageable chunks, vectorizing them, and using RAG to identify the most relevant segments for analysis by the LLM agents. Results were aggregated for automated validation of unusual responses by a critique LLM agent. The response informed then the final inclusion or exclusion decisions. We evaluated the performance of five LLMs based on privacy, openness, and effectiveness (precision and recall) to select the most accurate model. The AI tool was trained and validated against human-reviewed papers, arbitrated by a senior reviewer, with 197 papers used for training and 97 for validation (Fig 1). Performance metrics included sensitivity, specificity, false positive and negative rates, and Cohen’s κ to measure agreement between LLM and human reviewers.

Results

Our findings demonstrate significant improvement in sensitivity and specificity across the training (batches 1 and 2) and validation phases. In batch 1, sensitivity was 77.8% and specificity was 62.5%. These metrics improved in batch 2 to 81% and 79% respectively. Subsequently, the model settings were updated to prioritize minimizing the false negative rate. In the validation phase, the metrics reached a sensitivity of 91%, a specificity of 53%, a false positive rate of 46.8% and a false negative rate of 8.6% (Fig 2). Notably, inter-rater agreement showed that the AI tool demonstrated greater consistency (κ = 0.38) compared to human reviewers (κ = 0.23), suggesting that the tool provided more reliable results in screening for relevant publications.

Conclusion(s)

Our study demonstrated the potential of AI tool to reduce labor-intensive efforts while maintaining accuracy in literature reviews. Its reliability is further reinforced by greater inter-rater agreement compared to human reviewers. These findings suggest that LLM-based AI tools can significantly accelerate systematic reviews and enhance research efficiency in medicine.Figure 1Figure 2