Assessing the efficacy of large language models in data extraction for a meta-analysis on the role of glucagon-like peptide-1 receptor agonists in heart failure

European Heart Journal - Digital Health

12 January 2026
Organised by: Logo
ESC Journals

Abstract

AbstractBackground

Data extraction is a critical and time-intensive step in conducting systematic reviews and meta-analyses. As the volume of published biomedical literature increases, the integration of artificial intelligence (AI) tools, particularly large language models (LLMs), has emerged as a potential solution to accelerate this process. However, their reliability and accuracy in extracting complex data from clinical studies remain largely untested.

Purpose

This study aimed to evaluate the accuracy, quality, and time efficiency of three LLMs-ChatGPT, DeepSeek, and LeChat-in extracting key data from eligible studies for a systematic review and meta-analysis assessing the effects of glucagon-like peptide-1 receptor agonists in patients with heart failure.

Methods

A set of 25 studies previously deemed eligible for inclusion in the meta-analysis of this systematic review were used during our evaluation. The three LLMs were prompted to extract data from each article using only publicly available information from the web. An extraction was considered accurate if the model extracted one correct baseline characteristic and one correct result that matched human investigator-selected data. Additionally, the time taken per data extraction task was recorded for each model.

Results

ChatGPT successfully extracted valid data from 10 studies. Of those, 90% were precisely correct, and 10% were partially correct, meaning that one valid element was captured while another was missing or incorrect. DeepSeek extracted data from only 1 study, which was incorrect. LeChat extracted data from 20 studies; however, only 30% of its extractions were fully correct, 40% partially correct, and 30% incorrect. Regarding time efficiency, ChatGPT required an average of 17.87 seconds per extraction, DeepSeek 30.20 seconds, and LeChat 17.63 seconds.

Conclusions

This study highlights the potential of AI language models to expedite data extraction in a systematic review and meta-analysis by automating data extraction tasks. While LeChat demonstrated a higher extraction rate, its data quality was inconsistent. ChatGPT provided a more balanced performance, offering reliable accuracy and speed. These findings suggest that although AI tools can significantly reduce the time burden of data extraction, human oversight remains essential to ensure data validity. Further model optimization may enhance the future role and applicability of AI in biomedical research.

ESC 365 is supported by