CARMINA: optimizing low-parameter language models for high-quality cardiovascular research assistance

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractIntroduction

Large language models (LLMs) and their use in chatbots have demonstrated impressive capabilities in biomedical contexts [1]; however, the hallucinations, privacy issues and substantial computational requirements limit their widespread implementation in resource-constrained environments. Current approaches either sacrifice performance for efficiency, require prohibitive computational resources, or require payment for each word.

Purpose

We have developed and validated CARMINA (Cardiovascular And Research-driven Molecular Insight with Novel Assistant), a specialized biomedical assistant powered by smaller, resource-efficient, and open-source language models. We hypothesized that carefully optimized Retrieval-Augmented Generation (RAG) systems using models with fewer parameters (≤7B) could achieve performance comparable to larger models while maintaining or even improving factual accuracy and scientific rigor in cardiovascular research applications.

Methods

We constructed a comprehensive biomedical RAG system using four different language models: llama3.1:7b, gemma2:2b, qwen2:7b, and phi3:3.8b [2–5]. Models are coupled with a MongoDB vector database containing 650,000 indexed PubMed cardiology-related abstracts and GTE-large embedding model [6]. We optimized the system through prompt engineering to reduce hallucinations and provide source citations. For benchmarking, we developed a questionnaire with ~250 questions extracted from scientific abstracts using llama3.1. The questions were taylored to assess the groundedness, relevance, and context-independence [7,8] of the answers provided by CARMINA Model responses were systematically evaluated using an independent language model (llama3.1:7b) for accuracy, completeness, reference quality, and clarity, varying in the number of retrieved context documents 1-5 papers).

Results

Our benchmarking demonstrated that qwen2:7b is the most consistent model across all evaluation metrics [Figure 1]. All models acknowledged their lack of information answering "I don´t know" whenever needed and provided relevant references for their responses. The optimized RAG architecture significantly reduced hallucination rates compared to standard implementations. Furthermore, the use of larger open-source models does not substantially improve performance.

Conclusion

CARMINA shows that small language models, when equipped with specialized RAG workflows and optimizations techniques, can provide reliable research assistance even better than non-specialized larger models. This approach offers a solution for resource-limited environments while maintaining scientific accuracy and guaranteeing privacy. In future work, we plan to address the limitations of automated benchmarking methodologies, and the inherent risks associated with using LLMs as evaluators [9,10].