Cost-aware prediction (CAP): an LLM-supported machine learning pipeline for interpreting heart failure mortality predictions

European Heart Journal - Digital Health

12 January 2026

Organised by:

Abstract

AbstractBackground

Machine learning (ML) models for clinical prediction are often developed independently of downstream value, interpretability, and decision-making needs. As a result, model outputs may be difficult to act upon in practice, especially when trade-offs between various clinical aspects are not explicitly addressed.

Purpose

This paper presents a cost-aware prediction (CAP) framework designed to support 1-year mortality prediction for heart failure (HF) patients to identify those eligible for home care based on high predicted mortality risk. By leveraging cost-benefit analysis and large language model (LLM) agents, the framework facilitates clinical interpretability and helps stakeholders navigate real-world trade-offs in ML-based decision support.

Methods

First, we developed an ML model predicting 1-year mortality using electronic health records including all patients with a first in-hospital HF diagnosis between 2017-2023 (N=30,021, 22% mortality) from Region Västra Götaland, Sweden (total population 1.8 million people). Second, we introduced clinical impact projection (CIP) curves to visualise important cost dimensions – quality of life and healthcare provider expenses. These costs were further divided into treatment and error costs, to assess the consequences of both correct and incorrect predictions. Finally, we developed four LLM agents to generate individual patient-specific descriptions explaining the certainty of the prediction, cost-benefit considerations, and risk reduction strategies. The system was evaluated by clinicians for decision support value.

Results

For the ML predictive modelling, the eXtreme gradient boosting model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95%; confidence interval (CI) 0.792-0.816), area under the precision-recall curve (AUPRC) of 0.529 (95% CI 0.502-0.558) and a Brier score of 0.135 (95% CI 0.130-0.140) shown in Figure 1. The CIP cost curves provided a population-level overview of cost composition across decision thresholds. LLM-generated cost-benefit descriptions enabled prediction analysis at individual patient-levels. An illustrative example can be found in Figure 2. The system was generally well received according to the evaluation with clinicians. However, clinical feedback emphasised the need to strengthen the technical accuracy for speculative tasks.

Conclusions

An ML algorithm was developed to predict 1-year mortality risks. Further, the CAP framework utilises LLM agents to integrate this ML classifier outcomes and cost-benefit analysis, to support clinical decision-making by more transparent and interpretable communication. While receiving positive clinical feedback, future work is needed for fine-tuning of LLM narratives for optimal accuracy and usefulness.