Scoping Meta-Review of Methods Used to Assess Artificial Intelligence-Based Medical Devices for Heart Failure

Artificial intelligence and machine learning (AI/ML) are playing increasingly important roles, permeating the field of medical devices (MDs). This rapid progress has not yet been matched by the Health Technology Assessment (HTA) process, which still needs to define a common methodology for assessing AI/ML-based MDs. To collect existing evidence from the literature about the methods used to assess AI-based MDs, with a specific focus on those used for the management of heart failure (HF), the International Federation of Medical and Biological Engineering (IFMBE) conducted a scoping meta-review. This manuscript presents the results of this search, which covered the period from January 1974 to October 2022. After careful independent screening, 21 reviews, mainly conducted in North America and Europe, were retained and included. Among the findings were that deep learning is the most commonly utilised method and that electronic health records and registries are among the most prevalent sources of data for AI/ML algorithms. Out of the 21 included reviews, 19 focused on risk prediction and/or the early diagnosis of HF. Furthermore, 10 reviews provided evidence of the impact on the incidence/progression of HF, and 13 on the length of stay. From an HTA perspective, the main areas requiring improvement are the quality assessment of studies on AI/ML (included in 11 out of 21 reviews) and their data sources, as well as the definition of the criteria used to assess the selection of the most appropriate AI/ML algorithm.


Introduction
Heart failure (HF) is a multi-faceted and life-threatening syndrome and is one of the leading causes of mortality and hospitalisation.According to statistics, 64.3 million people suffered from HF globally in 2017 [1,2].The most recent definition describes HF as a clinical syndrome with symptoms and/or signs caused by a structural and/or functional cardiac abnormality, corroborated by elevated natriuretic peptide levels and/or objective evidence of pulmonary or systemic congestion [3].HF patients usually undergo numerous diagnostic tests, procedures, and therapies that generate a large amount of data.These data have been used in recent decades to train algorithms and develop artificial intelligence and machine learning (AI/ML) applications for different purposes, ranging from the identification of risk factors for incident HF to disease classification, early diagnosis, early detection of decompensation, risk stratification, management, and the organisation of health services, among others [4][5][6][7].
The widespread use of AI/ML solutions is expected to drastically change the domain of medicine and healthcare systems, especially after the World Health Organisation (WHO) included AI/ML-based medical devices (MDs) in the definition of "Health technology" [8,9].As the implementation, adoption, and use of AI/ML MDs in healthcare settings are crucial from several points of view (legal, ethical, social, economic, and organisational aspects), their worth must be assessed using approaches, even if modified, that are similar to those used to assess the value of other medical innovations.
Some barriers to the implementation of accurate AI/ML MDs for HF are already known.For instance, it is often difficult to identify a population with high enough event rates to demonstrate the effects of the solution [4].Such a problem of representativeness in trials is usually regarded as a methodological issue in both the quality of the data collected and the study design.This issue, when identified, is often partially addressed by meticulously specifying the inclusion criteria of participant selection [10].On top of methodological issues, another problem associated with AI solutions in the safety-critical setting of HF is the lack of accurate confidence intervals for predictions.This is a potentially serious issue that can affect the trustworthiness and robustness of AI solutions, as well as the generalizability of the results [4,11].There are ongoing research efforts to maximise the accuracy of the confidence intervals of AI/ML MD solutions [12] and optimise these solutions.Certainly, agreement on a common approach to assess and judge the quality of these systems is missing.
For health technologies, this process is normally overseen by Health Technology Assessment (HTA) [13] principles and criteria.At the international level, healthcare experts are trying to map and identify the key challenges (e.g., regulatory, ethical, etc.) involved in assessing AI in the real world, and reach a consensus on HTA methods and frameworks used to assess the quality of AI applications [14].However, although specific HTA frameworks for diagnostic technologies, medical and surgical interventions, and screening technologies are publicly available, frameworks for telemedicine or mobile health [15,16] have only recently been developed (e.g., the MAST-AI (Model for Assessing the value of Artificial Intelligence in medical imaging) [17]).Moreover, even if HTA agencies, such as the National Institute for Health and Care Excellence (NICE) in the UK, started defining standard frameworks for digital technologies [16], there is currently no agreement on a common HTA framework for the specific assessment of different types of AI/ML-based MDs.
The other aim of this manuscript is to explore and systematise all the methods available in the literature that have been exploited by healthcare professionals to assess the quality of AI-based MDs, specifically those related to heart failure (HF).

Methods
The International Federation of Medical and Biological Engineering (IFMBE) created a multidisciplinary working group to discuss potential methods for assessing AI/MLbased medical devices.The group was composed of 15 expert professionals in biomedical engineering, human factors, health economics, and the HTA (with more than two years of expertise).A series of focus groups were organised to establish the research question and the inclusion and exclusion criteria, as well as come to an agreement on various definitions, as reported in Box 1.

AI/ML-based medical devices (MDs)
Medical device software that includes AI/ML algorithms.

Artificial Intelligence (AI)
AI is broadly defined as the science and engineering of making intelligent machines, especially intelligent computer programs [18].

Health technology
Health technology is an intervention developed to prevent, diagnose, or treat medical conditions; promote health; provide rehabilitation; and organise healthcare delivery.The intervention can be a test, device, medicine, vaccine, procedure, program, or system [13].

Health Technology Assessment (HTA)
HTA is a multidisciplinary process that uses explicit methods to determine the value of a specific health technology at different points in its lifecycle.The purpose is to inform decision making to promote an equitable, efficient, and high-quality health system [13].

Heart failure (HF)
A clinical syndrome with symptoms and/or signs caused by a structural and/or functional cardiac abnormality, corroborated by elevated natriuretic peptide levels and/or objective evidence of pulmonary or systemic congestion [3].The definition of heart failure encompasses acute coronary syndromes and atrial fibrillation.

HTA framework
A methodological framework for the production and sharing of HTA information based on a standardised set of HTA questions (the ontology) that allows users to define their specific research questions within a hierarchical structure.(Definition adapted from EUnetHTA Core Model® [19]).

Machine learning (ML)
ML, a branch of artificial intelligence (AI) and computer science, focuses on developing systems that can learn and adapt without following explicit instructions, imitating the way humans learn.It gradually improves its accuracy by using algorithms and statistical models to analyse and draw inferences from patterns in data [20].

Medical device software (MDSW)
Medical device software is software that is intended to be used alone or in combination for a purpose specified in the definition of a "medical device" in the medical devices regulation (Article 2(1) of Regulation (EU) 2017/745-MDR) or in vitro diagnostic medical devices regulation (Article 2(2) of Regulation (EU) 2017/746) [21].
The defined research question is as follows: "What are the methods used and evidence collected to assess AI/ML-based medical devices for heart failure and what are their strengths and limitations?" A meta-review [22] of systematic reviews, scoping reviews, and meta-analyses was conducted, which focused on the AI/ML algorithms developed for and used in the management of adult patients with heart failure, with a particular focus on HTA techniques and methods used, if any.The meta-review was conducted in line with the extended version of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA-ScR) guidelines [23,24].
Embase and Scopus were searched for relevant literature.In addition, grey literature published on major HTA agencies' websites was included.The literature search covered the period from January 1974 to October 2022.The detailed search string is reported in Appendixes A.1 and A.2.

PICO and Eligibility Criteria
The PICO (Population, Intervention, Comparator, Outcomes) elements used in our review were as follows: Population-patients affected by HF; Intervention-AI/ML-based MDs; Comparator-traditional methods used in clinical practice and conventional statistical methods; Outcomes-accuracy, effectiveness, and organisational outcomes such as admissions/readmissions and/or impact on the length of stay (LOS).
The inclusion and exclusion criteria were based on the publication type and topic.Only studies reporting on AI/ML methods applied to the prediction of HF risk, monitoring, and management of the disease were included.No limitation was considered for the setting of their use (e.g., inpatient, outpatient, community).In addition, only systematic or scoping reviews or meta-analyses were considered for inclusion.All the other publication types, as well as all those out of our scope, were excluded.

Identification and Screening
The titles and abstracts of the retrieved articles and their full texts were screened by two researchers independently.Any conflict between the reviewers was resolved by the involvement of a third independent reviewer.

Data Extraction and Analysis
For the data extraction, an ad hoc table was created to collect data on both the review and the AI/ML methodology, including data sources (literature database), quality assessment, comparison of results, and clinical and organisational endpoints.
To manage the data/evidence traceability, from each of the selected studies, we extracted information regarding the specific literature search engines and the countries of the items included in the review.
The AI/ML algorithms were categorised using the framework adopted by Graili et al. [25] and proposed by Brownlee [26].The algorithms were categorised based on their function or form.The AI/ML framework includes more than 60 algorithms and divides them into 12 types: deep learning, ensemble models, neural networks, regulation, rule system, regression, Bayesian, decision trees, dimensionality reduction, instance-based, and clustering.
Many guidelines have been proposed for reporting trials that evaluate AI-driven technologies (i.e., TRIPOD-AI [27], STARD-AI [28], SPIRIT-AI [11], CONSORT-AI [29], and DECIDE-AI [30]).They differ in many aspects, including the stage of development of the technology and the study design.We recorded information regarding which guidelines, if any, had been adopted in the selected papers.It was considered a proxy of the quality in terms of the attention paid by the authors to the appraisal of the studies.
Since the EUnetHTA guidelines [31] mention the importance of identifying the appropriate comparator(s) in assessments, we also considered whether the studies included in our review explicitly defined the comparators.
Finally, we developed a comprehensive overview and synthesis of the evidence, without focusing on each study included in each review.

Results
The scoping review identified 524 potentially relevant papers.After removing duplicates, 456 underwent title and abstract selection; 365 articles were excluded, as the items did not match the eligibility criteria.A full-text review was conducted for 84 articles; 61 articles were excluded, as they were not related to AI or CHF or did not meet our inclusion criteria.Overall, 21 reviewed studies met our inclusion criteria and were included in our analysis.The PRISMA flow diagram (Figure 1) reports this process and the reasons for exclusion at each stage.

Selected Papers
As shown in Table 1, out of 21 studies, 4 reported the results of a meta-analysis, 11 were systematic reviews, 1 was a scoping review, and 5 were narrative reviews.The reviews included in the meta-review discussed and summarised data from a mean of 49 studies, spanning the five articles presented by Grün et al. [32] to the list of 122 studies presented by Bazoukis et al. [33].The earliest records available were published in 2018, whereas the latest items were published in 2022.The data sources of the selected reviews were quite diverse.The most frequently adopted sources for item identification and selection were Medline (in 11 out of 21 studies), Pubmed (n = 8), Cochrane Library (n = 6), Embase (n = 5), and Web of Science (n = 5).In terms of geographical representation (Figure 2), the selected articles included studies conducted mainly in North America (seven in the US and two in Canada) and Europe (mainly in the United Kingdom (n = 5), Germany (n = 4), and the Netherlands (n = 4)).Only a few studies were conducted in Asia (China (n = 3) and Korea (n = 2)) and Australia (n = 3).

Clinical Aspects
We adopted quite a vast definition of HF (as reported in Box 1), and some studies covered more than one clinical indication.Sixteen out of 21 papers (Figure 3) focused on (congestive) heart failure.Five studies also considered atrial fibrillation, whereas four articles included acute coronary syndromes.Quite common (as indicated by the Others category in Figure 3) was the inclusion under the HF umbrella of other clinical conditions such as cardiovascular diseases (CVD), coronary artery disease (CAD), ischemic heart diseases, stroke, and valvular heart diseases.In terms of clinical application, the majority of cases focused on risk prediction and/or early diagnosis (n = 19).The classification of HF was considered in only seven cases, and the prognosis was included in only five cases.In terms of the clinical setting making use of AI/ML algorithms, in half of the cases, both the inpatients and outpatients were considered.Ten studies did not report this information.

AI/ML Algorithms
The datasets used as the base of AI/ML algorithms were different (Figure 4, Table A1).Among the databases, electronic health records (EHR, n = 16) and registries (n = 14) were the preferred ones.Among the studies, retrospective cohorts (n = 11), prospective cohorts (n = 10), and randomised controlled trials (RCT, n = 10) were the most common data sources.In three cases, other data sources were adopted (electrocardiogram datasets, claims data).In four of the selected papers, all or one of the AI/ML datasets were not clearly specified.
To investigate the AI algorithm used category, the same framework reported in [25] was adopted.According to our scoping review (Figure 4, Table A2), deep learning was the most common type of algorithm used (reported in 18 papers), followed by neural networks (n = 16), ensemble (n = 15), and regression techniques (n = 14).
Finally, in terms of the quality assessment of the studies included in the selected reviews, in 11 cases ( [33][34][35][36][37]39,[41][42][43][44]47]), the quality assessment activity was clearly mentioned but no homogeneity emerged in terms of the adopted methods.Ad hoc approaches or the adaptation of available tools were also employed [34,41,42].No preference emerged towards TRIPOD-AI or any other reporting guidelines.In five reviews, the risk of bias was addressed using validated tools, such as PROBAST (Prediction model Risk Of Bias ASsessment Tool) [35,46] or QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies 2) [36,39,43].Other reported scales included AI-TREE, CHARMS, AHA, PROGRESS, and the Critical Appraisal Skills Program (CASP) checklist.

Performance, Effectiveness, and Safety
Great attention was paid to the assessment of test performance or the accuracy of AI/ML algorithms.In 11 reviews, AUC/ROC or sensitivity/sensibility were explicitly reported.In addition, the benefit-risk profile was investigated, with more than half of the selected reviews (n = 12) paying great attention to the impact on mortality.The impact on the incidence of progression of HF was assessed in 10 reviews, whereas the impact on admission/readmissions or LOS was reported in 13 cases.In four publications, it was unclear how efficacy or effectiveness had been investigated.Finally, the HTA requires a comparative assessment.EUnetHTA specifies that the comparator should be the alternative intervention(s) against which the intervention under assessment should be compared.As shown in Figure 5, current clinical practice was included in six cases (21 per cent).Comparisons with other AI/ML algorithms or reporting the results of conventional statistical methods were more common.

Discussion
The main finding of our scoping review is that when it comes to AI/ML models for HF for different purposes (e.g., identification of risk factors, disease classification, early diagnosis, etc.), there is a heterogeneous set of approaches adopted by developers in terms of not only algorithm design but also evidence generated and reported on the characteristics and performance of these algorithms.Such heterogeneity became clear when considering the aims of the selected studies, as well as the methods and data sources of the AI/ML algorithms, and identifying the comparators.
Our analysis mainly focused on the current types of evidence available in the assessment of AI/ML-based MDs.
We did not discuss the details of the accuracy of AI/ML algorithms or the likelihood of transition of AI/ML algorithms into clinical practice.In the same way, we did not investigate the factors affecting the choice of quality assessment scales.Various quality assessment scales are available for appraising the quality of clinical research related to AI/ML technologies, but there seems to be no unified standard for choosing those scales.

Key Findings
The HTA is based on the available evidence for assessing a health technology in comparison with current clinical practice [13].Considering the evidence collected on AI-based MDs for HF, some common gaps have emerged.They are: • Generalisability and representativeness.The majority of the retrieved systematic reviews mainly considered cases in developed countries (Figure 2), with an associated risk of discrimination and lack of representativeness.From an HTA perspective, this has consequences on the generalisability of both the trial results to other geographical areas and the performance of AI/ML algorithms to other populations of patients, not included in the data sources employed to develop the algorithms.This limits not only the recommendation an HTA can provide but also its transferability to other settings [53].• Quality of available evidence.Guidelines for reporting trials that evaluate interventions are increasingly used when it comes to modelling the impact of AI-driven technologies.For instance, TRIPOD-AI [27] was developed to predict models, STARD-AI [28] was developed for diagnostic accuracy studies, and SPIRIT-AI [11] and CONSORT-AI [29] were developed for randomised controlled studies.Recently, the Developmental and Exploratory Clinical Investigations of DEcision support systems driven by Artificial Intelligence (DECIDE-AI) approach [30] was proposed.This approach aims to improve the reporting of early-stage clinical evaluations of AI-based technologies, independently of the study design chosen.We encountered both a lack of attention and variability in reporting quality assessment in reviews on AI/ML algorithms for HF, as well as a lack of agreement on which criteria/scale should be adopted to investigate quality.This is in line with the observation by Shazad et al. [54], who highlighted that the quality of reporting of randomised controlled trials in AI is suboptimal.It is also in line with the finding of Plana et al. [55], who reported high variability in adherence to reporting standards.At the same time, available tools adapted to AI are not yet fully able to capture the peculiarities of AI/ML algorithms and trials.As an immediate consequence, practitioners should interpret with caution the findings of studies regarding AI/ML algorithms for HF.• AI/ML methods.Different models are currently being developed to manage HF, but no guidelines are available for assessors to investigate in detail the reliability of each algorithm and capture the added value of one AI/ML model in comparison to others.Given the long list of methods currently used, as shown in Figure 4, the HTA is neither able to select the most appropriate comparators nor conduct a comparative assessment of AI/ML algorithms.

•
Comparative evidence.Only a small proportion of studies evaluated AI/ML algorithms without conducting any kind of comparison.This is a promising result (Figure 5).However, the preferred comparator was not current clinical practice, as requested by the HTA, but rather other AI/ML models or other statistical methods.
As occurs with any expected disruptive technology, the choice of the comparator is not easy.It is not just a new active principle or MD, AI/ML promises to be a new paradigm, able to redefine clinical pathways.In this case, direct or indirect comparisons with current clinical practice are even more important and necessary.• Data sources.Last but not least, the data at the core of AI/ML algorithms are crucial.They are usually real-world data/evidence (RWD/RWE), which are becoming more and more relevant for the HTA and decision makers.While investigating the complexity of AI for the HTA, Alami et al. [14] mentioned not only data quality and representativeness but also fragmented and unstructured data coming from different sources.It becomes clear how that adds complexity to a scenario where the role of RWD/RWE and issues such as real-world data availability, governance, and quality are not fully addressed [56].

Strengths and Limitations
This study systematically analysed and synthesised data from multiple studies, providing a more robust and reliable analysis compared to a single study.We investigated whether AI/ML research and development is relevant to the HTA.In a way, our analysis integrates the analysis carried out by Sharma et al. [57], which demonstrates a dissonance between research and practice.The HTA plays an intermediate role in the flow from research to clinical practice.This study, therefore, provides an overview of the different methods used to evaluate artificial intelligence-based medical devices for heart failure.It covers various aspects, such as data collection, analysis, and evaluation of existing articles, providing readers with a comprehensive understanding of the topic and suggesting that the lack of relevant evidence for the HTA could impact market access and adoption.The insights gleaned from this study are highly applicable to medical professionals and researchers involved in the development and evaluation of AI-based MDs for heart failure and can potentially serve as a reference for those seeking to improve their evaluation methods.
However, due to the relatively new nature of AI-based MDs for HF, the data available for review are limited.This limitation affects the completeness of the analysis and the conclusions drawn from the study.The quality of the data extracted from the articles included may vary.The inclusion of low-quality studies may affect the overall conclusions drawn from the review, as comprehensive control criteria were not established.In addition, the methods used to evaluate artificial intelligence-based medical devices for heart failure may vary depending on the device, data, and intended use.Therefore, our study does not cover all areas of the HTA.Ethical and legal aspects are completely outside our scope.Finally, to ensure that the selection process is as objective as possible, the lack of standardisation of the methods and reporting needs to be better investigated.

Further Development
This scoping review allowed us to identify the challenges posed by AI/ML-based MDs for a specific clinical condition (HF) and from the perspective of the HTA.The situation is rapidly evolving, as demonstrated by the significant increase in the number of studies on AI/ML algorithms and their implementation in clinical practice [58].Therefore, although our analysis requires an update in three to five years, it is a useful starting point to investigate more aspects in detail, such as the quality and representativeness of data sources, as well as the criteria used to select the comparators.

Conclusions
In our scoping meta-review of the methods used to assess artificial intelligence (AI)based medical devices (MDs) for heart failure (HF), we uncovered critical insights into the dynamic landscape of AI applications in healthcare.Our analysis emphasised the heterogeneity in the approaches taken by developers, highlighting the diversity in AI/ML models designed for various HF management purposes.This diversity extends beyond algorithms to encompass evidence generation and reporting, signifying the evolving nature of this field.
We identified key challenges that warrant attention in the evaluation of AI-based MDs for HF.Notably, the limited generalisability of the evidence due to a predominant focus on developed countries poses a barrier to making recommendations applicable to diverse healthcare settings.Additionally, the absence of standardised quality assessment practices for AI/ML in clinical research raises concerns about result interpretation.It is crucial to develop and agree on reporting standards and assessment tools tailored to the unique features of AI/ML technologies.
The proliferation of AI/ML methods presents both promise and complexity.The absence of guidelines for assessing reliability and value in these methods complicates comparative assessments, hindering the ability to select the appropriate comparators and conduct thorough evaluations.Moreover, our findings reveal a shift in comparison practices, with AI/ML algorithms often benchmarked against each other or other statistical methods rather than current clinical practice.This departure from standard health technology assessment (HTA) practices underscores the need for comprehensive comparative evidence.
Real-world data/evidence (RWD/RWE) emerged as a vital consideration, with its use becoming increasingly relevant to the HTA and decision makers.However, the challenges associated with RWD/RWE, such as data quality, representativeness, and fragmentation, amplify the complexity of AI/ML evaluation.Addressing these challenges will be pivotal for harnessing the potential of AI in HF management.
In conclusion, our meta-review bridges the gap between AI/ML research and clinical practice, offering a comprehensive overview of AI-based MDs for HF evaluation methods.Although our study has strengths, including systematic analysis and an emphasis on the HTA's intermediate role, it also has limitations due to the evolving nature of AI applications and the variability in the data and methods used.
Looking forward, we recommend revisiting this analysis in three to five years to check the progress and the emerging challenges.Future research should delve deeper into aspects like data quality, representativeness, and the criteria used to select the appropriate comparators.
In summary, AI-based MDs hold promise for enhancing HF management, but assessing them poses multi-faceted challenges.Our meta-review underscores the need for standardised evaluation practices, greater attention to data quality, and the pursuit of comprehensive comparative evidence.As AI/ML technologies continue to evolve, so too must our evaluation methods to ensure their safe and effective integration into clinical practice.

Figure 1 .
Figure 1.Screening and selection of papers included in the scoping meta-review.

Figure 2 .
Figure 2. Countries for which at least one study was included in the articles considered in the scoping review.

Table 1 .
Summary of papers included in the scoping meta-review.