Convolutional Neural Networks for Fully Automated Diagnosis of Cardiac Amyloidosis by Cardiac Magnetic Resonance Imaging

Aims: We tested the hypothesis that artificial intelligence (AI)-powered algorithms applied to cardiac magnetic resonance (CMR) images could be able to detect the potential patterns of cardiac amyloidosis (CA). Readers in CMR centers with a low volume of referrals for the detection of myocardial storage diseases or a low volume of CMRs, in general, may overlook CA. In light of the growing prevalence of the disease and emerging therapeutic options, there is an urgent need to avoid misdiagnoses. Methods and Results: Using CMR data from 502 patients (CA: n = 82), we trained convolutional neural networks (CNNs) to automatically diagnose patients with CA. We compared the diagnostic accuracy of different state-of-the-art deep learning techniques on common CMR imaging protocols in detecting imaging patterns associated with CA. As a result of a 10-fold cross-validated evaluation, the best-performing fine-tuned CNN achieved an average ROC AUC score of 0.96, resulting in a diagnostic accuracy of 94% sensitivity and 90% specificity. Conclusions: Applying AI to CMR to diagnose CA may set a remarkable milestone in an attempt to establish a fully computational diagnostic path for the diagnosis of CA, in order to support the complex diagnostic work-up requiring a profound knowledge of experts from different disciplines.


Introduction
Amyloidosis is a complex, multisystemic disease that is caused by the deposition of misfolded protein fragments in the extracellular space of tissues [1,2]. Cardiac amyloidosis (CA) is associated with substantial morbidity and mortality. The increased use of cardiac magnetic resonance imaging (CMR) in cardiology has revealed a previously unrecognized prevalence of CA, which has emerged from a "rare" disease that was often only diagnosed post mortem, to a condition of significant clinical relevance that every cardiologist is confronted with. An autopsy study could demonstrate the presence of CA in 25% of elderly people (≥85 years) [3]. Further studies showed that 14% of patients undergoing transcatheter aortic valve implantation, 13% of heart failure patients with preserved ejection fraction (HFpEF), and 8% of severe aortic stenosis patients suffer from concomitant CA [4][5][6].
The two predominant amyloid proteins found in the heart are transthyretin (TTR) and immunoglobulin light chains (AL). The expansion of the extracellular space due to amyloid deposition causes diastolic dysfunction of the left ventricle (LV). Eventually, affected patients develop severe heart failure (HF) and face a dismal prognosis [7].
A comprehensive algorithm for diagnostic work-up of CA has recently been published [8]. It includes CMR as one baseline diagnostic modality. However, the signs of CA may be unspecific in CMR scans, and CMR may even appear unremarkable although CA is present [9]. Furthermore, readers in CMR centers with a low volume of referrals for the detection of myocardial storage diseases or a low volume of cardiac CMRs in general may overlook nonspecific or rare signs of CA. In light of the high prevalence of the disease and emerging therapeutic options [10], we feel that there is an urgent need to avoid lacking CA diagnoses. We therefore used convolutional neural networks (CNNs) to develop a fully automated algorithm for the diagnosis of CA using CMR.

Study Population
We enrolled consecutive adult patients between August 2010 and August 2018 who underwent a complete CMR study at our tertiary care center at the Vienna General Hospital. Our center is located at the Medical University of Vienna and has a high-volume cardiac catheterization unit and a high-volume cardiac transplantation program. Moreover, we are part of the European Reference Network for Amyloidosis and a national referral center for patients with heart failure and preserved ejection fraction (HFpEF). Patients underwent clinical and laboratory assessment, electrocardiogram (ECG), transthoracic echocardiography, CMR, and, if any suspicion of CA was present, 99 mTc-DPD bone scintigraphy, as well as blood and urine tests for the detection of pathological light chains. The pre-CMR suspicion of CA was raised when patients presented with LV-hypertrophy, in particular those with interventricular septum thickness ≥15 mm and shortness of breath. In case of suspicion of AL-CA, myocardial biopsy was performed. In case of suspicion of TTR-CA, endomyocardial biopsy (EMB) was performed until 2016, when the paper by Gillmore et al. [8] on the diagnostic algorithm of CA was published. Thereafter, only AL-CA cases and TTR-CA with presence of monoclonal protein underwent myocardial biopsy. All patients provided written informed consent. The study was approved by the Ethics Committee of the Medical University of Vienna (EK no. 796/2010).

Imaging Protocols and Data Preparation
Cardiac Magnetic Resonance Imaging CMR examinations were performed on a 1.5-T scanner (MAGNETOM Avanto; Siemens Healthcare GmbH, Erlangen, Germany), following standard protocols that included late gadolinium enhancement imaging (0.1 mmol/kg gadobutrol (Gadovist; Bayer Vital GmbH, Leverkusen, Germany)) if estimated glomerular filtration rate was ≥30 mL/min/1.73 m 2 [11]. At the time of insertion of the intravenous cannula, blood was drawn for hematocrit and serum creatinine measurement. For analysis of late gadolinium enhancement (LGE) images, two independent reviewers judged whether a typical pattern for CA was present or not. Electrocardiographically triggered modified look-locker inversion recovery (MOLLI) using a 5(3)3 prototype (5 acquisition heartbeats followed by 3 recovery heartbeats and further 3 acquisition heartbeats) was applied for precontrast T1 mapping. This method generates an inline, pixel-based T1 map by acquiring a series of images over several heartbeats with shifted T1 times, inline motion correction, and inline calculation of the T1 relaxation curve within 1 breath hold. T1 sequence parameters were as follows: starting inversion time 120 ms, inversion time increment 80 ms, reconstructed matrix size 256 × 218, and measured matrix size 256 × 144 (phase-encoding resolution 66% and phase-encoding field of view 85%). T1 maps were created both before and 15 min after contrast agent application. For postcontrast T1 mapping, a 4(1)3(1)2 prototype was used. T1 values from a midcavity short-axis slice and a midcavity 4-chamber view were averaged for assessment of entire LV myocardium. For extracellular volume (ECV) calculation, the following formula was used [12]: (1) T1 myo pre/T1 blood pre indicates myocardial/blood native T1 times and T1 myo post/T1 blood post indicates T1 times of myocardium/blood 15 min after gadobutrol application. The local reference range for normal MOLLI-ECV values is 25.4 ± 2.7%, derived from 36 healthy sex-matched controls [13].

Experimental Setting
To assess the performance of CNNs for fully automated CA diagnosis, we compared three different modeling techniques. We refer to them throughout this paper as: from scratch, feature extraction, and fine-tuning (Supplementary Figures S1-S3). From scratch is a standard deep convolutional pipeline based on progressive image downsampling [13]. Feature extraction and fine-tuning are two transfer learning techniques [14]. We used a pretrained VGG16 [15] CNN architecture on the ImageNet dataset for both pipelines. In case of feature extraction, last convolutional feature map activations of the pretrained VGG16 network were used as input features into a logistic regression classifier. For fine tuning, we retrained the four last convolutional layers of the pretrained VGG16 network, while keeping the weights of all other low-level layers intact.
All our computational results were achieved on a 2 × Intel Xeon CPU server (12 cores each, base frequency 2.2 GHz) with 10 × NVIDIA GTX1080 TI GPU (11 GB GDDR5 each) and 8 × 32 Gb of RAM. For deep learning image classification pipelines, we used Keras Python library (version 2.1.6) with Tensorflow-GPU (version 1.8) as a backend for GPU utilization. Additionally, all our statistical and machine learning experiments were performed with the open-source Scikit-Learn Python package (version 0.20).

Data Preprocessing
For a fair comparison of all models, heart images were preprocessed with the same data preparation pipeline. All images corresponding to a specific imaging modality were extracted from DICOM files. To all images, we applied the following transformations: (i) histogram equalization to improve contrast, (ii) image resizing (target resolution 224 × 224) to standardize input for pretrained networks, and (iii) Gaussian smoothing to prevent aliasing due to downscaling. Moreover, each image was represented in three RGB channels (original grayscale images were duplicated three times for three channels). During training, we computed the mean pixel values for each channel of the training set and subtracted them from all images, both in training and validation datasets; in our case, mean pixel value for all the three channels was the same (per training set of images). These data preparation steps ensured all models to receive similarly prepared inputs, and therefore they facilitated a fair comparison of different methods.

Statistical Analysis of Convolutional Neural Network Performance
To assess model's performance and its statistical variance, we employed a 10-fold cross-validation (CV) separately for each deep learning technique and imaging protocol. To prevent information leakage from training and validation, each CV fold of imaging data was split among patients-no overlapping patient images in training and validation sets. Moreover, splits were generated in a stratified fashion preserving class sample ratio. Since our classification problem is an imbalanced one, with more negative samples than positives, area under the receiver-operating characteristic curve (ROC AUC) score was chosen as our performance measure. For each averaged ROC curve, we also reported diagnostic accuracy in terms of sensitivity and specificity. These were determined by extracting operating points from the ROC curves. We used Youden's J statistic [16] to determine optimal operating points. In all our experiments, we reported and compared two classification scores per model: image classification and patient classification. In case of image classification, each image was treated as an independent measurement, i.e., two images of the same patient were classified independently. For patient classification, averages of all patient image predictions were compared, and patient classification was treated as average voting.
To prevent overfitting and ensure model generalization in each CV fold, training data were further split into training and development sets in 80/20 ratio. All CNN models were then trained for a maximum of 1000 epochs on the reduced training set using stochastic gradient descent optimizer with momentum and weight decay [17], i.e., L2 regularization. Furthermore, the training process was regularized with early stopping [18] on the ROC AUC score on the development set training continued as long as the ROC AUC score on the development set kept improving. To ensure gradual parameter update for fine-tuned CNNs, we set the learning rate to 0.001; for from-scratch CNNs, the learning rate was set to 0.01.
Our experimental setting thus allowed fair and statistical comparison of all models considered. For all analyses, a p-value < 0.05 was considered statistically significant.

Clinical Characteristics of Study Participants
The detailed clinical baseline characteristics for the 502 consecutively registered patients are displayed in Table 1. In brief, 82 (16.3%) were diagnosed with CA-associated HF (positive cases), and the remaining 420 with unrelated HF types (negative cases). Among the negatives, the predominant condition was HFpEF (n = 163), 107 patients were diagnosed with ischemic cardiomyopathy, 53 were diagnosed with hypertrophic and other cardiomyopathies, 44 patients had valvular heart disease, 30 patients suffered from cardiac sarcoidosis, 19 patients had HF condition linked with congenital heart disease, including muscular dystrophies, and the remaining 4 patients were diagnosed with rare HF conditions, such as pericardial disease (n = 3) and left atrial myxoma (n = 1). CA patients were predominantly male (65.8% of CA patients and 44.9% of controls, p = 0.003) and older

Cardiac Magnetic Resonance Imaging-Based Diagnostic Ability of the Convolutional Neural Network
In Table 2, we report average ROC AUC scores of a 10-fold cross-validation for image and patient classifications for all three imaging protocols and the three convolutional architectures. In what follows, we group these results according to the (1) respective imaging protocol, (2) deep learning technique, and (3) effect of using multiple images vs. a single image for CA prediction, and analyzed the effect on the diagnostic accuracy for each group.

The effect of Imaging Protocol on Diagnostic Accuracy
Expectedly, the imaging protocol had an important effect on the diagnostic accuracy for all considered deep learning techniques and prediction protocols. Independent of the deep learning technique, LGE-trained models achieved the best diagnostic performance ( Figure 1). The absolute best performance was observed with the fine-tuning deep learning technique, with the ROC AUC score of 0.96, resulting in 94% sensitivity and 90% specificity, respectively. Second best was a fine-tuned model trained on MOLLI images, with the best ROC AUC score of 0.93, and the diagnostic accuracy of 91% and 82%. A detailed performance of MOLLI images classification is depicted in Figure 2. CINE images were the hardest to classify (ROC AUC 0.89-0.91, for all deep learning techniques), as exemplified in Figure 3. The best diagnostic accuracy was achieved with a fine-tuned model, with 85% sensitivity and 86% specificity.

The effect of Imaging Protocol on Diagnostic Accuracy
Expectedly, the imaging protocol had an important effect on the diagnostic accuracy for all considered deep learning techniques and prediction protocols. Independent of the deep learning technique, LGE-trained models achieved the best diagnostic performance ( Figure 1). The absolute best performance was observed with the fine-tuning deep learning technique, with the ROC AUC score of 0.96, resulting in 94% sensitivity and 90% specificity, respectively. Second best was a fine-tuned model trained on MOLLI images, with the best ROC AUC score of 0.93, and the diagnostic accuracy of 91% and 82%. A detailed performance of MOLLI images classification is depicted in Figure 2. CINE images were the hardest to classify (ROC AUC 0.89-0.91, for all deep learning techniques), as exemplified in Figure 3. The best diagnostic accuracy was achieved with a fine-tuned model, with 85% sensitivity and 86% specificity.

The Effect of the Deep Learning Technique on Diagnostic Accuracy
All three modeling techniques-feature extraction, from scratch, and fine-tuningachieved high ROC AUC scores (0.89-0.96) for patient classification. The best diagnostic accuracy, in terms of sensitivity and specificity, was always obtained with the fine-tuning transfer technique (Table 2). From scratch and fine-tuning had comparable mean ROC AUC scores for all imaging protocols; however, the fine-tuning technique had a better diagnostic accuracy (sensitivity range 0.85-0.94 vs. 0.84-0.91). While the performance of the feature extraction technique, in terms of the mean ROC AUC score, stayed on par with the two other techniques, this technique recorded the lowest diagnostic accuracy performance for all imaging protocols (sensitivity range 0.77-0.97). For instance, LGE and MOLLI imaging protocols showed the closest performance in terms of the mean ROC

The Effect of the Deep Learning Technique on Diagnostic Accuracy
All three modeling techniques-feature extraction, from scratch, and fine-tuningachieved high ROC AUC scores (0.89-0.96) for patient classification. The best diagnostic accuracy, in terms of sensitivity and specificity, was always obtained with the fine-tuning transfer technique (Table 2). From scratch and fine-tuning had comparable mean ROC AUC scores for all imaging protocols; however, the fine-tuning technique had a better diagnostic accuracy (sensitivity range 0.85-0.94 vs. 0.84-0.91). While the performance of the feature extraction technique, in terms of the mean ROC AUC score, stayed on par with the two other techniques, this technique recorded the lowest diagnostic accuracy performance for all imaging protocols (sensitivity range 0.77-0.97). For instance, LGE and MOLLI imaging protocols showed the closest performance in terms of the mean ROC

The Effect of the Deep Learning Technique on Diagnostic Accuracy
All three modeling techniques-feature extraction, from scratch, and fine-tuningachieved high ROC AUC scores (0.89-0.96) for patient classification. The best diagnostic accuracy, in terms of sensitivity and specificity, was always obtained with the fine-tuning transfer technique (Table 2). From scratch and fine-tuning had comparable mean ROC AUC scores for all imaging protocols; however, the fine-tuning technique had a better diagnostic accuracy (sensitivity range 0.85-0.94 vs. 0.84-0.91). While the performance of the feature extraction technique, in terms of the mean ROC AUC score, stayed on par with the two other techniques, this technique recorded the lowest diagnostic accuracy performance for all imaging protocols (sensitivity range 0.77-0.97). For instance, LGE and MOLLI imaging protocols showed the closest performance in terms of the mean ROC AUC score. While both feature-extraction and fine-tuning techniques achieved comparable ROC AUC scores of 0.92 and 0.93, respectively, the fine-tuning model had a better operating point of 91%, 82% (sensitivity, specificity) vs. 85%, 86% for feature extraction.

The Effect of Prediction Protocol on Diagnostic Accuracy
In all cases, patient classification outperformed image prediction. In particular, it boosted the ROC AUC score for all imaging protocols of the feature extraction technique by 8-10%, by 4-5% for the "from-scratch" technique, and by 2-3% for the fine-tuning technique.

Discussion
Although LGE CMR imaging represents a real alternative to myocardial biopsy for diagnosing CA with an excellent diagnostic accuracy [19], readers in CMR centers with a low volume of referrals for the detection of myocardial storage diseases or a low volume of cardiac CMRs in general may overlook nonspecific or rare signs for CA. In light of the high prevalence of the disease and emerging therapeutic options [10], we feel that there is an urgent need to avoid lacking diagnoses with regard to CA. Herein, inspired by the hugely successful applications of state-of-the-art deep learning techniques in image understanding [13] and particularly transfer learning techniques in the medical imaging domain [14], we used CNNs to develop a fully automated algorithm for the diagnosis of CA using CMR. We were able to achieve highly accurate (average ROC AUC scores 0.90-0.96) fully automated CA prediction models validated on a cohort of 502 patients (n = 82 positive CA patients with EMB ground-truth labels).

Cardiac Magnetic Resonance Imaging for the Diagnosis of Cardiac Amyloidosis
According to Chacko et al. [20] and Fontana et al. [21], CMR should always be used if there is a suspicion of CA, because morphological changes in CMR are clearly visible. Indeed, after the administration of an extrinsic gadolinium-based contrast agent, CMR imaging can reveal characteristic LGE patterns alongside other morphological features, such as myocardial thickening, atrial dilatation, and pericardial and/or pleural effusions. It is furthermore possible to visually determine CA-specific gadolinium kinetics, such as faster washout of gadolinium from myocardium, and blood pool when compared with that of nonamyloid control subjects [22]. In addition, T1 mapping methods allow one to measure this abnormal gadolinium kinetics [23]. In fact, Gillmore at al. published a comprehensive algorithm for the diagnostic work-up of CA, which includes CMR as one baseline diagnostic modality [8]. Previously, Austin et al. [24] concluded that LGE-CMR was the most accurate noninvasive predictor of EMB-positive CA, with sensitivity, specificity, and positive and negative predictive values of 88%, 95%, 93%, and 90%, respectively. Similarly, Bhatti et al. [25,26] proposed a CMR pattern, which they applied to 251 CA patients (63 ± 10 years, 36% females), and achieved a sensitivity and NPV of 100%, and an ROC AUC score of 0.9.
However, there is evidence that in certain cases, CMR may not be enough to establish a reliable CA diagnosis. For the diagnosis of TTR CA, the most meaningful test presently is DPD bone scintigraphy. At the same time, in AL CA, DPD bone scintigraphy is not reliable and may frequently be completely normal. Furthermore, the signs of AL CA may be nonspecific on CMR scans, and CMR may even appear unremarkable although CA is present [9]. This is important to notice, as AL CA seems to be at least as frequent as TTR but affects younger patients and is characterized by a significantly higher morbidity and mortality than TTR CA [10]. Importantly, however, underlying conditions such as plasma cell dyscrasia (i.e., multiple myeloma) make AL CA an effectively treatable disease today [10]. Thus, particularly for AL CA patients, the diagnosis based on CMR findings is highly relevant.

Role and Contribution of AI for CA Diagnosis
Recently, AI has been successfully used to automate the diagnosis of CA from different data modalities. Goto et al. [27] have proposed two CNN-based prediction models for ECG (CA, n = 130) and echocardiography (CA, n = 70) data, achieving 0.85 ROC AUC and 0.91 ROC AUC scores, respectively. Our group [28] built a gradient-boosted tree prediction model for routinely available lab parameters (CA, n = 121) and achieved an ROC AUC score of 0.86 on the test set. Martini et al. [29] trained a CNN-based prediction model for LGE CMR images (CA, n = 107) and achieved an ROC AUC score of 0.982 on the test set. Our work can be directly compared to that of Martini et al., with some notable differences. First, we considered a larger patient cohort (n = 502 vs. n = 206) and a more realistic CA prevalence in a specialized center of 16% vs. 52%; second, on top of the LGE MRI, we also applied a CNN-based model to other imaging protocols, namely T1 mapping and raw CINE images. Our results confirm that an AI prediction model does not require any advanced knowledge of the disease and can potentially be agnostic of a specific imaging protocol.
These recent results demonstrate a remarkable milestone in an attempt to establish a fully computational diagnostic path for the diagnosis of CA to support the complex diagnostic work-up requiring a profound knowledge of experts from different disciplines. Comparing the performance of AI models on different data modalities, we can see that those that process CMR images are the ones that give the best diagnostic accuracy overall.
In their review, Slomka et al. [30] hypothesized that, due to the precise delineation of myocardial contours in LGE images, fully automatic feature extraction with deep learning techniques should be relatively easy. Our results, as well as those of Martini et al. [29], confirm this hypothesis. Therefore, we believe that at this point, it is almost inevitable that AI is tightly incorporated into a routine CA diagnosis practice. However, the intrinsic and extrinsic problems to AI most likely slow down its adoption rate at cardiac imaging centers. One of the biggest concerns of AI models is their lack of interpretability. Some of the noninvasive diagnostic algorithms rely on a list of accepted radiomic features, such as, shape features and the number of connected voxels that share the same intensity, which is widely accepted in clinical practice across the world. Current AI models do not necessarily form their predictions based on these accepted numeric features. Hopefully, the research on explainable AI (XAI) [31] may soon find an answer by translating low-level patterns recognized by cryptic AI models into the language of accepted radiomic features. Extrinsic to AI, data uniformity, as well as the lack of standardization of data acquisition pipelines, are among the biggest challenges for reusable clinical prediction models [32]. To accelerate a successful adoption of AI into a cardiac clinical practice, solving these intrinsic and extrinsic to AI challenges should be prioritized next.

Limitations
AI algorithms are known to be data hungry and require significantly more positive samples compared to traditional statistical clinical prediction models [33]. Therefore, to increase the sample size of CA positives, we did not focus on the development of separate patient profiles for AL and ATTR (61% of all CA patients in our dataset). However, there is evidence that transmural and subendocardial LGE patterns may differentiate AL from TTR (10)]. In addition, we observed differences extracellular volume in our dataset (Supplementary Table S1). Therefore, it is our priority for future work to collect representative sample sizes for MR images of both CA types and develop their patient profiles using CNNs. In addition, we would like to test if adding T2 time values as a marker would improve the diagnostic ability of the algorithm.
We are aware that this was a single-center study, and therefore, our developed algorithm may not generalize well to the general population. For example, patients with renal impairment (GFR < 30 mL/min/1.73 m 2 ) did not undergo CMR imaging. Patients with CA had elevated Troponin T levels (p < 0.001), which is a well-known hallmark of CA [34]. Furthermore, because most CA patients were in advanced HF stages, our current algorithm would most likely fail to identify individuals with early or preclinical disease, which is a clear limitation of this study. However, our findings may fuel future research attempting to perform early diagnosis of CA.
Another limitation of the study is a gender mismatch between patients and controls. While it is known that males have a higher incidence of amyloidosis, which is also reflected in our cohort, we had slightly more female patients among controls.
Lastly, in our control patient cohort, we did not perform EMB for the exclusion of CA. However, control patients had alternative diagnoses with congruence between clinical presentation and imaging.
Notwithstanding such limitations, we firmly believe that prediction systems become more accurate if we are able to increase the available sample size. What we need for further validation studies are more publicly available cardiac datasets, similar to what is happening in the adjacent medical domains [35].

Conclusions
We demonstrate here that an automated classification of CA patients by CMR images using state-of-the-art CNNs is possible and akin to human experts (ROC AUC 0.96 for LGE CMR). This result likely contributes to the establishment of fully computational diagnostic approaches operating on CMR images for CA. With a future perspective, we firmly believe that our experience and guidelines for algorithmic construction of computational and noninvasive diagnostic tools will support the less experienced CMR centers with a low volume of CA. Our hope is that in future clinical practice, we will be able to avoid EMB altogether and establish an accurate diagnosis of CA with noninvasive techniques, such as CMR imaging, in the earliest stages of this rare disease. Funding: This research was partially supported by a national research grant from Austrian Society of Cardiology ("Reducing costs of segmentation labeling in cardiac MRI using explainable AI" to A.A, A.S.).

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the local ethics committee (EK #796/2010).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data underlying this article cannot be shared publicly for privacy reasons of individuals that participated in the study. The data will be shared on reasonable request to the corresponding author.