1. Introduction
Cardiac magnetic resonance (CMR) is currently the gold standard noninvasive imaging tool for the evaluation of heart function [
1]. However, the foremost step in CMR analysis is precise segmentation. In addition to clinical experience, accurate parameter quantification is also essential for clinical decision making and risk stratification [
2]. Unfortunately, manual segmentation is a time-consuming process which also introduces intra- or interoperator variability. Although automatic segmentation has advanced rapidly in recent years, human-level quality control (QC) of automatic segmentation is still mandatory for clinical purposes. This is particularly important for the segmentation of small objects, such as apical slices in CMR imaging or small nodules in lung CT scans [
3]. More importantly, low-quality and inaccurate segmentation is hard to detect and may led to dramatic consequences. However, it is impractical to implement manual QC into an automatic pipeline. This highlights the urgent need for the development of an automatic QC system.
The purpose of the automatic segmentation tools is to lessen the burden for doctors and researchers, as well as to improve the stability and quality of segmentation. Additionally, these tools should provide a quality score for each automatic segmentation. Therefore, the absence of quality control is a limitation of automatic segmentation software, along with high expenses, patent restrictions, etc.
In the past decade, there has been an increasing amount of research on medical image QC systems. As a result, several automatic QC systems have been developed for medical image segmentation. Kohlberger developed a classifier to predict the error in segmentation methods without ground truth (GT) [
4]; Alba developed an automatic QC system based on the RF-based detector, a 3D-SPASM segmentation algorithm, and an anatomically driven classifier [
5]. Valindria proposed a reverse classification accuracy (RCA) method to predict the performance of a segmentation model on new data without GT [
6], and the RCA method was further adopted by Robinson for cardiac segmentation [
7]. More recently, deep learning (DL)-based prediction models have gained popularity. Fournel developed a DL-based automatic segmentation DSC prediction method and achieved good performances with various types of disease [
8]. Li proposed a pixel-level and image-level quality assessment system [
9]. All of the previous works have exhibited the importance of the QC system.
Radiomics has been developed for over a decade and has demonstrated impressive performances in various tasks. Additionally, radiomics features are calculated using precise equations, providing a significant advantage in terms of explainability. Furthermore, the computation load for radiomics methods is also significantly lower than those of DL methods. More importantly, CMR segmentation is an excellent imaging modality for radiomics analysis. This technique allows for the extraction of 2D features from single slice images as well as 3D features from reconstructed 3D images made up of 2D image stacks. The applications of radiomics have facilitated cardiovascular disease phenotyping [
10], differential diagnosis [
11], and prognosis prediction [
12] etc. However, we observed that few studies have used radiomics methods as tools for QC. Maffei proposed a radiomics-based QC system for cardiac CT segmentation, but their models were quantitative (only predict whether a segmentation is clinically acceptable or not), and the feature number used in their study was tremendous (a total of 78,000 for 25 substructures) [
13]. Although not in the field of cardiovascular disease, Sunoqrot showed that radiomics information could help to generate a quantitative quality score and facilitate the QC of prostate segmentation on T2 images [
14]; Wootton and Sakai also used radiomics to detect errors in radiation therapy [
15,
16].
Based on current understanding and previous studies of radiomics, we hypothesize that radiomics features can aid in the quality control of the segmentation of CMR images [
17]. The objective of this study is to establish a compatible pipeline that could incorporate DL-based segmentation and radiomics-based QC for CMR short-axis cine images. Our proposed pipeline has the following functions: (a) localizing, cropping, and segmenting the whole heart using DL; (b) detecting low-quality segmentations based on radiomics features (qualitative QC); (c) predicting DSC scores for automatic segmentation (quantitative QC); and (d) visualizing and analyzing the results.
2. Methods
Figure 1 shows the flowchart of this study.
2.1. Study Population
In this study, data were obtained from a private dataset and two publicly available external datasets. The private dataset was obtained from August 2017 to December 2021 from Shanghai Renji Hospital (Renji-2021) and included hypertrophic cardiomyopathy (HCM), dilated cardiomyopathy (DCM), hypertensive heart disease (HHD), and healthy controls (HC). The first external dataset containing HCM, DCM, and HC was obtained from the 2017 Automated Cardiac Diagnosis Challenge (ACDC-2017) training dataset [
18]. The second external dataset containing HCM, HHD, DCM, and HC was obtained from the 2020 M&M challenge (M&M-2020). The detailed inclusion and exclusion criteria for the Renji-2021 dataset were as follows.
The HCM inclusion criteria were (a) the genetic determination of an HCM mutation; (b) left ventricle hypertrophy (LVH) > 15 mm in the absence of known causes of hypertrophy [
19]; and (c) hypertrophy in a recognizable pattern, i.e., apical-variant HCM.
The HHD inclusion criteria were (a) electrocardiograph (ECG) demonstration of a hypertrophic LV (maximal LV wall thickness > 11 mm or LV mass to body surface area > 115 g/m
2 for men or >95 g/m
2 for women) in the absence of other cardiac or systemic diseases [
20] or (b) a diagnosis of arterial hypertension [
21].
DCM is defined by the presence of left ventricular (LV) dilation and systolic dysfunction in the absence of abnormal loading conditions or coronary artery disease sufficient to cause LV systolic impairment [
22].
The HC group consisted of healthy volunteers who demonstrated normal cardiac dimensions and volumes, normal cardiac function, and the absence of late gadolinium enhancement. None of the control subjects had a history of known cardiac disease, including cardiac surgery and interventions.
Exclusion criteria for all subjects were an established diagnosis of Fabry disease, cardiac amyloidosis, severe valvular disease, aortic stenosis, iron deposition, evidence of inflammatory processes in the myocardium or pericardium, history of ST-segment elevation myocardial infarction, and subjects experiencing activity of sufficient duration, intensity, and frequency to explain the abnormal LV wall thickness.
2.2. Image Acquisition
Renji-2021 dataset: The CMR examinations were performed with a 3T MRI scanner (Ingenia, Philips). A balanced steady-state free precision (SSFP) sequence with breath hold was used for cine imaging acquisition. The typical parameters were as follows: slice thickness = 6–8 mm, gap between slices = 6–10 mm, in-plane resolution = 0.8–1.2 mm × 0.8–1.2 mm, and number of cardiac phases = 30.
ACDC-2017 dataset: The acquisitions were obtained using a 1.5T scanner (Area, Siemens) and a 3T scanner (Trio Tim, Siemens). A conventional SSFP sequence with breath hold was used for cine imaging acquisition. The typical parameters were as follows: slice thickness = 5–8 mm, gap between slices = 5–10 mm, in-plane resolution = 1.37–1.68 mm × 1.37–1.68 mm, and number of cardiac phases = 28–40.
M&M-2020 dataset: In this dataset, CMR images were acquired with scanners from different vendors (Siemens, Philips, GE and Canon) with both 1.5T and 3.0T magnetic fields. The parameters were as follows: tTypical slice thickness = 9.2–10 mm, typical gap between slices = 10 mm, in-plane resolution = 0.85–1.45 mm × 0.85–1.45 mm, number of slices = 10–12, and number of frames = 25–30.
For all datasets, short-axis cine images covered LV and RV from the base to the apex.
2.3. Data Preparation
For all datasets, only end-diastole (ED) short-axis cine images were included in this study.
For the Renji dataset, images were exported in digital imaging and communications in medicine (DICOM) format.Foremost, patients’ information was de-identified. An experienced cardiologist (4 years of CMR experience) manually removed images of insufficient quality. Selected ED images were then converted to .nii format (same as for the ACDC-2017 and M&M-2020 datasets).
2.3.1. Manual Segmentation
Manual segmentation results from the ACDC-2017 and M&M datasets were available from the ACDC challenge website
https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html (accessed on 1 January 2023) and M&M challenge website
https://www.ub.edu/mnms/ (accessed on 1 June 2023). An open-source software itk-snap (version 3.8.0) was used to delineate manual segmentation of the Renji-2021 dataset [
23]. The LV endocardium and epicardium was delineated as previously described [
11,
18]. To obtain the right ventricular cavity, the RV wall was not included. All images were segmented by C1 and verified by another experienced cardiologist 2 (C2, 5 years of CMR experience). After manual segmentation, the Renji-2021 dataset was combined with the ACDC-2017 dataset.
2.3.2. Pre-Processing
Original images and masks were processed with following steps. First, the in-plane resolutions of the images and masks were resampled to 1.0 mm × 1.0 mm using SimpleITK [
24]; second, contrast-limited adaptive histogram equalization was applied to all images with the scikit-image library [
25]; third, the pixels’ gray-level value underwent min–max normalization.
2.3.3. Data Partition
The Private Renji-2021 dataset and ACDC-2017 dataset were combined and then randomly divided into three parts: a training dataset (60% of data, N = 350), a validation dataset (20% of data, N = 115), and an internal testing dataset (20% of data, N = 115). Within each part, patients with different types of disease were evenly distributed. The M&M-2020 dataset was independently used as an external testing dataset.
2.4. Machine Learning Scheme
2.4.1. DL Scheme
The proposed deep learning pipeline included a localization-cropping module and a segmentation module: (a) whole heart segmentation was obtained from an UNet-like model (Localization UNet, L-Unet); (b) the center of mass (CoM) was calculated at the slice-level for each image, and the weighted CoM was defined at the subject-level according to Equations (
1)–(
3); (c) images were cropped around the weighted CoM and an optimal size was chosen; and (d) anatomical structure segmentation was obtained from UNet-like models (Segmentation UNets, S-Unets) with cropped images. Models were trained with the training dataset. The validation dataset was utilized to monitor the model’s performance. The internal and external testing datasets were used for the model assessment.
where
W represents the pixel number in slice
j,
X and
Y represent the location of the CoM for slice
j on the x-axis and y-axis.
Model Structure and Experiment Settings
An 18-layer UNet backbone was used in both L-UNet and S-UNets but with different input shapes. A detailed model structure is shown in
Figure 2. For L-UNet, after preprocessing, images maintained their height to width ratios and were resized to 256 × 256 pixels as inputs. For S-UNets, the inputs were 160 × 160 pixel precropped images, and four S-Unet model structures were included: (i) UNet; (ii) Attention UNet (A-UNet); (iii) Residual UNet (R-UNet); and (iv) Residual Attention UNet (RA-UNet).
Both L-UNet and S-UNets were trained for 200 epochs with an Adam optimizer with a 1
initial learning rate. The optimizer scheduler monitored the validation loss and reduced the learning rate by a factor of 0.2 and a patience of 7 epochs. The minimal learning rate was set to
; the loss functions used were DICE, cross entropy (CE), and the DICE + CE loss function (Equations (
4) and (
5)). Data augmentation was only used in S-UNets and included random rotation, translation, flipping, and elastic transformation using the Albumentations library [
26].
y represents the true label of pixel
i, while
p represents the predicted value of pixel
i.
DL Model Evaluation
The performance of the DL models was evaluated based on the dice similarity coefficient (DSC), intersection over union (IoU), precision, and recall in the testing dataset (Equations (
6) and (
7)).
2.4.2. Segmentation Quality Definition
For 2D segmentation, a DSC < 0.7 was defined as bad quality while a DSC ≥ 0.7 was defined as good quality, as in previous research [
7]. For 3D segmentation, the cutoff value was set to 0.85. Therefore, for each anatomical structure, each 2D segmentation or reconstructed 3D segmentation was classified into good- or bad-quality groups according to its actual DSC value.
2.4.3. Radiomics Scheme
To generate a robust radiomics dataset for further analysis, segmentations with various DSC predictions were needed. Therefore, eight weights (including the best model weight and seven suboptimal model weights) used during the model training process were selected from each S-UNet and used to generate a new segmentation dataset based on training and validation images. Therefore, we theoretically generated 32 2D/3D segmentations of varied quality at both the slice and subject levels. This dataset was used for radiomics feature extraction, feature selection, and model development. The testing data were used for the model performance evaluation, as described previously.
Definition of Suboptimal Models
During the segmentation model development stage, after each epoch, the validation loss was documented once validation loss decreased compared to previous epochs. The model weights were saved to the local server. After model training processes, there were many model weights saved with different validation losses. The model weight with the lowest validation loss was named the best model, while we named the other model weights suboptimal models. By using suboptimal models, we were able to generate low-quality segmentations and use these to increase our radiomics dataset and improve the model’s generalization ability.
Feature Extraction and Feature Selection
Radiomics features were extracted at the 2D and 3D levels from the original images. Before feature extraction, the images underwent normalization (normalization scale equals 0 to 256) and discretization (with a bin-width of 16). For 3D images, the z-axis spatial resolution was resampled to 1.0 mm in addition to achieve voxel spatial isotropy. Additionally, the radiomics features of different anatomical structures (RVC, MYO, and LVC) were extracted separately with Pyradiomics library [
27]. Thereafter, we had six radiomics groups, namely RVC-2D, RVC-3D, MYO-2D, MYO-3D, LVC-2D, and LVC-3D.
After feature extraction, within each feature group, the Pearson correlation (
) coefficient was calculated for each feature. Features with
> 0.8 were defined as being highly correlated and were removed [
28].
For the regression task, feature selection was performed based on the highest mutual information dependency with DSC values [
29]; for the classification task, feature selection was performed based on the analysis of variance (ANOVA) F-value between the score quality (good [1] or bad [0]) and feature values. To improve the explainability of our proposed models. For model development, we considered the subject number in this study and decided to select, at most, 12 features, which was also similar to previous studies [
30].
Regression Model Development and Evaluation
Although various DSC values were present in our radiomics dataset, to generate a balanced model, weights were calculated for good- and bad-quality groups when appropriate. Five types of regression model were tested: (1) random forest regressor (RFR), (2) gradient boost regressor (GBR), (3) K nearest neighbor regressor (KNNR), (4) linear regression regressor (LRR), and (5) multilayer perceptron regressor (MLPR). Support vector machine methods were abandoned due to their long computation times. All regressors were trained with the five-fold cross validation and grid search method to determine the best combination of parameters. The final regression model was developed with all training and validation datasets. The model performance was evaluated in the testing dataset using the mean absolute error (MAE). The prediction coefficient of determination () is also reported. The models with the best performance were selected.
Classification Model Development
We further developed a series of classification models based on the radiomics features to evaluate the value of the radiomics features for segmentation quality classification. The regressors’ performance was re-evaluated in the images predicted as being of good quality. Two classifiers were selected: (1) the random forest classifier (RFC) and (2) the gradient boost classifier (GBC). Since we aimed to develop a QC system, the classification performance was mainly evaluated with the negative detection rate (NDR), as shown in Equation (
8).
TN = true negative; FN = false negative. Negative results represent bad-quality segmentation predictions.
Distribution Pattern among Disease Types, Segmentation Quality, and MAE
To explore changes in the tendency of disease type/segmentation quality and MAE, density plots were plotted. Bland–Altman analyses were used to show agreement for actual DSC scores and predicted DSC scores.
2.5. Post Hoc Analysis
To verify the usefulness of the classification models on the QC system performance, we compared the MAE vlues of all segmentations and predicted good-quality segmentations.
With the advent of the segment anything model (SAM) [
31], we chose some of our images as inputs and tested their segmentation ability with the “every” mode
https://segment-anything.com/demo (accessed on 10 April 2023).
2.6. Statistical Analysis
SPSS (version 26) and Python (version 3.7.10) were used for the statistical analysis. The model performance was assessed using the MAE (between the predicted DSC and actual DSC) as previously described.
To compare means, student’s T tests were conducted as appropriate. Class weights were calculated when appropriate using the scikit-learn library [
32].
4. Discussion
In this study, we developed an analysis platform that incorporates a DL-based automatic segmentation cine and a radiomics-based QC for short-axis CMR cine. To achieve this, we first developed a localization and segmentation pipeline using U-net models. Thereafter, we developed a two-stage radiomics-based quality control (QC) system for automatic segmentations. Our hypothesis, that radiomics features could facilitate the QC of automatic segmentations, was validated through experiments. By using RF classifiers and GB regressors, our methods exhibited a high mal-segmentation detection rate and accurate DSC estimation in most situations.
4.1. Discussion Regarding Model Performance
4.1.1. DL Model Performance
As shown in
Table 4, our DSC scores for 2D segmentations were 0.863, 0.940, and 0.872 for RVC, LVC, and MYO, respectively. These results demonstrate that our model structure is suitable for the segmentation task. However, as depicted in
Figure 3, our four S-Unets exhibited a decreased segmentation performance on certain apical and basal slices. This phenomenon was also observed in the recent SAM segmentations (
Figure 10). During our analysis, we observed that S-Unets with various modifications (residual, attention parts) exhibited slightly varying performances on ambiguous regions of interest (ROIs). For instance, S-Unets equipped with attention structures tended to segment the right ventricular cavity more accurately in basal slices, while those with residual parts displayed better performance levels on smaller ROIs (
Figure 3). Moreover, the variability of low-quality segmentations offered unique samples for our radiomics dataset. This variability partially explains why we incorporated modified model structures in our analysis pipeline.
4.1.2. QC Performance on Our Dataset
The regression models performed well on all subgroups in the training dataset with better performances seen in the 2D subgroups, as characterized by the
values (
Figure 4). However, this difference could be partially attributed to the larger sample size used in the 2D segmentations. We further utilized RF classifiers and found that, in the 2D subgroups, all NDRs were above 85% (
Figure 6), indicating that the radiomic-feature-based classification models could effectively identify low-quality segmentations.
One notable exception was the 3D-RVC group, which exhibited an NDR of only 16.7%. Upon examining the density plots for the MAE (
Figure 7), we observed that as the MAE increased, the proportion of predicted bad-quality segmentations also increased for all segmentation results (with the exception of the 3D-RVC group). The distribution pattern of 2D MAE was relatively balanced across the various disease groups (
Figure 8). With the failure of our model in the RVC-3D subgroup, we checked the radiomics features included in
Supplementary Table S5, and we found that most include features belonging to the shape feature family. Due to the underlying pathology, the shape variance of the RVC is much higher than those of LVC or MYO (the RVC shape is more sensitive to external changes, such as myocardium hypertrophy or hemodynamic changes). Additionally, the segmentation models failed to segment many RVC apical slices. After reconstruction, this could lead to the instability of 3D-RVC features.
By comparing two Bland–Altman plots for each subgroup, we demonstrated the prediction results for all segmentations and predicted good-quality segmentations. We also noticed that most excluded segmentation instances were located outside of the >1.96 SD interval, as shown in
Figure 9. We also noticed a group of awkward points in the Bland–Altman plots, especially in the RVC-2D subgroup. For an image with an actual DSC equal to 0 and a predicted dice equal to D
(means the segmentation model segmented some irrelevant area as a ROI), the
x axis location for that point is 0.5D
and the
y axis location is −D
. This could explain why all of those awkward points are located on the line ‘y = −2x’. We also examined our segmentation performance retrospectively and found that apical slices were hard to segment in some subjects. By using Bland–Altman plots, those low-quality segmentations were obvious in our 2D-RVC group. Luckily, our classification model successfully detected most of these segmentations.
4.2. QC Performance Compared with Previous Methods
We compared our methods with a previous RCA method [
7] and a DL-based method [
8]. In Robinson’s work, they performed experiments on several datasets with 2D segmentations for the RVC, LVC, and MYO subgroups. The MAE values were 0.030–0.146, 0.020–0.082, and 0.044–0.268, respectively, compared to our results where the MAE values for RVC, LVC, and MYO were 0.060, 0.021, and 0.032. One obvious drawback of the RCA method is the need for a reference dataset. A larger reference dataset could lead to overestimation of the segmentations’ quality. For the 3D segmentation QC performance, we compared our results with previous DL-based QC results. The DSC for MYO was 0.017 ± 0.016 (ours) vs. 0.016 ± 0.028 (DL); for LVC, it was 0.011 ± 0.020 (ours) vs. 0.012 ± 0.017 (DL). These comparison results show that our models perform well for automatic QC regarding the LVC and MYO structures. However, experiments were not performed on the RVC subgroup.
4.3. QC with the Mature Segmentation Model
In contrast to diagnostic applications, a QC system should prioritize the detection of bad-quality segmentations, rather than improving the classification accuracy. As a result, we chose the NDR as our primary criterion for the classification evaluation. Additionally, we observed that deep learning segmentation models for cardiac segmentation are currently well-developed. The reported average DSC is 0.85–0.97 for the ACDC-2017 dataset and M&M challenge dataset with various DL model structures [
3,
33,
34,
35]. As previously mentioned, mal-segmentations at the slice level are primarily distributed in the apical or basal regions of the heart. However, these low-quality segmentations have little effect on the 3D DSC prediction. This partially explains why the 3D DSC is higher than the 2D DSC (refer to
Table 4). However, it is important to note that the absence of apical or basal slice segmentations can significantly impact the 3D radiomics features, such as the maximal long axis length, particularly in the shape feature group. This phenomenon was also observed in our experiment (
Supplementary Tables S7 and S8). Meanwhile, although coarse borders may not significantly impact the 2D or 3D DSC, certain radiomics features are sensitive to edges, as demonstrated in previous studies [
36,
37,
38]. The characteristics exhibited by radiomics-based quality control systems provide a new evaluation perspective compared to previous methods. This makes radiomics an ideal method for automatic segmentation evaluation and quality control. As shown in
Figure 10, the segmentation of apical slices remains challenging, which is why a quality control system is necessary, even with the use of large models, such as SAM.
4.4. Technical Innovations and Clinical Insights
To the best of our knowledge, this is the first study to utilize radiomics as a quality control tool for automatic cardiac magnetic resonance (CMR) segmentations. Our findings demonstrate that radiomics techniques yield great DSC prediction results. In addition, our method is capable of efficiently detecting mal-segmentations with NDR values greater than 0.85 in all 2D groups, achieving values of 90.0% [RVC], 93.0% [LVC], and 85.5% [MYO]. As a previous comparison showed, our models showed great QC performances for both 2D and 3D segmentations.
Our method is also computationally friendly. In our case, with a NVIDIA RTX 3090 GPU and an AMD 3900X CPU, the training time for the localization model and segmentation model was less than 5 h. We also tested our training model with a NVIDIA RTX 3060 (12 GB memory), which is also capable of carrying out the training process. More importantly, the Unets used in this study were examined in various tasks with numerous variants, and every center was able to develop dedicated models.
In this study, we aimed to provide a new perspective for QC, and we tested the feasibility of our proposed method. Once the quality control pipeline is built, the operator only needs to decide the ED frame, and the computation time is <10 s for each instance, which is 60 times faster than that of the RCA method, as previously reported [
7]. With a short computation time, our method has the application potential for real-time DSC predictions in a clinical scenario. Timely detection of low-quality segmentations could save time for researchers and reduce the human workload.
This method has another obvious advantage compared with the RCA method: we do not need to select a reference dataset as in the RCA method [
7]. Additionally, we do not need to test the reproducibility of selected features. The manually derived ROIs were only used for calculating the DSCs of automatic segmentations. The radiomics features of manual segmentations were not extracted or analyzed.
4.5. Limitations
This study had several limitations. Firstly, while we did include the ACDC-2017 and M&M-2020 datasets, the majority of data for our training and validation datasets (nearly 90%) were derived from a single center (Renji Hospital). Secondly, from a practical perspective, most dedicated segmentation models have shown great results. To address this, we included suboptimal models for the radiomics training dataset generation. However, during the testing phase, we only used segmentations from the optimal S-UNets to evaluate the model performance for both regression and classification. Thirdly, radiomic features can only be extracted from images that have specified ROIs. Therefore, our method is not applicable to images that lack segmentations. However, the missing information may be reflected in the 3D radiomics characteristics. Fourth, the contours of different structures are more clear in the ED phase than in the ES phase; therefore, only the ED phase was selected in this study.