Cardiac Disease Classification Using Two-Dimensional Thickness and Few-Shot Learning Based on Magnetic Resonance Imaging Image Segmentation

Cardiac cine magnetic resonance imaging (MRI) is a widely used technique for the noninvasive assessment of cardiac functions. Deep neural networks have achieved considerable progress in overcoming various challenges in cine MRI analysis. However, deep learning models cannot be used for classification because limited cine MRI data are available. To overcome this problem, features from cine image settings are derived by handcrafting and addition of other clinical features to the classical machine learning approach for ensuring the model fits the MRI device settings and image parameters required in the analysis. In this study, a novel method was proposed for classifying heart disease (cardiomyopathy patient groups) using only segmented output maps. In the encoder–decoder network, the fully convolutional EfficientNetB5-UNet was modified to perform the semantic segmentation of the MRI image slice. A two-dimensional thickness algorithm was used to combine the segmentation outputs for the 2D representation of images of the end-diastole (ED) and end-systole (ES) cardiac volumes. The thickness images were subsequently used for classification by using a few-shot model with an adaptive subspace classifier. Model performance was verified by applying the model to the 2017 MICCAI Medical Image Computing and Computer-Assisted Intervention dataset. High segmentation performance was achieved as follows: the average Dice coefficients of segmentation were 96.24% (ED) and 89.92% (ES) for the left ventricle (LV); the values for the right ventricle (RV) were 92.90% (ED) and 86.92% (ES). The values for myocardium were 88.90% (ED) and 90.48% (ES). An accuracy score of 92% was achieved in the classification of various cardiomyopathy groups without clinical features. A novel rapid analysis approach was proposed for heart disease diagnosis, especially for cardiomyopathy conditions using cine MRI based on segmented output maps.


Introduction
Heart diseases, such as coronary, arrhythmia, congenital, and muscle disease (cardiomyopathy) [1], are the leading cause of death worldwide. Cardiomyopathy has attracted considerable research attention because it is typically associated with patients with heart failure [2]. Several cardiomyopathy types, such as dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), heart failure due to myocardial infractions (MINFs), and right ventricular abnormalities, have been reported [3]. Early identification of cardiac dysfunction and rapid diagnosis of heart disease can be achieved by evaluating cardiac parameters, such as end-systolic (ES) volume, end-diastolic volume, ejection fraction, and stroke volume, using magnetic resonance imaging (MRI) [4]. Computer-aided automated classification and segmentation solutions have alleviated time-consuming problems in manual diagnostics and interobserver variability [5,6]. Moreover, these problems have spurred the development of the deep neural network (DNN) methods in cardiomyopathy diagnosis.
The output of deep-learning-based cine MRI segmentation can be used as the basis for diagnostically obtaining derived features, such as ventricular volume [7], myocardial mass, and ejection fraction (EF). These derived features are clinical parameters or clinical indices. These values are generally evaluated two times-at the end of the diastolic and systolic phases (ED and ES, respectively). The accuracy of the calculation of clinical parameters depends on the accurate depiction of the contours of the underlying cardiac structures. Isensee et al. [8] used UNet segmentation by combining 2D and 3D modeling to obtain derivative features for diagnosis. Khened et al. (2019) [9] proposed a 2D multiscale fully convolutional network architecture based on residual DenseNet to derive the features of cardiac physiological parameters and handcrafted features. UNet outperforms most other methods; however, the derived features in this technique are obtained by acquiring the slice number and instrument setting, which enhances classification complexity.
The classification of cine MRI based on derived features is highly reliable. Classical machine learning methods, such as random forest [10] and multilayer perceptron (MLP) [8], have been used for disease classification. Khened et al., [11] proposed an ensemble of multiple models for improving classification performance. The small amount of data renders conventional machine learning more suitable than deep learning. Deep learning was also used by Isensee et al. [8] with augmentation and MLP for classification; however, this method included derived information. In deep learning, few-shot learning is highly preferable in areas with the limited data.
The few-shot model was introduced to generalize a new class with a limited number of labeled examples for each class. The meta-learning paradigm was used to achieve accurate DNN models with less training data [12]. Few-shot learning has been implemented through several approaches such as transfer learning [13], metric learning [12,14,15], data augmentation [16], and model optimization [17]. These methods typically work well within the same domain; however, their performance deteriorates in cross-domain applications [18]. Chen et al. [19] proved that feature improvement using a deeper encoder can minimize the variation in scores between methods. The metric learning-based approach exhibits stable performance and can be flexibly used to study the similarities and differences between classes. This phenomenon is more advantageous when the dataset is in the same domain; however, the baseline model [13] is preferable in cross-domain cases.
The adaptive subspace [15] approach is a subspace-based dynamic classifier that has been developed to address data ambiguity with class representations based on subspace mechanisms. This model outperforms previous metric learning methods such as matching networks [12] and prototype networks [14]. The subspace mechanism can be easily customized and can optimize the feature representation in the encoder network and class representation in the subspace classifier, which enables backpropagation. Moreover, episodic learning can be used freely. The use of random permutations for each class that appears for each number of shots and queries can produce many possibilities that can be contained in episodes.
The main contributions of this paper are summarized as follows: (1) Several lightweight encoders were ensembled using a block-inverted residual network in the UNet architecture for automatic cine MRI segmentation optimization. (2) We proposed a novel 2D thickness algorithm to decode the segmentation outputs to develop the 2D representation images of the ED and ES cardiac volumes as the input for of cardiac muscle heart disease patient groups classification without using clinical features. (3) A few-shot model with an adaptive subspace classifier was proposed, and various encoder few-shot models were investigated for deep-learning-based cardiac disease group classification. Unlike the general few-shot learning mechanism, the same classes and domain datasets are provided in the few-shot learning mechanisms in the training, validation, and testing phases. (4) The ensemble method was used to obtain an even distribution of the class representations from this few-shot mechanism [20]. The source code for the developed models is available at https:/github.com/bowoadi/cine_MRI_segmentation_classification (accessed on 6 July 2022).

Dataset
In this study, the MICCAI 2017 automated cardiac diagnosis challenge (ACDC) dataset was used [21]. The dataset consisted of the cardiac cine image series of one cardiac cycle at various short-axis slice positions, from basal to apical, collected from the examination of 150 patients at the University Hospital of Dijon (France). Two MRI scanners with distinct magnetic strengths (1.5 T-3.0 T) were used to acquire short-axis cine MRI slices. The obtained short-axis MRI cine slices covered the LV and RV from the base (top slice) to the apex (bottom slice), with a slice thickness of 5-8 mm, a gap between slices 5 mm or 10 mm and 1.37-1.68 mm/px as spatial resolution. Each cardiac cycle was covered by 28 to 40 frames in the time dimension. The patients were grouped under five cardiac conditions, namely dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), heart failure due to MINFs, abnormal RV (ARV), and healthy or normal patients (NOR); each group contained 30 patients. The ground truth for segmentation had four labels, namely background (0), RV (1), myocardium (2), and left ventricle (3). Information on disease class and ground truth segmentation was provided for 100 patients, whereas the data pertaining to the other 50 patients were used as test data and submitted in the post-competition phase to obtain the test scores.

Proposed Method
The proposed architecture consisted of a 2D segmentation model based on a U-Netbased encoder-decoder structure and a classification model based on few-shot learning ( Figure 1). The proposed 2D thickness algorithm connected two tasks to convert the output maps of the 2D segmentation network into a 2D image that represents the thickness of the short-axis slices for each patient. In this experiment, a lightweight model based on an inverted residual network was used [22] for segmentation and classification encoders. EfficientNet is composed of a basic block-inverted residual network used as an encoder and pre-trained weights. This method is intended to strengthen the feature representation for both segmentation and few-shot classification. EfficientNet [23] performance was compared with that of MobileNetV3 [24], which also has an inverted residual network architecture base [25].

Segmentation Model and 2D Thickness Algorithm
The segmentation model was inspired by the lightweight encoder-decoder MobileNetV3-UNet [26]. The lightweight model based on an inverted residual network exhibits high generalization capability, which is faster and lighter, and optimization of the network architecture search, which improves the accuracy of feature representation. UNet is a fast generalized semantic segmentation model, particularly for handling biomedical images [27]. Therefore, to improve performance, the feature representation of the encoder is essential at each stage of the UNet. UNet is used to capture the global and local features through skip connections at each stage. The MobileNetV3-UNet model was modified by replacing the encoder with EfficientNet [23] and subsequently fine-tuning the models from EfficientNetB0 until EfficientNetB5.

Segmentation Model and 2D Thickness Algorithm
The segmentation model was inspired by the lightweight encoder-decoder Mo-bileNetV3-UNet [26]. The lightweight model based on an inverted residual network exhibits high generalization capability, which is faster and lighter, and optimization of the network architecture search, which improves the accuracy of feature representation. UNet is a fast generalized semantic segmentation model, particularly for handling biomedical images [27]. Therefore, to improve performance, the feature representation of the encoder is essential at each stage of the UNet. UNet is used to capture the global and local features through skip connections at each stage. The MobileNetV3-UNet model was modified by replacing the encoder with EfficientNet [23] and subsequently fine-tuning the models from EfficientNetB0 until EfficientNetB5.
Ronneberger et al. first proposed the UNet architecture [27], which consists of a fully convolutional neural network that helps in classifying classes in each pixel with an encoder-decoder pattern. According to Figure 2, the input image is obtained through a series of blocks marked in red, which are contraction pathways or encoders. The encoder can be replaced using state-of-the-art encoder variations, such as EfficientNet [23] and Mo-bileNetV3 [24]. Typically, the downsampling phase is performed from a convolution kernel with two strides, which start directly at the start block; therefore, the red arrow is an illustration of connecting blocks after the downsampling phase. Each downsampling phase with a height H and a width W of the feature map is halved, and C0 to C5 is the channel size extracted from each block; the amount depends on the variation of the encoder used. Feature maps are sent from the encoder to the decoder through a skip connection; therefore, the resolution of the encoder and decoder should be the same. The lowest resolution block encoding, also called a bottleneck, ends the contraction path and starts the expansion path. Operations opposite to the previous and in reverse order are referred to as expansion paths or decoders, marked in blue. The block decoder consists of convolution 3 × 3 operations followed by batch normalization and ReLU, which are performed Ronneberger et al. first proposed the UNet architecture [27], which consists of a fully convolutional neural network that helps in classifying classes in each pixel with an encoder-decoder pattern. According to Figure 2, the input image is obtained through a series of blocks marked in red, which are contraction pathways or encoders. The encoder can be replaced using state-of-the-art encoder variations, such as EfficientNet [23] and MobileNetV3 [24]. Typically, the downsampling phase is performed from a convolution kernel with two strides, which start directly at the start block; therefore, the red arrow is an illustration of connecting blocks after the downsampling phase. Each downsampling phase with a height H and a width W of the feature map is halved, and C0 to C5 is the channel size extracted from each block; the amount depends on the variation of the encoder used. Feature maps are sent from the encoder to the decoder through a skip connection; therefore, the resolution of the encoder and decoder should be the same. The lowest resolution block encoding, also called a bottleneck, ends the contraction path and starts the expansion path. Operations opposite to the previous and in reverse order are referred to as expansion paths or decoders, marked in blue. The block decoder consists of convolution 3 × 3 operations followed by batch normalization and ReLU, which are performed two times. The upsampling block gradually increases spatial dimensions (marked with blue arrows) to 2 × 2 of the H and W measurements of the feature map. At the final level of the expansion path, the kernel's projection (convolution block) operation uses 1 × 1 to match the output channel dimensions to the number of segmented classes. Figure 3 displays the interaction of the segmentation network with other networks.
The ensemble method involves combining the pixel-wise classification results from more than a single semantic segmentation model. The first single semantic segmentation model is trained with a composition of input sets of training, validation, and testing. Furthermore, other single semantic segmentation models each receive the same composition of training, validation, and testing input sets as the first single semantic segmentation model. Every semantic segmentation model produces a segmentation output probability. The segmentation output probabilities from these single models are combined and averaged. The softmax function is used to determine the class for each pixel resulting from the average segmentation output probability. two times. The upsampling block gradually increases spatial dimensions (marked with blue arrows) to 2 × 2 of the H and W measurements of the feature map. At the final level of the expansion path, the kernel's projection (convolution block) operation uses 1 × 1 to match the output channel dimensions to the number of segmented classes. Figure 3 displays the interaction of the segmentation network with other networks.  The ensemble method involves combining the pixel-wise classification results from more than a single semantic segmentation model. The first single semantic segmentation model is trained with a composition of input sets of training, validation, and testing. Furthermore, other single semantic segmentation models each receive the same composition of training, validation, and testing input sets as the first single semantic segmentation two times. The upsampling block gradually increases spatial dimensions (marked with blue arrows) to 2 × 2 of the H and W measurements of the feature map. At the final level of the expansion path, the kernel's projection (convolution block) operation uses 1 × 1 to match the output channel dimensions to the number of segmented classes. Figure 3 displays the interaction of the segmentation network with other networks.  The ensemble method involves combining the pixel-wise classification results from more than a single semantic segmentation model. The first single semantic segmentation model is trained with a composition of input sets of training, validation, and testing. Furthermore, other single semantic segmentation models each receive the same composition of training, validation, and testing input sets as the first single semantic segmentation Several preprocessing steps were applied because of variations in the image frame size of each patient. The steps were initiated by applying auto-zero padding to maintain image proportions. In auto-zero padding, zero pixels are added to the vertical or horizontal side in a balanced manner. The obtained square-shaped image was resized to 256 × 256. Finally, the grayscale image was duplicated in three channels. Random augmentation [26] was applied to expand the diversity of images to achieve robustness against complex images. The input size of the segmentation model was 256 × 256 × 3.
The segmentation model produces an output map O with dimensions of 256 × 256 × 4, expressed as H × W × C, where C = [c 0 , c 1 , c 2 , c 3 ] represents the background, right ventricle (RV), MYO, and left ventricle (LV), respectively. These output maps are used in the 2D thickness algorithm (Algorithm 1) to develop the 2D representation images of the ED and ES cardiac volumes. For all patients, the segmentation output maps were regrouped for the ED and ES frames expressed as P = {(p 0,0 , p 1,0 ), . . . , (p 0,n−1 , p 1,n−1 )}, where n is the number of patients. Here, N-slices were considered because of the varying number of short-axis slices in one frame p f . Furthermore, N was obtained from the least number of slices in each frame (N = 6). Merging was applied by filling the actual pixels of slice k with values from s k , where S is a set of N unique values. The merging of the labels of the RV, MYO, and LV forms the red-green-blue (RGB) channel. The background label was not used because the background was automatically created when the three non-background channels were merged. The resulting RGB images I in the ED and ES frames were applied to the center crop, resized to 64 × 64 × 3, and joined vertically for each patient. The final size at post-processing was 128 × 64 × 3.
The post-processing results were used as inputs for the few-shot classification model. The use of the few-shot mechanism in this study differed slightly from its general use. Because in this model, the similarities and differences are learned between the classes, the classes in the support and query samples in the training, validation, and testing phases differed considerably. In this study, the model was trained, validated, and tested by using the same five classes with an episodic mechanism. The prediction output (class representation) for each class in the testing phase was selected through majority voting. M←empty array with shape of n × 128 × 64 × 3 4: for i = 0 to n do 5: for f = 0 to 1 do 6: I←blank image with shape of 256 × 256 × 3 7: for k = 0 to N do: 8:

Few-Shot Learning Model
The few-shot learning method was used as the classification model to handle limited data. Few-shot terminology is used for training multiple samples in each iteration of the meta-learning mode. In the training process, an episode of T i consisted of two types of sets, namely support set S = {(x 1,1 , c 1,1 ), . . . , (x N,K , c N,K )} and query set Q = q 1 , . . . , q N×M . For each iteration, the number of S and Q was limited based on N-way, which denotes the number of classes, and K-shot or M-query denote the number of samples in each class. The few-shot model has two parts, namely encoder and classifier.
A dynamic classifier based on an adaptive subspace [15] was used in this study. Classification was performed based on the subspace in addition to the shortest distance between the data points and their projection onto the subspace. A set of samples encoded by θ can be q is the query set, and the subspace classifier calculation is formulated as follows: where M c = P c P T c and µ c denote the offset between the data point and the subspace, respectively, P c is the truncated matrix of matrix B c with an orthogonal basis for linear The probability of a query entering the class c can be determined using the softmax function is expressed as follows: Backpropagation through singular value decomposition can be used to minimize the negative log from p c,q .
The projection metric on Grassmannian geometry was used as a discriminative method to maximize the margin between two subspaces P i and P j during training and is formulated as follows: The projection metric was maximized by minimizing ||P T i P j || 2 F followed by formulating a loss function as shown in Equation (4); subsequently, L t can be used to update θ.

Segmentation Training and Testing Scenario
Segmentation model training was conducted on an i7 processor with an RTX 2060 SUPER by using a batch size of four for model generalization [28]. A five-fold crossvalidation mechanism was used for model training by considering the weights at the best validation for each fold in the first 150 epochs. The data distribution in each fold was separated based on the number of patients, that is, 80 for training and 20 for validation, where each class was set equally. Each patient had a distinct number of short-axis slices; therefore, 1902 images were acquired. The image preprocessed with auto-zero padding and random augmentation was used as the input model with dimensions of 256 × 256 × 3. The grayscale image was copied into three channels to obtain a superior result by using the pre-trained ImageNet model. The Jaccard loss L jaccard , formulated as Equation (5), was used as the loss function along with the Adam optimizer with a learning rate of 1 × 10 −3 . Finally, the segmentation output was post-processed using the hole-filling method and the largest contour, and an operation was applied to remove the double contour of the RV because it tends to be noisy.
where G is the ground truth and P is the output of the segmentation model. The 2D segmentation model was tested in two stages: a local test conducted with the validation set, and an online test based on the competition leaderboard, conducted with the test set. Both the validation and test sets were preprocessed but not augmented. For each fold on the MobileNetV3-UNet and EfficientNet-UNet models, local tests were conducted using the Dice coefficient evaluation metric. An online test based on the competition leaderboard was performed by submitting the final segmentation results, obtained as the average of the segmentation map output from the five-fold model. During submission, the final segmentation result was returned to its original size first and reassembled into a nii file for each patient. The model was evaluated using several metrics according to the competition leaderboard.

Few-Shot Classification Training and Testing Scenarios
A minimal dataset was used for the few-shot model. Images from the post-processing segmentation model of 100 patients were randomly separated into 50 balanced classes for the training and validation phases. During the testing phase, 50 images were randomly sampled. The settings used were five-way for all phases, five-shot on the support and query sets, and 100 episodes for each epoch. Each episode consisted of 25 support sets and 25 query sets. The training iteration used ten epochs, with a learning rate of 1 × 10 −3 , optimized by Adam, and lambda of 0.03. The few-shot model ran on an i7 processor with RTX 2060 SUPER.
The validation support set was used as the support set test in the testing phase, and 25 queries for each episode were selected randomly from the test data. The majority of the prediction results were voted based on the patient ID. The training-testing process was repeated five times to obtain stable models, and an ensemble was then performed. Experiments were conducted to investigate the effects of using various encoders, augmentation, dropout, and 1-5 shot settings. The best experimental model was considered the final model and submitted to the post-2017-MICCAI-challenge testing phase for diagnosis. The experimental results were evaluated against the validation set based on accuracy.

Segmentation Results
The proposed model was evaluated in two stages, namely local and online. In the local tests, five-fold cross-validation was used to test MobileNetV3-UNet and EfficientNet-UNet on the validation set using Dice coefficients, as presented in Table 1. Although EfficientNetB3 and EfficientNetB5 obtained the highest scores, the Dice coefficients of MobileNetV3-UNet and EfficientNet-UNet exhibited slightly distinct values. However, the standard deviation of EfficientNetB5-UNet was lower by 0.0024 when compared with that of MobileNetV3-UNet. Because of the insignificant difference, EfficientNetB5 was used for online testing. MobileNetV3 was additionally used, and two ensembles of the models were created. First, Ensemble B0-B5 was generated from the average of the output maps of EfficientNetB0 to EfficientNetB5. Next, the V3B5 ensemble was generated from the average of the output maps of MobileNetV3 and EfficientNetB5. In Table 2, the outcomes of the proposed method and those of several previous methods are displayed in terms of the Dice coefficient evaluation metrics. The test results on the leaderboard revealed that EfficientNetB5-UNet outperformed MobileNetV3-UNet for the LV, RV, and MYO features. Although ensembling did not outperform several methods in the post-2017-MICCAI-challenge testing phase, it improved the average Dice coefficient. Therefore, Ensemble B0-B5 with an average Dice coefficient of 90.8% was used to perform the next task (few-shot classification).

Classification Results
In the proposed classification method, 2D images are processed as inputs. Figure 4 displays the 2D images obtained by merging the output maps from the segmentation model by using the 2D thickness algorithm. Figure 4g was acquired by combining the first six short-axis slices from the basal slice (Figure 4a-f). In the case of six slices, the combination was from basal to apical. Experiments were conducted to improve the performance of the few-shot model under five scenarios. The first scenario involved determining the effect of using distinct encoders. In the second scenario, we examined the effects of dropout and augmentation. The third step was to locate the best set of shots. The fourth was to observe the effect of various numbers of short-axis slices in the 2D thickness images, and the final step was to determine the best few-shot on the test leaderboard.

Classification Results
In the proposed classification method, 2D images are processed as inputs. Figure 4 displays the 2D images obtained by merging the output maps from the segmentation model by using the 2D thickness algorithm. Figure 4g was acquired by combining the first six short-axis slices from the basal slice (Figure 4a-f). In the case of six slices, the combination was from basal to apical. Experiments were conducted to improve the performance of the few-shot model under five scenarios. The first scenario involved determining the effect of using distinct encoders. In the second scenario, we examined the effects of dropout and augmentation. The third step was to locate the best set of shots. The fourth was to observe the effect of various numbers of short-axis slices in the 2D thickness images, and the final step was to determine the best few-shot on the test leaderboard. Figure 4. Segmentation output maps (a-f) of short-axis slices 1-6 of each ED (top) and ES (bottom) frames processed by the 2D thickness algorithm to produce 2D images (g) for few-shot model input.
Three types of encoders (CNN, MobileNetV3, and EfficientNet) were used in the first scenario. A standard CNN with four convolutional layers (Conv4) was used to determine the feature extraction capability by using a simple convolutional model. MobileNetV3 and EfficientNet are deep and lightweight models based on an inverted residual network with excellent feature representation capabilities [31]. The effects of using pre-trained ImageNet weights and non-pre-trained weights for the deep models were investigated. Table 3 reveals that EfficientNetB1 using the pre-trained ImageNet exhibits a higher average accuracy of 68.68% and lower standard deviation of 0.6883%. The results revealed that deep and lightweight models based on pre-trained models could improve model generalizability, especially in the case of deep models.  Three types of encoders (CNN, MobileNetV3, and EfficientNet) were used in the first scenario. A standard CNN with four convolutional layers (Conv4) was used to determine the feature extraction capability by using a simple convolutional model. MobileNetV3 and EfficientNet are deep and lightweight models based on an inverted residual network with excellent feature representation capabilities [31]. The effects of using pre-trained ImageNet weights and non-pre-trained weights for the deep models were investigated. Table 3 reveals that EfficientNetB1 using the pre-trained ImageNet exhibits a higher average accuracy of 68.68% and lower standard deviation of 0.6883%. The results revealed that deep and lightweight models based on pre-trained models could improve model generalizability, especially in the case of deep models.
In the second scenario, dropout and augmentation were applied. The training results are not presented because the entire training phase exhibited 100% accuracy. However, optimization was applied by adding dropouts and augmentations to the input images. A similar augmentation in segmentation was applied by adding a coarse dropout by removing 4-8 pixels randomly. A dropout of 0.5 was applied at the end of the encoder before entering the classifier. Table 4 lists the effects of augmentation and dropout. Typically, augmentation improves accuracy, whereas dropout degrades accuracy. However, the application of both augmentation and dropout can increase accuracy. EfficientNetB1 with pre-training, dropout, and augmentation resulted in an average accuracy of 78.23%, with the smallest standard deviation.  In the third scenario, the maximum possible number of shots was five because each class had only twenty data points (divided into five for support set and five for query set, with the remaining split between training and validation). However, fewer than five shots were recorded. Table 5 details the optimal number of shots between one and five. A small number of shots can degrade model performance. However, a stable model can be obtained with the five-shot setting. In the fourth scenario, the smallest number of short-axis slices from the basal to the apical side in the dataset was six. Therefore, the 2D thickness images were rendered in several slices between one and six. Examples of the results of the 2D thickness algorithm are displayed in Figure 5a 1-slice, (b) two-slice, (c) three-slice, (d) four-slice, (e) five-slice, and (f) six-slice. Using the 2D thickness algorithm, each S pixel value inserted into the slice was 255 for one-slice, (85 for first slice, 170 for second slice) for two-slice, (21, 87, 147) for three-slice, (16,51,85,103) for four-slice, (15,29,51, 67, 93) for five-slice, and (14,26,38,49,59, 69) for six-slice. Different numbers of slices (between one and six) were tested using the best model (EfficientNetB1, pre-trained, dropout, and augmented), as presented in Table 6. The results revealed that higher number of slices improved classification performance, and the performance in the experiment with six-slice was superior to those in the experiments with lesser number of slices because of its rich features.  In the final scenario, because the post-2017-MICCAI-challenge testing phase for diagnosis allowed limited testing, several tests were performed using various accounts. Table 7 lists the test leaderboard scores. Five individual experiments were conducted, and ensemble was used in each of the experiments. In the ensemble model, major voting was used based on the most frequent occurrences of the class. Each model obtained an accuracy score between 78% and 86%, which indicated instability; because the few-shot determined the label by matching the support set, the support set samples were randomly selected. Therefore, an ensemble was created to obtain models with completely random samples. Based on the results, the ensemble model achieved 92% accuracy for diagnosis and could compete with the state-of-the-art models.

Discussion
Segmentation and classification of cine MRI data are challenging tasks. A U-Netbased encoder-decoder model was used in segmentation to create output maps and classify them into a few-shot model. Several studies have been conducted on cine MRI segmentation ( Table 8). The advantages and disadvantages of each method have been listed in the table. In general, UNet is the best framework for this segmentation, and achieves an average Dice score of more than 80. Moreover, the proposed method renders  In the final scenario, because the post-2017-MICCAI-challenge testing phase for diagnosis allowed limited testing, several tests were performed using various accounts. Table 7 lists the test leaderboard scores. Five individual experiments were conducted, and ensemble was used in each of the experiments. In the ensemble model, major voting was used based on the most frequent occurrences of the class. Each model obtained an accuracy score between 78% and 86%, which indicated instability; because the few-shot determined the label by matching the support set, the support set samples were randomly selected. Therefore, an ensemble was created to obtain models with completely random samples. Based on the results, the ensemble model achieved 92% accuracy for diagnosis and could compete with the state-of-the-art models.

Discussion
Segmentation and classification of cine MRI data are challenging tasks. A U-Net-based encoder-decoder model was used in segmentation to create output maps and classify them into a few-shot model. Several studies have been conducted on cine MRI segmentation ( Table 8). The advantages and disadvantages of each method have been listed in the table. In general, UNet is the best framework for this segmentation, and achieves an average Dice score of more than 80. Moreover, the proposed method renders segmentation lighter with UNet-EfficientNetB5 for data per slice, but the segmentation performance does not outperform previous method. Table 8. Strengths and weaknesses of the proposed method compared to previous methods.

Method Stage Strengths Weaknesses
Proposed Segmentation Segmentation becomes lighter with UNet-EfficientNetB5 for data per slice.
The segmentation performance has not outperformed the previous method.

Classification
Only considers slices and does not depend on the parameter settings of the tool and the number of slices obtained. Can be compared between the number of slices.
This approach is only suitable for morphological problems. Uncertainty is high because depending on the training in each episode, this is handled by the ensemble.
Mahendra Khened [9] Segmentation Segmentation using DenseNet which is suitable for limited data.
Segmentation using 2D UNet with dense block doesn't outperform the ensemble of 2D and 3D DMR-UNet.

Classification
Classification becomes faster with Random Forest.
The classification most exclusively focus on end-diastole and end-systole features.
The 3D UNet has large slice gap on the input images, it causes pooling and upscaling operations are carried out only in the short-axis plane. Moreover, the 3D network involves a smaller number of feature maps.

Classification
Perform ensemble classification by combining MLP and Random Forest.
The ensemble method does not outperform single Random Forest.

Irem Cetin [32] Segmentation
The training data was manually segmented to produce accurate results.
They computed large number of computations manually. This method tends to overfitting. To prevent from overfitting, they selected the most discriminative features and used SVM for classification.

Classification
Classification using Support Vector Machines suitable for limited data.
The classification method does not outperform.
Jelmer M Wolterink [10] Segmentation The network was designed to contain a number of convolutional layers with increasing levels of dilatation to produce high resolution feature maps.
Convolutional neural network does not exhibit an encoder-decoder architecture. For the cine MRI classification presented in Table 8, in most studies non-deep learning methods that include clinical features are used. In the deep learning method, only slices are considered, and model performance does not depend on the parameter settings of the tool and the number of slices obtained. Furthermore, the data can be compared between the number of slices with a small dataset. The few-shot model overcomes the small dataset, especially in image-based tasks. The few-shot learning approach with an adaptive subspace classifier was modified to suit the classification task by training 100 heart disease datasets. The meta-learning paradigm was applied using episodic mechanisms. A task domain with the same class was used for each phase (training, validation, and testing). The encoder plays a critical role in extracting image features. Therefore, using a pre-trained lightweight and deep model based on an inverted residual network improves accuracy. The use of augmentation and dropout increased model robustness. To utilize the few-shot model, an ensemble was generated to obtain an even distribution of class representations from the models. Figure 6 displays the confusion matrix of the results in the test leaderboard. The performance of the model in recognizing the classes ARV-NOR and DCM-MINF was highly biased. Prediction errors still occurred in these classes. Although the accuracy was not the highest on the leaderboard, this model could effectively perform image-based classifications. The application of six short-axis slices plotted for each cardiac condition for various heights/weights (H/W) and the available number of slices in the samples is displayed in Figure 7. The DCM condition with small H/W/S ratios is depicted in Figure 7a. By plotting the ED and ES in alignment, as displayed in Figure 7a, we can observe the cardiac enlargement and ineffective blood pumping in the ES state [33]. The HCM condition is displayed in Figure 7b. The H/W/S ratios differ considerably in this case, and the HCM condition can be observed by plotting the thickness of the heart muscle in comparison with that in other conditions [34]. Figure 7c details   A comparison of ED and ES revealed a left ventricular EF of <40% [35]; furthermore, some myocardial segments with abnormal contractions could be observed. However, this condition is similar to that of DCM, but no enlargement of the heart was observed. Therefore, the padding method is advantageous to remain following its proportions. Normal cardiac conditions are displayed in Figure 7d. Normal cardiac plotting is similar to that of ARV because the RV becomes thicker with layer accumulation, especially when the number of slices is small. Figure 7e displays the ARV condition. The right ventricular volume is higher than usual [36]. Furthermore, the EF of the RV was lower than 40% because it was enlarged despite the ES condition. These results revealed that ED-ES plotting with a 2D thickness algorithm can be used to visualize cardiac conditions with various heights, The application of six short-axis slices plotted for each cardiac condition for various heights/weights (H/W) and the available number of slices in the samples is displayed in Figure 7. The DCM condition with small H/W/S ratios is depicted in Figure 7a. By plotting the ED and ES in alignment, as displayed in Figure 7a, we can observe the cardiac enlargement and ineffective blood pumping in the ES state [33]. The HCM condition is displayed in Figure 7b. The H/W/S ratios differ considerably in this case, and the HCM condition can be observed by plotting the thickness of the heart muscle in comparison with that in other conditions [34]. The application of six short-axis slices plotted for each cardiac condition for various heights/weights (H/W) and the available number of slices in the samples is displayed in Figure 7. The DCM condition with small H/W/S ratios is depicted in Figure 7a. By plotting the ED and ES in alignment, as displayed in Figure 7a, we can observe the cardiac enlargement and ineffective blood pumping in the ES state [33]. The HCM condition is displayed in Figure 7b. The H/W/S ratios differ considerably in this case, and the HCM condition can be observed by plotting the thickness of the heart muscle in comparison with that in other conditions [34]. Figure 7c details the conditions of patients with MINF. H191, W97, S11 H182, W106, S8, H170, W80, S13 H165, W42, S7 H165, W76, S12 H170, W70, S6 (a) (b) (c) H179, W93, S14 H155, W74, S6 H186, W76, S18 H170, W103, S6 (d) (e) A comparison of ED and ES revealed a left ventricular EF of <40% [35]; furthermore, some myocardial segments with abnormal contractions could be observed. However, this condition is similar to that of DCM, but no enlargement of the heart was observed. Therefore, the padding method is advantageous to remain following its proportions. Normal cardiac conditions are displayed in Figure 7d. Normal cardiac plotting is similar to that of ARV because the RV becomes thicker with layer accumulation, especially when the number of slices is small. Figure 7e displays the ARV condition. The right ventricular volume is higher than usual [36]. Furthermore, the EF of the RV was lower than 40% because it was enlarged despite the ES condition. These results revealed that ED-ES plotting with a 2D thickness algorithm can be used to visualize cardiac conditions with various heights, A comparison of ED and ES revealed a left ventricular EF of <40% [35]; furthermore, some myocardial segments with abnormal contractions could be observed. However, this condition is similar to that of DCM, but no enlargement of the heart was observed. Therefore, the padding method is advantageous to remain following its proportions. Normal cardiac conditions are displayed in Figure 7d. Normal cardiac plotting is similar to that of ARV because the RV becomes thicker with layer accumulation, especially when the number of slices is small. Figure 7e displays the ARV condition. The right ventricular volume is higher than usual [36]. Furthermore, the EF of the RV was lower than 40% because it was enlarged despite the ES condition. These results revealed that ED-ES plotting with a 2D thickness algorithm can be used to visualize cardiac conditions with various heights, weights, and slices. Additionally, the proposed method can handle the use of various MRI instruments and settings for classifying cardiac conditions. These results provide numerous opportunities for rapid and straightforward cardiac screening for cardiac diagnosis. In the future, larger sample sizes and classes of cardiac abnormalities should be tested using this approach. Furthermore, 2D thickness and meta-learning experiments for long-axis data represent a novel approach for verifying the detection of various abnormal heart conditions.

Conclusions
A novel segmentation and classification model was proposed for heart disease diagnosis by using a cine MRI dataset. Encoder-decoder segmentation network EfficientNetB5-UNet was used to perform the semantic segmentation of MRI image slices. A 2D thickness algorithm that combined the segmentation outputs was proposed to develop the 2D representation images of the ED and ES cardiac volumes. The average Dice coefficients of segmentation for the LV were 96.24% (ED) and 89.92% (ES); the values for the RV were 92.90% (ED) and 86.92% (ES), whereas those for MYO were 88.90% (ED) and 90.48% (ES). Subsequently, the 2D thickness images obtained from the model with the best segmentation result were used for the classification to overcome the data shortage problem. The ensemble approach addresses the high uncertainty of prediction results. By using a six-slice 2D thickness image classification, the model could classify heart diseases in the ACDC dataset with an accuracy score of 92%. These results were consistent with those of other studies in which derivatives and other clinical features were used. This image-based classification can be used as a rapid scanning method for diagnosing heart disease.