Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss

Liu, Zhi; He, Xuelin; Lu, Yunhua

doi:10.3390/app12189208

Open AccessArticle

Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss

by

Zhi Liu

^*,†

,

Xuelin He

^† and

Yunhua Lu

School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401120, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(18), 9208; https://doi.org/10.3390/app12189208

Submission received: 30 August 2022 / Revised: 9 September 2022 / Accepted: 11 September 2022 / Published: 14 September 2022

(This article belongs to the Special Issue Recent Advances in Machine Learning and Computational Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Left ventricle (LV) segmentation of cardiac magnetic resonance (MR) images is essential for evaluating cardiac function parameters and diagnosing cardiovascular diseases (CVDs). Accurate LV segmentation remains a challenge because of the large differences in cardiac structures in different research subjects. In this work, a network based on an encoder–decoder architecture for automatic LV segmentation of short-axis cardiac MR images is proposed. It combines UNet 3+ and Transformer to jointly predict the segmentation masks and signed distance maps (SDM). UNet 3+ can extract coarse-grained semantics and fine-grained details from full scales, while a Transformer is used to extract global features from cardiac MR images. It solves the problem of low segmentation accuracy caused by blurred LV edge information. Meanwhile, the SDM provides a shape-aware representation for segmentation. The performance of the proposed network is validated on the 2018 MICCAI Left Ventricle Segmentation Challenge dataset. The five-fold cross-validation evaluation was performed on 145 clinical subjects, and the average dice metric, Jaccard coefficient, accuracy, and positive predictive value reached 0.908, 0.834, 0.979, and 0.903, respectively, showing a better performance than that of other mainstream ones.

Keywords:

left ventricle segmentation; UNet 3+; encoder–decoder; transformer; magnetic resonance imaging

1. Introduction

The World Health Organization (WHO) showed that in 2019, almost 17.9 million people died of cardiovascular diseases (CVDs), accounting for 32% of fatalities worldwide [1]. Early diagnosis of CVD can help improve cardiac function and reduce patient mortality [2]. Cardiovascular magnetic resonance (MR) imaging is harmless and has become the most commonly used technique for evaluating cardiovascular system structure and function. Left ventricle (LV) segmentation, as a key step in the treatment of CVD, can provide better visual aid during the diagnosis of CVD. Most CVDs affect the physiological shape of the cardiac LV, and LV dysfunction is the cause of many heart diseases, such as ventricular hypertrophy and myocardial infarction, making the examination of the LV an important prerequisite for determining whether the heart is diseased. LV segmentation can accurately delineate the boundaries of the LV on cardiac MR images so that physicians can better understand some clinical parameters, such as the patient’s ventricular volume, ejection fraction, LV mass, and stroke volume [3,4].

In early clinical work, medical images are usually annotated by experts to mitigate the subjective bias caused by the level of a particular expert or possible negligence of subtle symptoms [5]. However, for most professional clinicians, manual segmentation is a cumbersome and time-consuming task. In general, it takes a clinician about 20 min to segment a patient’s cardiac MR image. Additionally, in Figure 1, as you can see from the left ventricle’s structure, the intensity and shape similarity between the LV and other organs, boundary inaccuracies, and the inherent noise of cardiac MR imaging have all posed obstacles to LV segmentation [6,7,8].

Some excellent segmentation algorithms spawned in the past decades have been caught in certain dilemmas that make them difficult to apply in clinical settings. Among them, many of the algorithms are traditional methods based on machine learning [9,10,11], such as thresholding [12], clustering [13], active contours [14,15], and split–merge. Figure 2 displays the thresholding, clustering, and active contours methods. Most of them use semi-automatic methods that heavily rely on the initialization step [16,17,18,19], resulting in their failure to achieve the desired results. Meanwhile, because of the rapid development of the accessibility of vast training data and computer hardware, medical image segmentation algorithms based on deep learning are becoming increasingly prevalent. In particular, Convolutional Neural Networks (CNNs) have achieved excellent results in various computer vision (CV) fields, such as image segmentation [20], object detection [21], and image classification [22]. In such a trend, CNN-based LV segmentation models such as Densely Connected Convolutional Networks (DenseNet) [23] and Fully Convolutional Neural Networks (FCN) [24] have been suggested, and have achieved good results in clinical trials.

Although these networks are representative, CNN-based approaches still have some limitations in capturing global information and recovering weak texture details. To overcome this limitation, some studies have suggested Transformer-based designs that use the self-attention mechanism to construct the contextual representations. Transformers, unlike traditional CNNs, can model global contextual information by modeling the relationships between spatially distant pixels [25]. However, both models have their drawbacks. CNN is insensitive to global features, and Transformer has the issues of high computation cost and a lack of ability to capture regional features, so the applicability of both models needs to be further improved.

To solve the above problems, this study proposes a fast and automated method of cardiac LV segmentation to help facilitate the diagnosis of CVD. The network has the following two main contributions: (1) The backbone uses a combined framework of UNet 3+ [26] and Transformer, which can efficiently acquire low-level spatial features as well as high-level semantic information and can also model the global context. (2) The shared backbone network is used to jointly predict the segmentation masks and signed distance maps (SDM) to study segmentation targets’ different representations from different perspectives.

1.1. Traditional Segmentation Methods

Currently, the methods for LV segmentation can usually be divided into traditional segmentation methods and deep learning methods. However, the early traditional segmentation methods have some obvious drawbacks. Several of these algorithms can only obtain relatively accurate segmentation results when the pixel intensities of the LV and other tissues reach a high level of contrast. For example, the threshold-based segmentation method was used by Goshtasby et al. [27] in 1995 to extract the LV contour. It adaptively determines the threshold grayscale according to the global or local grayscale histogram of the image. Therefore, this method can only obtain better segmentation results on the condition that the grayscale of the region is significantly different from the background. Since its appearance, the K-Means clustering algorithm has been extensively applied in the fields of image analysis and data mining. However, because many regions in cardiac MR images are similar or even connected to the LV, the K-means clustering algorithm cannot achieve the expected results and needs further improvement. In 2006, Katouzian et al. [12] employed the idea of split–merge to segment the LV, which could successfully extract the epicardium and endocardium of the LV under the condition of manually annotating the epicardium and endocardium of the first slice. This manual process obviously increases the clinical application complexity. On the whole, these traditional methods have the characteristic of relying on manual design, which contradicts the idea of automatic segmentation.

1.2. Deep Learning

Unlike traditional segmentation methods, deep learning methods prefer to train with large-scale data to find the intrinsic patterns and representation levels in images and get more representative feature information [28]. They naturally describe image features without relying on the manual extraction of features, addressing the limitations of traditional segmentation methods. J. Long et al. [24] proposed a fully convolutional neural network (FCN) based on a CNN to recover the feature map to the original image size via transposed convolution, thus achieving image segmentation in 2015. P. V. Tran [7] segmented both the left and right ventricles in 2016 using FCN. However, the recovery of the LV contour was poor due to only a single upsampling. Ronneberger et al. [29] suggested the U-Net network based on FCN. U-Net effectively integrates low-resolution information and high-resolution information to learn better feature representation and improve generalization performance. However, the utilization of feature maps is poor and does not work well for object boundary segmentation. SegNet [30] proposed a decoder to perform nonlinear upsampling, the input feature maps of which were the maximum pooling index received from the corresponding encoder. In spite of avoiding the learning of upsampling and improving the precision of image boundary localization via this method, the segmentation accuracy is not high enough to satisfy the real-time requirement. On the other hand, despite the excellent representation capability of all these networks, CNN-based approaches have difficulty learning global semantic information because of the inherent locality of convolution operations. As a result, these networks usually yield weak segmentation performance, particularly for target structures that exhibit large inter-patient variations in size, shape, and texture.

1.3. Transformers

Transformer was first proposed by Vaswani et al. [31] as the main method for machine translation. It has now been introduced as a new model for image recognition [32], semantic segmentation [33], and many other computer vision tasks [34,35]. In contrast with previous CNN-based approaches, Transformer is powerful at modeling global contexts and demonstrates superior transferability for downstream tasks under large-scale pretraining. However, Transformer concentrates on modeling the global context at all stages, leading to generating low-resolution features. Further, due to the lack of detailed localization information for these low-resolution features, which cannot be efficiently recovered by direct upsampling to full resolution, the ultimate obtained segmentation results are coarse. In order to make the model focus on both regional and global features in segmentation tasks, recent studies have tended to combine CNN and Transformer, such as TransUNet [36] and TransFuse [37], which yield satisfactory performance in segmentation tasks. These works show that the combination model has great potential in the field of CV.

2. Method

The proposed network in this paper aims to automatically segment the LV in cardiac MR images, thus reducing the tedious manual segmentation and improving the disease diagnosis efficiency. The new medical image segmentation framework proposed in this work is shown in Figure 3, and the backbone part uses a combined architecture of UNet 3+ and Transformer. The network takes the same cardiac MR images as input and predicts both pixel probability maps and SDM. A loss function composed of two main components is designed to train the segmentation network. One is for computing the pixel probability map and the other one is for computing the SDM.

2.1. Segmentation Network

Feature Extraction—First, feature maps are generated for the input images using the encoder structure of UNet 3+ as a feature extractor. We are given an image

x \in R^{H \times W \times C}

with the spatial resolutions H, W and C. The preprocessed image x is fed into the network and the encoder applies a series of convolutional blocks to model the pixel-level contextual representations, with features progressively downsampled to

\frac{H}{16} \times \frac{W}{16}

.

Transformer—To extract the global features, the Transformer module is applied in the encoder design. Because Transformer is not as efficient as UNet 3+ when it comes to capturing regional features, patch embedding is directly used for patches generated from the UNet 3+ feature maps rather than from raw images. Then patches

x_{p}

are mapped into a K-dimensional embedding space by a trainable linear projection. To preserve location information, specific position embeddings

E_{p o s}

are added to the patch embeddings when encoding the patch spatial information. Details are as follows:

z_{0} = [x_{p}^{1} E; x_{p}^{2} E; \dots; x_{p}^{N} E] + E_{p o s},

(1)

where

E \in R^{(P^{2} \cdot C) \times K}

is the patch embedding projection,

E_{p o s} \in R^{N \times K}

denotes the position embedding, and

x_{p}^{1}, \dots, x_{p}^{N}

are image patches.

Then, a stack of Transformer blocks consisting of alternating layers of multi-head self-attention (MSA) and multi-layer perceptron (MLP) blocks are used to learn the long-range context representation. Layer normalization (LN) is used before each block, and residual connections are used after each block. The following can be expressed as the output of the i-th layer:

z_{i}^{^{'}} = M S A (L N (z_{i - 1})) + z_{i - 1},

(2)

z_{i} = M L P (L N (z_{i}^{^{'}})) + z_{i}^{^{'}},

(3)

where

L N (\cdot)

represents the layer normalization operator,

i \in \{1, 2, \dots, L\}

where L is the number of Transformer layers and

z_{i}

denotes the image representation of the encoded output of the i-layer Transformer.

Decoder—To generate segmentation masks and SDM in the raw image space, a decoder for UNet 3+ is introduced to perform feature upsampling. Since the output of the Transformer is sequential data, its output should first be recovered to spatial order. The encoded feature representation

z_{i} \in R^{N \times K}

is reshaped into

z_{i} \in R^{\frac{H}{P} \times \frac{W}{P} \times K}

where p is the size of each patch, and then the channel dimension is reduced by a

3 \times 3

convolution block. In addition, each decoder layer in the decoder combines the feature maps of all encoders. Additionally, the full-scale deep supervision proposed by UNet 3+ is used to learn hierarchical representations from the full-scale aggregated feature maps, and the output from each decoder stage is supervised by the ground truth (GT). To achieve deep supervision, a

3 \times 3

convolution block, a bilinear upsampling, and a sigmoid function are added to the last layer of each decoder stage in the network. To generate the SDM, the network adds an SDM head at the last decoder stage, which consists of a convolution block and tanh activation.

2.2. Loss Function

Based on the above architecture design, the segmentation network generates pixel-level segmentation maps and SDM, and the following functions are introduced to convert GT to SDM. The SDM assigns each pixel a value, indicating its signed distance to the nearest boundary of the target object.

D (a) = \{\begin{matrix} 0, & a \in S \\ \underset{b \in C}{- i n f} {∥a - b∥}_{2}, & a \in C_{i n} \\ \underset{b \in C}{+ i n f} {∥a - b∥}_{2}, & a \in C_{o u t} \end{matrix}

(4)

where

∥a - b∥

is the Euclidean distance between pixels a and b, S represents the boundary of the target object, and

C_{i n}

and

C_{o u t}

represent the region inside and outside of the target object, respectively. Typically, SDM takes negative values inside the target and positive values outside the target, with the absolute value indicating the distance from the point to the nearest point on the target object’s surface.

In the network training, for the regression task branch, a

L_{2}

loss is used between the SDM of the network output

P_{a}

and the transformed GT map

D (Y)

.

L_{s d m} (P_{a}, Y) = {∥P_{a} - D (Y)∥}^{2},

(5)

where Y denote the GT map.

For the segmentation task branch, the combination

L_{s e g}

of dice loss and focal loss is applied as the loss function between the segmentation mask and the GT for each decoder output, and then their average is taken as the final segmentation loss.

L_{s e g} (P_{b}, Y) = L_{D i c e} (P_{b}, Y) + L_{F L} (P_{b}, Y),

(6)

where

L_{D i c e}

denotes the dice loss and

L_{F L}

denotes the focal loss.

P_{b}

and Y denote the prediction partition map and label, respectively.

The final loss is defined as:

L = L_{s d m} (P_{a}, Y) + L_{s e g} (P_{b}, Y) .

(7)

3. Experiments

3.1. Datasets

The dataset from MICCAI 2018 is used to train and evaluate the proposed model [38]. The dataset contains 2900 short-axis cardiac MR images from 145 subjects at three hospitals attached to two healthcare centers (London Healthcare Center and St. Joseph’s Healthcare). The age range of the study subjects is 16 to 97, with a mean of 58.9. In the 1.5625 mm/pixel mode, the pixel spacing of the MR images ranges from 0.6836 mm/pixel to 2.0833 mm/pixel. Pathologies such as myocardial hypertrophy, LV dysfunction, atrial septal defect, regional wall motion abnormalities, mildly enlarged LV, etc., are present. During the entire cardiac cycle, twenty frames are acquired for each subject. According to the standard AHA prescription, in each frame, the LV is divided into equal thirds (basal, medial, and parietal) perpendicular to the long axis of the heart. Before manual annotating GT, all cardiac MR images need to be performed with landmark labeling, rotation, ROI cropping, and resizing. After preprocessing, all images are cropped and resized to the dimension of

80 \times 80

and normalized. After manual contouring, two experienced cardiac radiologists (A. Islam and M. Bhaduri) obtain the epicardial and endocardial boundary and perform a double examination. The labels of the data are approved by industry doctors.

3.2. Implementation Details

In the experiments, the backbone of the main framework is the combination of UNet 3+ and Transformer. The network is implemented in PyTorch (1.11.0), with the runtime platform processor of Inter(R) Core (TM) i9-10850K CPU, NVIDIA GeForce RTX 3080 Ti. All training and test images are uniformly adjusted to

80 \times 80

dimensions. In the training stage, random rotation and flipping operations are applied as data augmentations. The model is evaluated and compared by the five-fold cross-validation. The dataset is split into five groups of 29 subjects each. One group (580 images) is selected for testing, and the remaining four groups (2320 images) are utilized as the training set. The final evaluation result is calculated using the average of five times this process. The proposed network is trained end-to-end with the Adam optimizer with a weight decay of 1

\times 10^{- 5}

and an initial learning rate of 2

\times 10^{- 4}

. The model is trained in 100 epochs in a batch size of 16. The segmentation result utilized in testing is the output of the segmentation task branch.

3.3. Evaluation Metric

The goal of the network is making sure the model can accurately segment the LV from cardiac MR images. For objective evaluation of the proposed model, the region-based dice metric (DM) and Jaccard coefficient (JC) are employed as metrics. DM and JC are explained in detail as follows.

Dice Metric—DM calculates the overlap between the manual segmentation area and the automatic segmentation contour area obtained using the proposed method. DM lies in the [0, 1] range. The better the match between manual and predicted segmentation, the larger the DM value is. DM is defined as:

D M (A, B) = \frac{2 |A \cap B|}{|A| + |B|},

(8)

where A and B represent the area of manual and automatic contour, respectively.

Jaccard Coefficient—The Jaccard coefficient, also named the Intersection over Union (IoU), is used to calculate the degree of dissimilarity between the manual segmentation area and the automatic segmentation contour area using the proposed method. Similar to DM, the JC is between [0, 1]. The larger the JC, the lower the similarity. The formula for JC is as follows:

J (A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|},

(9)

where A and B represent the area of manual and automatic contour, respectively.

In addition, we also use accuracy (ACC) and positive predictive value (PPV) to evaluate the results of pixel classification. They are defined by:

A C C = \frac{T P + T N}{T P + F N + F P + T N},

(10)

where

T P

,

T N

,

F P

, and

F N

represent true positive, true negative, false positive, and false negative.

P P V = \frac{T P}{T P + F P},

(11)

where

T P

,

T N

,

F P

, and

F N

represent true positive, true negative, false positive, and false negative.

4. Experimental Results

4.1. Performance of the Network

Figure 4 displays the predicted segmentation masks, GT, and contours for four subjects’ cardiac MR images from the MICCAI 2018. It is clear from Figure 4a–d that the proposed method can accurately segment LV in cardiac MR images. In Figure 4d, it can be found that the automated segmentation contours marked by the red curve almost overlap on the GT (marked by the green curve), indicating that the model can accurately segment LV in diverse shapes. In summary, the model shows great potential for cardiac MR image segmentation with high accuracy, and thus may provide a visual aid to clinicians for qualitative diagnosis.

4.2. Performance Comparison

On the MICCAI 2018 test set, comparison is made between the method and other prevalent segmentation methods such as FCN, Conv–Deconv [39], U-Net, Indices-JSQ [40], and SegNet to evaluate its effectiveness. As shown in Table 1, the DM and JC of the proposed method reach 0.908 and 0.834, respectively, with a certain extent of improvement compared to that of other segmentation methods. These results suggest that the model is capable of more accurately determining the class of each pixel point and then achieving higher segmentation performance.

4.3. Ablation Studies

In this section, ablation studies are performed on each component of the suggested method. As shown in Table 2, as the proposed modules are sequentially added on the UNet 3+ baseline, the model performance is gradually improved. It is clear from Table 2 that when Transformer is integrated into the UNet 3+ baseline, the DM reaches 0.907, with a 1.1% improvement compared to that of the baseline (0.896). This is because the incorporation of Transformer compensates for the inability of UNet 3+ to model the global context. Simultaneously, the combination also solves the problem of Transformer ignoring low-resolution detail features compared to using itself directly.

To further study the impact of the loss function in the model, an ablation study is conducted for the segmentation loss (

L_{s e g}

) and SDM loss (

L_{s d m}

). By adding an additional regression head to the segmentation network’s end and using the SDM loss, the dice metric value of the model is further enhanced to 0.908, with a slight increase of 0.1% compared to that of the model only with

L_{s e g}

. This result suggests that joint SDM training can implicitly force the model to learn shape information compared to traditional training using only segmented masks.

5. Conclusions

In this study, a network for automatic LV segmentation from cardiac MR images is proposed, providing an effective solution that allows physicians to diagnose CVDs. The proposed method extensively experimented on cardiac MR image data from 145 subjects, and the DM and JC reached 0.908 and 0.834, respectively, on the MICCAI2018 test set. The proposed module and loss function both improve the segmentation accuracy, as verified by ablation experiments. The method also outperforms the current mainstream methods in the comparison experiments suggesting that it can be considered an effective automatic LV segmentation task that will reduce the workload of radiologists during clinical diagnosis. In future research, more possibilities for Transformer application to medical image segmentation networks will be explored to provide a better technique for LV segmentation.

Author Contributions

Conceptualization, X.H.; methodology, Z.L. and X.H.; resources, Z.L. and X.H.; data curation, Z.L. and X.H.; writing—original draft preparation, X.H.; writing—review and editing, Z.L., X.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded partly by the Natural Science Foundation of Chongqing, China (Grant Nos. cstc2019jcyj-msxmX0487, cstc2021jcyj-msxmX0605). The APC was funded partly by the National Natural Science Foundation of China (Grant Nos. 61971078, 61501070). The APC was funded partly by the Science and Technology Foundation of Chongqing Education Commission (Grant Nos. KJQN202001137, CQUT20181124). The APC was funded partly by the Scientific Research Foundation of Chongqing University of Technology (2020ZDZ026).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

WHO. WHO Fact-Sheets Cardiovascular Diseases (CVDs); WHO: Geneva, Switzerland, 2021. [Google Scholar]
Shaaf, Z.F.; Jamil, M.M.A.; Ambar, R.; Alattab, A.A.; Yahya, A.A.; Asiri, Y. Automatic Left Ventricle Segmentation from Short-Axis Cardiac MRI Images Based on Fully Convolutional Neural Network. Diagnostics 2022, 12, 414. [Google Scholar] [CrossRef] [PubMed]
Gessert, N.; Schlaefer, A. Left Ventricle Quantification Using Direct Regression with Segmentation Regularization and Ensembles of Pretrained 2D and 3D CNNs. arXiv 2019, arXiv:1908.04181. [Google Scholar]
Tavakoli, V.; Amini, A.A. A survey of shaped-based registration and segmentation techniques for cardiac images. Comput. Vis. Image Underst. 2013, 117, 966–989. [Google Scholar] [CrossRef]
Petitjean, C.; Dacher, J.N. A review of segmentation methods in short axis cardiac MR images. Med. Image Anal. 2011, 15, 169–184. [Google Scholar] [CrossRef]
Dakua, S.P. Towards Left Ventricle Segmentation From Magnetic Resonance Images. IEEE Sens. J. 2017, 17, 5971–5981. [Google Scholar] [CrossRef]
Tran, P.V. A Fully Convolutional Neural Network for Cardiac Segmentation in Short-Axis MRI. arXiv 2016, arXiv:1604.00494. [Google Scholar]
Xue, W.; Lum, A.; Mercado, A.; Landis, M.; Warrington, J.; Li, S. Full Quantification of Left Ventricle via Deep Multitask Learning Network Respecting Intra- and Inter-Task Relatedness. In Lecture Notes in Computer Science, Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017; Springer: Cham, Switzerland, 2017. [Google Scholar]
Wernick, M.N.; Yang, Y.; Brankov, J.G.; Yourganov, G.; Strother, S.C. Machine Learning in Medical Imaging. IEEE Signal Process. Mag. 2010, 27, 25–38. [Google Scholar] [CrossRef]
Yabrin, A.; Amin, B.S.; Mir, A.H. A Comparative Study on Left and Right Endocardium Segmentation using Gradient Vector Field and Adaptive Diffusion Flow Algorithms. Int. J. Bio-Sci. Bio-Technol. 2016, 8, 105–120. [Google Scholar]
Li, W.; Ma, Y.; Zhan, K.; Ma, Y. Automatic Left Ventricle Segmentation in Cardiac MRI via Level Set and Fuzzy C-Means. In Proceedings of the International Conference on Recent Advances in Engineering & Computational Sciences, Chandigarh, India, 21–22 December 2015. [Google Scholar]
Katouzian, A.; Prakash, A.; Konofagou, E. A new automated technique for left-and right-ventricular segmentation in magnetic resonance imaging. In Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society, New York, NY, USA, 30 August–3 September 2006; Volume 2006, pp. 3074–3077. [Google Scholar] [CrossRef]
Lynch, M.; Ghita, O.; Whelan, P.F. Automatic segmentation of the left ventricle cavity and myocardium in MRI data. Comput. Biol. Med. 2006, 36, 389–407. [Google Scholar] [CrossRef]
Zhang, Z.; Duan, C.; Lin, T.; Zhou, S.; Wang, Y.; Gao, X. GVFOM: A novel external force for active contour based image segmentation. Inf. Sci. 2020, 506, 1–18. [Google Scholar] [CrossRef]
Wu, Y.; Wang, Y.; Jia, Y. Segmentation of the left ventricle in cardiac cine MRI using a shape-constrained snake model. Comput. Vis. Image Underst. 2013, 117, 990–1003. [Google Scholar] [CrossRef]
Chakraborty, A.; Staib, L.H.; Duncan, J.S. Deformable boundary finding in medical images by integrating gradient and region information. IEEE Trans. Med. Imaging 1996, 15, 859–870. [Google Scholar] [CrossRef] [PubMed]
Lynch, M.; Ghita, O.; Whelan, P.F. Segmentation of the left ventricle of the heart in 3-D+t MRI data using an optimized nonrigid temporal model. IEEE Trans. Med. Imaging 2008, 27, 195–203. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Y.; Wen, Z.; Tian, B.; Kao, E.; Liu, X.; Xuan, W.; Ordovas, K.; Saloner, D.; Liu, J. Deep learning based fully automatic segmentation of the left ventricular endocardium and epicardium from cardiac cine MRI. Quant. Imaging Med. Surg. 2021, 11, 1600–1612. [Google Scholar] [CrossRef] [PubMed]
Xijing, Z.; Qian, W.; Ting, L. A novel approach for left ventricle segmentation in tagged MRI. Comput. Electr. Eng. 2021, 95, 107416. [Google Scholar]
Haque, I.R.I.; Neubert, J. Deep learning approaches to biomedical image segmentation. Inform. Med. Unlocked 2020, 18, 100297. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Baloglu, U.B.; Talo, M.; Yildirim, O.; Tan, R.S.; Acharya, U.R. Classification of myocardial infarction with multi-lead ECG signals and deep CNN. Pattern Recognit. Lett. 2019, 122, 23–30. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Laurens, V.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in Medical Imaging: A Survey. arXiv 2022, arXiv:2201.09873. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Goshtasby, A.; Turner, D.A. Segmentation of cardiac cine MR images for extraction of right and left ventricular chambers. IEEE Trans. Med. Imaging 1995, 14, 56–64. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Peng, Y.; Li, D.; Guo, Y.; Zhang, B. MMNet: A multi-scale deep learning network for the left ventricular segmentation of cardiac MRI images. Appl. Intell. 2021, 52, 5225–5240. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image Transformer. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4055–4064. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation. In Lecture Notes in Computer Science, Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Cham, Switzerland, 2021. [Google Scholar]
Xue, W.; Islam, A.; Bhaduri, M.; Li, S. Direct Multitype Cardiac Indices Estimation via Joint Representation and Regression Learning. IEEE Trans. Med. Imaging 2017, 36, 2057–2067. [Google Scholar] [CrossRef] [Green Version]
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. arXiv 2015, arXiv:1505.04366. [Google Scholar]
Du, X.; Tang, R.; Yin, S.; Zhang, Y.; Li, S. Direct Segmentation-Based Full Quantification for Left Ventricle via Deep Multi-Task Regression Learning Network. IEEE J. Biomed. Health Inform. 2019, 23, 942–948. [Google Scholar] [CrossRef] [PubMed]
Du, X.; Yin, S.; Tang, R.; Zhang, Y.; Li, S. Cardiac-DeepIED: Automatic Pixel-Level Deep Segmentation for Cardiac Bi-Ventricle Using Improved End-to-End Encoder-Decoder Network. IEEE J. Transl. Eng. Health Med. 2019, 7, 1900110. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Short-axis cardiac MR anatomy image.

Figure 2. Traditional segmentation methods. (a) Thresholding. Automated contours (green) and manual contours (red). (b) Clustering. (c) Active contours.

Figure 3. The proposed LV segmentation model is based on an encoder–decoder architecture. The network outputs pixel probability maps and SDM.

Figure 4. Segmentation results for four subjects. The arrows point to places where the proposed method can be seen to almost overlap with the GT manually delineated by the experts, indicating better segmentation. (a) Cardiac MR image. (b) Results of segmentation using the suggested method. (c) GT. (d) The segmentation contours obtained by the proposed method are marked by red curves, and the corresponding GT manually delineated by experts is marked by green curves.

Table 1. The comparison of the suggested method with several popular segmentation methods.

Methods	DM	JC	ACC	PPV
FCN	0.873	0.778	0.972	0.878
SegNet	0.843	0.728	0.964	0.845
U-Net	0.887	0.800	0.975	0.887
Conv-Deconv	0.846	0.733	0.965	0.840
Indices-JSQ	0.870	− ¹	−	−
Cardiac-DeepIED [41]	0.890	0.801	0.976	0.891
Our Method	0.908	0.834	0.979	0.903

¹—Indicates that the results are not provided.

Table 2. Ablation experiments of the model on the MICCAI 2018 test set.

UNet 3+	Transformer	$L_{seg}$	$L_{sdm}$	DM
✓	×	✓	×	0.896
✓	✓	✓	×	0.907
✓	✓	✓	✓	0.908

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; He, X.; Lu, Y. Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss. Appl. Sci. 2022, 12, 9208. https://doi.org/10.3390/app12189208

AMA Style

Liu Z, He X, Lu Y. Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss. Applied Sciences. 2022; 12(18):9208. https://doi.org/10.3390/app12189208

Chicago/Turabian Style

Liu, Zhi, Xuelin He, and Yunhua Lu. 2022. "Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss" Applied Sciences 12, no. 18: 9208. https://doi.org/10.3390/app12189208

APA Style

Liu, Z., He, X., & Lu, Y. (2022). Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss. Applied Sciences, 12(18), 9208. https://doi.org/10.3390/app12189208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining UNet 3+ and Transformer for Left Ventricle Segmentation via Signed Distance and Focal Loss

Abstract

1. Introduction

1.1. Traditional Segmentation Methods

1.2. Deep Learning

1.3. Transformers

2. Method

2.1. Segmentation Network

2.2. Loss Function

3. Experiments

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Metric

4. Experimental Results

4.1. Performance of the Network

4.2. Performance Comparison

4.3. Ablation Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI