Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation

Weng, Weihao; Zhu, Xin; Jing, Lei; Dong, Mianxiong

doi:10.3390/electronics12030682

Open AccessArticle

Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation

¹

Graduate School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu 965-8580, Japan

²

Department of Information and Electronic Engineering, Muroran Institute of Technology, Muroran 050-0071, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(3), 682; https://doi.org/10.3390/electronics12030682

Submission received: 18 December 2022 / Revised: 23 January 2023 / Accepted: 28 January 2023 / Published: 29 January 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The understanding of long-range pixel–pixel dependencies plays a vital role in image segmentation. The use of a CNN plus an attention mechanism still has room for improvement, since existing transformer-based architectures require many thousands of annotated training samples to model long-range spatial dependencies. This paper presents a smooth attention branch (SAB), a novel architecture that simplifies the understanding of long-range pixel–pixel dependencies for biomedical image segmentation in small datasets. The SAB is essentially a modified attention operation that implements a subnetwork via reshaped feature maps instead of directly calculating a softmax value over the attention score for each input. The SAB fuses multilayer attentive feature maps to learn visual attention in multilevel features. We also introduce position blurring and inner cropping specifically for small-scale datasets to prevent overfitting. Furthermore, we redesign the skip pathway for the reduction of the semantic gap between every captured feature of the contracting and expansive path. We evaluate the architecture of U-Net with the SAB (SAB-Net) by comparing it with the original U-Net and widely used transformer-based models across multiple biomedical image segmentation tasks related to the Brain MRI, Heart MRI, Liver CT, Spleen CT, and Colonoscopy datasets. Our training set was made of random 100 images of the original training set, since our goal was to adopt attention mechanisms for biomedical image segmentation tasks with small-scale labeled data. An ablation study conducted on the brain MRI test set demonstrated that every proposed method achieved an improvement in biomedical image segmentation. Integrating the proposed methods helped the resulting models consistently achieve outstanding performance on the above five biomedical segmentation tasks. In particular, the proposed method with U-Net improved its segmentation performance over that of the original U-Net by 13.76% on the Brain MRI dataset. We proposed several novel methods to address the need for modeling long-range pixel–pixel dependencies in small-scale biomedical image segmentation. The experimental results illustrated that each method could improve the medical image segmentation accuracy to various degrees. Moreover, SAB-Net, which integrated all proposed methods, consistently achieved outstanding performance on the five biomedical segmentation tasks.

Keywords:

segmentation; attention mechanism; convolutional neural networks

1. Introduction

Convolutional neural networks (CNNs), such as VggNet [1], ResNet [2], and DenseNet [3], are a class of neural networks. CNNs replace matrix multiplication with convolution operations in at least one neural network layer. Existing works have made CNN-based systems highly efficient in automated semantic segmentation [2,3,4]. Semantic segmentation is a class of classification at the pixel level. It is used to predict categorical labels for every pixel in an image, which promises a foundation for pathological study and helps physicians make more accurate clinical diagnoses [5]. Biomedical image segmentation is currently a critical and challenging task. However, the small amount of available labeled data limits the success of CNNs in segmenting biomedical images. In addition, federal and edging learning are required for algorithms that can feasibly be trained with a small dataset from a single hospital [6]. Ronneberger et al. proposed U-Net [7], a relatively shallow structure with shortcut connections, which is suitable for small-scale biomedical image segmentation tasks. U-Net copies the output features derived from convolutions in the encoder and concatenates them with the input features of the corresponding decoder. These shortcut connections promise high-level semantics while recovering low-level features and fusing multiscale features that are essential for biomedical image segmentation [8]. Shortcut connections are currently the fundamental components of most methods that have been developed for biomedical image analysis [9]. Furthermore, MultiResUNet [10] passes the feature maps that are output by the contracting path through stacked convolutional layers instead of simply copying the feature maps. Similarly to MultiResUNet, UNet++ [11] fuses multiscale features from a chain of convolutional layers. BSEResUNet [12] and SAR-UNet [13] redesign the skip pathway with the squeezing-and-exciting method. Multiscale attention-guided U-Net [14] gradually enriches the skip pathway with self-attentional feature maps. This paper proposes the smooth attention branch (SAB), a novel architecture that simplifies the understanding of long-range dependencies for biomedical image segmentation.

1.1. Related Feature Fusion Works

Feature fusion is used to merge feature maps from different branches or layers via simple operations, such as summation or concatenation, and it is an omnipresent part of modern network architectures [15,16,17]. The main idea behind these schemes is to reuse features from previous layers. Feature fusion can reduce the number of trainable parameters in the resulting model and can work for model regularization, since feature maps with different characteristics are jointly trained.

1.1.1. Multilayer Feature Fusion

Encoder–decoder networks have disadvantages in that the valuable features that are captured are gradually lost through the contracting path because of pooling, in addition to the global information that indicates the relationships between the parts and the whole. Position-sensitive tasks, such as semantic segmentation and object detection, require those captured features and that global information. Previous works [18,19] recovered high-resolution representations from previous layers to raise the representation resolution from the successive convolutional output of a classifier or classifier-like network. In the context of biomedical image segmentation, U-Net [7] is the de facto multilayer feature fusion module that compensates for the knowledge of location losses in features derived from deep layers and the shortage of semantics obtained from shallow layers. CNNs can study heterogeneous information and improve their robustness through feature fusion of the output from the deep and shallow layers through a top-down path. UNet++ [11] and dilated densely connected U-Net [20] used the dense block idea from DenseNet [21]; adding intermediate convolutional blocks and densifying the cross-layer connections between blocks would cause the concatenated results of each upsampling block to be more semantically similar than the feature maps obtained with the original U-Net architecture. Neural architecture search (NAS)-UNet [22] designed an NAS [23] space that covered all candidate connections across layers to obtain a laterally repeatable U-Net structure that shared the same dimension between its input and output feature maps. Similar structures can also be seen when detecting objects. Feature pyramid networks (FPNs) [24] have top-down architectures with skip paths to capture features with rich semantics at every scale. This approach has improved generic feature extractors in several applications. Furthermore, EfficientNet [25] utilized a weighted bidirectional FPN to connect multilayer feature maps. M2Det [26] stacked FPNs at multiple levels. Feature fusion in the same dimensions from various consecutively connected hourglass-based CNNs was used to capture multiscale features for predicting objects.

1.1.2. Multibranch Feature Fusion

The authors of [27] showed that only a small fragment of the theoretical receptive field (TRF) is the effective receptive field (ERF). With more stacked convolutional layers, relative to the TRF, the occupied region of the ERF is gradually reduced at a rate of

O (1 / \sqrt{n})

. Convolutions with small kernels, such as

1 \times 1

or

3 \times 3

, are more effective than larger kernels with the same amount of computational resources. When a CNN requires the operation of dense prediction, such as in semantic segmentation, a large kernel (and ERF) plays a key factor when the CNN needs to simultaneously perform the classification and localization tasks. Following this concept, inception-like modules [4,28] have parallel convolutions with different filters (followed by concatenation) and capture different features with

1 \times 1

,

3 \times 3

, and

5 \times 5

filters, which makes the output feature obtain different receptive fields at the same depth. Similar structures can also be seen in the global convolutional network [29], where a CNN implements large kernels to build dense cross-layer connections and pixel-level classifiers to strengthen the ability of the CNN to operate on different attention modules.

It is noted that a larger kernel has more parameters and requires a greater computational cost. To this end, atrous convolution was proposed to allow a kernel to enlarge the receptive field while keeping the parameter size and computational complexity unchanged. Yu et al. utilized dilated convolutions to systematically aggregate multiscale contextual information [30]. Moreover, Lv et al. proposed a transformer-based attention-guided U-Net with atrous convolution [31]. This proved that an attention module could guide a model to produce a better segmentation boundary, while atrous convolution could increase the receptive field with fewer parameters.

1.2. Related Attention Mechanisms

Attention mechanisms [32] have proven to be some of the most important concepts in the CNN field, and they are motivated by a common-sense intuition that people should pay more attention to the small but critical parts of every large amount of data. CNNs, which have the property of intrinsic locality, cannot model the long-range pixel–pixel dependencies that are present in images [33]. To address the problem of lacking the ability to model long-range pixel–pixel dependencies in biomedical image segmentation, previous studies proposed combining CNNs with attention mechanism and achieved significant biomedical image segmentation performance [34,35,36].

An attention mechanism that simulated the human brain’s ability to categorize information was first proposed for natural language processing by [37], where an encoder was employed in a sequence-to-sequence prediction model to automatically pick every useful word or subsequence in a source language for translation into a target language. Visual attention mechanisms [38] are widely applied in computer vision tasks, including neural classification [39], detection [40], segmentation [41], and video processing [42]. A self-attention mechanism computes localization as the weighted sum of every feature at a position. We categorized existing self-attention-based methods into spatial and channel attention approaches. Spatial self-attention is position-sensitive, and it weighs every pixel. Spatial attention pays no attention to the information interactions between channels. Meanwhile, spatial attention considers every channel’s output features to the same degree. Similarly, channel attention ignores the information interactions in the spatial dimension. We closely reviewed related attention mechanisms in visual recognition tasks [43] in terms of two aspects: spatial attention and channel attention. We will also discuss some works related to multiscale fusion.

1.2.1. Spatial Attention

Spatial attention focuses on spatial locations. Therefore, it learns to weigh each pixel, and it makes a

h \times w

2D matrix for the weights. Jaderberg et al. proposed a differentiable spatial attention layer that can be inserted into existing CNNs, giving neural networks the ability to study the locations of object areas based on the input features [33]. Hu et al. showed that it is possible to use stacked self-attentions to constitute a network with full attention mechanisms by limiting the attentional operation to a local area [44].

1.2.2. Channel Attention

Channel attention exploits the interchannel relationships of features. Therefore, it learns to weigh each channel, and the form used for the weights is a one-dimensional vector. Each channel map containing high-level semantics can be considered as related to a specific class, and different semantic relations are associated with each other. Fu et al. built a cross-channel interaction-based self-attentional network to explicitly model the long-range pixel–pixel dependencies between channels [45]. Hu et al. introduced the squeezing-and-exciting method, which performs dynamic channel-wise feature recalibration [46].

However, the use of a CNN plus an attention mechanism still has room for improvement, as few past works have studied this aspect in the segmentation of biomedical images. Further, the use of a CNN plus an attention mechanism has been shown to work efficiently only when large-scale training data are available [47]. When a CNN suffers from overfitting, attention mechanisms may worsen the overfitting problem. This becomes problematic when adopting attention mechanisms for biomedical image segmentation tasks with a small-scale training set.

To this end, instead of focusing on the relationship between input pixels, we performed self-attention on the inputs with high-level semantics. We (1) used convolutions on reshaped feature maps, (2) introduced position blurring as a regularization process, (3) employed inner cropping to exploit the interdependencies between patches at different positions, and (4) redesigned skip connections to lessen the semantic gaps between features of the contracting and expansive paths. Finally, we combined these steps and describe the overall architecture as a smooth attention branch (SAB).

2. Methods

Figure 1 shows the architecture of U-Net with an SAB (SAB-Net). When a CNN suffers from overfitting, attention mechanisms may worsen the overfitting problem. This becomes problematic when adopting attention mechanisms for biomedical image segmentation tasks with a small-scale training set. To overcome this problem, we extend the existing self-attentions by developing an attention mechanism to encode long-range pixel–pixel dependencies from reshaped feature maps instead of directly applying softmax functions to all possible positions. In this work, we refer to traditional self-attentions that directly apply softmax functions to all possible positions as rough attentions. We refer to the proposed attention mechanism that can prevent overfitting while training on an insufficient dataset as smooth attention. This section first describes the SAB architecture, a regularization method, and a patchwise training strategy.

2.1. Network Architecture

In this work, we use

x_{o} \in R^{c_{in} \times h \times w}

,

\forall o \in N

to denote every input feature map, while the height is h, the width is w, the channels are

c_{in}

, and

N

is a whole location lattice. Through the contracting path, which is also known as encoding, convolutional layers output lower-spatial-resolution feature maps with high-level semantics for image recognition. Through the expansive path, which is also known as decoding, convolutional layers output higher-spatial-resolution feature maps with higher precision for localization and image recovery.

Contracting paths consist of the repeated application of two

3 \times 3

convolutions with strides of

1 \times 1

, followed by the most commonly used activation function, the rectified linear unit (ReLU), and a

2 \times 2

max-pooling operation with a stride of

2 \times 2

for downsampling. The corresponding expansive paths consist of the repeated application of transposed convolution with a stride of

2 \times 2

for upsampling, and the upsampled features concatenate the corresponding features copied from the contracting path; two

3 \times 3

convolutions with strides of

1 \times 1

follow, both of which are followed by a ReLU function. Let us consider a contracting path with a corresponding expansive path as a stage and the so-called “bottleneck” as an independent stage. Then, SAB-Net consists of five stages and classifies every pixel with a

1 \times 1

convolution with a stride of

1 \times 1

at the end. In the first and fourth stages without the SAB, the output features of the contracting path are first copied and concatenated with the corresponding output features derived from the expansive path. Then, the concatenated features constitute the new input for the successive convolutional layers. We chose this design since (1) the first stage has the highest resolution. SAB-Net maintains a high resolution instead of recovering high-resolution information from low-resolution information and, accordingly, generates a reliable highest-resolution representation with possibly high spatial sensitivity. (2) The fourth and fifth stages have the lowest and second-lowest resolutions, where one pixel represents a very large area of the input image. Implementing the SAB in the fourth and fifth stages runs counter to the goals of focusing on local patterns in place of the global context. (3) According to the SAB-Net architecture (Figure 1), features from two stages are needed. Thus, the SAB is implemented in the second stage. Further details are described in the following Section 2.2.

2.2. Smooth Attention Branch

SAB-Net is a multilevel attention network for the proposed model that simultaneously exploits higher-level semantic and spatial information. As depicted in the SAB-Net architecture (Figure 1), SAB-Net consists of two major components. By definition, according to [37], attention is a weighted average of values,

A t t = \sum_{i} α_{i} v_{i}

, where v refers to values and

\sum_{i} α_{i} = 1

. We start from a restriction in which

α

is a one-hot vector; then, the attention operation becomes the same as retrieval from a group of values v by index

α

. When we remove this restriction, the attention operation is computed based on the probability vector

α

, as in “proportional retrieval”. Bahdanau et al. computed the weighted probabilities

α

over each annotation

v_{i}

in [37]:

e_{i j} = a (s_{i}, h_{j}), α_{i j} = \frac{exp (e_{i j})}{\sum_{k} exp (e_{i k})},

(1)

in which

h_{j}

is derived from the contracting path and

s_{i}

is derived from the expansive path. However, when the contracting and expansive paths are of lengths m and n, respectively, a network must operate

m \cdot n

times to compute all of the attention scores

e_{i j}

. Vaswani et al. modified the attention operation to address this problem [48]. The modified attention operation first projects s and h into a common space. Then, the dot product (or any similarity measure) constitutes the attention score, making the formulation

e_{i j} = f (s_{i}) g {(h_{j})}^{T} .

(2)

The modified attention operation obtains the projection vectors by only executing the computation

g (h_{j}) m

times and

f (s_{i}) n

times, and it computes

e_{i j}

effectively with a matrix.

2.2.1. Convolutions within SABs

Inspired by [37], we project the output feature maps of the second stage to queries

q_{o}

, keys

k_{o} \in R^{c_{in} \times 1 \times h}

, and values

v_{o} \in R^{c_{in} \times 2 \times h}

. This is followed by a matrix-multiplication operation between the transpose of

q_{o}

and

k_{o}

. Now, the modified attention operation projects the feature maps

q_{o}^{T} k_{l} \in R^{c_{in} \times h \times h}

to the successive convolutional layers

C o n v (\cdot)

.

{(q_{o}^{T} k_{l})}^{'} = C o n v (q_{o}^{T} k_{l})

(3)

Unlike in convolutions, the modified attention operation calculates the values v with the index computed by

softmax {(q_{o}^{T} k_{l})}^{'}

and extracts a feature from the pixel with the highest score based on

q_{o}^{T}

against a set of

k_{l}

. The computation of such global affinities is still time-consuming. This becomes problematic when using modified attention operations for large feature maps with constrained memory. We decompose the problem into height-axis and width-axis subproblems that can be separately solved to simultaneously reduce computational costs and optimize memory consumption. We first perform the height-axis attention operation with positional contraction along the width-axis attention operation. We denote the updated self-attention applied to every feature map x as:

y_{o} = \sum_{w = 1}^{W} {softmax}_{l} {(q_{o}^{T} k_{l})}^{'} (v_{l})

(4)

where

{softmax}_{l}

indicates the softmax function applied to all suitable candidate

l = (i, w)

positions around the location

o = (i, j)

.

2.2.2. Position Blurring

For the positional contraction of the images, learning with a small-scale training set is difficult [47]. To further prevent overfitting, we produce two position-blurred feature maps and then concatenate them with the original feature map. Specifically, we change the value into a random number every two pixels. Two position-blurred feature maps change to different values, and thus, convolutions can only generate features based on different local information. The underlying hypotheses behind position-blurred feature maps are that (1) they cause the system to escape local minimum traps and (2) they increase the amount of training data, which should improve the segmentation performance of the CNN. We write the updated self-attention with position-blurred feature maps along the width axis as:

y_{o} = \sum_{w = 1}^{W} {softmax}_{l} ({(q_{o}^{T} k_{l})}^{'} + t_{1} {(q_{o}^{T} k_{l})}^{'} + t_{2} {(q_{o}^{T} k_{l})}^{'}) (v_{l})

(5)

2.2.3. Inner Cropping

Recently developed CNNs for medical image segmentation have consisted of a large number of parameters, making traditional data augmentation techniques insufficient. Inspired by cropping [49], which was proposed to avoid overfitting while increasing the amount of training data, we propose an inner cropping technique. The self-attention with inner cropping performed on the width axis is then computed as:

y_{o} = \sum_{w = 1}^{W} {softmax}_{l} (\sum_{p = 1}^{P} ({(q_{o}^{T} k_{l})}^{'} + t_{1} {(q_{o}^{T} k_{l})}^{'} + t_{2} {(q_{o}^{T} k_{l})}^{'})) (v_{l})

(6)

where p denotes the patches cropped from input feature maps. Given an input feature map xs, we project it to queries

q_{o}

, keys

k_{o} \in R^{c_{g} \times 1 \times h}

, and values

v_{o} \in R^{c_{g} \times 2 \times h}

. In this work,

c_{g} = c_{in} / 4

. Patchwise training prevents the CNN from extracting any representations or dependencies for interpatch pixels. Self-attention operates on the patches to learn proper local features.

As depicted in Figure 1, we combined every proposed self-attention method, designed SAB-Net, and mainly implemented the SAB in the second stage. The first SAB extracts features from the output of the first convolution and then inputs them into the second convolution. Next, the output of the second convolution is obtained and then used as input for the second SAB, which redesigns the skip pathway. It has been observed that when output features are semantically similar [11], optimization problems are easier to solve for the optimizer. Then, the output features of the second SAB are concatenated with the upsampled output features from a series of downsampling and upsampling operations. We also fuse the output features from the third contracting path and the downsampled features within SAB-Net for the purpose of reducing the semantic dissimilarity between the features of SAB-Net and the expansive path before fusion.

2.3. Datasets

We evaluated all models on five biomedical segmentation tasks that are described below to demonstrate the generalization of SAB-Net [50,51,52,53].

Brain MRI dataset. The Brain MRI dataset described in [50] includes 3064 T1c magnetic resonance images collected from 233 patients with three brain tumor categories: 708 meningioma images, 1426 glioma images, and 930 pituitary tumor images. It is used for the evaluation of the segmentation of tumors from other lesions.

Heart MRI dataset. Segmenting the left atrium is one of the main practical problems when guiding atrial fibrillation ablation, quantifying liver fibrosis, and processing biophysics. The Heart MRI dataset reported in [51] contains MRI images covering the entire heart obtained from 30 patients, and it is used for the evaluation of left atrium segmentation.

Liver CT dataset. Liver metastases refer to cancerous tumors that have spread to the liver and started in another part of the body. Liver metastases are more common than primary liver cancers. The Liver CT dataset introduced in [52] includes 40 contrast-enhanced CT (CECT) images of three types of diseases, from primary to secondary liver tumors and metastases, and it is used for the evaluation of the segmentation of liver tumors from other lesions.

Spleen CT dataset. The Spleen CT dataset used in [51] contains 40 CECT images of randomly chosen scans of patients undergoing chemotherapy treatment for liver metastases. We used the Spleen CT dataset for the evaluation of the segmentation of spleen CT images.

Colonoscopy dataset. Colorectal cancer is mainly derived from adenomatous polyps developing in glandular tissues of the colonic mucosa, and it is the gold standard for colorectal polyp and cancer diagnosis. The Colonoscopy dataset introduced in [53] includes 612 polyp images with corresponding annotations, it is used for the evaluation of colorectal cancer segmentation.

2.4. Implementation Details

In each experiment, the cases were first randomly divided into training (approximately 50%), validation (approximately 25%), and test (approximately 25%) sets. This guaranteed that the images in the three sets were from different cases. The numbers of selected positive and negative cases corresponded to their distributions in each dataset to guarantee that the learned features were meaningful. Then, we chose 100 images from the training set. The chosen images were resized to

128 \times 128

. All methods were fine-tuned with the Dice loss:

L_{d i c e} = - \frac{1}{N} \sum_{k = 1}^{N} (1 - \frac{2 y_{k} \cdot t_{k}}{y_{k}^{2} + t_{k}^{2}}),

(7)

where the segmentation result is denoted as

y_{k}

, the ground truth is denoted as

t_{k}

, and the batch size is denoted as N. The networks were developed by running on the PyTorch framework [54] and were trained on a GeForce RTX 3080 with the He-Normal initializer, the adaptive moment estimation (ADAM) optimizer, and batch normalization for 100 epochs. The quantitative segmentation performance was evaluated with the sensitivity (TPR%), specificity (TNR%), Dice coefficient (Dice%), and 95% Hausdorff distance (HD95).

3. Results

For comparison, we use three widely used models as baselines, including the original U-Net, Att-UNet [55], which employs spatial attention, and Focus-UNet [56], which combines spatial attention and channel attention. The next subsections first present a series of ablation experiments carried out on the Brain MRI task. Then, we summarize the comparisons between the proposed method and the convolutional and transformer-based baselines on brain MRI images, heart MRI images, liver CT images, spleen CT images, and Colonoscopy datasets.

3.1. Results Obtained on the Brain MRI Dataset

We employed the SAB on the original U-Net to model long-range pixel–pixel dependencies for a comprehensive understanding of the feature map. To verify the performance of the proposed self-attention mechanism and other techniques, we performed experiments with various settings, as shown in Table 1. AB represents U-Net with traditional attention modules, SAB-1, 2, and 3 are the methods employing one, two, and three of the proposed operations, and SAB denotes the method employing all four proposed operations. We found that the SAB significantly improved the segmentation performance. We also observed that the convolutional operation (Conv) conducted on the reshaped feature maps was the most critical operation in comparison with the results of AB. Therefore, SAB-1, SAB-2, and SAB-3 should include Conv to maintain their performance. All variants that implemented two of the proposed operations (SAB-2) outperformed U-Net. Compared with U-Net, employing Conv and Plus yielded a Dice score of 59.23% and an 8.98% improvement. Additionally, the variants employing three of the proposed operations (SAB-3) outperformed U-Net by 7.10% to 12.03%. Furthermore, when we integrated all of the proposed operations, the performance improved to 61.83%, and this approach, therefore, achieved significantly improved application performance on the segmentation of brain MRI images over that of AB by 13.76%. We tested the SAB with inner cropping of

c_{in} / 2

and

c_{in} / 4

, and we observed that the SAB with inner cropping of

c_{in} / 2

achieved the highest Dice scores on the Brain MRI dataset. Meanwhile, the SAB with inner cropping of

c_{in} / 4

outperformed the SAB with inner cropping of

c_{in} / 2

in terms of the Dice scores on the other three datasets. We argue that a higher inner cropping rate can help U-Net to increase the diversity among the features and, thus, improve the segmentation performance. We only show the SAB with inner cropping of

c_{in} / 4

in Table 2, Table 3, Table 4 and Table 5. Meanwhile, the original U-Net outperformed AB-UNet in our experiments. Therefore, we compared SAB-Net with the original U-Net in Table 2, Table 3, Table 4 and Table 5. Advanced attention-guided models, Att-UNet and Focus-UNet, were used for comparison between the proposed SAB-UNet and the self-attention-based U-Net.

Table 2 shows the segmentation performance of U-Net, UNet++, Att-UNet, Focus-UNet, and our suggested SAB-Net architecture. We observed that SAB-Net produced the best results with dominant advantages by comparing SAB-Net with recently proposed CNNs on the Brain MRI test set. In particular, SAB-Net outperformed UNet++, a state-of-the-art framework for segmenting biomedical images, by 9.53% in terms of Dice scores. Moreover, SAB-Net outperformed two powerful transformer-based networks by significant margins. As discussed above, transformer-based networks may encounter overfitting problems when insufficient training data are provided. Att-UNet yielded better improvements in segmentation performance in comparison with U-Net and UNet++. Focus-UNet was modified from Att-UNet via integration with a channel attention module. It performed well on the training set (which is not listed here), but did not work well on the test set.

Figure 2 shows the qualitative results obtained on the sample test images from the Brain MRI dataset. U-Net had a rather high misclassification rate. Some details and object boundaries were precisely obtained with Att-UNet, but the central areas were misclassified as gliomas. In contrast, most of the categories that were misclassified by other methods were correctly classified by SAB-Net. This supports our claim that integrating all of the proposed operations lets the network learn context information. Thus, the proposed methods enhanced the discrimination of details and improved the semantic consistencies for the segmentation tasks involving this Brain MRI dataset.

3.2. Results Obtained on the Heart MRI Dataset

Table 4 demonstrates the successful application of SAB-Net to ventricular segmentation and the superior performance compared with that of existing approaches based on the validation results obtained on the Heart MRI dataset. SAB-Net’s average Dice value exceeded that of U-Net, the baseline, by 35.32%. We argue that the degradation problem caused U-Net to have very low Dice scores on the Heart MRI dataset. When increasing the network capacity with depth, the network performance became unsurprisingly saturated and was then quickly degraded [2]. This was also verified by our experiments, as shown in Table 4. Figure 3 illustrates an example of qualitative results obtained by U-Net, Att-UNet, and SAB-Net for the Heart MRI images.

3.3. Results Obtained on the Spleen and Liver CT Datasets

As shown in Table 3 and Table 6, SAB-Net’s segmentation of the overall CT images was more accurate than that of U-Net on small-scale datasets. In terms of the Dice coefficient, SAB-Net achieved average improvements of 23.60% and 5.86% over U-Net on the Spleen and Liver CT datasets, respectively. SAB-Net also outperformed UNet++, Att-UNet, and Focus-UNet for the tasks of spleen and liver CT image segmentation in terms of the Dice coefficient. Specifically, SAB-Net yielded average improvements of 2.68% and 5.68% over the second-best algorithms on the Spleen (Att-UNet) and Liver (UNet++) CT datasets. By evaluating U-Net in comparison with the UNet++ and Att-UNet architectures across the spleen and liver CT image segmentation tasks, we observed that the learning of long-range pixel–pixel dependencies played a critical role that affected the ability of the utilized model to segment spleen CT images, while bridging the semantic gap improved the segmentation performance achieved on the liver CT images.

Figure 4 and Figure 5 show the results of a qualitative comparison of U-Net and Att-UNet with the proposed method, respectively. SAB-Net could capture higher semantic similarities and long-range relationships. Figure 5 demonstrates that some categories misclassified by U-Net and Att-UNet were correctly classified by our proposed method.

3.4. Results Obtained on the Colonoscopy Dataset

As shown by the experimental results in Table 5, to evaluate the generalizability of SAB-Net concerning clinical applications other than those with CT and MRI images, we also conducted experimental studies on the colonoscopy images. In terms of the Dice coefficient, SAB-Net improved its segmentation performance over that of the original U-Net by 9.40% and over that of the second-best algorithm, Att-UNet, by 1.61% on the Colonoscopy dataset, even though the gains in segmentation accuracy were minor for colonoscopy images compared with those achieved for the MRI and CT images. Figure 6 shows a qualitative segmentation example from the endoscopy dataset to visualize the effect of the proposed SAB-Net.

4. Discussion

Our experimental results demonstrate that the SAB outperformed existing approaches across various biomedical image segmentation tasks when trained with small datasets. In this section, we summarize the benefits of the SAB in our experiments, especially in terms of resolving the degradation problem. That is, the segmentation performance was usually saturated and was quickly degraded when the number of stacked convolutional layers increased. According to some studies [57,58], this kind of degradation is not caused by overfitting, and increasing the convolutional layers to a suitable depth brings a higher training loss, which was supported by our experiments (as shown in Figure 7). This degradation process indicates that it is not similarly easy for every CNN to achieve an optimal condition. The SAB addresses this issue by fusing multiscale features and modeling long-range pixel–pixel dependencies.

4.1. Residual Learning Framework and Comparison with the U-Net Family

We found that the degradation problem was alleviated when the features learned from the contracting and expansive paths were semantically similar. To explain the reasoning for this phenomenon, we first review the residual learning framework (ResNet) and its relationship with shortcut connections within the U-Net family.

ResNet [2] addresses the degradation problem by introducing shortcut connections (residual connections) between layers to skip layers in the forward step of an input. As ResNet has gained increasing popularity in the research community, the practices and theories based on shortcut connections have achieved notable progress. ResNeXt [59] exploits the effects of shortcut connections through the split–transform–merge paradigm; i.e., output feature maps from different layers are integrated by jointly adding them. DenseNet [21] adopts shortcut connections to all subsequent layers to prevent direct summation, but preserves the features in preceding layers. Despite some small differences between ResNet and its two variants, i.e., ResNeXt and DenseNet, they all use identity mappings. UNet++ outperformed the original U-Net in our experiments. Instead of directly inputting fast-forward output feature maps acquired from the contracting to the expansive path, UNet++ and the SAB fused high-level semantics with low-level semantics. Our results that were obtained on five biomedical image datasets confirmed the previous perspective that multilevel semantics [10,60] enable networks to learn the fine-grained details of foreground objects more.

The major contribution of U-Net is arguably the implementation of the shortcut connections between layers in the contracting and expansive paths. These shortcut connections facilitate the direct flow of gradients to the earlier layers without any degradation. Let us denote the desired underlying mapping as

H (x)

. With one downsampling operation and one upsampling operation, U-Net recasts the original mapping to

F (x) + x

, which is similar to a residual connection. However, when U-Net becomes deeper, the formulation becomes

F_{2} (F_{1} (x)) + x

, which reduces the expected effect of the residual connection. Figure 7 shows typical examples of the progress of the training and validation losses on the training set of the Heart MRI dataset. A U-Net with a three-layer deep contracting path and a three-layer expansive path (shallow-UNet) is easier to optimize than a U-Net with a four-layer deep contracting path and a four-layer expansive path. Differently from the plain shortcut connections in the original U-Net, the second SAB shown in Figure 1 first recasted the above mapping,

F_{2} (F_{1} (x)) + x

, into

F_{2} (F_{1} (x)) + F_{1} (x) + x

. Then, by fusing the upsampled feature maps within the shortcut connections, the new mapping is expressed as

F_{2} (F_{1} (x)) + F_{2} (x) + F_{1} (x) + x

. Our experimental results show that it is simpler for the optimizer to optimize the residual mapping than the original semantically dissimilar mapping.

4.2. Comparison with Recent Self-Attentions

Because of the vanishing gradient problem, learning long-term relationships with a CNN is difficult [61]. Adding attention modules to shortcut connections is beneficial for solving the vanishing gradient problem [62]. The SAB leverages self-attention to encode long-range pixel–pixel dependencies. We compared U-Net with SAB-Net and two self-attentions, Att-UNet and Focus-UNet. We found that not every attention mechanism improved the segmentation results. When we analyzed the segmentation results predicted by U-Net and Att-UNet, it appeared that Att-UNet often misclassified areas that were also misclassified by U-Net. To overcome this problem, we extended the existing self-attentions by introducing a subnetwork in the SAB to encode long-range pixel–pixel dependencies from reshaped feature maps instead of directly applying softmax functions to all possible positions.

4.3. Position Blurring and Inner Cropping

Furthermore, position blurring and inner cropping techniques are used in the SAB specifically for small-scale datasets. The underlying hypothesis behind these two techniques is that a self-attention can more effectively model long-range pixel–pixel dependencies in small-scale biomedical tasks when we reduce the strong correlations among different hidden units in the self-attention. This is different from the classic attention mechanisms, which rely on the quality of the input images. As the backgrounds of images are usually scattered, modeling the long-range pixel–pixel dependencies between the background pixels of an image helps prevent a CNN from misclassifying pixels and, therefore, leads to a reduction in the true negative rate (TNR; classifying the background as negative and the segmentation masks as positive). Regarding the TNR, SAB-Net consistently outperformed Att-UNet, except in the Brain MRI segmentation case. This supports the notion that the SAB gives a promising direction for segmenting biomedical images. As shown in Table 2, Table 3, Table 4 and Table 5, based on the proposed attention module modifications, the SAB is easier to optimize and has good generalization performance on small-scale biomedical image segmentation tasks.

4.4. Limitations and Future Work

From Table 2, Table 3, Table 4 and Table 5, the experimental values of the TNR show the effectiveness of the proposed SAB-Net in identifying images without diseases. This means that a small-scale training set is enough to train SAB-Net to segment healthy human organs. However, the experimental values of the TPR show that a small-scale training set becomes problematic when adopting SAB-Net for the segmentation of images with diseases. As shown in Figure 8, a disease may appear in different parts of the left atrium, which means that SAB-Net may not learn similar circumstances in images of the left atrium with insufficient training data. Under these circumstances, SAB-Net can only segment images based on knowledge for the segmentation of a healthy left atrium. At the same time, a test left atrium image from a different patient from that in the training data can be provided. Therefore, with limited training data, SAB-Net can identify images without diseases well, but it is hard for it to identify images with diseases. The predicted contours are not similar when compared to the ground truth, which causes a low Dice coefficient. Since we used the Dice loss, which originates from the Dice coefficient, and left atrium image used for validation was also provided by a patient who was different from that in the training data, and the validation loss for 100 epochs was still in the 50–60% range. This indicates that large-scale and diverse training data may still be the key to the success of training with biomedical image datasets. Attention should be paid to federal learning because local data may be not typical enough to get a generalized model.

5. Conclusions

This paper presents an SAB for small-scale biomedical image segmentation. The SAB can model the long-range pixel–pixel dependencies in an image by using the proposed self-attention. Specifically, we feed reshaped feature maps to a subnetwork instead of directly applying softmax functions to all possible positions to prevent overfitting. We also introduce techniques such as position blurring, inner cropping, and redesigning the skip pathway. Thus, the proposed method can capture long-range pixel–pixel dependencies more effectively than other approaches and learn better feature representations for biomedical image segmentation. Our experiments were carried out by training models on 100 random images. SAB-Net consistently achieved outstanding performance on five biomedical segmentation tasks, i.e., the Brain MRI, Heart MRI, Liver CT, Spleen CT, and Colonoscopy datasets. We expect that the proposed method can also be used in edging and federal training with limited data from a single medical institution. In addition, one of the challenges left for the future is the extension of the proposed method to large-scale biomedical image segmentation and the enhancement of the model’s robustness. This will enable the generalizability of the proposed methods.

Author Contributions

W.W., X.Z., L.J. and M.D. designed the experiments. W.W. conducted the experiments. W.W., X.Z., L.J. and M.D., interpreted the data and drafted the manuscript. W.W. directed the research and gave initial input. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Competitive Research Fund of The University of Aizu, 2022-P-12.

Institutional Review Board Statement

Institutional Review Board approval was waived for this study because only open databases were used.

Informed Consent Statement

Patient consent was waived for this study because only open databases were used.

Data Availability Statement

All datasets are publicly available. We have five benchmark datasets. Our code and trained models are available at https://github.com/on1kou95/Smooth-Attention-Branch.

Conflicts of Interest

The authors declare no conflict of interest.

References

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Liu, X.; Song, L.; Liu, S.; Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
Lian, Z.; Yang, Q.; Wang, W.; Zeng, Q.; Alazab, M.; Zhao, H.; Su, C. DEEP-FEL: Decentralized, Efficient and Privacy-Enhanced Federated Edge Learning for Healthcare Cyber Physical Systems. IEEE Trans. Netw. Sci. Eng. 2022, 9, 3558–3569. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Drozdzal, M.; Vorontsov, E.; Chartrand, G.; Kadoury, S.; Pal, C. The importance of skip connections in biomedical image segmentation. In Deep Learning and Data Labeling for Medical Applications; Springer: Berlin/Heidelberg, Germany, 2016; pp. 179–187. [Google Scholar]
Gegundez-Arias, M.E.; Marin-Santos, D.; Perez-Borrero, I.; Vasallo-Vazquez, M.J. A new deep learning method for blood vessel segmentation in retinal images based on convolutional kernels and modified U-Net model. Comput. Methods Programs Biomed. 2021, 205, 106081. [Google Scholar] [CrossRef]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Li, D.; Rahardja, S. BSEResU-Net: An attention-based before-activation residual U-Net for retinal vessel segmentation. Comput. Methods Programs Biomed. 2021, 205, 106070. [Google Scholar] [CrossRef]
Wang, J.; Lv, P.; Wang, H.; Shi, C. SAR-U-Net: Squeeze-and-excitation block and atrous spatial pyramid pooling based residual U-Net for automatic liver segmentation in Computed Tomography. Comput. Methods Programs Biomed. 2021, 208, 106268. [Google Scholar] [CrossRef]
Cui, H.; Yuwen, C.; Jiang, L.; Xia, Y.; Zhang, Y. Multiscale attention guided U-Net architecture for cardiac segmentation in short-axis MRI images. Comput. Methods Programs Biomed. 2021, 206, 106142. [Google Scholar] [CrossRef]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 354–370. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Cao, X.; Chen, H.; Li, Y.; Peng, Y.; Wang, S.; Cheng, L. Dilated densely connected U-Net with uncertainty focus loss for 3D ABUS mass segmentation. Comput. Methods Programs Biomed. 2021, 209, 106313. [Google Scholar] [CrossRef] [PubMed]
Jégou, S.; Drozdzal, M.; Vazquez, D.; Romero, A.; Bengio, Y. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 11–19. [Google Scholar]
Weng, Y.; Zhou, T.; Li, Y.; Qiu, X. NAS-Unet: Neural Architecture Search for Medical Image Segmentation. IEEE Access 2019, 7, 44247–44257. [Google Scholar] [CrossRef]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; Volume 33, pp. 9259–9266. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4898–4906. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Lv, Y.; Ma, H.; Li, J.; Liu, S. Attention guided U-Net with atrous convolution for accurate retinal vessels segmentation. IEEE Access 2020, 8, 32826–32839. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 14–24. [Google Scholar]
Karimi, D.; Vasylechko, S.D.; Gholipour, A. Convolution-free medical image segmentation using transformers. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 78–88. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. Transbts: Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 109–119. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Cheng, J.; Dong, L.; Lapata, M. Long short-term memory-networks for machine reading. arXiv 2016, arXiv:1601.06733. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10076–10085. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video summarization with attention-based encoder–decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef] [Green Version]
Fukui, H.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10705–10714. [Google Scholar]
Hu, H.; Zhang, Z.; Xie, Z.; Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3464–3473. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Cheng, J.; Huang, W.; Cao, S.; Yang, R.; Yang, W.; Yun, Z.; Wang, Z.; Feng, Q. Enhanced performance of brain tumor classification via tumor region augmentation and partition. PLoS ONE 2015, 10, e0140381. [Google Scholar] [CrossRef] [PubMed]
Simpson, A.L.; Antonelli, M.; Bakas, S.; Bilello, M.; Farahani, K.; Van Ginneken, B.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv 2019, arXiv:1902.09063. [Google Scholar]
Bilic, P.; Christ, P.F.; Vorontsov, E.; Chlebus, G.; Chen, H.; Dou, Q.; Fu, C.W.; Han, X.; Heng, P.A.; Hesser, J.; et al. The liver tumor segmentation benchmark (lits). arXiv 2019, arXiv:1901.04056. [Google Scholar] [CrossRef]
Bernal, J.; Tajkbaksh, N.; Sánchez, F.J.; Matuszewski, B.J.; Chen, H.; Yu, L.; Angermann, Q.; Romain, O.; Rustad, B.; Balasingham, I.; et al. Comparative validation of polyp detection methods in video colonoscopy: Results from the MICCAI 2015 endoscopic vision challenge. IEEE Trans. Med Imaging 2017, 36, 1231–1249. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Yeung, M.; Sala, E.; Schönlieb, C.B.; Rundo, L. Focus U-Net: A novel dual attention-gated CNN for polyp segmentation during colonoscopy. Comput. Biol. Med. 2021, 137, 104815. [Google Scholar] [CrossRef]
Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1319–1327. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Jiang, J.; Hu, Y.C.; Liu, C.J.; Halpenny, D.; Hellmann, M.D.; Deasy, J.O.; Mageras, G.; Veeraraghavan, H. Multiple resolution residually connected feature streams for automatic lung tumor segmentation from CT images. IEEE Trans. Med. Imaging 2018, 38, 134–144. [Google Scholar] [CrossRef]
Amini, M.H. Optimization, Learning, and Control for Interdependent Complex Networks; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1123. [Google Scholar]
Shan, D.; Zhang, X.; Shi, W.; Li, L. Neural Architecture Search for a Highly Efficient Network with Random Skip Connections. Appl. Sci. 2020, 10, 3712. [Google Scholar] [CrossRef]

Figure 1. (a) U-Net; (b) an example of an SAB network. Each blue box corresponds to a multichannel feature map. The size is provided on top of the associated box. B, N, W, H, and G represent the batch size, number of channels, width, height, and group, respectively. (c) The details of the SAB.

Figure 2. Qualitative results obtained for the brain MRI images. From left to right: ground truth, U-Net, Att-UNet, and SAB-Net.

Figure 3. Qualitative results obtained for the heart MRI images. From left to right: ground truth, U-Net, Att-UNet, and SAB-Net.

Figure 4. Qualitative results obtained for the spleen CT images. From left to right: ground truth, U-Net, Att-UNet, and SAB-Net.

Figure 5. Qualitative results obtained for the liver CT images. From left to right: ground truth, U-Net, Att-UNet, and SAB-Net.

Figure 6. Qualitative results obtained for the colonoscopy images. From left to right: ground truth, U-Net, Att-UNet, and SAB-Net.

Figure 7. (a) Progress of the training procedure and (b) the validation loss with the number of epochs when conducting training on the Heart MRI dataset. Green: U-Net. Blue: shallow-UNet. Red: SAB-Net.

Figure 8. Examples of misclassified heart images. From left to right: original image, ground truth, and SAB-Net.

Table 1. Ablation study conducted on the Brain MRI test set. Conv denotes convolutions within SABs, PB denotes position blurring, Crop denotes inner cropping, and Plus denotes the redesigned skip pathway. SAB-1, 2, and 3 denote the methods employing one, two, and three of the proposed operations.

	Conv	PB	Crop	Plus	Dice
U-Net					54.35%
AB-UNet					49.85%
SAB-1-Net	✓				54.95%
SAB-1-Net		✓			48.08%
SAB-1-Net			$c_{in} / 2$		49.83%
SAB-1-Net			$c_{in} / 4$		49.86%
SAB-1-Net				✓	53.51%
SAB-2-Net	✓	✓			56.41%
SAB-2-Net	✓		$c_{in} / 2$		54.28%
SAB-2-Net	✓		$c_{in} / 4$		55.17%
SAB-2-Net	✓			✓	59.23%
SAB-3-Net	✓	✓	$c_{in} / 2$		57.40%
SAB-3-Net	✓	✓	$c_{in} / 4$		58.21%
SAB-3-Net	✓	✓		✓	60.89%
SAB-3-Net	✓		$c_{in} / 2$	✓	58.81%
SAB-3-Net	✓		$c_{in} / 4$	✓	60.80%
SAB-Net	✓	✓	$c_{in} / 2$	✓	62.37%
SAB-Net	✓	✓	$c_{in} / 4$	✓	61.83%

Table 2. Segmentation results obtained on the Brain MRI dataset.

	Dice	TPR	TNR	HD95
U-Net	54.35%	45.23%	99.80%	9.22
UNet++	56.45%	48.35%	99.76%	8.06
Att-UNet	58.06%	46.87%	99.86%	8.35
Focus-UNet	48.59%	37.34%	99.85%	9.43
SAB-Net	61.83%	55.15%	99.78%	8.06

Table 3. Segmentation results obtained on the Liver CT dataset.

	Dice	TPR	TNR	HD95
U-Net	73.94%	80.71%	98.24%	24.28
UNet++	74.06%	90.25%	97.50%	28.33
Att-UNet	72.82%	83.74%	97.84%	27.39
Focus-UNet	68.03%	87.79%	96.72%	39.92
SAB-Net	78.27%	82.63%	98.67%	21.21

Table 4. Segmentation results obtained on the Heart MRI dataset.

	Dice	TPR	TNR	HD95
U-Net	42.36%	27.46%	99.99%	5.60
UNet++	50.36%	37.14%	99.96%	5.00
Att-UNet	53.15%	50.20%	99.88%	7.00
Focus-UNet	35.19%	27.42%	99.91%	14.46
SAB-Net	57.32%	58.40%	99.89%	5.10

Table 5. Segmentation results obtained on the Colonoscopy dataset.

	Dice	TPR	TNR	HD95
U-Net	55.88%	42.11%	99.15%	8.77
UNet++	58.78%	46.28%	98.90%	8.00
Att-UNet	60.16%	50.82%	98.21%	6.40
Focus-UNet	46.10%	38.36%	97.24%	8.06
SAB-Net	61.13%	50.27%	98.60%	7.07

Table 6. Segmentation results obtained on the Spleen CT dataset.

	Dice	TPR	TNR	HD95
U-Net	60.12%	56.72%	99.93%	4.99
UNet++	68.93%	59.93%	99.97%	6.32
Att-UNet	72.37%	61.77%	99.98%	5.74
Focus-UNet	68.13%	55.16%	99.97%	7.01
SAB-Net	74.31%	63.05%	99.98%	4.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Weng, W.; Zhu, X.; Jing, L.; Dong, M. Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation. Electronics 2023, 12, 682. https://doi.org/10.3390/electronics12030682

AMA Style

Weng W, Zhu X, Jing L, Dong M. Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation. Electronics. 2023; 12(3):682. https://doi.org/10.3390/electronics12030682

Chicago/Turabian Style

Weng, Weihao, Xin Zhu, Lei Jing, and Mianxiong Dong. 2023. "Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation" Electronics 12, no. 3: 682. https://doi.org/10.3390/electronics12030682

APA Style

Weng, W., Zhu, X., Jing, L., & Dong, M. (2023). Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation. Electronics, 12(3), 682. https://doi.org/10.3390/electronics12030682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention Mechanism Trained with Small Datasets for Biomedical Image Segmentation

Abstract

1. Introduction

1.1. Related Feature Fusion Works

1.1.1. Multilayer Feature Fusion

1.1.2. Multibranch Feature Fusion

1.2. Related Attention Mechanisms

1.2.1. Spatial Attention

1.2.2. Channel Attention

2. Methods

2.1. Network Architecture

2.2. Smooth Attention Branch

2.2.1. Convolutions within SABs

2.2.2. Position Blurring

2.2.3. Inner Cropping

2.3. Datasets

2.4. Implementation Details

3. Results

3.1. Results Obtained on the Brain MRI Dataset

3.2. Results Obtained on the Heart MRI Dataset

3.3. Results Obtained on the Spleen and Liver CT Datasets

3.4. Results Obtained on the Colonoscopy Dataset

4. Discussion

4.1. Residual Learning Framework and Comparison with the U-Net Family

4.2. Comparison with Recent Self-Attentions

4.3. Position Blurring and Inner Cropping

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI