Recently, with the growing number of computational techniques, such as convolutional neural networks (CNN) and the appearance of hardware acceleration devices, deep learning-based techniques have been successfully used for various computer vision applications, including image detection/segmentation [
12,
13], image classification [
14,
15,
16,
17,
18], and image generation [
19]. For medical image segmentation, previous studies have used deep learning-based techniques for breast, brain stroke, and thyroid nodule segmentation. One of the most significant studies on medical image segmentation was carried out by Ronneberger et al. [
12]. In that study, Ronneberger et al. proposed the use of an auto encoder–decoder neural network architecture for image segmentation problem, namely, the UNet network. This architecture first used a conventional fully convolution network (FCN) to learn useful information from an input image, i.e., an encoder. In their study, the encoder network is formed by stacking several convolution blocks (convolution layer with max pooling) together for the purpose of image feature extraction. As a result, the outputs of encoder are the image feature maps at various abstract levels, in which each abstract level represents the image information at a specific image scale. Subsequently, a decoder network was used to decode the output of the encoder and reconstruct the target image. For this purpose, the decoder network uses the transposed convolution layers to expand the size of feature maps and the convolution layers to learn the mapping function to reconstruct the target image. At every scale, the output feature maps of the encoder and corresponding feature maps of the decoder are combined to capture both low and high abstract level of image features, which can help to enhance the learning ability of the network. This neural network architecture was found to work well in many medical image segmentation applications. To simplify the training of the UNet network and increase its depth, Khanna et al. [
20] used a residual UNet network architecture, i.e., a UNet network with a residual connection. They showed that the residual UNet outperformed state-of-the-art studies on the retinal vessel segmentation problem. Recently, Zhou et al. [
21] proposed an enhancement to UNet for medical image segmentation, namely, the UNet++ network. The UNet++ network could be seen as a nesting of multiple UNet networks. Through experiments, the authors showed that their network architecture with the nesting methodology outperformed the conventional UNet network. However, the UNet++ does not explore the information from every feature scale. To address this limitation, Huang et al. [
22] proposed a similar but more efficient network architecture, namely, UNet 3+, to explore the information of full-scale using full-scale skip connections and deep supervision. By using the full-scale skip connections, the UNet 3+ network can incorporate the information of both low-level and high-level features. As a result, the UNet 3+ network can enhance segmentation accuracy compared to the UNet++ network. Instead of using the nesting approach of UNet networks such as the UNet+++ or UNet 3+, Baccouche et al. [
23] proposed an enhancement version of the UNet network by concatenating the two UNet networks together to form a so-called Connected-UNets network. In the connected-UNets network architecture, the output of the first UNet network is further enhanced by using a second UNet network. The authors proved the efficiency of the connected-UNets in their experiments with the breast nodule segmentation problem. However, the use of two deep networks makes the connected-UNets network too deep and possibly causes the gradient vanishing problem that hinders the network segmenting small objects. In a most recent study, Lin et al. [
24] attached the transformer blocks to the encoder and decoder paths of the UNet network to model the long-range contextual information in input images. As a result, they obtained better segmentation accuracy than other Unet variants such as Unet or Unet++ networks. Lu et al. [
25] simplified the Unet network by taking advantage of full-scale fusion, Ghosh modules, and the unification of channel numbers. They showed that their network, namely, Half-UNet, had similar segmentation performance compared with the UNet and its variants, but was simpler by reducing the network parameters and number of floating-point operations.
In a study by Vakanski et al. [
26], the performance of the UNet network was enhanced by combining the input image with a salient map through an attention-based module. In contrast to other studies that segment objects using a single input image, Vakanski et al. [
26] used an additional input, i.e., a salient map that is obtained from the input image using some handcrafted-based segmentation methods. This salient map, therefore, can be used as an approximation of the segmentation of target object. Although the salient map contains segmentation error, it can be used as a prior knowledge of experts about the object and its characteristics. With the salient map, they designed an attention module that enables to incorporate the information in the salient maps with feature maps in UNet network, to enhance the segmentation performance of a conventional UNet network. Through experiments, they showed that the use of salient map and attention mechanism are efficient for enhancing segmentation performance compared to conventional UNet network. However, this method is not clear with regard to obtaining the salient maps and it mainly depends on the knowledge of experts when designing the system.
Although these UNet-based network architectures have been successfully applied to various medical image segmentation problems, the segmentation performance is still limited. In addition, the most important requirement for a medical image processing system is for consistently high performance. Therefore, in this study, we propose an enhancement of the UNet network for the medical image segmentation problem. Our proposed method is applied to thyroid nodule segmentation to enhance the segmentation performance. Our proposed method has four novel elements compared to previous studies: