Medical Image Segmentation with Learning Semantic and Global Contextual Representation

Abstract Automatic medical image segmentation is an essential step toward accurate diseases diagnosis and designing a follow-up treatment. This assistive method facilitates the cancer detection process and provides a benchmark to highlight the affected area. The U-Net model has become the standard design choice. Although the symmetrical structure of the U-Net model enables this network to encode rich semantic representation, the intrinsic locality of the CNN layers limits this network’s capability in modeling long-range contextual dependency. On the other hand, sequence to sequence Transformer models with a multi-head attention mechanism can enable them to effectively model global contextual dependency. However, the lack of low-level information stemming from the Transformer architecture limits its performance for capturing local representation. In this paper, we propose a two parallel encoder model, where in the first path the CNN module captures the local semantic representation whereas the second path deploys a Transformer module to extract the long-range contextual representation. Next, by adaptively fusing these two feature maps, we encode both representations into a single representative tensor to be further processed by the decoder block. An experimental study demonstrates that our design can provide rich and generic representation features which are highly efficient for a fine-grained semantic segmentation task.


Introduction
Medical images have been used in the diagnosis of various diseases in the field of health and medicine. Today, these images are typically analyzed by Computer-Aided Diagnosis (CAD) systems. More specifically, doctors and pathologists utilize CAD systems to precisely interpret medical images to make an accurate diagnosis and apply the appropriate treatment method to patients [1,2]. Several advantages of using these CAD systems include reducing the cost, time, and error of humans in analyzing medical images. CAD systems are used to perform tasks that include image segmentation, classification, and detection. Medical image segmentation methods seek to divide images into meaningfully different areas, such as disease-prone and healthy regions, so that medical professionals can focus on disease areas with great precision. Yet, segmentation of medical images is a challenging task due to some factors such as inherent noise value in these images, low contrast, the presence of multiple similar tissues, lesion sizes, color shift, complex geometry, and non-uniform lighting system between different laboratories.
Medical image segmentation has a wide range of applications, such as the segmentation of skin lesions and lung cancer. Skin lesion segmentation separates areas of skin that are likely to be infected by cancer from healthy areas. In such applications, early diagnosis of the disease is crucial because the disease can be treated in the early stages and prevented from spreading to other parts of the body. According to the study [3], if such diseases, which are caused by the unusual growth of melanocytes, are diagnosed early, the five-year relative survival rate becomes 92%. Lung cancer is one of the most dangerous types of cancer, which kills many people worldwide every year. According to statistics [4], the mortality rate of this cancer is 40%, and it causes the death of more than one million people annually [4]. Figure 1 shows examples of medical images and the corresponding segmentation maps. Although Convolutional Neural Network (CNN) methods are highly effective in segmentation tasks, they are not capable of effectively modeling long-range semantic dependencies caused by the characteristics of the convolutional operations and the restricted receptive field size in convolution layers, even when the dilated/atrous sampling techniques are utilized [5]. These deficiencies reduce network performance, especially in dealing with images that have complex structures, such as highly detailed medical images with similar textures. To address the problem of the restricted receptive field of typical CNNs, several studies have been conducted [5][6][7][8][9][10][11][12]. Among the proposed methods, Transformer-based architectures that utilize the self-attention mechanism have achieved the highest ability to model long-range semantic dependencies and global contexts. Recently, several studies have been performed to adapt Transformers to image recognition applications [13,14], especially in the field of medical image segmentation [15][16][17]. All the aforementioned methods lack a distinctive mechanism to adaptively integrate the local and global contextual representations. More specifically, these methods can model long-range semantic dependencies and global contexts well but they perform more weakly in local information modeling than CNN models. Therefore, a mechanism is needed to model the global contextual features derived from the Transformer module along with the local semantic CNN representation.
In this paper, we present a two-stream pipeline network to tackle the limitation of the state-of-the-art (SOAT) methods. First, we extract local semantic information using a CNN module. Next, we employ a Transformer module to model long-range contextual representations.
Unlike the proposed approach by [15], which merely concatenates the local and global features, our proposed model adaptively fuses these feature maps and highlights the important regions by employing a spatial attention module. Our empirical findings from extensive experiments confirm that the proposed method not only provides a strong semantic segmentation map but can also pay more attention to the overlapped boundary area. The key contributions of the paper are as follows: • Using Transformer model on the network bottleneck to generate a complementary representation for the CNN features; • Proposing spatial attention mechanism to adaptively scale the important regions inside the given feature map; • End-to-end design for coupling CNN and Transformer models.
The remainder of this paper is organized as follows: Section 2 presents the related work in more detail, and the proposed method and the experimental results are discussed in Sections 3 and 4, respectively. Finally, Section 5 presents the conclusion.

Handcrafted Approaches
Handcrafted feature-based methods utilize the information present in the image itself and are typically used by traditional machine learning approaches, such as Support Vector Machines, for computer vision tasks. Several handcrafted feature-based approaches have been proposed in the medical image segmentation domain that use techniques such as histogram thresholding methods [18][19][20], unsupervised color-based methods [21][22][23], region-merging-based approaches [24][25][26], active contour methods [27][28][29], and morphological operations-based methods [30,31]. In retina blood vessel segmentation applications, Zhang et al. [32] applied denoising, normalization, and eliminating artifacts in the retina images and utilized mathematical morphology operation to segment the input images. Furthermore, using the segmentation results, they employed a binary random forest classifier to classify the images into lesion and non-lesion areas. Fraz et al. [33] observed the shift in a branching pattern, diameter, and tortuosity of retinal blood vessel morphology, to segment blood vessels in retinal images. Lam et al. [34] proposed a multi-concavity to segment healthy and unhealthy pixels in retinal images. The authors employed differentiable concavity measures to take bright lesions in the input images.
In skin lesion segmentation applications, Riaz et al. [27] proposed an active contoursbased method to segment melanoma areas in dermoscopy images calculating the Kullback-Leibler divergence between the skin and lesion. Then, they used image local binary patterns features to extract the periphery of the melanoma area. Pereira et al. [19] utilized a histogram and clustering-based approach for skin lesion segmentation. They found an optimal region of interest (ROI) according to a medium between the ROI with the highest gradient in the orthogonal direction of their boundary line, and another ROI with a smaller gradient and larger area. Ashour et al. [22] addressed the skin lesion segmentation problem by proposing a genetic algorithm (GA) based approach which reduces the indeterminacy of the input dermoscopy images by using the neutrosophic set (NS) operation. Then, they applied the k-means clustering algorithm to segment the skin lesion regions. In lung segmentation applications, Hu et al. [35] proposed an approach to identify the lungs in pulmonary X-ray CT images as follows. First, they used a gray-level thresholding technique to extract the lung region from the CT images. Then, they identified the anterior and posterior junctions of the lungs to separate the left and right lungs. Finally, the segmentation result was obtained by applying a sequence of morphological operations that smooth the irregular boundaries. In another study, Mansoor et al. [36] presented a two-steps method for pathological lung image segmentation as follows. First, they utilized a fuzzy connectedness (FC) algorithm to conduct initial lung parenchyma extraction alongside using rib-cage information to estimate the lung volume that compares the volume differences between the rib cage and FC. Next, they identified the abnormal imaging patterns that might have been omitted during the foremost stage of the algorithm by employing texture-based features.
Although several handcrafted feature-based approaches have been proposed to tackle the medical images segmentation problem, they extract features heuristically and therefore they do not produce accurate results. More specifically, they typically fail in situations where there are problems such as fuzzy lesion borders, the presence of multiple tissues that are similar, hair artifacts, low contrast, and patient-specific properties that may change tissue colors.

Deep Learning Approaches
Deep learning approaches have grown rapidly and now they are the most prominent methods for medical images segmentation. Fully Convolutional Neural Network (FCN) [37] is one of the first methods introduced for image segmentation, which works based on the deep convolutional and deconvolution layers. In these networks, the weight of the kernels used for convolution operations is learned by the network, and, after proper model training, these networks are able to extract discriminative features to segment input images. U-Net [38] is an extended idea from the FCN for medical image semantic segmentation applications. The U-Net architecture is designed symmetrically U-shaped and consists of two main paths: the encoder path, which is responsible for reducing the dimensionality of the input images and extracting feature maps, and the decoder path, which is responsible for producing the segmentation map by applying series of up-convolutional layers. This architecture also utilizes a series of skip connections for integrating deep and shallow features acquired from encoder and decoder paths at different scales. Other successful CNN-based architectures, such as 3D U-Net [39], Unet++ [40], SegNet [41], hourglass [42] and DeepLab [5], have also been introduced in recent years and are used in several medical image segmentation applications. Some recent CNN-based approaches are reviewed in the following.
Liu et al. [43] utilized edge prediction-based auxiliary information to segment lesion areas in dermoscopic images. The proposed method employs a cross-connection layer module and creates multi-scale features to improve the network performance. Tong et al. [44] extended the original U-Net model by adding a triple attention mechanism. The first attention module computes contextual information to select regions. The second and third attention modules apply spatial and channel attention to catch correlation between features. This triple attention mechanism allows the network to concentrate on more relevant regions. Kim et al. [45] proposed a four-region segmentation technique to separate different parts of the lung in chest X-ray images and apply an ensemble strategy with five diverse models to quantify COVID-19 pneumonia.
CNN-based methods are favorably efficient in segmentation tasks but are not able to effectively model long-range semantic dependencies. Several methods have been introduced to address this problem, among which Transformer-based architectures have been the most efficient. These models use the self-attention mechanism and are highly capable of modeling the long-range semantic dependencies and global contexts. Liu et al. [46] used a two patch-based strategies for medical image segmentation based on a vision transformer. Mode specifically, they used a patch-based contrastive module to improve the feature representation by enforcing locality conditions. Moreover, they eliminated artifacts from the patch splitting by employing a 3D window/shifted-window multi-head self-attention module. Meng et al. [47] utilized the global information of CT images to recognize morphologic margins of liver tumors and used a multi-scale feature fusion network segmenting tumor areas. Xu et al. [48] combined a Transformer module as an encoder into the U-Net model to balance the accuracy and efficiency of the Transformer block. Furthermore, using special skip-connections, they passed all multi-scale feature maps, created in transformer and convolutional blocks, into the decode to integrate the spatial information of the input data into the model.
The main limitation of the CNN networks is the ability to capture the global contextual representation as it is only capable of modeling the local representation. On the other hand, Transformer unlike the CNN models are highly capable of capturing the long-range connectively but less effective in reconstructing local information. To benefit from both architecture designs, we combine these two networks with an extra attention mechanism to perform a fine-grained semantic segmentation task. In the next section, we present our proposed method in a comprehensive manner.

The Proposed Model
Transformer architecture is designed in such a way that patch-wise training is faster than the case of feeding the entire image into the network. However, in a patch-wise training strategy, the network cannot learn information or dependencies for inter-patch pixels. This strategy is not a suitable mechanism for medical image segmentation tasks due to the fact that in medical images there are semantic dependencies between different pixels of images.
To address this issue, we proposed a two-branches network including a Transformer branch that analyzes image patches and a CNN branch that operates on the original resolution of the input image. This two-branches structure increases the network's overall understanding of the images by effectively distilling local semantic information derived from the CNN module and the long-range contextual representation of the Transformer model. Figure 2 depicts the architecture of our suggested hybrid network. The Transformer branch divides each input image into 16 patches of size I/4 × I/4, where the dimension of the original image is denoted by I, and fed each patch to the network. Next, based on the location of each patch, the output feature maps are re-sampled to produce the output feature maps. Furthermore, in the CNN branch, a seminal U-Net encoder is incorporated to model the local semantic representation. Given that the CNN branch emphasizes more delicate details and the Transformer branch concentrates on highlevel information, our approach improves the network's performance. To further effectively combine these two feature maps, we proposed to include the bi-directional ConvLSTM module in the bottleneck of the network to adaptively combine and generate the aggregated feature map for the decoding path. We argue that the suggested architecture is capable of learning both local and global characteristics of the input image which is critical for the segmentation task. In the next subsections, we explain each part in more detail.

Local Semantic Representation
The first branch of the proposed method utilizes the CNN encoder to capture local semantic representation. The local feature extracted by the CNN module contains rich and generic information for modeling semantic dependency among local pixels, which is crucial for the segmentation task. To this end, we consider the input image x, CNN encoder module E parametrized with θ to produce the semantic representation: In our design, the CNN encoder module can follow any well-known structure, hence, we utilize the Xception encoder [49] to produce better fine-grained representation. The Xception model was initially proposed for the object classification task and exhibited excellent performance on several challenging benchmarks. It is further utilized for the segmentation task and the tremendous achievement obtained by this network. Due to the nature of the inception module incorporated in this CNN structure, it is an ideal network for multi-scale object description. With all these characteristics along with the literature report on the advances of the Xception model for better feature representation, we utilized this as an encoder of our network.

Global Contextual Representation
To predict the pixel-wise label of an image x ∈ R H×W×C , with C as the number of channels and a spatial resolution of H × W, we first split the x shape into a series of flattened 2D patches x i p ∈ R(i = 1, . . . , N), where each input image will have N = (H × W)/P 2 number of patches of size P × P. Then, we used a linear projection vector to get a latent D-dimensional embedding space from the vectorized patches x p . Using the below patch embedding equation, we are assured that the positional information is present.
where the patch embedding projection is represented by E ∈ R (P 2 C) × D and the position embedding is indicated by E pos ∈ R N×D . After we achieved the embedding space, in the form of a layer, we feed forward it through a multi-scale context block, made up of multi-headed self-attention (MSA), and a stack of transformer blocks, made up of multi-layer perceptron (MLP) layers [13]. Equations (3) and (4) depict these two blocks.
where the layer normalization is denoted by Norm and the individual block is represented by i. The MLP consists of two linear layers and the MSA block consists of n parallel self-attention (SA) heads. The transformer module produces a global contextual representation corresponding to each patch. To reconstruct the image level representation, using the location of each patch, we resample the output feature maps to produce the image level representation.

ConvLSTM Module
Standard LSTM uses full connections in state-to-state and input-to-state transitions. This means that these methods do not consider spatial correlation, which is the central limitation of this method. Shi et al. proposed ConvLSTM [50] to address this problem. The ConvLSTM uses convolution operations in transferring input-to-state and state-to-state. From a mathematical aspect, the ConvLSTM comprises three controlling gates: an input gate i t , an output gate o t , and a forget gate f t to access, update, and clear memory cell C t . We formally define the formula that models ConvLSTM as follows: where • and * mark Hadamard function and convolutional operation, respectively. X t states the input tensor, H t notes the hidden state tensor, C t shows the memory cell tensor, W x * marks an input state 2D Convolution kernel, and W h * notes a hidden state 2D Convolution kernel. b i , b f , b o , and b c show the bias terms.
In our architecture, we employed BConvLSTM [51] as it uses recalibrated feature pyramid encoder that maps the features to a single multi-scale representation. More specifically, BConvLSTM comprises two ConvLSTMs, one used to process input data in the forward path and the other to process data in the backward path direction. A standard ConvLSTM merely processes forward-direction dependencies, whereas BConvLSTM decides on the current input concerning the data dependencies in both directions. A study by Cui et al. [52] have shown that considering both forward and backward temporal perspectives improve the predictive performance of the model. We can consider the BConvLSTM as a two separate standard ConvLSTMs: therefore, we need two sets of parameters for backward and forward states. BConvLSTM output can be modeled as follows: where the forward hidden state tensors are denoted by H t , the backward hidden state tensors are indicated by H t , the final Spatio-temporal information-based output is marked by Y t ∈ R F l ×W l ×H l , and the bias term is shown by b. Furthermore, we used the hyperbolic tangent tanh to integrate in a non-linear way the output of both the forward and backward states.

Decoder
The last module incorporated in our design is the CNN decoding block. Our decoder follows the regular U-Net decoder with five deconvolutional blocks to gradually decode and upsample the encoded feature to the segmentation map.

Experimental Results
We evaluated our proposed method on different datasets with different applications. Initially, we used three datasets, ISIC 2017 [53], ISIC 2018 [54] and PH² [55], to report the performance of the proposed method in the skin lesion segmentation task. Then, we used the lung dataset to evaluate the performance of the proposed method on the lung area segmentation task. For the implementation, we trained the network from scratch for all datasets using the PyTorch framework in the Python V3 programming language. Our experiments were performed on the same machine, with NVIDIA GTX 3090 GPU and a batch size of eight without any data augmentation. We utilized the Adam optimizer and set an initial learning rate of 1 × 10 −3 and a decay rate of 1 × 10 −4 for 100 epochs to train the network. We terminated the model training process when the validation does not change in 10 consecutive epochs. For having a stable starting point for the network, we used a standard normal distribution for initializing the model weights.
In the following, we describe the metrics we used to evaluate our model's performance. Additionally, the specifications of the datasets used and the results obtained in the model evaluation stage for each of these datasets are explained. Furthermore, the performance of the proposed model on each of the datasets has been compared with other state-of-the-art methods in the literature.

Evaluation Metrics
In order to evaluate our proposed model's performance from different aspects, we have used the known metrics of accuracy (ACC), specificity (SP), sensitivity (SE), and Dice (DSC) score. Each of these metrics examines the specific capabilities of the proposed model. In the following, we first explain the concepts needed to sense the metrics, and then we illustrate the calculation formula of these metrics. Accuracy implies the percentage of correct prediction, Specificity implies the proportion of FP that are correctly identified by model, Sensitivity denotes the proportion of predicted TP that are correctly identified by model, F1 score, also known as Dice Score (DSC), is a weighted average of the precision and recall,

ISIC 2017 Dataset
One of the most well-known datasets in the field of dermoscopic images segmentation for skin cancer diagnosis is the International Skin Imaging Collaboration (ISIC) 2017. Researchers have gathered this dataset by taking 2000 dermoscopic image samples using a skin-surface reflection elimination technique that captures images of the skin surface in deep detail [53]. To prepare this dataset for skin lesion segmentation, lesion localization, and skin disease classification tasks, each of the image samples has been annotated by clinical experts using a semi-automated or manual process. The purpose of this research is image segmentation. For this purpose, similar to the research conducted in [56], we first randomly separated the dataset into three sets: training set, validation set, and testing set, each of which contains 1250, 150, and 600 images, respectively. Besides, to reduce network load and speed up the network training process, we resized all image's spatial dimensions to 256 × 256 pixels in the pre-processing stage. Table 1 depicts the evaluation results of our proposed model on the ISIC 2017 dataset. The results illustrated that the proposed method outperformed the state-of-the-art approaches in all metrics, except MCGU-NET using sensitivity metric. Some of the results obtained from the semantic segmentation of the proposed method on the ISIC 2017 dataset are shown in Figure 3. The segmentation results illustrate that our model accurately separates the lesion area from the healthy parts of the skin.

ISIC 2018 Dataset
To conduct further research on the tasks of skin lesion segmentation, lesion localization and skin disease classification and to improve melanoma diagnosis, the ISIC 2018 database [54] has been created by an international collaboration. This dataset comprises 2594 dermoscopic image samples, each of which has been annotated by clinical experts using a semi-automated or manual process similar to the ISIC 2017. In the pre-processing section, we randomly split the dataset into three sets: a training set with 1815 samples, a validation set with 259 samples, and a testing set with 520 samples. Similar to the ISIC 2017 pre-processing stage, to reduce network load and speed up the network training process, we resized all images' spatial dimensions from 2016 × 3024 pixels to 256 × 256 pixels. Table 2 presents the comparison results of the proposed method against the SOTA approaches.
The results indicate that our model outperformed the seven previous works based-on DSC and SP metrics. To further analyze the segmentation performance of the proposed method, we provide Figure 4 to illustrate some segmentation maps obtained by our proposed network. It can be observed that the generated segmentation masks are quite precise in both object detection and boundary separation from the background.

PH² Dataset
The PH² is another popular dataset in the field of skin lesion analysis, prepared by the dermatology services of Pedro Hispano Hospital, Matosinhos, Portugal. This dataset includes 200 dermoscopic images of skin lesions region that are gathered for future research on the classification and segmentation of cancerous regions in dermoscopic images. In the pre-processing stage, we followed the same procedure of a previous work [56] and randomly split the dataset into two subsets of 100 samples as a training set and 100 samples as a validation and a testing set.
To validate the performance of the proposed method, we have provided Table 3 to quantitatively compare the obtained results with the SOTA approaches. Our results suggest that the proposed approach outperforms the SOTA methods in all metrics, excluding the FAT-Metusing SE metric. To illustrate the effectiveness of our network on segmenting skin lesion, we provide some visual examples of the segments generated by our network in Figure 5.

Lung Segmentation Dataset
This dataset is provided by the National Cancer Institute (NIH) and used in the Lung Nodule Analysis (LUNA) competition at the Kaggle Data Science Bowlin 2017 [64] to encourage researchers and scientists to develop lung cancer detection algorithms. This dataset includes lung computerized tomography (CT) scan images provided with annotations for lung segmentation. In Table 4, we have provided the comparison results of the proposed method against the counterparts to quantitatively analyze the obtained results. As it can be seen from the table that the suggested network slightly outperformed that the SOTA approaches in all metrics, except for the R2U-net and MCGU-Net with SE and SP metrics, respectively. We have also provided Figure 6 to visualize some segmentation results on the 2D scan of the lung dataset.

Ablation Study
To analyze the proposed structure in more detail, we conducted an ablation study. Throughout the ablation study, we used the ISIC 2018 dataset. To begin, we defined the baseline model as a seminal U-Net model without incorporating any of the proposed modules. Next, by inserting a transformer path we created the two-stream network where the CNN module learns local semantic representation while the Transformer module tries to encode the long-range contextual dependency. The resulting features of these two-streams were then fused using a concatenation operation followed by the convolutional layer. In the third setting, we replaced the concatenation operation with a one-directional ConvLSTM. For the last setting, we used the bi-directional ConvLSTM module to learn rich and generic representations. Results are presented in Table 5. It can be noticed that, by inserting each of the proposed modules, the entire model performance steadily improves and reaches the highest performance (in our experiments) using the combination of CNN+Transformer+Bidirectional ConvLSTM modules. It should also be noted that while these modules increase the performance at the same time they also increase the number of parameters to be trained. Thus, there is a trade-off between performance and computation complexity.

Discussion
In this work, we compared the performance of our suggested network with the SOTA approaches, e.g., MCGUnet, Fatnet and MedT. As the main dataset, we used three skin lesion segmentations, which contained diverse and challenging samples. As can be seen from Figures 3-5, the annotation provided by the dermatologist (ground-truth mask) already contains the noisy labelling in the object boundary. Hence, boundary segmentation always comes with uncertainty. Our predicted results comparatively produce a better segmentation mask than the original noisy annotation and indeed it reveals the effectiveness of our approach in precise boundary separation. This fact might explain the importance of both local semantic and global dependency features for addressing such noise in the annotation mask.

Conclusions
This paper proposes a two-stream network where in the first stream a CNN module is incorporated to model the local semantic representation while the second stream utilizes a Transformer module to model long-range contextual dependency. To adaptively combine the generated features, it further applies a bi-directional ConvLSTM module to model both local and global interactions. Throughout the several experimental results on skin lesion segmentation datasets with overall accuracy ISIC 2017: 0.957, ISIC 2018: 0.949 and PH 2 : 0.971, on ISIC 2017 we have demonstrated that the proposed architecture is highly capable of learning rich and generic representation, which is crucial for the segmentation task. Furthermore, the experimental results on the lung segmentation dataset with an overall accuracy of 0.997 show that our method is extendable to other medical segmentation tasks. It should be noted that one drawback of our approach is the high number of parameters, hence, for a clinical application, future research should consider a parameter reduction technique in order to deploy the model in real-world scenarios.