AWEU-Net: An Attention-Aware Weight Excitation U-Net for Lung Nodule Segmentation

Lung cancer is deadly cancer that causes millions of deaths every year around the world. Accurate lung nodule detection and segmentation in computed tomography (CT) images is the most important part of diagnosing lung cancer in the early stage. Most of the existing systems are semi-automated and need to manually select the lung and nodules regions to perform the segmentation task. To address these challenges, we proposed a fully automated end-to-end lung nodule detection and segmentation system based on a deep learning approach. In this paper, we used Optimized Faster R-CNN; a state-of-the-art detection model to detect the lung nodule regions in the CT scans. Furthermore, we proposed an attention-aware weight excitation U-Net, called AWEU-Net, for lung nodule segmentation and boundaries detection. To achieve more accurate nodule segmentation, in AWEU-Net, we proposed position attention-aware weight excitation (PAWE), and channel attention-aware weight excitation (CAWE) blocks to highlight the best aligned spatial and channel features in the input feature maps. The experimental results demonstrate that our proposed model yields a Dice score of 89.79% and 90.35%, and an intersection over union (IoU) of 82.34% and 83.21% on the publicly LUNA16 and LIDC-IDRI datasets, respectively.


Introduction
According to the World Health Organization (WHO), lung cancer is the leading cause of cancer deaths at 1.80 million in 2020 [1]. The estimated new cases rise to 2.89 million with the death projected to reach 2.45 million worldwide by 2030 [2]. These deaths could be avoidable by an early diagnosis with a quick treatment plan. The National Lung Screening Trial (NLST) showed that the morality of lung cancer is reduced by 20% for emphasising the significance of nodule detection and assessment [3]. Currently, the efficient investigation to discover pulmonary nodules is based on Computed Tomography (CT) imaging technology that generates hundred images of the lung within a second by a single scan. It is a very difficult and tedious job for radiologists to detect the nodules from these images manually. However, computer-aided diagnosis (CAD) systems have assisted radiologists in the automated diagnosis of lung cancer and pulmonary diseases over the last years. These CAD systems mainly depend on the detection and segmentation of various pulmonary parts. They consist of two subsystems, namely; Computer-aided detection (CADe) and segmentation (CASe), respectively. In the lung cancer screening CAD system, the CADe system identifies the region of interest in the lung nodule and CASe segments the region of the nodule. The detection and segmentation of lung nodules are always challenging tasks because of their heterogeneity in CT images. However, automated analysis is necessary to measure the properties of the lung nodule for identifying the malignancy of the tumour. The lung nodule segmentation system can determine the malignancy by analysing nodule size, shape and change rate [4]. It should be noted that the use of an accurate lung nodule screening CAD system can accelerate the entire diagnosis and radiotherapy process where patients can perform the required radiation or photon therapy (shown in Figure 1) on the same day. arXiv:2110.05144v1 [eess.IV] 11 Oct 2021 Figure 1. A proton therapy plan for lung tumour treatment. The red tumour area is segmented by radiologists (Image from the Seattle Cancer Care Alliance Proton Therapy Center [5]).
To address these premises, in this article we focus on developing a fully automated end-to-end CAD system based on current state-of-the-art deep learning models. Many researchers have already handled the lung nodule detection and segmentation problem based on Convolutional Neural Networks (CNNs) and achieved promising results [6][7][8][9][10][11][12]. CNNs can learn complex features to detect and segment the nodule accurately. However, existing methods are mostly semi-automated and commonly used architectures inspired by U-Net [13]. This article proposed AWEU-Net, an attention-aware weight excitation U-Net for lung nodule segmentation. AWEU-Net learns preciously low-level lung nodulerelated features using proposed PAWE, and CAWE block. The PAWE and CAWE can enhance the position and channel-based nodule feature representations. The main contributions of this research can be summarized as follows: • We present a fully automated end-to-end lung nodule detection and segmentation system. • We adopt the Faster R-CNN [14], a state-of-the-art detection model, improved and named it as "Optimized Faster R-CNN" for a reliable lung nodule detection system. • We propose AWEU-Net, an efficient fully automated lung nodule segmentation model. • We propose the use of PAWE and CWEU mechanisms to discover the correlation between the position and channel features and enhance the model ability to distinguish between lung nodule and normal tissue feature representations. • We assess The AWEU-Net model on two publicly datasets; LUNA16 and LIDC-IDRI and demonstrate that its performance overcomes the current state-of-the-art models.
This article is prepared as follows. Section 2 reviews the existing lung nodule segmentation systems based on classical computer vision and deep learning techniques. The proposed system workflow and model architecture are explained in Section 3. The experimental results are demonstrated in Section 4. Finally, Section 5 concludes the article with some future lines of this research.

Related work
During the last decade, several lung nodule detection and segmentation systems based on classical computer vision and deep learning techniques were presented. The most common lung nodule segmentation techniques are discussed in this section and summarized in Table 1.

Traditional computer vision-based approaches
In the field of lung nodule analysis, many computer vision methods have been used like region growing [15], active contours [16], level sets [17], graph cuts [18], adaptive thresholding [19], Gaussian  [16] Active contours LIDC-IDRI Thresholding & morphological operations Markov random field [17] Level sets LIDC-IDRI Statistical intensity Region condition [18] Graph cuts PRIVATE Gaussian smoothing - [19] Adaptive thresholding LIDC-IDRI Histogram equalization & noise filtering Morphological operations [20] GMM fuzzy C-means LIDC-IDRI & GHGZMCPLA Non-local mean filter & gaussian pyramid Random walker [21] Region-based fast marching LIDC-IDRI Convex hulls Mean threshold Deep learning-based [6] U-Net LIDC-IDRI Nodule ROI selection - [7] iW-Net LIDC-IDRI Nodule ROI selection - [8] U  [20], and region-based fast marching [21]. For instance, a contrast-based region growing method and fuzzy connectivity map of the object of interest were used in [15] to segment the various types of pulmonary nodules. This method did not perform well with irregular nodules because of the merging criterion in the region growing technique that needs accurate settings. A geometric active contours with a marker-controlled watershed as well as Markov random field (MRF) was used in [16] to segment the lung nodule. This method depends on manually selects a region of interest (ROI) in the nodule region. [17] used the shape prior hypothesis along with level sets that iteratively minimizes the energy to segment the juxtapleural nodules. The precision of this method also depends on the selection of the ROIs. A graph cuts with the expectation-maximization (EM) algorithm was proposed in [18] for lung segmentation on chest CT images. This algorithm has a high computational cost because its main focus is on the Gaussian mixture model(GMM) training method and the creation of the corresponding graph. [19] used the adaptive thresholding along with watershed transform to detect the nodules. This approach mainly relies on several image pre-and post-processing procedures. [20] combined GMM earlier knowledge within the conventional fuzzy Cmeans method to improve the robustness of pulmonary nodules segmentation. The major disadvantage of fuzzy C-means algorithms is that they are sensitive to noise, outliers and primary cluster selection. A region-based approach was introduced in [21] by using the fast marching method, which gives a precise segmentation of the nodule and can properly handle juxtapleural and juxtavascular nodules. All these above-mentioned traditional approaches are semi-automated or depend on several image pre-and post-processing methods.

Deep learning-based approaches
Recently, many researchers have developed various deep learning-based systems for lung nodule detection and segmentation. [6] presented a simple U-Net for for lung nodule segmentation by utilizing half of the convolutional layers normally used in the original U-Net [13]. The model has many window widths and window centres enhancing the nodule features. It improves the model performance over the original U-Net by 2% in Dice Score Coefficient (DSC). [7] combined two U-Net models (named iW-Net) that can be used with and without user interaction. In the first instance, the user selects the nodules ROI and the corresponding end-point to produce a weight map for developing the prediction of the model. The architecture is designed by considering the expected round shape of the nodules. The loss function is also combined with the weight map and the feature of the model output. [12] presented a multiple resolution residual network (MRRN) that is a modification of the ResNet [22] based on the U-Net model. A slightly transformed version of U-Net called U-Det was presented in [8], where many hidden layers were used to filter the residuals blocks located within the encoder and decoder. U-Det also applies the Mish activation function. An end-to-end 3D Deep-CNN called NoduleNet was presented in [9] for detection and segmentation jointly of the pulmonary nodule. NoduleNet uses an UNet-like model to detect the nodule and then runs a segmentation distillation on the Volume of Interest (VOI) surrounding the detected nodule, gradually up-sampled the segment volumes and integrates them with low-level features. NoduleNet solves the loss of resolution inside the VoI by duplicating the pooling layers and the convolutions of the image. [10] presented a dual-branch residual network (DB-ResNet) that achieved results similar to [6]. The major differences between [10] and [6] are the use of residuals layers, two slightly modified pooling layers and the convolutional blocks of ResNet [22].

Proposed methodology
The main workflow of the proposed method is illustrated in Figure 2. As a pre-processing stage, we extracted the CT volumes slice by slice. The extraction of the CT slices are converted into images from the original CT scans (".dcm" file). To detect the nodule ROI, the Optimized Faster R-CNN method is then used. The detected nodule ROI is fed as an input to the segmentation task. The proposed AWEU-Net is used to segment the nodule preciously, as detailed below.

Nodule detection model
In the nodule detection model, we used Optimized Faster R-CNN based on the original Faster R-CNN [14] to automatically detect lung nodules based on the lung images as shown in Figure 2-(c). The model is a two-stage network with three main blocks, Backbone Network, Region Proposal Network (RPN), and Box head, as shown in Figure 3. We used ResNet50 [22] as a backbone network to extract feature maps from the input image. The feature map is then fed into the RPN to perform boundary regression and classification analysis. The classification principle is based on which a candidate frame is  either related to background or to the object. The position and score of the RPN outputs on the candidate frame are sent to the Box head, where the final regression and classification of the object is performed. Finally, the prediction will show the bounding box of the target (nodule) with the classification score.

Nodule segmentation model
We crop the ROI based on the detection box suggested by the nodule detection model introduced above. We resize the ROI to 224 × 224 and feed it into the proposed nodule segmentation model. We propose an attention-aware weight excitation U-Net, AWEU-Net, for our lung nodule segmentation ( Figure 4). The network is based on the U-Net [13], which is a well-known deep learning model for medical image segmentation. The AWEU-Net model learns to segment the input sub-images by determining the boundaries of the nodule region to discriminate between normal and abnormal tissues. One of the main contributions in this article is to develop a PAWE block in the AWEU-Net model to capture the contextual positional features of the input image. We also propose another CAWE block to enhance the channel-wise feature maps that are coming from each layer of the AWEU-Net model. The details about PAWE and CAWE will be discussed in Section 3.3.1 and Section 3.3.2, respectively.
The AWEU-Net architecture is composed of two successive networks: encoder and decoder. The encoder consists of a sequence of PAWE blocks and max-pooling layers. Each encoder layer is composed of a PAWE block with a 3 × 3 convolutional layer followed by a ReLU as an activation function. Four down-sample blocks with 2 × 2 max-pooling followed by a stride of 2 are used after each block of the encoder. The decoder consists of a sequence of up-convolutions and concatenation with the corresponding high-resolution features from the CAWE blocks that provide a high-resolution output segmentation map. The features coming from encoder layers are also upsampled (Up1, Up3 and Up5) and concatenated with CAWE blocks. The same features coming from CAWE blocks also upsampled (Up2, Up4, and Up6) and concatenated with corresponding decoder layers. The decoder network consists of four layers similar to the encoder. Each layer also consists of a PAWE block with a 3 × 3 convolutional layer followed by a ReLU and a 2 × 2 up-conv layer. After each decoder layer, the feature maps are upsampled to the same size of the CAWE block output to keep the consistency and concatenate it. This mechanism enhances the positional and channel attention-based features learned from the encoder phase and utilises them for the reconstruction means in the decoder network. The final output layer of the model applies a 1 × 1 convolutional layer to map the final 64 feature vector to the number of targeted segmentation classes. It should be noted that in our case the segmentation classes are the background and lung nodule (two classes). The PAWE block consists of two sub-blocks: the position attention block (PAB) and the weight excitation block (WEB). To demonstrate the proposed PAWE block, let the input feature is Y ∈ R C×H×W , where C, H and W are channel, height and width, respectively (see Figure 5). In the PAB block, Y is fed into three convolution layers that produces three new feature maps A, B and C, respectively. The produced feature maps, A p , B p ∈ R C/8×H×W are from the first two convolutional layers, where p superscript is for PAB.Then, A p and B p feature maps are reshaped into (H ×W ) ×C/8. A matrix multiplication is implemented between the transpose of A p and B p , and a spatial attention map is produced, D p ∈ R (H×W )×(H×W ) by using a softmax function: where s i, j indicates the i th position's associated on position of j th . The softmax function D p attempts to learn the relationship between two spatial positions in the input feature maps. In addition, the output of the third convolutional layer C p ∈ R C×(H×W ) is also reshaped to the same shape of the input feature map Y and then multiplied by a permuted order of the spatial attention map D p of (1). The final output is reshaped to a R C×(H×W ) to provide the final feature map of PAB block, F as, where α p is defined as 0 as explained in [25]. The resulting feature F at every position is a weighted sum of the entire neighbours of original features. Note that all " p " notations are defined for the position.
In the WEB,a sub-network for the location-based weight excitation (LWE) proposed in [26] is used. The LWE provides fine-grained weight-wise attention during the backpropagation. The WEB shown in (Figure 5) is defined as: where W W EB, j is the weights across the j th output channel. The average pooling layer, AP, averages the values of each H ×W . Re1 and Re2 are two ReLu activation functions. FC1 and FC2 are two fully connected layers. The output feature from WEB is reshaped and multiplied to the input feature map. Finally, an element-wise sum operation is performed between the features maps from the PAB and WEB to produce the final PAWE features as follows:Ŷ This process generates a global contextual description and aggregates the context according to a spatial weighted attention map by creating relevant weighted features that can produce common weightexcitation and enhance the intra-class semantic coherence.

Channel attention-aware weight excitation (CAWE)
Like PAWE, the proposed CAWE block includes two sub-blocks, a channel attention block (CAB) and a weight excitation block (WEB). In the CAB block, the input Y ∈ R C×H×W is reshaped in the initial two steps and permuted in the second part into Y c 1 ∈ R (H×W )×C and Y c 2 ∈ R C×(H×W ) , where the superscript c is defined for CAB. Afterwards, a matrix multiplication between Y c 1 and Y c 2 is performed. The channel attention map E c ∈ R C×C is defined as: where the outcome of the i th channel on the j th is produced by e c i, j . A multiplication of transposed version of the input feature maps, Y c 3 reshaped to R C×(H×W ) , and the resulted E c is performed. Consequently, the final channel attention map can be defined as: where α c quantify the weight of the channel attention map by the input feature map Y . The final WEB sub-network feature map can be obtained from the Equation 3.
Finally an element-wise sum operation is performed between the CAB and WEB output features maps to produce the final CAWE features as follows: This process emphasizes class-dependent feature maps using weighted excitation versions of the features of all the channels and boosting the feature difference among the classes. Note that all " c " notations are defined for the channel.

Datasets
In this work, we used two publicly available datasets: • Lung Image Database Consortium image collection (LIDC-IDRI) [27] consists of 1018 CT scans performed on 1010 patients from seven different organisations. Each CT scan has been analysed by four radiologists, who individually identified the nodule and manually segmented the region of all the nodules with a diameter larger than three millimetres. Each CT scan can include one or more nodule regions, so the total segmented masks are 5066. Looking close at the dataset, many nodules are very small and not satisfied the malignancy index. Therefore, we used a diameter threshold larger than 20 mm to excluded all tiny nodules from our dataset. Afterwards, we split our final dataset, which contains 2044 nodule masks in total, into train, validation and test sets of 70%, 10%, and 20% respectively. • LUng Nodule Analysis 2016 (LUNA16) [28] is derived from the LIDC-IDRI dataset [27]. It contains 888 CT scans from the LIDC-IDRI for the grand challenge with round annotation masks for all the segmented nodules. LUNA16 challenge dataset contains 1186 nodules annotations. We obtained 2300 nodule masks from the annotation after pre-processing. We split the dataset into train, validation and test sets similar to the LIDC-IDRI dataset.

Model Implementation
We individually trained the nodule detection and segmentation models on the PyTorch framework [29]. To train the detection model, the Stochastic Gradient Descent (SGD) [30] optimizer with a learning rate of 0.002 was used. The Binary Cross-Entropy (BCE) and the L1 norm loss functions were used to train the detection model with a batch size of 4. On the other hand, the Adam [31] optimizer with a learning rate of 0.0002, the BCE and the IoU loss functions were also used to train the segmentation model with a batch size of 4. Note that, data-augmentation was applied during training for both detection and segmentation models to increase the size of the training dataset. We augmented the datasets by random rotation, flipping horizontally and vertically and applying the elastic transform. Finally, all the experiments were carried on NVIDIA GeForce GTX 1080 GPU with 8GB memory and running about 10 − 15 hours to train 100 epochs for each model.

Model Evaluation
Two different procedures were used on both datasets to evaluate the proposed detection and segmentation models. For Pixel-level evaluation, the segmentation model provides a pixel-wise output of the class probabilities for every pixel in the input nodule ROIs. The output is converted into a binary segmentation map using a threshold value. Regarding pixel-level evaluation metrics, accuracy (ACC), sensitivity (SEN) and specificity (SPE) are calculated to evaluate the performance of the segmentation model. We also plot a receiver operating characteristic (ROC) curve for calculating Area Under the curve (AUC). For Object-level evaluation, we used the segmentation output to calculate the Dice coefficient (DSC) and intersection over union (IoU) for assessing the ability of the algorithm to preciously segment the boundaries of the nodule. Note that in our case, there is no "true negative" class, since there is no "object" corresponding to the absence of nodules. Besides, we also plot the precision-recall (PR) curve instead of the ROC to compare the ground truth number and find the correlation.

Nodule Detection
To detect the nodule in the input CT images, we used different state-of-the-art deep learning detectors models, such as R-CNN [32], Fast-RCNN [33], original Faster R-CNN [14] and Optimized Faster R-CNN. The aforementioned detection models were trained and tested on LIDC-IDRI and LUNA16 datasets. To training the above models, we used the data splits as discussed in Section 4.1. We used all default parameters for training the R-CNN [32], Fast-RCNN [33], and original Faster R-CNN [14] models based on their original paper. We fine-tune the parameters of the original Faster R-CNN to find the best parameters to achieved the highest performance and named it Optimized Faster R-CNN. The best combination for this model is with a learning rate of 0.001, step size of 70000, gamma of 0.1, and the dropout ratio of 0.5. The model was trained by the pre-trained ResNet50 model to extract the features with a batch size of 64. We finally compare the average precision (AP) of the detection as shown in Table 2 to select the best detection model among the tested models. The Optimized Faster R-CNN model yields the best results, with the highest AP on both datasets. In turn, R-CNN, Fast R-CNN, original Faster R-CNN models have not properly detected all nodules in the input CT images. Therefore, we have selected the Optimized Faster R-CNN model to detect the nodule in CT images. Some examples of lung nodule detection using Optimized Faster R-CNN are shown in Figure 7. As shown, the Optimized Faster R-CNN model is able to detect the nodule regions even the small nodules.

Nodule Segmentation
The proposed lung nodule segmentation model is compared to the state-of-the-art approaches and evaluated in terms of quantitative and qualitative results. For the quantitative study, we used ACC, SEN, SPE for pixels-level and DSC and IoU for object-level performance, respectively as shown in Table 3. We compared the AWEU-Net to six different lung nodule segmentation models considering both datasets: PSPNet [34], MANet [35], PAN [36], FPN [37], DeeplabV3 [38], and U-Net [13]. As shown in Table 3, AWEU-Net outperforms all the tested models in terms of the ACC, SPE, DSC, and IoU metrics on the LUNA16 dataset. AWEU-Net yields ACC, SPE, DSC, and IoU scores of 91.32%, 93.46%, 89.79%, 82.32%, and 89.88%, respectively which is 1.18%, 1.47%, 0.97%, 1.8%, and 0.93% points higher than the scores of the second-best method (i.e., U-Net). In turn, the DeeplabV3 achieved a SEN score of 93.01%, which is 1.32% point higher than AWEU-Net. However, the proposed segmentation model provides a comparable SEN score of 91.69%.  In addition, using the test set of the LUNA16 and LIDC-IDRI datasets, the box plots of DSC and IoU scores of the six models and AWEU-Net were drawn to demonstrate the segmentation ability of AWEU-Net as shown in Figure 8. On both datsets, the proposed AWEU-Net yields the higher DSC and IoU mean scores and the lowest standard deviation with only two outliers compared to the other six segmentation models that are represented many outliers with lower mean and higher standard deviation scores. Furthermore, to predict the probability of the binary segmented masks, the ROC and PR curves were constructed as shown in Figure 10. Using the LUNA16 test set, the proposed AWEU-Net yields the highest AUC and PR of 97.10%, and 96.66%, respectively among the seven segmentation models tested.
On the other hand, AWEU-Net outperforms all the tested models in terms of all evaluation metrics on the LIDC-IDRI dataset. The proposed model yields the ACC, SEN, SPE, DSC, and IoU scores of 94.66%, 90.84%, 96.41%, 90.35%, and 83.21%, respectively. Its improved 0.3%, 1.16%, 0.06%, 0.48%, and 1.21% of ACC, SEN, SPE, DSC, and IoU scores from the original U-Net. Again, the box plots of DSC and IoU scores of the LIDC-IDRI dataset to compare the models performance is displayed in Figure 9. Likewise, the proposed AWEU-Net highest DSC and IoU mean scores and the small standard deviation with only one outlier. The proposed model achieved the AUC of ROC and PR on the LIDC-IDRI test dataset are 91.58%, and 82.02%, respectively shown in Figure 11.
Finally, a qualitative comparison of the segmentation results of the AWEU-Net and the six segmentation models is shown in Figure 12. The segmentation results of the input nodule ROIs of CT images with a variety of difficult levels: illumination variations, irregular shape and boundary of the nodule regions were presented. As shown in Figure 12, four examples from the two datasets along with the ground truth and the predicted mask of the six tested models were compared to the proposed AWEU-Net model. AWEU-Net provides segmentation results very close to the ground truth with an average similarity of > 86% (True Positive (TP)). Our segmentation method also provides the lowest degrees of False Negative (FN) and False Positive (FP) compared to the rest of the models. The AWEU-Net model yields regular borders compared to PSPNet, MANet, FPN, since our model strives for higher accuracy on nodule region boundaries. The resulting segmentation of the six tested models may significantly differ from the ground truth in some cases, e.g., the second example of the LUNA16 dataset.

Conclusions
This article proposed a reliable system for lung nodule detection and segmentation. The system contains two deep learning models. Firstly, the Optimized Faster R-CNN model [14] trained with lung CT scan images was used for detecting the nodule region in a CT image as an initial step. Secondly, a segmentation model, AWEU-Net, was proposed for segmenting the nodule boundaries of the detected nodule region. The proposed segmentation model, AWEU-Net includes PAWE and CAWE blocks to improve the segmentation performance. Compared to the state-of-the-art models, the proposed AWEU-Net yields the best segmentation accuracy with DSC and IoU scores of 89.79%, 90.35%, and 82.34%, 83.21% on the LUNA16 and LIDC-IDRI datasets, respectively. In future work, we will develop a comprehensive end-to-end nodule segmentation system and it will be able to classify and grade the nodule malignancy.