Improved Complementary Pulmonary Nodule Segmentation Model Based on Multi-Feature Fusion

Accurate segmentation of lung nodules from pulmonary computed tomography (CT) slices plays a vital role in the analysis and diagnosis of lung cancer. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in the automatic segmentation of lung nodules. However, they are still challenged by the large diversity of segmentation targets, and the small inter-class variances between the nodule and its surrounding tissues. To tackle this issue, we propose a features complementary network according to the process of clinical diagnosis, which made full use of the complementarity and facilitation among lung nodule location information, global coarse area, and edge information. Specifically, we first consider the importance of global features of nodules in segmentation and propose a cross-scale weighted high-level feature decoder module. Then, we develop a low-level feature decoder module for edge feature refinement. Finally, we construct a complementary module to make information complement and promote each other. Furthermore, we weight pixels located at the nodule edge on the loss function and add an edge supervision to the deep supervision, both of which emphasize the importance of edges in segmentation. The experimental results demonstrate that our model achieves robust pulmonary nodule segmentation and more accurate edge segmentation.


Introduction
Lung cancer remains the most fatal type of cancer worldwide [1], and among Chinese women (most non-smokers), the incidence and mortality of lung cancer rank first in the world [2]. Early screening of lung cancer is key to improving the survival chances of patients. In clinical practice, the most widely used predictors for assessing the probability of malignancy and tumor progression are the size, shape, and growth rate of the nodule [3]; therefore, accurate segmentation of lung nodules is essential in the diagnosis of lung cancer. However, with the popularization of pulmonary computed tomography (CT), the feasibility of manual CT annotation has become increasingly unrealistic because it is too strenuous [4]. Therefore, it is necessary to develop a computer-aided diagnosis (CAD) of lung nodule segmentation to avoid laborious manual annotation in clinical practice, which objectively ensures the consistency of nodule diagnosis [5].
As shown in Figure 1, lung nodules have high variability in size, shape, and intensity, as well as similar visual characteristics between nodules and their surroundings. Although the traditional semi-automatic segmentation methods based on image processing improve the repeatability of annotation, they generally suffer from poor adaptability and low segmentation accuracy to heterogeneous nodules. In addition, these methods also need Figure 1. Examples of pulmonary nodules with large variation between nodules and small variation between nodules and their surrounding tissues in CT patches. (a) Indicates that nodules are high variability in size, shape, and intensity. (b) Indicates that nodules have similar visual characteristics to their surrounding lung parenchyma, lung wall, blood vessels, etc.
As mentioned above, how to improve CNNs approaches to accurately and robustly segment heterogeneous nodules, especially hard-to-segment nodules, is our motivation. Inspired by the diagnosis process of a pulmonary nodule: clinicians first roughly locate a suspicious nodule, then extract the coarse nodule area and accurate nodule edge information according to the local manifestations of the nodule, and finally identify nodules by combining these three factors for further diagnosis and treatment plan. Therefore, we plan to consider building a general segmentation network that combines lung nodule location, coarse region, and edge information. Fortunately, Fan et al. [16] proposed an Infnet that can segment the COVID-19 infection regions. Although the performance of its network is perfect, it has not been applied to the nodule segmentation task, and it also has the disadvantage of not comprehensively considering the factors in clinical CT images.
To this end, we aim at constructing a general lung nodule segmentation network, namely a features complementary network, which combines the location, coarse area, and edge of lung nodules to achieve coarse-to-fine segmentation. The main contributions of this study are summarized as follows: (1) We propose a novel end-to-end lung nodule segmentation guidance network by fully integrating the global context and spatial information from different scale features, which leverages the complementary information extracted at both small-and largescales. (2) Under the guidance of assigning more weight to the pixels located at edges and explicitly modeling edges in depth supervision, the location and coarse area are With the development of convolutional neural networks (CNNs) in computer vision [6][7][8], its application in medical image segmentation has become a research hotspot [9][10][11][12]. Although CNN-based methods have achieved great improvements in segmenting lung tumors compared with traditional approaches [13][14][15], the segmentation of heterogeneous nodules still requires further attention for the following reasons: (1) Large variation between nodules and small inter-class variances between nodules and their surrounding tissues. Nodules come in different types, sizes, locations, etc. (Figure 1a). The intensity may be heterogeneous even within the same nodule (calcific and non-calcific tissues in partially calcified nodules). The intensities of the juxta-pleural, juxta-vascular, and ground-glass opacity (GGO) nodules are indistinguishable from their surrounding lung wall, blood vessels, and lung parenchyma, respectively (Figure 1b). These characteristics hinder their accurate identification. (2) No comprehensive analysis of factors in clinical CT images. Although the location, region, and edge of nodules are three key factors for the diagnosis of lung nodules in practice, researchers usually neglected to consider all of these elements together in segmentation.
As mentioned above, how to improve CNNs approaches to accurately and robustly segment heterogeneous nodules, especially hard-to-segment nodules, is our motivation. Inspired by the diagnosis process of a pulmonary nodule: clinicians first roughly locate a suspicious nodule, then extract the coarse nodule area and accurate nodule edge information according to the local manifestations of the nodule, and finally identify nodules by combining these three factors for further diagnosis and treatment plan. Therefore, we plan to consider building a general segmentation network that combines lung nodule location, coarse region, and edge information. Fortunately, Fan et al. [16] proposed an Inf-net that can segment the COVID-19 infection regions. Although the performance of its network is perfect, it has not been applied to the nodule segmentation task, and it also has the disadvantage of not comprehensively considering the factors in clinical CT images.
To this end, we aim at constructing a general lung nodule segmentation network, namely a features complementary network, which combines the location, coarse area, and edge of lung nodules to achieve coarse-to-fine segmentation. The main contributions of this study are summarized as follows: (1) We propose a novel end-to-end lung nodule segmentation guidance network by fully integrating the global context and spatial information from different scale features, which leverages the complementary information extracted at both small-and large-scales. (2) Under the guidance of assigning more weight to the pixels located at edges and explicitly modeling edges in depth supervision, the location and coarse area are complemented with edge information, which effectively boosts the accuracy and robustness of the lung nodule segmentation model. (3) Experimental results illustrate that the proposed model outperforms other CNNs methods with high accuracy and robustness in lung nodule segmentation performance.

Related Work
The lung nodule segmentation techniques include traditional image processing-based methods and machine learning-based methods. Traditional techniques include morphology, region growing, level set, and graph cut methods. Machine learning methods can be divided into traditional machine learning methods and deep learning methods, both of which convert segmentation into pixel-classification tasks.
In the morphology method [17], the attached vascular components were first separated by an opening operation, followed by a connected component analysis to retain the nodule volume. The region growing method [18] performed an adaptive sphericity-oriented contrast region growing to distinguish nodules from the lung wall. In the active contour model method, the images were represented as level-set functions [19]. A segmentation of pulmonary nodules study [20] adopted a graph-cut method based on graph theory. However, a common shortcoming of these methods is that one method performs well regarding a certain type of nodule and often performs poorly regarding another. In addition to weak generalization, they often need to add user interaction, prior information, and so on, which is dependent on the human experience.
Recently, the application of machine learning in the segmentation of lung nodules has been ubiquitous. Gonçalves et al. [21] proposed a multi-scale segmentation process for lung nodules based on the Hessian strategy. Mukhopadhyay et al. [22] constructed a two-stage segmentation framework for pulmonary nodules based on internal texture and external attachment features. In addition, a segmentation method that can extract solid and non-solid components in GGO nodules was proposed [23]. Although these traditional machine learning algorithms have achieved excellent accuracy in nodule segmentation, they have encountered some drawbacks, including but not limited to relying highly on manually defined features, being time-consuming, and having weak generalization, which hinder the further development of lung nodule segmentation schemes.
In recent years, deep learning technology has developed rapidly, and CNNs have been widely used for lung nodule segmentation with promising results. Jiang et al. [24] improved the full-resolution residual neural network (FRRN) and designed two lung nodule segmentation networks that combined features of all levels through residual flows. Wang et al. [25] proposed a two-branch structure and used multi-scale for lung nodule segmentation, which is a centrally focused convolutional neural network that combines 3D and multi-scale 2D features. A parallel structure was applied by Cao et al. [26], who also devised a weighted sampling strategy based on nodule boundaries. Similarly, a multi-view CNN with a three-branch structure was adopted by Wang et al. [27] that was fed a set of multi-scale 2D patches from three orthogonal directions: axial, coronal, and sagittal views. Although the parallel structure could effectively integrate multiple features, the model complexity is relatively high, which requires more run-time to reach convergence and increases the risk of overfitting. Especially, Hu et al. [28] paralleled a hybrid attention mechanism with a densely connected convolutional network to segment glioblastoma tumors in an entire lung CT. Their approach is more suitable for larger nodules (glioblastoma diameter range 40-90 mm). In addition to the aforementioned multi-branch parallel architecture, the researchers also designed some other structures. For example, [29,30] are based on CNN and combined with other methods in the pre-processing stage. A hybrid deep learning model [29] applied the adaptive median filter in the preprocessing stage and then used the U-Net-based architecture to segment the lung tumor. In the 2D-3D cascaded CNN framework [30], the CT scan volume was also pre-processed by the maximum intensity projection technique, and then the pulmonary nodules were segmented by the U-Net network which integrated the residual blocks, the squeeze, and excitation blocks. In particular, Song et al. [31] introduced a Faster-CNN model into a generative adversarial network to automatically segment various types of pulmonary nodules. Ni et al. [13] designed a two-stage segmentation algorithm for pulmonary nodules from coarse to fine, which included two multi-scale U-Nets, one of which was used for localization and the other for refinement segmentation. Zhao et al. [14] proposed an improved pyramid deconvolution neural network that fused low-level fine-grained features with finely segmented lung nodules in CT slices. Huang et al. [32] proposed a system for the fully automatic segmentation of lung nodules directly from raw thoracic CT scans. Although [14,32] improved the segmentation accuracy by fusing all low-level features, the computational burden was increased because of integrating low-level features equally, while not fully reusing high-level features.
Furthermore, deep learning methods focusing on multi-scale have also been applied to lung nodule segmentation. For example, Maqsood et al. [33] proposed a U-Net-based segmentation framework that integrates dense deep blocks and dense Atrous blocks. Shi et al. [34] presented a lung nodule segmentation model multi-scale residual U-Net (MCA-ResUNet), which applies Atrous Spatial Pyramid Pooling (ASPP) as a bridging module and adds three adjacent smaller-scale guided Layer-crossed Context Attention (LCA) mechanisms. A semi-supervised three-view segmentation network with detection branches was proposed by Sun et al. [35], but three parallel dilated convolutions for multi-scale feature extraction were performed in the detection and classification modules. Based on the encoder-decoder model, Wang et al. [36] changed skip connections to multiple long and short skip connections. In addition, a global attention unit and a boundary loss were added to segment difficult-to-segment (DTS) nodules. Through skip connections, each convolutional block of the decoder can access its feature maps of each previous layer at the same level to aggregate multi-scale semantic information. Yang et al. [37] used a ResNet structure to improve 3D U-Net, which focuses on adopting deep supervision to guide the network to extract multi-scale features, rather than fusing features at different scales. Specifically, it adds side-depth supervision to each layer in the decoder. Considering the complementarity between the nodule patch and the global CT, Wang et al. [38] proposed a dual-branch multigranularity scale-aware network (MGSA-Net), which unifies the representation of global-and patch-level image features in one framework. The deep scale-aware module (DSAM) in the global branch extracts the concealed multi-scale contextual information at different stages through three parallel branches. Ni et al. [13] constructed a two-stage network for lung nodule segmentation and classification. A 3D multi-scale U-Net (MU-Net) was employed in the first stage to locate nodules. In the second stage, a 2.5D multiscale separable U-Net (MSU-Net) adopts a multi-branch separable convolutional input layer to extract features of different scales from any input image scale to refine the output of MU-Net. Similarly, the models of Wang et al. [38] and Ni et al. [13] are all improved on the basis of U-Net. Studies by Yang et al. [37] and Ni et al. [13] are 3D networks that exploit the continuity of information between CT slices. In general, 3D networks have more parameters than 2D networks, which may easily lead to overfitting and slow convergence speed. Especially if there is a lack of enough labeled samples during training, the performance of the 3D network is worse than that of the 2D network. Zhu et al. [39] added a High-Resolution network with Multi-scale Progressive Fusion (HR-MPF) in the encoder part of the High-Resolution Network (HRNet) and proposed a Progressive Decoding Module (PDM) in the decoder part. In addition, a loss function with edge consistency constraint is designed in the segmentation loss.
Note that there are two common shortcomings in all the above-mentioned studies that use CT patches for lung nodule segmentation. First, the tumor-centered CT patches were used uniformly, which meant that the location of the tumor in the CT patch was fixed. Therefore, the segmentation performance was likely to be biased as it did not depend on the position of the tumor in the raw CT slice. The other issue was the use of a fixed-size square CT patch for feature extraction, which may contradict the large changes in the sizes and shapes of lung nodules.
Compared to the previously developed CNNs, our model differs in the following: (1) Our model multi-scales the input 2D CT patches and further cross-scale weighted aggregates high-level multi-scale features to extract global features containing rich location and semantic information when only one network is involved. (2) Considering that lowlevel and high-level features are different, the model does not integrate them equally; instead extracts them separately in a manner that ensures they complement each other, which reduces the calculation complexity. (3) The edge information is explicitly modeled to preserve the nodule boundaries. In particular, the nodule location information is introduced into the edge information to strengthen the edge features. (4) The tumor location and size of the CT patches in the dataset are not fixed, which improves the robustness of the model.
The remainder of this paper proceeds as follows. Section 3 describes the proposed method in detail. The datasets and experimental details are presented in Section 4. Section 5 presents a comparison between the qualitative and quantitative experimental results. Finally, we discuss some potential improvements and draw conclusions in Section 6. Figure 2 illustrates the architecture of the proposed network, including four major parts: the backbone, HDM, LDM, and CM. Our proposed model uses pre-trained Res2Net50 as the backbone and takes CT patches of three scales to capture coarse multi-scale features. To enhance the representation ability of the model and adapt to the segmentation task setting, we replace a 7 × 7 convolution in layer 1 with three consecutive 3 × 3 convolution and ReLU layers and remove the last pooling layer and fully connected layer. As such, layer 1 and layer 2 with low-level features contain rich edge information, while layer 3, layer 4, and layer 5 with high-level features embrace strong semantic information. Next, HDM takes high-level rough multi-scale features as its input could acquire refined multiscale features with rich spatial information whilst suppressing irrelevant background noise. HDM extracts the nodule location and the coarse area of different size nodules. Meanwhile, the high-resolution low-level edge information is fed into the LDM to obtain initial edge information on the basis of reducing computer memory. Finally, CM is used to perform complementation of the location, coarse area, and edge of a lung nodule. Concretely, CM makes location information supplement on initial edge information through location fusion (LF), and refined edge information supplement on coarse nodule area via edge fusion (EF).

High-Level Feature Decoder Module (HDM)
Researchers apply different sizes of convolution kernels to obtain multi-size receptive fields, which are designed to be superior to those that share a fixed size. Here, we design an MF block to capture more spatial information on nodules of different sizes by four cascade branches {b m , m = 0, . . . ,3}, which is inspired by the receptive field block (RFB) [40]. As shown in Figure 3, each branch consists of a standard and dilated convolutional layer. As the convolution kernel size and the atrous convolution dilation rate of the four branch increases from 1 to 3, 5, and 7, then the receptive fields will be 1, 9, 15, and 21, respectively. To be specific, every branch first applies a 1 × 1 convolutional layer to reduce the number of channels to 32. To further reduce the number of parameters and deeper non-linear layers, for {bm, m ≥ 1}, we replace (2 m + 1) × (2 m + 1) convolutional layers with a 1 × (2 m + 1) and a (2 m + 1) × 1 convolutional layers, followed by a 3 × 3 convolutional layer with a (2 m + 1) dilation rate, which is widely used in Deeplab [41]. We then concatenate the output of the abovementioned four branches and send it to a 3 × 3 convolutional layer to reduce the number of channels from 4 × 32 to 32. Finally, a shortcut is added elementwise to the original MF block. To extract richer high-level semantic features and reserve more spatial information, we added three MF blocks to the HDM. In particular, the MF block we added in layer 4 and layer 5 can further compensate for the loss of spatial information and capture a more accurate position of a nodule, which can be used as a feature supplement for the LE block.

High-Level Feature Decoder Module (HDM)
Researchers apply different sizes of convolution kernels to obtain multi-size receptive fields, which are designed to be superior to those that share a fixed size. Here, we design an MF block to capture more spatial information on nodules of different sizes by four cascade branches {bm, m = 0,…,3}, which is inspired by the receptive field block (RFB) [40]. As shown in Figure 3, each branch consists of a standard and dilated convolutional layer. As the convolution kernel size and the atrous convolution dilation rate of the four branch increases from 1 to 3, 5, and 7, then the receptive fields will be 1, 9, 15, and 21, respectively. To be specific, every branch first applies a 1 × 1 convolutional layer to reduce the number of channels to 32. To further reduce the number of parameters and deeper non-linear layers, for {bm, m 1}, we replace (2 m + 1) × (2 m + 1) convolutional layers with a 1 × (2 m + 1) and a (2 m + 1) × 1 convolutional layers, followed by a 3 × 3 convolutional layer with a (2 m + 1) dilation rate, which is widely used in Deeplab [41]. We then concatenate the output of the abovementioned four branches and send it to a 3 × 3 convolutional layer to reduce the number of channels from 4 × 32 to 32. Finally, a shortcut is added elementwise to the original MF block. To extract richer high-level semantic features and reserve more spatial information, we added three MF blocks to the HDM. In particular, the MF block we added in layer 4 and layer 5 can further compensate for the loss of spatial information and capture a more accurate position of a nodule, which can be used as a feature supplement for the LE block.  . The architecture of the multi-receptive field (MF) block, "d" denotes dilation rate. It contains four branches that can extract features from different scales to retain more accurate spatial location information at high levels.
Many segmentation tasks consider all high-and low-level features of the backbone equally and aggregate them uniformly [42]. However, compared to high-level features, low-level features have higher resolution and contain weaker semantic information, which require more computation cost and contribute less to the segmentation results [43]. Based on the aforementioned reasons, we designed an MD block, as shown in Figure 4, which only progressively integrates three high-level features. Specifically, for a 2D CT image, we first extract two low-level features , 1,2 and three high-level features , 3,4,5 through five convolution layers of Res2Net. We then apply the MD block to Many segmentation tasks consider all high-and low-level features of the backbone equally and aggregate them uniformly [42]. However, compared to high-level features, low-level features have higher resolution and contain weaker semantic information, which require more computation cost and contribute less to the segmentation results [43]. Based on the aforementioned reasons, we designed an MD block, as shown in Figure 4, which only progressively integrates three high-level features. Specifically, for a 2D CT image, we first extract two low-level features { f i , i = 1, 2} and three high-level features { f i , i = 3, 4, 5} through five convolution layers of Res2Net. We then apply the MD block to gradually integrate high-level features and generate the coarse area of pulmonary nodules f g , which will continue to be supplemented by location and edge information in subsequent CM. We set l = 3 and L = 3, the MD block operation is defined as follows: where f g is the initial aggregated global coarse nodule area, denotes the concatenating operation, and f i and f i are the multi-scale context feature output by the MF block and its corresponding updated feature, respectively. For the deepest feature i = L, we set f i = f i .  Finally, we obtain a progressive aggregated feature map with two 3 × 3 and one 1 × 1 convolutional layer, which is a coarse nodule area. The aggregation method of the MD block fully reuses the high-level global features through weighted cross-scale integration.

Low-Level Feature Decoder Module (LDM)
It is known that in the stage of down-sampling feature extraction (Res2Net50 in this paper), low-level feature maps attain significant high-resolution edge information [43]. Meanwhile, many researchers have pointed out that edge information can be used as an a priori constraint, which can effectively improve segmentation performance [10,16,44]. The edge of pulmonary nodules is also one of the key pieces of information that clinicians pay attention to in clinical diagnosis. Therefore, we must consider that the lower-level features ( and in our model) retain enough edge information. We input these edge features into the proposed LE block ( Figure 5) to yield a complete edge feature map with moderate resolution. Specifically, layer 1 extracts local edge features , while layer 2 captures more abstract global edge features ; the two low-level edge features can complement and enhance each other in a positive manner. The workflow is as follows: two shallow features { , are first sent to a set of filters that can capture a robust edge feature map. Then, they are fused to produce an original edge feature map . The LE block function could be represented by the following: where  and  are the addition and multiplication operations in an element-wise manner, respectively. Unlike , which emphasizes complementary features,  emphasizes the enhancement of common features. For the updated feature map f i , i ∈ [l, . . . , L − 1] , which is obtained by multiplying its original feature with the remaining deeper feature maps, f i is defined as follows: where U p f k ; f i is the up-sampling operation that aims to resize f k to the same size as f i by bilinear interpolation, Conv is a 3 × 3 convolutional layer with BN, and ⊗ means multiplication in an element-wise manner. Finally, we obtain a progressive aggregated feature map with two 3 × 3 and one 1 × 1 convolutional layer, which is a coarse nodule area. The aggregation method of the MD block fully reuses the high-level global features through weighted cross-scale integration.

Low-Level Feature Decoder Module (LDM)
It is known that in the stage of down-sampling feature extraction (Res2Net50 in this paper), low-level feature maps attain significant high-resolution edge information [43]. Meanwhile, many researchers have pointed out that edge information can be used as an a priori constraint, which can effectively improve segmentation performance [10,16,44]. The edge of pulmonary nodules is also one of the key pieces of information that clinicians pay attention to in clinical diagnosis. Therefore, we must consider that the lower-level features ( f 1 and f 2 in our model) retain enough edge information. We input these edge features into the proposed LE block ( Figure 5) to yield a complete edge feature map f e with moderate resolution. Specifically, layer 1 extracts local edge features f 1 , while layer 2 captures more abstract global edge features f 2 ; the two low-level edge features can complement and enhance each other in a positive manner. The workflow is as follows: two shallow features { f 1 , f 2 } are first sent to a set of filters that can capture a robust edge feature map. Then, they are fused to produce an original edge feature map f E . The LE block function could be represented by the following: where ⊕ and ⊗ are the addition and multiplication operations in an element-wise manner, respectively. Unlike ⊕, which emphasizes complementary features, ⊗ emphasizes the enhancement of common features.

Complementary Module (CM)
This module includes location fusion (LF) and edge fusion (EF) (see the upper right of Figure 2), which aim to complement and enhance the location and edge information while extracting enhanced edge features and explicitly modeling the enhanced edge information. More prominent edge features can be obtained by introducing high-level semantic information or location information into the local edge information [45]. Inspired by [46,47], we take and obtained from the MF block as more accurate location information, and then combine them with the original edge information to obtain the final edge guidance information , which can effectively constrain the edge in segmentation. After obtaining the final edge guidance information and coarse nodule area, we utilize the final edge guidance information to further refine the coarse nodule area to achieve fine segmentation. Specifically, the LF block first adjusts the size of to be consistent with through a set of convolutions and up-sampling, and then it is dotted with to obtain an accurate location feature map. The final edge feature map is obtained by adding the original boundary information point-by-point to explicitly learn the edge representation of the lung nodule. Finally, we use EF block to combine the final nodule edge with the coarse area of pulmonary nodule through convolution, up-sampling, and addition operation to obtain the final nodule segmentation prediction map . The location fusion (LF) and edge fusion (EF) operations are expressed as follows:  ; where Convs is a set of convolutional operations that aims to capture features with rich detailed information. Furthermore, Conv is a convolutional operation with a ReLU activation function that can change the number of channels. * ; is the up-sampling operation, which is used to resize * to the same size of by bilinear interpolation, ⸹ represents a sigmoid function,  and  denote element-wise multiplication and summation.
Here, to explicitly model the enhanced edge features, we add an extra edge supervision that measures the difference between the final predicted edge map and edge map generated by the groundtruth (GT). We use the standard binary cross entropy (BCE) Figure 5. The architecture of the low-level edge (LE) block. It extracts complete edge information, which effectively constrains and guides the lung nodule segmentation, and further refines the coarse lung nodule area.

Complementary Module (CM)
This module includes location fusion (LF) and edge fusion (EF) (see the upper right of Figure 2), which aim to complement and enhance the location and edge information while extracting enhanced edge features and explicitly modeling the enhanced edge information. More prominent edge features can be obtained by introducing high-level semantic information or location information into the local edge information [45]. Inspired by [46,47], we take f 4 and f 5 obtained from the MF block as more accurate location information, and then combine them with the original edge information f E to obtain the final edge guidance informationf E , which can effectively constrain the edge in segmentation. After obtaining the final edge guidance information and coarse nodule area, we utilize the final edge guidance information to further refine the coarse nodule area to achieve fine segmentation. Specifically, the LF block first adjusts the size of f 5 to be consistent with f 4 through a set of convolutions and up-sampling, and then it is dotted with f 4 to obtain an accurate location feature map. The final edge feature mapf E is obtained by adding the original boundary information f E point-by-point to explicitly learn the edge representation of the lung nodule. Finally, we use EF block to combine the final nodule edgef E with the coarse area of pulmonary nodule f g through convolution, up-sampling, and addition operation to obtain the final nodule segmentation prediction map P s . The location fusion (LF) and edge fusion (EF) operations are expressed as follows: where Convs is a set of convolutional operations that aims to capture features with rich detailed information. Furthermore, Conv is a convolutional operation with a ReLU activation function that can change the number of channels. U p( * ; f ) is the up-sampling operation, which is used to resize * to the same size of f by bilinear interpolation, δ represents a sigmoid function, ⊗ and ⊕ denote element-wise multiplication and summation.
Here, to explicitly model the enhanced edge features, we add an extra edge supervision that measures the difference between the final predicted edge mapf E and edge map G e generated by the groundtruth (GT). We use the standard binary cross entropy (BCE) loss L edge as an edge constraint: where M denotes the total number of pixels and G e is the edge groundtruth map, which is obtained by calculating the gradient of the groundtruth map G s . Equation (6) is the edge supervision that we added to supervise the edge feature map.

Loss Function
Inspired by reference [48], we consider that pixels located at edges contain more texture information than other pixels, so we pay more attention to edges during segmentation. In this paper, each pixel is assigned a weight W. The edge pixel corresponds to a larger W, while the non-edge pixel corresponds to a smaller one, and W can be used as an indicator of pixel importance. W i denotes the weight map, which is derived from the ground truth map G s , and is calculated as follows: where A n denotes the area that surrounds the pixel i. α and β are the threshold and intensity parameters, respectively, which are hyperparameters. Here, we empirically set α = 1 and β = 5. In summary, we extracted pixels located at edges through average pooling and subtraction operations. W i has the same size as the groundtruth map G s . When the pixel in question is located at the edges, it will be assigned a large weight, and vice versa. Figure 6 visualizes the weight distribution of nodules using this weighting strategy, which explicitly weighs more edges into the segmentation loss.

Loss Function
Inspired by reference [48], we consider that pixels located at edges contain more texture information than other pixels, so we pay more attention to edges during segmentation. In this paper, each pixel is assigned a weight . The edge pixel corresponds to a larger , while the non-edge pixel corresponds to a smaller one, and can be used as an indicator of pixel importance.
denotes the weight map, which is derived from the ground truth map , and is calculated as follows: where denotes the area that surrounds the pixel .  and  are the threshold and intensity parameters, respectively, which are hyperparameters. Here, we empirically set  1 and  5. In summary, we extracted pixels located at edges through average pooling and subtraction operations.
has the same size as the groundtruth map . When the pixel in question is located at the edges, it will be assigned a large weight, and vice versa. Figure 6 visualizes the weight distribution of nodules using this weighting strategy, which explicitly weighs more edges into the segmentation loss. Figure 6. Visualization of weighted pixels located at edges from different nodule samples. The nodule pixel weight distribution under the weighted strategy for pixels located at edges, where red is the high weight and blue is the low weight.
Based on the above analysis, to obtain better segmentation performance, in addition to the edge loss for edge supervision proposed in Equation (6), we propose the joint segmentation loss , which is used for deep segmentation supervision.
where  and  are the weighted BCE and weighted IOU loss, respectively.  is a pixel-level loss defined as follows: where is the predicted global map. In contrast to the standard BCE loss, the weighted BCE loss  assigns a larger effect coefficient to the pixels located at edges to increase their loss contribution to the loss. Considering that  ignores the global structure of the image when calculating the loss of each pixel independently, we introduce weighted Figure 6. Visualization of weighted pixels located at edges from different nodule samples. The nodule pixel weight distribution under the weighted strategy for pixels located at edges, where red is the high weight and blue is the low weight.
Based on the above analysis, to obtain better segmentation performance, in addition to the edge loss L edge for edge supervision proposed in Equation (6), we propose the joint segmentation loss L ω seg , which is used for deep segmentation supervision.
where L ω BCE and L ω IOU are the weighted BCE and weighted IOU loss, respectively. L ω BCE is a pixel-level loss defined as follows: where P s is the predicted global map. In contrast to the standard BCE loss, the weighted BCE loss L ω BCE assigns a larger effect coefficient to the pixels located at edges to increase their loss contribution to the loss. Considering that L ω BCE ignores the global structure of the image when calculating the loss of each pixel independently, we introduce weighted IOU loss L ω IOU to make the network pay more attention to the global structure. L ω IOU is image-level loss, which is widely used in segmentation and object detection. It is designed to optimize the global structure rather than focusing on a single pixel, which can be used as a complement to L ω BCE . Similar to L ω BCE , the weighted IOU loss L ω IOU focus on pixels located at edges through the following weighting method: As shown in Equation (8), the segmentation loss function includes both local and global losses, which complement each other and provide effective supervision for accurate lung nodule segmentation.
The total loss function of the proposed network consists of two parts: one part tackles the most common segmentation supervision presented in Equation (8), and the other focuses on the edge supervision described in Equation (6), which plays a crucial role in medical segmentation. Therefore, the total hybrid loss is defined as:

Datasets
To evaluate the performance of the proposed network, we conducted experiments on two datasets. One is a public benchmark dataset: the LUng Nodule Analysis 2016 (LUNA16) dataset [49], and the other is an independently collected dataset from the Fudan University Shanghai Cancer Center (FUSCC).
LUNA dataset: There are 888 CT scans and 1186 GT nodules in LUNA16, which exclude slices thicker than 2.5 mm obtained from the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) [50]. The LIDC-IDRI database contains annotations collected through a two-stage phase annotation process by four experienced radiologists. Among all the marked lesions, only nodules ≥ 3 mm and accepted by at least three out of four radiologists constitute the LUNA16 dataset. In other words, annotations that do not conform to the reference standard (non-nodules, nodules < 3 mm, and nodules annotated by only one or two radiologists) are referred to as irrelevant findings.
FUSCC dataset: The second dataset contains 1134 CT slices of nodules from 89 subjects with single nodules admitted at the Fudan University Shanghai Cancer Center. All nodules are randomly assigned to four board-certified radiologists for labeling, then, verified and corrected by an experienced radiologist (10+ years of experience). Generally, each nodule has several to dozens of slices in raw CT volume, and we take each slice as a sample.
Dice similarity coefficient (DSC) and Jaccard Index (JA) are common evaluation criteria that are used to calculate the overlap ratio between the segmentation result (S) and groundtruth (GT). They both range from 0 to 1, where 1 means perfect overlap [51]. It is calculated as below: where TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively. | · | refers to calculating the number of pixels in a given region. ∩ and ∪ denote taking the intersection and union, respectively. Hausdorff distance (HD) is formulated as: where p and t are pixels on S and GT, respectively. As suggested in [52], we use the Hausdorff distance (95%) (HD95) to eliminate the adverse effects of outliers. S-measure (Sm) computes the structural similarity between the segmentation result and ground-truth, which is defined as follows: where α is the balance coefficient between object-aware similarity S o and region-aware similarity S r , which is set to 0.5. E-measure (Em) jointly evaluates the local and global similarities between the binarized prediction and ground truth, which is defined as: where i and M mean each pixel and the total number of pixels in the GT, respectively. The symbol ∅ indicates the enhanced alignment matrix. MAE reflects the pixel-wise error between S and GT, which is denoted as follows [53]:

Implementation Details
We divide the two datasets into training and testing sets at a patient level with the same ratio of 9:1 before the experiments. At the beginning of training, we first resize all patches or regions of interest (ROIs) to 96 × 96 and then multi-scale the input patches. Specifically, the network applies bilinear interpolation to resample the input single patch with three ratios of 0.75, 1, 1.25 to obtain three scale images. In other words, our model is trained with a multi-scale strategy. Note that in order to obtain the different locations of the same nodule in patches, we perform five cropping operations around the same nodule on the same CT slice and the size of the patch is not a fixed square.
The entire framework is implemented with Python3.6 based on PyTorch-GPU 1.4.0 on an experimental platform consisting of an Ubuntu 18.04 operating system with an NVIDIA GeForce GTX 1080 Ti graphics cards and 32 Gigabytes of memory. We adopt the Adam optimizer for training with an initial learning rate of 1e-4 that dropped 10% every 30 epochs. To avoid overtraining, if the performance stabilizes, training is stopped after ten extra epochs. We found that our model converges after approximately 50 epochs with a batch size of four. Consequently, we set the upper limit of the training period to 60 epochs. The performance of our approach was evaluated using MATLAB 2018a.

Quantitative Analysis
To verify the efficiency of the proposed model, we performed a quantitative comparison with the SOTA methods on the LUNA16 and FUSCC test datasets, namely residual U-Net (RUN) [54], Huang et al. [32], U-Net++ [55], U-Net [42], CE-Net [56] and Attention U-Net [57]. Table 1 reports the overall segmentation performance of all methods and datasets based on multiple indicators. We observe that our proposed model outperforms almost all methods on all evaluation metrics in the LUNA16 dataset. In addition, when tested on an independent FUSCC dataset, the similar good experimental results of our proposed model reaffirm the competitiveness for the segmentation of different types of pulmonary nodules. Our-Net obtains significantly better DSC, JA, and MAE than U-Net and U-Net++ for all data sets. Specifically, our network improves from 0.676, 0. 019, respectively. This shows that the classical CE-Net proposed to capture more high-level information and spatial information is very effective for segmentation. Particularly, although Attention U-Net obtains the highest SE value (0.907), its SP and DSC values are relatively lower (0.982 and 0.772, respectively). The combination of the three metrics illustrates that Attention U-Net is more prone to mis-segmentation compared to our method, which is not conducive to clinical diagnosis. Our approach is statistical significance compared to U-Net, U-Net++, and Attention U-Net (p-value < 0.05). The superiority of our network in lung nodule segmentation may be owed to the complementarity of the sufficient refinement feature and weighted cross-scale fusion feature. Table 1. Performance comparison of lung nodule segmentation on the LUNA16 and FUSCC test datasets. DSC, JA, and HD95 are presented as mean ± standard deviation (SD) (95% confidence interval). The best two results are indicated in red and blue. ↑: larger is better, ↓: smaller is better.

Ablation Studies
To verify the effectiveness of each component in our model, we conducted a series of ablation experiments on the FUSCC test dataset, as shown in Table 2. We observed that the results of almost all evaluation metrics increased as the components were added sequentially. In particular, we demonstrate the huge advantage of the proposed MF block by comparing row (a) with row (b). Its application boosts the DSC value by a substantial 44% (from 0.459 to 0.659) and the HD95 value reduces from 34.703 to 26.286. This indicates that MF blocks can extract features of different size nodules by combining atrous convolution with different atrous rates. We also compare row (b) with row (c), for example, JA increases from 0.491 to 0.654 by 33%, HD95 decreases from 26.286 to 8.389 while MAE decreases from 0.101 to 0.076, which further supports that our proposed MD block is beneficial for pulmonary nodule segmentation. Finally, we demonstrate the effectiveness of the proposed weighted strategy for pixels located at edges by comparing the fusion result with and without the edge weighting (row (d) and row (e)) in CM, where (w/o) represents our model without the edge weighting strategy. Table 2 Table 2 are all further improved, which confirms that the proposed components are effective in learning pulmonary nodule features. Moreover, we also make a visual analysis of the feature learning ability of each component in our model as shown in Figure 7. As can be observed, the segmentation edges gradually approach the ground-truths, and the final segmentation edges are closest to the ground-truths, which indicates that OUR-Net is effective in locating nodule edges. Table 2. Ablation analysis of our network on the FUSCC dataset. Dice similarity coefficient (DSC), Jaccard Index (JA), and Hausdorff distance (95%) (HD95) in test results are displayed in the form of mean ± standard deviation (SD) (95% confidence interval), and the best results are reported in bold. (w/o W): with CM and without loss strategy for weighted pixels located at edges; *: with CM module and loss strategy for weighted pixels located at edges. ↑: larger is better, ↓: smaller is better. In addition, in order to explore the complementary performance of location information in our network, we sequentially added the location information f 3 , f 4 , and f 5 obtained from layer 3, layer 4, and layer 5 to CM on the FUSCC test set, as shown in Table 3. It can be seen that the addition of pulmonary nodule location information significantly boosts the segmentation performance, where the combination of f 4 and f 5 obtains the optimal values of 0.868, 0.767, and 5.354 on DSC, JA, and HD metrics, respectively. Its performance is better than the single f 5 or the combination of f 3 , f 4 and f 5 , which demonstrates f 4 + f 5 jointly provides enough and the most balanced location information.  In addition, in order to explore the complementary performance of location information in our network, we sequentially added the location information , , and obtained from layer 3, layer 4, and layer 5 to CM on the FUSCC test set, as shown in Table  3. It can be seen that the addition of pulmonary nodule location information significantly boosts the segmentation performance, where the combination of and obtains the optimal values of 0.868, 0.767, and 5.354 on DSC, JA, and HD metrics, respectively. Its performance is better than the single or the combination of , and , which   Figure 8 displays representative nodule segmentation edges from the FUSCC and LUNA testing sets (F1-F5 and L1-L5, respectively) to visually compare our approach to other approaches; including U-Net, U-Net++, and Attention U-Net. Specifically, for spiculate nodules (F1) found on the FUSCC, U-Net excessively segments the surrounding tissues. When segmenting the juxta-pleural nodules (F2), it is arduous to distinguish nodules and surrounding tissues with the same strength using U-Net and Attention U-Net, and U-Net++ segments the nearby pleura of similar intensity excessively. Regarding GGO nodules (F3), owing to the low contrast, Attention U-Net excessively segments the lung parenchyma. When segmenting isolated nodules (F4), U-Net falsely segments nearby tissues. In cavitary nodules (F5) segmentation, U-Net++ segments nodules only partially owing to low contrast. U-Net and Attention U-Net struggle to distinguish nodules from complex surroundings. In contrast, our network maintains strong robustness when segmenting these types of nodules. For simplicity, we analyzed typical nodule segmentation results on the LUNA dataset. When U-Net and Attention U-Net attempt to segment calcified nodules (L2), they cannot distinguish the background. When segmenting the GGO nodules with cavity structures (L4), U-Net and U-Net++ only partially recognize the nodules owing to the low-intensity contrast. Attention U-Net is slightly aggressive in segmentation, while U-Net++ is slightly conservative. From the above-mentioned qualitative comparison, it could be observed that the challenging nodules are mostly juxta-pleural nodules, juxta-vascular nodules, and nodules with heterogeneous intensities. Figure 9 further visualizes the segmentation results of these types of nodules from the LUNA and FUSCC datasets using our method. Notice that the segmentation results of our model have a large overlap with the ground-truth, which shows that it can always obtain the most accurate segmentation edge in the trade-off. The strong robustness of our method may benefit from the combination of the proposed components that are surely helpful for the refinement and fusion of complementary information between cross-scale features. Visual qualitative experiments show that our approach is effective in the segmentation of various types of lung nodules.

Discussion
With the development of computer hardware technology and deep learning algorithms, more and more convolution neural networks are designed for the automated analysis of medical images. Although deep learning models have achieved marvelous results in various medical image tasks, precision lung nodule segmentation remains a challenging task owing to the diversity of lung nodules, the blurry edges, and small inter-class variances between nodules and their surrounding tissues. Inspired by the process of clinical diagnosis, we design a model in which the location, region, and edge information complement each other.
The experimental results show that the proposed network can segment pulmonary nodules effectively and robustly. Our MF block focuses on pulmonary nodule areas with rich details and yields locations with rich spatial information. The MD block generates coarse nodule areas by weighted cross-scale feature fusion and suppresses irrelevant information. In addition, our CM refines the nodule edge by complementing each other on location, region, and edge information to achieve the accurate segmentation of pulmonary

Discussion
With the development of computer hardware technology and deep learning algorithms, more and more convolution neural networks are designed for the automated analysis of medical images. Although deep learning models have achieved marvelous results in various medical image tasks, precision lung nodule segmentation remains a challenging task owing to the diversity of lung nodules, the blurry edges, and small inter-class variances between nodules and their surrounding tissues. Inspired by the process of clinical diagnosis, we design a model in which the location, region, and edge information complement each other.
The experimental results show that the proposed network can segment pulmonary nodules effectively and robustly. Our MF block focuses on pulmonary nodule areas with rich details and yields locations with rich spatial information. The MD block generates coarse nodule areas by weighted cross-scale feature fusion and suppresses irrelevant information. In addition, our CM refines the nodule edge by complementing each other on Unexpectedly, we found a wrong GT label for a juxta-vascular nodule in the comparative examination of test results (as shown in Figure 9, L13). Fortunately, our proposed network still accurately segmented it without being affected by the erroneous GT label, which proved the robustness and consistency of our model in pulmonary nodule segmentation.

Discussion
With the development of computer hardware technology and deep learning algorithms, more and more convolution neural networks are designed for the automated analysis of medical images. Although deep learning models have achieved marvelous results in various medical image tasks, precision lung nodule segmentation remains a challenging task owing to the diversity of lung nodules, the blurry edges, and small interclass variances between nodules and their surrounding tissues. Inspired by the process of clinical diagnosis, we design a model in which the location, region, and edge information complement each other.
The experimental results show that the proposed network can segment pulmonary nodules effectively and robustly. Our MF block focuses on pulmonary nodule areas with rich details and yields locations with rich spatial information. The MD block generates coarse nodule areas by weighted cross-scale feature fusion and suppresses irrelevant information. In addition, our CM refines the nodule edge by complementing each other on location, region, and edge information to achieve the accurate segmentation of pulmonary nodules. These components can support plug-and-play to flexibly and effectively combine with other networks to improve performance. It is worth noting that OUR-Net achieved inspiring performance in pulmonary nodule segmentation in CT patches without any preor post-processing tricks.
However, there are still some limitations in our work. Although our approach achieved good results in lung nodule segmentation, it is only for 2D image applications. Additionally, nodule segmentation is only one part of lung cancer analysis and diagnosis. In the future, we will consider introducing the correlated inter-slice information of the 3D volume into the model to obtain better segmentation results. What is more, we will integrate the segmentation and classification of pulmonary nodules into a unified framework, and take advantage of the correlation between tasks to study a detection model in which segmentation and classification promote each other. Therefore, the framework can be better applied to clinical analysis.

Conclusions
In this study, we propose a novel scheme for lung nodule segmentation, which extracts high-and low-level local features in different ways and complements them with each other. Our proposed network is trained in an end-to-end manner to obtain an effective and robust segmentation performance for different types of pulmonary nodules. Our core idea is to imitate the nodule determination process so that nodule location, coarse nodule area, and nodule edge complement each other in clinical diagnosis, so that the network can learn multi-scale features with high consistency. Compared to several classic lung nodule segmentation methods, our method demonstrates excellent performance (DSC = 0.835 ± 0.002 for LUNA, and DSC = 0.868 ± 0.001 for FUSCC). In particular, our model exhibits great potential in segmenting challenging nodules, such as juxta-pleural nodules, juxta-vascular nodules, and nodules with heterogeneous intensity.

Institutional Review Board Statement:
This study was conducted in accordance with the Declaration of Helsinki and was approved by the local ethics committee. The requirement for informed consent was waived due to the anonymous and retrospective nature of this work.
Data Availability Statement: Data sharing not applicable.