A Novel Pulmonary Nodule Detection Model Based on Multi-Step Cascaded Networks

Pulmonary nodule detection in chest computed tomography (CT) is of great significance for the early diagnosis of lung cancer. Therefore, it has attracted more and more researchers to propose various computer-assisted pulmonary nodule detection methods. However, these methods still could not provide convincing results because the nodules are easily confused with calcifications, vessels, or other benign lumps. In this paper, we propose a novel deep convolutional neural network (DCNN) framework for detecting pulmonary nodules in the chest CT image. The framework consists of three cascaded networks: First, a U-net network integrating inception structure and dense skip connection is proposed to segment the region of lung parenchyma from the chest CT image. The inception structure is used to replace the first convolution layer for better feature extraction with respect to multiple receptive fields, while the dense skip connection could reuse these features and transfer them through the network. Secondly, a modified U-net network where all the convolution layers are replaced by dilated convolution is proposed to detect the “suspicious nodules” in the image. The dilated convolution can increase the receptive fields to improve the ability of the network in learning global information of the image. Thirdly, a modified U-net adapting multi-scale pooling and multi-resolution convolution connection is proposed to find the true pulmonary nodule in the image with multiple candidate regions. During the detection, the result of the former step is used as the input of the latter step to follow the “coarse-to-fine” detection process. Moreover, the focal loss, perceptual loss and dice loss were used together to replace the cross-entropy loss to solve the problem of imbalance distribution of positive and negative samples. We apply our method on two public datasets to evaluate its ability in pulmonary nodule detection. Experimental results illustrate that the proposed method outperform the state-of-the-art methods with respect to accuracy, sensitivity and specificity.


Introduction
Lung cancer is one of the most lethal diseases with only about 16% 5-year survival rate [1,2]. With the development of modern medical techniques, researchers have proved that the survival rate could achieve 54% on average if the lung cancers could be diagnosed in the early stage [3]. Therefore, early detection of pulmonary nodule plays a critical role in the early diagnosis of lung cancers [4] and computer-assisted diagnosis system (CADs) [5]. Recently, pulmonary nodule detection has been typically performed on the chest CT scans, and many automated detection methods have been proposed by processing and analyzing the chest CT images [6]. These methods can be generally categorized into two types: (1) detection based on hand-crafted features, and (2) detection based on deep learning. above comparison methods, including our proposed method are summarized in Table 1, and we will discuss these methods in detail in our work.   The main contributions of this work are listed as follows: 1.
we propose a uniform framework with three hierarchical U-Net-like [17] networks following the "coarse-to-fine" manner: (a) parenchyma region detection, (b) nodule candidate detection, and (c) true nodule determination; 2.
we apply the inception structure and dense connection in the U-Net-like network for better segmentation of lung parenchyma regions; 3.
we leverage dilated convolutions instead of conventional ones in the U-Net-like network so that those small-size tissues including nodules can be detected without omission; 4.
we adapt multi-scale pooling strategy and multi-resolution convolution block in the U-Net-like network to differentiate the true nodules with multiple sizes from the candidates in the complicated environment; 5.
we modify the dice loss by considering the imbalance distribution of positive and negative samples. The proposed dice loss, together with the pixel-wise loss and the perceptual loss, are used to train the proposed cascaded network.
The rest of this paper is organized as follows: in Section 2 we describe the three sub-networks for lung parenchyma regions segmentation, nodule candidate detection and true nodule determination in detail, respectively. The joint loss and the training strategy are discussed in Section 3. The experimental results are presented and discussed in Section 4 and finally we conclude our work in Section 5.

Inception-Dense U-Net for Lung Parenchyma Segmentation
Given to the fact that pulmonary nodules only occur in the lung parenchyma, it is necessary to precisely segment the parenchyma region from the CT image to avoid the interference from those outside-lung organs and tissues such as sternums in the nodule candidates' detection. As shown in Figure 2, we propose a modified U-Net network to find the lung parenchyma mask of the input CT image. An inception [18] module is used to replace the first convolution layer of the original U-Net to extract features using different receptive fields. Moreover, dense connections between all convolution layers and de-convolution layers are used instead of the simple skip connections between each convolution layer and its corresponding de-convolution layer. We will introduce the inception module and dense connection in detail in the following parts. 2.1.1. Inception Structure Figure 3 shows the inception block we used in our proposed segmentation network. The input layer is filtered by convolution kernels with different sizes in parallel, and the results from these channels are concatenated together. Therefore, more comprehensive features of the outside-lung organs, lung parenchyma and the boundary of pulmonary lobe could be extracted from the image, leading to better segmentation of lung parenchyma. Before connected to next layer, a 1 × 1 convolution layer is applied to reduce the dimensionality of the feature maps to preserve only the necessary information using as few parameters as possible. The Inception module is only applied as the first convolution layer for a good tradeoff between multi-scale features separation and the computational cost.

Dense Connection
During the forward transmission of the U-Net, image features are very likely to be lost, and the incomplete features will result in incorrect segmentation. Enlightened by the Dense-Net proposed in [19], we propose to connect every convolution layer to all the other convolution layers in the network. Therefore, the feature maps from all the previous layers are used as input the current layer and the feature map from the current layer will be used as the input of all the subsequent layers. It could avoid the problem of feature vanishing and improve the segmentation performance as a result.
As shown in Figure 4, the 4 convolution layers are connected following the dense manner. The relation between these layers can be mathematically represented as follows: where X n is the feature map from the n-th layer, H n (·) denotes the convolution operation of the n-th layer, ↓ m represents the m-times down-sampling operation of the certain layer. The input of the X n layer are the concatenated features of those from previous n − 1 layers after down-sampling. The dense connection between the de-convolution layers follows the opposite way of the convolution ones. Figure 5 shows the input CT image together with the lung parenchyma mask detected by the first proposed U-Net-like network. It can be considered that the mask fits the ground truth well so those outside-lung tissues, even with similar sizes and shapes as pulmonary nodules, are filtered out from the input of the next stage.

Dilated-Convolution U-Net for Nodule Candidate Detection
Taking the image with outside-lung regions filtered out as input, we propose another U-Net-like network to detect nodule candidates. The structure of the detection network is shown as Figure 6, where every convolution layer is replaced by the convolution block consisting of inception module and dilated convolution, while every de-convolution layer is followed by dilated convolution to form the de-convolution block.  Figure 6. The structure of the proposed sub-network for nodule candidate region detection.
The traditional U-Net makes use of pooling layer with regular convolution and the pooling stride larger than 1 to increase the receptive field for global features learning. However, this operation decreases the size of the feature map, resulting in degradation of resolution, loss of important information, and low detection accuracy after up-sampling to the same size of input image. The dilated convolution [20] is applied to solve this problem by introducing a parameter called "dilation rate" to represent the distance between every two non-zero weighted parameters in the convolution kernel. The sizes of the dilated-convolution kernel and the feature map after convolution can be calculated as follows: where k represents the kernel size, p denotes the number of 0 during the convolution, s is the convolution stride, d is the dilation rate, i, n and o are the size of input feature map, new kernel after dilation and output feature map, respectively. It can be considered that the convolution operation could contain larger range of information without pooling operation. Figure 7 shows the general structure of the convolution block used in Figure 6. The inception module is used to extract the features in multiple perceptual scales as discussed in Section 2.1.1. After that, the dilated convolution is used instead of pooling operation to maintain the receptive field so the output feature map can preserve differences between nodule candidates and other tissues as much as possible. The general structure of the convolution block used in the nodule candidate detection sub-network.
Contrarily, the structure of the de-convolution block is shown in Figure 8. After the concatenation of de-convolution results and the corresponding convolution results, two dilated convolutions are used continuously. Similar to those used in convolution steps, dilated convolutions in the de-convolution block can preserve the perceptual information more than the traditional up-sampling. Figure 9 shows the visualization of the output of the second proposed U-Net-like network, where nodule candidates can be detected comprehensively without missing those true nodules.

Multi-Resolution Feature Concatenation and Multi-Scale Pooling CNN for Pulmonary Nodule Determination
The lung parenchyma and the nodule candidate regions are detected by the two U-Net-like networks discussed in Sections 2.1 and 2.2, respectively. In this part, we propose the third network to differentiate pulmonary nodules from non-nodule candidates, which can be considered to be an expansion of the second network. Figure 10 shows the structure of the proposed network for classification, where the multi-resolution feature concatenation module is used to replace the conventional convolution layers and the multi-scale pooling is applied to replace the max-pooling. Therefore, features of the small nodule candidates can be extracted more clearly for better classification of true pulmonary nodules.  Figure 10. The structure of the proposed sub-network for nodule determination.

Multi-Resolution Convolution Block
Inspired by the HRNet proposed in [21], as shown in Figure 11, we propose a multi-resolution feature concatenation module to extract more detailed features of those small-size nodule candidates. The input layer is processed by convolution kernel with 3 different strides in parallel: stride equals to 1; 2.
stride equals to 1 in horizontal direction and 2 in vertical direction; 3.
stride equals to 2 in horizontal direction and 1 in vertical direction.
It then generates the feature maps in 3 different resolutions: 1. feature map with the same high resolution as the input layer; 2.
feature map with the 1/2 resolution in horizontal direction while the same resolution in vertical direction as the input layer; 3.
feature map with the 1/2 resolution in vertical direction while the same resolution in horizontal direction as the input layer.
Feature maps in these 3 routes are processed by normal convolution and pooling in parallel. After that, the feature map with 1/2 resolution in horizontal direction is processed by a de-convolution kernel with stride equaling to 0.5 in horizontal direction, while the feature map with 1/2 resolution in vertical direction is processed by a de-convolution kernel with stride equaling to 0.5 in vertical direction. Finally, the feature maps in 3 routes are concatenated together as the input for the following network layer. Since the feature maps are calculated with the same resolution and half resolution in one certain direction, more features of the small-scale targets in the image can be preserved. Moreover, the convolution and de-convolution with different strides are used to replace the down-sampling and up-sampling based on simple interpolation, leading to more detailed features preservation.

Multi-Scale Pooling Strategy
Between every multi-resolution convolution block, we propose a multi-scale pooling block to replace the conventional max-pooling layer. The structure of the proposed pooling block is shown in Figure 12. The input layer is sent to 4 parallel scaling-convolution layers of which the kernel sizes are 2 × 2, 3 × 3, 4 × 4 and 5 × 5 with strides 2. These parallel convolution layers work similarly as the inception structure [18] to extract features with different receptive fields, so both global, large-scale and local, small-scale features can be obtained simultaneously. To maintain the dimensionality of feature maps and reduce the computational complexity, a 1 × 1 convolution layer is added to every scaling-convolution layer. Let the number of each output feature map channels from the scaling-convolution layer be m, the 1 × 1 convolution layer reduces the number of channels to m/4. Therefore, the dimensionality of the final output after concatenating these convolution layers is still m, leading to the comprehensive feature representation by relatively low dimensionality of feature map.

Loss Function and Training Strategy
The three sub-networks are cascaded as one end-to-end network. We train all the parameters of the uniform network once for each group of training samples, avoiding the confliction of optimizing three sub-networks separately.

Joint Loss of the Network
The pixel-wise loss, perceptual loss and dice loss are used as the objective of the enhancement network for parameters optimization. The joint loss function is mathematically defined as follows: where L j is the joint loss, L pix , L perc and L d represent the pixel-wise loss, perceptual loss and dice loss respectively, µ 1 , µ 2 and µ 3 are the loss normalization coefficients. The pixel-wise loss follows the traditional mean square error (MSE) loss, which can be expressed as follows: where y is the input chest CT image, F(y) is the nodule detection result, x is the ground truth nodule map, while W and H are the width and height of the image, respectively. The MSE loss attempts to restrain the network so that the detected nodule map could be as close to the ground truth nodule map as possible in pixel-level. However, since the nodules usually occupy small-size regions of the chest CT image, it is possible that the nodules are wrongly detected but the MSE loss is still low. To solve this problem, we need to make use of the visual perception of the CT image with nodules in it, which could make sure that the nodules are detected at the "right" regions. For the perceptual loss part, the VGG-19 network [22] is so far regarded as one of the best methods to reflect how human beings observe a given image. Specifically, in our work, the VGG-19 is used to process the image with detected pulmonary nodules and the ground truth image with true pulmonary nodules. The VGG-19 network in our algorithm can simulate how human beings observe and extract the perceptual features of the pulmonary nodules or other tissues in the chest CT image. Mathematically, the perceptual loss is defined as the Euclidean distance between the 12-th layer outputs of the VGG-19 network from generated image and the ground truth image: where W i and H i denote the width and height of the convolution output of layer i of the VGG-19 network, and φ i represents the i-th layer used for feature extraction. We set i as 12 empirically. For the dice loss part, traditional dice loss [23] is defined to measure the similarity between the nodule-detected image F(y) and the ground truth image x as follows: where |·| calculates the number of non-zero pixels in the image. However, the traditional dice loss does not work when the ground truth is negative sample, i.e., there is no nodules in the image, because the G will be null so the dice will be 0 constantly. To solve this problem, for positive samples, we use the same definition of dice loss as the traditional one, while for negative samples, we define the dice loss as the L-1 norm of the network output. Then the modified dice loss function is expressed as follows: where all the symbols are the same as those in Equation (6). The process of calculating the joint loss is shown in Figure 13. The MSE loss, perceptual loss and dice loss are taken into consideration to optimize the parameters of the whole network. By minimizing the joint loss, the network can learn the differences with respect to pixel distribution, boundary shape and human perception simultaneously. Furthermore, the modified dice loss makes it possible to use negative samples for training so that the size of training database can be enlarged, leading to better pulmonary nodule detection.

Training Strategy
The pair-wise 2D slices of the detected nodule map F(y) from the cascaded network and the ground truth nodule map x are fed to the pre-trained VGG-19 network for extracting features and calculating the perceptual loss L perc . Together with the pixel-wise loss L pix and the dice loss L d , the objective loss L j is computed according to Equation (3). Instead of optimizing every single sub-network while fixing the parameters of the other two sub-networks, which may result in contradictory adjustments and difficulty in convergence, the loss is back-propagated to update the weights of all the parameters in the three sub-networks as the parameters of one end-to-end network, increasing the efficiency of the training process.

Experimental Dataset
To evaluate the performance of our proposed method, we use the public LUNA16 dataset [16] and the ALIBABA Cloud TianChi Medical Competition as the training and testing samples.
The LUNA16 dataset contains 888 chest CT images and 1186 pulmonary nodules. Every image represents the slice with size of 512 × 512 and thickness less than 2.5 mm. Pulmonary nodules in each image are annotated by four experienced radiologists during a two-phase procedure. Each radiologist annotates lesions they observed as non-nodule, nodules smaller than 3 mm, nodules larger than 3 mm. The standard of the LUNA16 challenge includes all nodules larger than 3 mm accepted by at least 3 from 4 radiologists.
The TianChi dataset contains 800 cases and the nodules are labeled by the radiologists and the form of labeling information is the same as LUNA16. The maximum slice thickness of all scans is limited to 2 mm. The nodule size distribution is as follows: 50% of them varied from 5mm to 10 mm and others were in 10 to 30 mm. Details can be seen in https://tianchi.aliyun.com/competition/entrance/231601.

Parameter Setting
Some key parameters for the convolution and pooling layers of the 3 detection sub-networks in series are shown in Table 2, Table 3 and Table 4, respectively. For the hyper-parameters, regularization coefficient is set to 10 4 , the initial learning rate is set to 0.001, µ 1 , µ 2 , µ 3 are set to 0.9, 0.9 and 0.999 respectively. The batch size of training is set to 40, and total epoch is 100. Table 3. Some key parameters in the nodule candidate detection sub-network (m represents the channels of the former layer)

U-Net
Dilated convolution Di-conv1 3 × 3 × m, stride = 1, dilation rate = 2 Table 4. Some key parameters in the nodule determination sub-network (m and c represent the channels of the corresponding former layers)

Modules Parameters
Multi-resolution convolution Multi-scale pooling All the training and testing were carried out under the Pytorch framework on an Intel Core i7-4790K 4.0 GHz PC with 16G RAM and an NVIDIA TITAN XP GPU with 12G RAM.

Experiment Implementation
We compare our method with the three-dimensional fully convolutional neural network (3D-FCN) [12], the multi-resolution CNN (MRCNN) [13], the three-dimensional U-Net (3D-UNET) [14], the progressive resolution network with the hierarchical saliency network (PRN-HSN) [15], the DCNN pulmonary nodule detection network [24], the nodule detection with contrast limited adaptive histogram equalization (CLAHE-SVM) [25] and the Mask R-CNN-based pulmonary nodule detection network (Mask-RCNN) [26]. The testing chest CT images are processed by the proposed method, as well as the comparator state-of-the-art methods, to provide pulmonary nodule detection results. Then the detection accuracy [27], sensitivity [27], specificity [27] and the area under precision-recall (PR) curve (AUC) are calculated as follows to evaluate all the methods quantitatively: where TP denotes the number of true positive samples, i.e., the nodule samples recognized as nodules. TN represents those non-nodule samples recognized as non-nodules, FP and FN represent the non-nodule recognized as nodules and the nodule samples recognized as non-nodule samples, respectively. The precision-recall curve is generated by calculating the value of precision for every given recall. Figure 14(b1-b5) show several example results of the parenchyma segmentation sub-network. Table 5 shows the performance of the proposed parenchyma region segmentation sub-network, including precision, sensitivity, specificity and dice value. The average dice value of the proposed method is 0.8636, and the precision, sensitivity and specificity are 0.8792, 0.8878, and 0.9590, respectively. It can be considered that the proposed method achieves convincing performance on two datasets. The high dice value and sensitivity further prove that the segmented parenchyma mask is very close to the ground truth, which can guarantee the performance of the following nodule candidate detection. Figure 14(c1-c5) show the results of applying the nodule candidate detection sub-network to the same input images in Figure 14(a1-a5). In addition, Figure 14(d1-d5) show the results of determining "true nodules" from the nodule candidates in Figure 14(c1-c5), respectively. Table 6 shows the performance of the proposed nodule candidate detection sub-network with respect to sensitivity and specificity. The proposed method achieves a quite high rate of sensitivity with an acceptable specificity, which demonstrates that the sub-network can detect most of the true nodules accurately with only a little over-estimation of other tiny tissues that can be further refined by the following nodule determination sub-network. Figure 14. The results of detecting parenchyma regions, nodule candidates and determining nodules by the proposed cascaded network on five different chest CT images. Table 5. Quantitative results of detecting parenchyma regions from input chest CT images by the proposed network.

Method Precision Sensitivity Specificity Dice
Inception-dense U-Net sub-network 0.8792 0.8878 0.9590 0.8636 Table 6. Quantitative results of detecting nodule candidates from input chest CT images by the proposed network.

Method Sensitivity Specificity
Dilated-convolution U-Net sub-network 0.9692 0.9078 Figure 15 shows the results of detecting pulmonary nodules of using different network modules and different loss functions, including: (1) MSE loss only, (2) MSE-perceptual loss, and (3) MSE-perceptual-dice loss. Using only MSE as loss function results in failure detection of nodules with small sizes and irregular shapes. Without dice loss, the MSE-perceptual loss cannot differentiate the nodule candidates with quite small sizes because the number of pixels belonging to nodule candidates are much smaller than those belonging to background region. In contrast, combining the MSE, perceptual and dice loss can find both large and small nodule candidates, and achieve a good balance in taking the sizes and shapes of the candidates into the account of identification. Furthermore, training the three sub-networks as a uniform framework can enhance the inherent relation and co-operation of these three sub-networks in nodule detection and classification. The accuracy, sensitivity and specificity results shown in Table 7 confirm the above observation.

Input image
Detected nodule candidates Detected nodule  To further evaluate the effect of dense block in our proposed cascaded networks, we integrate the dense block in different sub-networks. Table 8 shows the results of detecting pulmonary nodules and the running time of using the dense block in: (1) the lung parenchyma segmentation sub-network solely, (2) both the lung parenchyma segmentation and the nodule candidates detection sub-networks, (3) both the lung parenchyma segmentation and the pulmonary nodule determination sub-networks, (4) all the three sub-networks. It can be considered that using the dense block in the nodule candidate detection sub-network and the pulmonary nodule determination sub-network cannot improve the detection accuracy, sensitivity and specificity to an obvious extent, but increase the average running time from 1.3411 s to 3.2354 s, 3.1563 s and 8.1823 s, respectively. It is mainly because in the lung parenchyma segmentation sub-network, there is many useful image features need to be preserved by the dense connection to distinguish the lung regions from the complex chest environment. Although in the other two cascaded sub-networks, the inputs are the parenchyma mask and the nodules regions that are relatively simple with fewer details to be preserved, using dense connection would only increase the computational burden with no obvious effect in improving the detection performance.  Figure 16 shows the changes of nodule detection accuracy and loss value over the training process of the proposed networks. Please note that the loss value is normalized in the range of 0 to 1. It demonstrates that the proposed method achieves a convergence rate after about 400 iterations.  Figures 17 and 18 illustrates the performance of pulmonary nodule detection by different methods on two example images in the LUNA16 dataset. In addition, Figures 19 and 20 illustrates the performance of pulmonary nodule detection by different methods on two examples images in the TianChi dataset. As marked by green circles in Figures 18b,c and 19c, the 3D-FCN and MR-CNN directly detected the nodule candidates from the original CT image without pre-processing, resulting in the incorrect determination of non-nodule tissue outside lung as nodule since the outside-lung organs are not filtered out from the nodule candidates. The 3D-UNET and PRN-HSN add the lung parenchyma region segmentation stage before detecting the nodule candidates inside-lung, so they provide better performance than 3D-FCN and MR-CNN in decreasing the over-estimation rate. However, they still suffer from unsatisfactory results for the following reasons: (1) the lung parenchyma segmentation is generated by simple thresholding with morphological operations so the near-edge regions are lost, shown as the one marked by yellow circle in Figure 20d,e; (2) the convolution kernel used in nodule candidate detection of 3D-UNET is with a small receptive field to learn global features from the image, so it is likely to confuse some small tissues as true nodules with small sizes, shown as the one marked by green circle in Figures 18d and 20d; and (3) the proposed hierarchical saliency network (HSN) in PRN-HSN for nodule candidate classification omits the information with different resolutions, resulting in that the small-size nodule within the weakened, low-resolution region cannot be correctly recognized, as shown by the yellow circle in Figure 18e. The DCNN method simply applies the Faster RCNN method to provide good performance with low computational cost, but it may omit the nodules on the parenchymal edge shown as yellow circles in Figure 20f. CLAHE-SVM method adds a contrast-enhancement pre-processing before the nodule detection, leads to better performance on detecting nodules in the low-contrast region. However, it is easily to over-enhance the small-size tissues and over-estimate them as nodules, as shown by the green circles in Figures 19g and20g. The detection is also implemented over the whole image, so the nodule on the parenchyma edge may be under-estimated show by the yellow circle in Figure 20g. The Mask-RCNN method provides better effects than the above methods because of the good performance of Mask-RCNN in object detection. However, the performance is not stable for the small-size tissues and the irregular-shape nodule, shown by the green circles in Figures 17h and 19h, and the yellow circle in Figure 19h. The proposed method takes the advantage of a series of U-Net-like networks to perform the nodule detection following a "coarse-to-fine" order of inside-lung region detection, nodule candidate detection and nodule determination. The U-Net network is modified by embedding inception structure, replacing the convolution and pooling by dilated convolution, and adapting multi-scale pooling and multi-resolution convolution connection, for different requirements of the three stages, respectively. Moreover, it makes use of the MSE loss, VGG-19-based perceptual loss as the complement of dice loss to optimize the whole framework. Therefore, as shown in Figures 17i, 18i, 19i and 20i, the proposed framework provides superior performance on pulmonary nodule detection with low over-estimation of non-nodule tissues at the same time.  Tables 9 and 10 illustrate the values of AUC and average accuracies, sensitivities and specificities of these methods. It can be considered that the proposed method provides highest accuracy, sensitivity, specificity and AUC, which confirms our qualitative observations.   The running time of implementing different methods on the testing data is shown in Table 11. It costs our proposed method 1.3411 s on average to generate the nodule detection results for a CT scan image. The running time is a little longer than MR-CNN, PRN-HSN, DCNN and CLAHE-SVM, but it is acceptable considering that our method provides 2.49%, 1.12%, 2.01% and 1.30% higher detection accuracy than them. The proposed method performs effectively on most cases of pulmonary nodule detection. However, when the nodule is on the edge of lung parenchyma and the intensity level of nodule is quite close to that of the parenchyma, our method cannot distinguish the nodule from the outside-lung region. When the inside-lung tissue is with the medium size and similar shape as the nodule, our method is likely to over-estimate it as a nodule. Both cases are shown in Figure 23.

Conclusions
In this paper, we novelly proposed to detect pulmonary nodules from chest CT images through a uniform framework consisting of three consecutive U-Net-like networks. An inception structure is used to replace the first convolution layer of the U-Net to estimate the lung parenchyma region. Then another U-Net-like network is proposed by leveraging the dilated convolution to replace all the convolution layers to detect the small tissues as nodule candidates. Finally, the third U-Net-like network is proposed by adapting multi-scale pooling and multi-resolution convolution connection to determine the true nodules. Moreover, the three sub-networks are integrated together and optimized using a fused loss function consisting of MSE loss, perceptual loss and dice loss. Experimental results demonstrate that the proposed method provides superior performance of pulmonary nodule detection to the state-of-the-art methods on the public LUNA16 dataset and the TianChi competition dataset.