An Improved Algorithm for Detecting Pneumonia Based on YOLOv3

: Pneumonia is a disease that develops rapidly and seriously threatens the survival and health of human beings. At present, the computer-aided diagnosis (CAD) of pneumonia is mostly based on binary classiﬁcation algorithms that cannot provide doctors with location information. To solve this problem, this study proposes an end-to-end highly e ﬃ cient algorithm for the detection of pneumonia based on a convolutional neural network—Pneumonia Yolo (PYolo). This algorithm is an improved version of the Yolov3 algorithm for X-ray image data of the lungs. Dilated convolution and an attention mechanism are used to improve the detection results of pneumonia lesions. In addition, double K-means is used to generate an anchor box to improve the localization accuracy. The algorithm obtained 46.84 mean average precision (mAP) on the X-ray image dataset provided by the Radiological Society of North America (RSNA), surpassing other detection algorithms. Thus, this study proposes an improved algorithm that can provide doctors with location information on lesions for the detection of pneumonia.


Introduction
In recent years, the number of people suffering from pneumonia in the world has increased year by year. In particular, the incidence of pneumonia in infants has increased significantly, which seriously threatens the survival and health of human beings [1]. At present, the essence of most computer-aided diagnosis (CAD) system algorithms for the lungs is image classification. Although this kind of algorithm has the advantages of simple implementation and high accuracy, the output results lack the location information of lesion tissue, so it cannot provide more valuable reference for doctors. The algorithm proposed in this study benefits from the accurate labeling of datasets [2] and the rapid development of a convolutional neural network (CNN)-based objection detection algorithm, which enables it to identify pneumonia and locate pneumonia tissue at the same time, so it can provide more reference information for doctors.
The CNN is a kind of artificial neural network with a deep structure and convolution calculation. The CNN has the ability of representation learning, which can capture the spatial local correlation of the input data through convolution operation and obtain the translation invariance of the input Girshick et al. [3] first proposed to train the CNN to extract features using back propagation and used support vector machine (SVM) as a classifier to construct a region-based CNN (RCNN) object detection algorithm. However, the RCNN needs to pre-select regions where objects may exist using selective search methods and separate the extraction feature phase from the classification phase. Such an algorithm is not end-to-end, it is difficult to implement, and it has low computational efficiency. Girshick [4] and Ren et al. [5] proposed an end-to-end detection algorithm that integrates the selection of object regions, extraction features, and classification using an extraction feature network and region proposal network (RPN). However, this algorithm is a two-stage detection algorithm that has high hardware requirements and has difficulty achieving real-time detection in the training and testing phases. Redmon et al. [6] introduces an end-to-end real-time object detection algorithm called Yolo that uses the CNN to perform the extraction feature, classification, and localization of the object. It is a one-stage detection algorithm that has a high detection speed but unsatisfactory performance on object localization. The Yolov2 algorithm was proposed by Redmon and Farhadi [7]. This algorithm uses K-means clustering to cluster anchor scales as prior knowledge in many datasets to improve object localization. In addition, random scaling has proposed for use to enhance the generalization of the algorithm to different scales. The Yolov3 algorithm, subsequently proposed by Redmon and Farhadi [8], uses Feature Pyramid Net (FPN) [9] to improve performance with respect to missed detection of small objects in Yolov2.
The above are the study of CNN-based object detection algorithms. Next, we will learn about the CAD system.
At present, most CAD systems for the lung rely on image classification algorithms. This kind of system takes a whole image as input, extracts the feature using a feature extractor, and finally obtains a predictive label of the image from the classifier. This kind of CAD system cannot provide the doctors with accurate location information on lesions; therefore, its usefulness is limited. For example, in a study by Varshni et al. [10], the CNN was used to extract features, and an SVM was used as a classifier to detect pneumonia from an input image. In a study by Setio et al. [11], multiple pulmonary nodule detection algorithms were proposed and combined with the CNN to construct a CAD system that was shown to achieve detection sensitivities of 85.4% and 90.1% at 1 and 4 false positives per scan on lung image database consortium and image database resource initiative (LIDC-IDRI) [12], which is a small-scale dataset. In a study by Rajpurkar et al. [13], a large-scale pneumonia detection dataset was proposed, and a network with 121 layers of convolution was proposed to detect pneumonia. The essence of these studies is the classification of X-ray images; therefore, the CAD systems constructed can only provide category information, not location information as a reference. However, it is difficult for the naked eye to distinguish pneumonia lesions from normal tissues in X-ray images and therefore it is important to construct CAD systems that can provide pneumonia lesion location information. There are two difficulties encountered in constructing such a CAD system: 1) judging the presence of objects in the image and 2) accurately locating objects. The currently available CNN-based object detection algorithms can be used to construct end-to-end CAD systems to provide location information, but the algorithms need to be improved to fit X-ray image datasets.
In the study, Yolov3 was improved by analyzing the advantages and disadvantages of the existing algorithms and combining the characteristics of the pneumonia dataset (see Figure 2). Small differences in the characteristics of lesion and non-lesion were believed to increase the difficulty of detecting lesion, but increasing the perception field of the algorithm so that it could use the global information in the image was expected to increase the recognition ability. Therefore, multi-branch dilated convolution [14,15] was added to Yolov3. In the Yolov3 object detection algorithm, multiple down-sampling operations result in the loss of semantic information and spatial information, making differences between lesion and non-lesion smaller in the feature space and improving the recognition ability of the algorithm. Although the fusion of low-level and high-level features can solve this problem to a certain extent, the use of attention mechanism to suppress the output of inaccurate semantic information in low-level features further improves the performance of the algorithm. The use of double K-means enables the algorithm to generate anchor boxes of different scales for different input images.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 16 location information. There are two difficulties encountered in constructing such a CAD system: 1) judging the presence of objects in the image and 2) accurately locating objects. The currently available CNN-based object detection algorithms can be used to construct end-to-end CAD systems to provide location information, but the algorithms need to be improved to fit X-ray image datasets.
(a) (b) (c) Figure 2. Experimental data used in this study. The area within the blue box shows a pneumonia lesion. In images (a) and (b), there is no significant difference between pneumonia lesion and nonlesion. It can be seen from image (c) that the features of the left lesion and the right non-lesion are similar.
In the study, Yolov3 was improved by analyzing the advantages and disadvantages of the existing algorithms and combining the characteristics of the pneumonia dataset (see Figure 2). Small differences in the characteristics of lesion and non-lesion were believed to increase the difficulty of detecting lesion, but increasing the perception field of the algorithm so that it could use the global information in the image was expected to increase the recognition ability. Therefore, multi-branch dilated convolution [14,15] was added to Yolov3. In the Yolov3 object detection algorithm, multiple down-sampling operations result in the loss of semantic information and spatial information, making differences between lesion and non-lesion smaller in the feature space and improving the recognition ability of the algorithm. Although the fusion of low-level and high-level features can solve this problem to a certain extent, the use of attention mechanism to suppress the output of inaccurate semantic information in low-level features further improves the performance of the algorithm. The use of double K-means enables the algorithm to generate anchor boxes of different scales for different input images.
This study describes the propose detection algorithm, i.e., Pneumonia Yolo (PYolo) for pneumonia. PYolo uses multi-branch dilated convolution to increase the perception field and attention mechanism to suppress the output of inaccurate semantic information in low-level features and enhance the ability of the algorithm to detect a lesion. In addition, this paper proposes the use of double K-means to generate anchor boxes to improve the localization accuracy. This study describes the propose detection algorithm, i.e., Pneumonia Yolo (PYolo) for pneumonia. PYolo uses multi-branch dilated convolution to increase the perception field and attention mechanism to suppress the output of inaccurate semantic information in low-level features and enhance the ability of the algorithm to detect a lesion. In addition, this paper proposes the use of double K-means to generate anchor boxes to improve the localization accuracy.

Materials
Earlier research on attention focuses on the analysis of brain imaging, which will not be introduced in detail in this chapter. At a time when deep learning grows vigorously, it is prominent to construct CNN with the attention mechanism. On the one hand, the neural network can learn the attention mechanism autonomously; on the other hand, the attention mechanism can in turn help us understand the world presented by neural network. In recent years, most of the research on the combining of deep learning and visual attention [16][17][18] focuses on using masks to form the attention mechanism. The principle of using the mask as the attention mechanism lies in the extraction of key features from the image using the weights predicted by the neural network. Through learning and training, the neural network can learn the areas that need attention in each new image. This is the purpose of the attention mechanism in deep learning.
In the field of semantic segmentation, the architecture of the neural network generally adopts the fully convolutional network (FCN) [19]. The FCN, like the traditional CNN, first performs convolution operations on images and then performs pooling operations to reduce the image size and increase the receptive field. However, since semantic segmentation is a pixel-wise output, the smaller image size obtained after the pooling operation is up-sampling to the original image size for prediction (up-sampling is generally made by bilinear interpolation). In this regard, there are two key operations in the segmentation algorithm: one is pooling, to reduce the image size and increase the receptive field; and the other is up-sampling, to increase the image size. In the process of reducing the size of the image, FCN loses some of the spatial and semantic information in images, which is not conducive to segmentation. Therefore, a convolution operation that can obtain a large receptive field without pooling is introduced, i.e., dilated convolution [20]. Dilated convolution introduces a parameter called dilation rate, which defines the distance of sampling by the kernel. The larger the dilation rate, the larger the sampling distance, and the larger the receptive field of the kernel.

Methods
In the following sections, we describe the detection process of PYolo in general and then introduce each of the three algorithm improvements proposed in this paper: 1) location pre-processing, 2) MaskFPN, and 3) dilated convolution.

PYolo Detection Process
As shown in Figure 3a, in the location pre-processing, PYolo uses double K-means to produce the anchor box of a lesion. As shown in Figure 3b, PYolo uses DarkNet53 to extract features, uses MaskFPN to fuse features of different levels, and uses a multi-branch convolution module to obtain multi-perception field information. Unlike Yolov3, PYolo only detects the features of the module output. The input image size of DarkNet53 is 416 × 416 pixels, and the output features are {F1, F2, F3} with the sizes of {13 × 13, 26 × 26, 52 × 52}, respectively. In the experiment, the input image was scaled to 416 × 416 pixels in the pre-processing stage. The difference between MaskFPN and FPN is that MaskFPN uses the information of high-level feature as prior knowledge to generate a weight map, and then multiplies the weight map with low-level features linearly to suppress the output of inaccurate semantic information of low-level features. By contrast, FPN directly combines high-level features and low-level features, directly overcoming the problem of inaccurate semantic information in low-level features.
In an object detection algorithm, the ratios of positive and negative samples are critical to the performance of the algorithm. As shown in Figure 4, like Yolov3, PYolo corresponds to the feature points by dividing the image into grid cells. In the training phase, the real bounding box is mapped to the corresponding coordinates on the feature map by dividing by the stride; in the detection phase, the predicted bounding box on the feature map is mapped to the corresponding coordinates on the original image by multiplying the stride. The dimensions of the output features of PYolo are [S, S, A * (B + Conf + Cls)]. S × S is the number of grid cells; B is the predicted bounding box; Conf is the confidence level of the output object; Cls is the class of the dataset; and A is the number of scales for each anchor. With respect to the selection of positive and negative samples, anchors with the Intersection over Union (IOU) with the ground-truth bounding boxes were used for evaluation. Anchors that have IOU with any ground-truth box greater than 0.5 were included as training samples. The center points of the anchor and ground-truth bounding boxes that fall on the same grid were designated as positive samples, and other anchors were designated as negative samples.

Methods
In the following sections, we describe the detection process of PYolo in general and then introduce each of the three algorithm improvements proposed in this paper: 1) location preprocessing, 2) MaskFPN, and 3) dilated convolution.  There is still a problem of imbalance between positive and negative samples in the screened sample set. To overcome the problem of imbalance between positive and negative samples [21], a hyper-parameter λ = 200 is introduced in the loss function to strengthen the learning intensity for negative samples and accelerate the speed of the convergence of the model. The localization loss function is different from the function in Yolov3. Smooth L1 loss was adopted as the localization loss function as it has a higher level of smoothness compared to others. The loss functions of the model are as follows:

PYolo Detection Process
L total = L loc + L cls + L pos + λL neg (5) where L loc , L cls , L pos , and L neg represent localization loss, classification loss, positive sample loss, and negative sample loss, respectively, and g, p, C, X, M, and Y refer to the actual coordinates, predicted coordinates, probability of the actual class, and probability of the predicted class, actual set of positive and negative samples, and predicted set of positive and negative samples, respectively. Equation (5) is the overall loss function of the algorithm.
original image by multiplying the stride. The dimensions of the output features of PYolo are [S, S, A * (B + Conf + Cls)]. S × S is the number of grid cells; B is the predicted bounding box; Conf is the confidence level of the output object; Cls is the class of the dataset; and A is the number of scales for each anchor. With respect to the selection of positive and negative samples, anchors with the Intersection over Union (IOU) with the ground-truth bounding boxes were used for evaluation. Anchors that have IOU with any ground-truth box greater than 0.5 were included as training samples. The center points of the anchor and ground-truth bounding boxes that fall on the same grid were designated as positive samples, and other anchors were designated as negative samples.  In the left image, the red and green boxes are ground-truth boxes. In the right feature, the red box is a ground-truth box, and the yellow box is an anchor. The center points of the anchor and ground-truth bounding boxes that fall on the same grid cell were designated as positive samples, and other anchors were designated negative samples. If the Intersection over Union (IOU) of the anchor and ground truth is greater than the threshold but the center points do not fall on the same grid cell, then the anchor it is regarded as a negative sample.

Double K-Means
The anchor box is a preset bounding box size. Regression of the anchor box helps to improve the localization accuracy of the algorithm. K-means is used in Yolov3 to generate the anchor box. In the algorithm proposed in this study, double K-means is used to generate the anchor box for lesions proposed in a specific method for the pneumonia dataset. The method consists of two phases. In the first phase, K-means is used to generate the lung anchor box for the algorithm to locate the lung. In this study, the lung anchor box with three scales of {[78, 136], [129,207], [163, 256]} were generated. In the second phase, K-means is used again to generate one scale ratio for each lung anchor box; therefore, in PYolo, the anchor box of three scales was obtained for lesions. Figure 5 shows the steps involved in generating the scaling ratio, where the lesion-bounding box and lung-bounding box are clustered into the three clusters shown in the Figure 5 through K-means clustering. The mean IOU for the two kinds of bounding box in each cluster is calculated to obtain the scale ratio.  Figure 5. Schematic of scaling ratio generation. The black dots represent the cluster center; mIOU is the average value of the IOU of the black points and all the red points in each cluster, and this value is taken as the scaling ratio.

MaskFPN for Suppression of Information
MaskFPN is an improvement aspect of the attention mechanism. In the study by Hu et al. [22], the channel-wise weighting obtained by global pooling was too coarse. MaskFPN proposed in this paper assigns a weight to each pixel of the low-level feature map by generating pixel-wise weights to suppress the output of inaccurate semantic information in low-level features. As shown in Figure  6, MaskFPN performs linear multiplication on a set of feature maps C = {C1, C2, C3, C4} and a set of weight maps W = {W1, W2, W3, W4}. The weight value of each pixel in the weight map is in the interval [0,1], where ci and wi have the same dimensions, i∈{1, 2, 3, 4}. For a pixel with a small weight value in the weight map, the intensity of the suppression of information at its corresponding position in the low-level feature is strong; and for a pixel with a large weight value, the intensity of suppression of information at its corresponding position in the low-level feature is low. In PYolo, each MaskFPN consists of two convolutional layers, a batch normalization, a leaky relu activation function, and a squeezing function. Figure 7 shows the specific flow of MaskFPN. The   Figure 5. Schematic of scaling ratio generation. The black dots represent the cluster center; mIOU is the average value of the IOU of the black points and all the red points in each cluster, and this value is taken as the scaling ratio.

MaskFPN for Suppression of Information
MaskFPN is an improvement aspect of the attention mechanism. In the study by Hu et al. [22], the channel-wise weighting obtained by global pooling was too coarse. MaskFPN proposed in this paper assigns a weight to each pixel of the low-level feature map by generating pixel-wise weights to suppress the output of inaccurate semantic information in low-level features. As shown in Figure 6, MaskFPN performs linear multiplication on a set of feature maps C = {C 1 , C 2 , C 3 , C 4 } and a set of weight maps W = {W 1 , W 2 , W 3 , W 4 }. The weight value of each pixel in the weight map is in the interval [0, 1], where c i and w i have the same dimensions, i∈{1, 2, 3, 4}. For a pixel with a small weight value in the weight map, the intensity of the suppression of information at its corresponding position in the low-level feature is strong; and for a pixel with a large weight value, the intensity of suppression of information at its corresponding position in the low-level feature is low.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 16 Figure 5. Schematic of scaling ratio generation. The black dots represent the cluster center; mIOU is the average value of the IOU of the black points and all the red points in each cluster, and this value is taken as the scaling ratio.

MaskFPN for Suppression of Information
MaskFPN is an improvement aspect of the attention mechanism. In the study by Hu et al. [22], the channel-wise weighting obtained by global pooling was too coarse. MaskFPN proposed in this paper assigns a weight to each pixel of the low-level feature map by generating pixel-wise weights to suppress the output of inaccurate semantic information in low-level features. As shown in Figure  6, MaskFPN performs linear multiplication on a set of feature maps C = {C1, C2, C3, C4} and a set of weight maps W = {W1, W2, W3, W4}. The weight value of each pixel in the weight map is in the interval [0,1], where ci and wi have the same dimensions, i∈{1, 2, 3, 4}. For a pixel with a small weight value in the weight map, the intensity of the suppression of information at its corresponding position in the low-level feature is strong; and for a pixel with a large weight value, the intensity of suppression of information at its corresponding position in the low-level feature is low. In PYolo, each MaskFPN consists of two convolutional layers, a batch normalization, a leaky relu activation function, and a squeezing function. Figure 7 shows the specific flow of MaskFPN. The In PYolo, each MaskFPN consists of two convolutional layers, a batch normalization, a leaky relu activation function, and a squeezing function. Figure 7 shows the specific flow of MaskFPN. The feature map F1 is converted by the first convolution layer, batch normalization, leaky relu, and the second convolution layer, and generates feature and weight maps. Then, each weight value in the weight map is converted into a value in the [0, 1] range through convolution and the squeezing function. The output after linear multiplication of the weight map and feature map F2 is combined with the high-level features to generate F4.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 16 feature map F1 is converted by the first convolution layer, batch normalization, leaky relu, and the second convolution layer, and generates feature and weight maps. Then, each weight value in the weight map is converted into a value in the [0, 1] range through convolution and the squeezing function. The output after linear multiplication of the weight map and feature map F2 is combined with the high-level features to generate F4.  Table 1 presents the specific parameter settings of MaskFPN. It can be seen that the combination of the kernel size, stride, padding, and dilation rate in MaskFPN does not reduce the resolution of the feature map. It only transforms feature information in the dimension of the channel.   Table 1 presents the specific parameter settings of MaskFPN. It can be seen that the combination of the kernel size, stride, padding, and dilation rate in MaskFPN does not reduce the resolution of the feature map. It only transforms feature information in the dimension of the channel.

Dilated Convolution for Capturing Information in Multiple Receptive Fields
Humans usually rely on relevant feature information to guess the objects that are difficult to be recognized. In the dataset used in this study, there is no significant difference between pneumonia lesion and non-lesion, so two parallel dilated convolution layers were introduced for the PYolo algorithm to capture global information to increase the prediction ability of the algorithm. Dilated convolution expands the perception field of the kernel by inserting 0 in it. The advantage of the kernel is that it can obtain a larger range of view without down-sampling. For the case where the features of the pneumonia lesion are not obvious, the features are easily lost during the down-sampling process, which leads to a decrease in the accuracy of the algorithm. The algorithm uses dilated convolution to increase its own perception field and avoid the loss of semantic and spatial information caused by down-sampling. The dilation rate is usually selected based on the principle of not reducing the resolution of the feature map and multiple perception field. A kernel with a smaller perception field can obtain local information of features, while the kernel with a larger perception field can obtain the global information of feature [23]. However, a too large dilation rate of the kernel can degrade the performance of the kernel. The reason of this is that the kernel would have a too large receptive field to capture the local dependencies in the image, and too many dilated convolution branches will cause an increase in the computation of dot products, affecting the forward propagation speed of the algorithm. The proposed algorithm of the study uses four convolution branches with different dilation rates. Table 2 shows the parameter settings in the convolution layer:

Experimental Data
The dataset selected for the experiments in this study consists of a total of 6000 chest X-ray images provided by the Radiological Society of North America (RSNA). The 600 images in the dataset were randomly divided into test set, and the remaining 5400 images were used as training set. The training set was augmented to 10,800 images and the augment technique was only horizontal flip. The ratios of images with pneumonia lesions in the 10,800 training set and 600 test set were 0.65 and 0.70, respectively. Each input image was an original single-channel grayscale image, which was converted into a three-channel image with a resolution of 1024 × 1024 pixels during the image pre-processing phase. The bounding box of the lesion and lungs was labeled as (x, y, w, h), where x and y are the coordinates of the upper left corner of the object, and w and h are the length and width of the object. The lung bounding box was manually marked by the author of this paper.

Experimental Settings
In this study, experiments were ran with PyTorch 0.4, which was developed by Facebook in the United States. In order to increase the generalization ability of the model, the algorithm used pre-trained weights in Yolov3. The initial learning was 0.005, the learning rate schedule was polynomial decay, the momentum was set to 0.0005, and the optimizer adopted SGD [24][25][26], the weight decay was 0.0005, and the activation function used the leaky relu function [27]. BatchNorm [28] was used to prevent gradient descent during the training phase and accelerate the convergence of the model. The batch size was set to six.

Performance Indicators
The accuracy index of this study was defined as the average precision (AP) calculated from precision and recall.
Precision indicates the percentage of actual positive samples out of predicted positive samples. There are two sources of predicted positive samples: one is the TP number of positive samples predicted as positive samples; the other is the FP number of negative samples predicted as positive samples. Therefore, precision is calculated as P = TP/(TF + FP).
Recall refers to the percentage of total positive samples in the sample that are predicted correctly. The sample set includes: the TP number of predicted positive samples, and the FN number of predicted negative samples. Recall is calculated as R = TP/(TP + FN).

Ablation Experiment
In order to verify the effectiveness of the three improvements proposed in this study, the performance of Yolov3 was used as the baseline, and the three improvements were combined with Yolov3 in the ablation experiment conducted. Figure 8a,b show the effects of K-means and double K-means on detection effectiveness, where the calculation formula is the IOU between the predicted bounding box and the real bounding box obtained through non-maximum suppression NMS and then divided by the number of predicted bounding boxes. It can be seen from (a) that the value of IOU increases gradually with the increase of iterations. However, at the beginning of the iteration, the IOU of the lesion bounding box predicted by Yolov3 was very low compared to the real bounding box. According to the analysis of this study, in the early phase of the training, due to the inaccurate pulmonary bounding box predicted by Yolov3, the anchor box of the lesion was not accurate enough; therefore, the IOU value of predicated lesion bounding box was low. It can be seen from (b) that, during the training phase, the loss value of Yolov3 in locating the lesion is higher than that in the lungs because the lung features are relatively obvious and it is easier for PYolo to locate them, while the pneumonia lesions were more similar to the normal lung, which increased the learning difficulty for PYolo.

Double K-Means
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 16 In this study, experiments were ran with PyTorch 0.4, which was developed by Facebook in the United States. In order to increase the generalization ability of the model, the algorithm used pretrained weights in Yolov3. The initial learning was 0.005, the learning rate schedule was polynomial decay, the momentum was set to 0.0005, and the optimizer adopted SGD [24][25][26], the weight decay was 0.0005, and the activation function used the leaky relu function [27]. BatchNorm [28] was used to prevent gradient descent during the training phase and accelerate the convergence of the model. The batch size was set to six.

Performance Indicators
The accuracy index of this study was defined as the average precision (AP) calculated from precision and recall.
Precision indicates the percentage of actual positive samples out of predicted positive samples. There are two sources of predicted positive samples: one is the TP number of positive samples predicted as positive samples; the other is the FP number of negative samples predicted as positive samples. Therefore, precision is calculated as P = TP / (TF+FP).
Recall refers to the percentage of total positive samples in the sample that are predicted correctly. The sample set includes: the TP number of predicted positive samples, and the FN number of predicted negative samples. Recall is calculated as R = TP / (TP+FN).

Ablation Experiment
In order to verify the effectiveness of the three improvements proposed in this study, the performance of Yolov3 was used as the baseline, and the three improvements were combined with Yolov3 in the ablation experiment conducted.  Figure 8a,b show the effects of K-means and double K-means on detection effectiveness, where the calculation formula is the IOU between the predicted bounding box and the real bounding box obtained through non-maximum suppression NMS and then divided by the number of predicted bounding boxes. It can be seen from (a) that the value of IOU increases gradually with the increase of iterations. However, at the beginning of the iteration, the IOU of the lesion bounding box predicted by Yolov3 was very low compared to the real bounding box. According to the analysis of this study, in the early phase of the training, due to the inaccurate pulmonary bounding box predicted by Yolov3, the anchor box of the lesion was not accurate enough; therefore, the IOU value of predicated lesion bounding box was low. It can be seen from (b) that, during the training phase, the loss value of Yolov3 in locating the lesion is higher than that in the lungs because the lung features are relatively obvious In order to improve the precision of the experiment, the test results under three different IOU thresholds were tested in the experiment, and the mAP in Table 3 is the average value of the test results with threshold values of {0.4, 0.5, 0.6}. '@' means that the accuracy of the algorithm is tested with the IOU threshold set. It can be seen that the AP value of Yolov3 obtained using the double K-means is higher than that obtained using K-means for each of the three thresholds. The size of the anchor box generated by double K-means varies with the input data, with stronger flexibility. However, as can be seen from Figure 8, the performance of double K-means proposed in this paper depends on the accuracy of the algorithm used in locating the lung.  Figure 9 and Table 4 show the performance of sigmoid, tanh, and softmax as the squeezing function for the last layer of MaskFPN. It can be seen from Table 4 that MaskFPN using a sigmoid function improved the detection performance of Yolov3 to the highest level in comparison with FPN, but the detection performance of MaskFPN using a softmax function decreased in comparison with FPN. It can also be seen from Figure 9 that the overall convergence effect of the Yolov3 algorithm using the softmax function was not as satisfactory as that with the tanh and sigmoid functions. According to the analysis of this study, each probability score of MaskFPN output is relatively low due to the use of the softmax function in high-dimensional space, so the mask value in the weight map is small, which greatly reduces the information of low-level features, resulting in insufficient feature expression ability after fusion.

Double K-Means
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 16 and it is easier for PYolo to locate them, while the pneumonia lesions were more similar to the normal lung, which increased the learning difficulty for PYolo. In order to improve the precision of the experiment, the test results under three different IOU thresholds were tested in the experiment, and the mAP in Table 3 is the average value of the test results with threshold values of {0.4, 0.5, 0.6}. '@' means that the accuracy of the algorithm is tested with the IOU threshold set. It can be seen that the AP value of Yolov3 obtained using the double Kmeans is higher than that obtained using K-means for each of the three thresholds. The size of the anchor box generated by double K-means varies with the input data, with stronger flexibility. However, as can be seen from Figure 8, the performance of double K-means proposed in this paper depends on the accuracy of the algorithm used in locating the lung.   Figure 9 and Table 4 show the performance of sigmoid, tanh, and softmax as the squeezing function for the last layer of MaskFPN. It can be seen from Table 4 that MaskFPN using a sigmoid function improved the detection performance of Yolov3 to the highest level in comparison with FPN, but the detection performance of MaskFPN using a softmax function decreased in comparison with FPN. It can also be seen from Figure 9 that the overall convergence effect of the Yolov3 algorithm using the softmax function was not as satisfactory as that with the tanh and sigmoid functions. According to the analysis of this study, each probability score of MaskFPN output is relatively low due to the use of the softmax function in high-dimensional space, so the mask value in the weight   Table 5 shows the detection performance of Yolov3 with dilated convolution branches. As seen from the data in the table, the detection performance of the algorithm gradually improves as the expansion rate increases. The mAP of Yolov3 increased by 2.20% when the dilation rate was {1, 3, 6, 12}, compared with a dilation rate of 1. The increase in convolution branches with different dilation rates means that the algorithm can obtain more information on the perception field. Because of hardware limitations, it was not possible in this study to continue to explore more dilated convolution in the detection algorithm. It has been suggested [15], however, that excessive dilated convolution prevents capture of the local spatial image correlation; therefore, the kernel size degenerates to a 1 × 1 size, which prevents the continued improvement of the detection performance of the algorithm.  Table 6 shows the effection of double K-means, MaskFPN and dilated convolution for Yolov3 with λ = 200. It can be seen from the Table 6 that when double K-means, MaskFPN and dilated convolution are used alone or together for Darknet53, the algorithm's mAP is improved. The parameter settings in MaskFPN and dilated convolution are the same as in Tables 1 and 2. We also evaluated the effects of different hyper-parameter values on controlling negative samples in the training phase, which is important in overcoming the problem of imbalance between positive and negative samples. Table 7 shows the detection accuracy of PYolo for different hyper-parameter values.  We controlled the learning intensity of the negative samples by setting different values of λ. When the value of λ is larger, the loss value of the negative samples is greater, and the algorithm's learning intensity of the negative samples is greater. As can be seen from Table 7, when λ = 50, the AP of the algorithm was still very low after 200k iterations because there were many negative samples being predicted as positive samples, resulting in low accuracy. When λ = 200, the algorithm had the highest accuracy after 90k iterations. When λ = 250, the prediction accuracy of the algorithm started to decrease after 90k iterations because the λ value was too large, and the gradient direction was basically dominated by negative samples, resulting in the poor learning of positive samples.

Comparison of Detection Performance of Different Algorithms
As Table 8 shows, the average precision of PYolo for different IOU thresholds was higher than other algorithms. Faster RCNN is a two-stage algorithm, while SSD, Yolov3 and PYolo are one-stage algorithms. Faster RCNN uses the RPN and it can control the proportion of positive and negative samples well; therefore, the average precision of Faster RCNN was higher than that of SSD and Yolov3. PYolo is an improvement on Yolov3 in feature fusion. Although its mean AP (mAP) was higher than Faster RCNN, PYolo was not able to avoid the imbalance of positive and negative samples. Figure 10 shows the detection effectiveness of the different algorithms. It can be seen that the localization accuracy of PYolo and Yolov3 was higher than that of SSD and Yolov3, and that it had a slight advantage over Faster RCNN. However, for the last image, all four algorithms exhibited false detection and missed detection. As Table 8 shows, the average precision of PYolo for different IOU thresholds was higher than other algorithms. Faster RCNN is a two-stage algorithm, while SSD, Yolov3 and PYolo are one-stage algorithms. Faster RCNN uses the RPN and it can control the proportion of positive and negative samples well; therefore, the average precision of Faster RCNN was higher than that of SSD and Yolov3. PYolo is an improvement on Yolov3 in feature fusion. Although its mean AP (mAP) was higher than Faster RCNN, PYolo was not able to avoid the imbalance of positive and negative samples. Figure 10 shows the detection effectiveness of the different algorithms. It can be seen that the localization accuracy of PYolo and Yolov3 was higher than that of SSD and Yolov3, and that it had a slight advantage over Faster RCNN. However, for the last image, all four algorithms exhibited false detection and missed detection.  The paired t test results show that PYolo has a significant improvement over Yolov3 in detection The paired t test results show that PYolo has a significant improvement over Yolov3 in detection performance. Table 9 shows the test results for Yolov3 and PYolo for 600 images. During the testing phase, the 600 images were divided into 10 groups and the detection accuracy of the algorithms was tested. The test statistic t was calculated to be 2.687, and t(9) 0.05 = 2.262 and t(9) 0.01 = 3.250 were determined by looking up tabulated t values. For |t| = 2.687, the p-value range was [0.01, 0.05]. According to this p-value range, the detection performance of PYolo was significantly better than that of Yolov3. In order to compare the performance of PYolo with other pneumonia classification algorithms, the location information of each image detection result was ignored, and the accuracy of the classification results was determined. The criterion used was that if the confidence of at least one predicted bounding box was greater than or equal to 0.3, and the true label of the image was pneumonia, or the confidence of all predicted bounding boxes was less than 0.3, and the true label of the image was a normal lung, then the prediction result of the algorithms was judged to be correct. Table 10 shows the detection results for the different algorithms. The evaluation index was the ratio of the number of correctly classified images to the number of classified images. As Table 10 shows, CheXNet had the highest classification accuracy, because CheXNet uses the pre-training model provided in the study [13], but this model cannot provide location information. The classification accuracy of PYolo was 81.0, which was higher than that of the other algorithms. The feature extraction network of PYolo and Yolov3 is DarkNet53, which is deeper than VGG16 used by Faster RCNN, SSD and CNN + SVM.

Conclusions
This study firstly analyzes the problem of small characteristic differences in X-ray images of pneumonia lesions and proposes an improved end-to-end pneumonia detection algorithm based on Yolov3. The three main improvements offered by the proposed algorithm are as follows: the use of MaskFPN to generate pixel weights to suppress the output of inaccurate semantic information in low-level features, the use of a dilated convolution enhancement algorithm to detect pneumonia lesions, and the generation of a lesion anchor box with double K-means. The comparison experiment on the RSNA dataset proved that MaskFPN, dilated convolution and double K-means improved the detection ability of pneumonia lesions. We also demonstrated how to configure the parameters for MaskFPN and dilated convolution to facilitate further development of the algorithm.