Fusion High-Resolution Network for Diagnosing ChestX-ray Images

: The application of deep convolutional neural networks (CNN) in the ﬁeld of medical image processing has attracted extensive attention and demonstrated remarkable progress. An increasing number of deep learning methods have been devoted to classifying ChestX-ray (CXR) images, and most of the existing deep learning methods are based on classic pretrained models, trained by global ChestX-ray images. In this paper, we are interested in diagnosing ChestX-ray images using our proposed Fusion High-Resolution Network (FHRNet). The FHRNet concatenates the global average pooling layers of the global and local feature extractors—it consists of three branch convolutional neural networks and is ﬁne-tuned for thorax disease classiﬁcation. Compared with the results of other available methods, our experimental results showed that the proposed model yields a better disease classiﬁcation performance for the ChestX-ray 14 dataset, according to the receiver operating characteristic curve and area-under-the-curve score. An ablation study further conﬁrmed the e ﬀ ectiveness of the global and local branch networks in improving the classiﬁcation accuracy of thorax diseases.


Introduction
ChestX-rays (CXRs) are often included in routine physical examinations. Due to the advantages of being rapid, simple and economical, X-ray photography has become the most popular method for performing chest examinations [1]. A ChestX-ray can clearly record gross lesions of the lungs, including pneumonia, masses and nodules. The interpretation of CXR images in current medical practice, however, is mainly performed by radiologists, through artificial reading. The ChestX-ray image of a patient needs to be read by a senior radiologist for at least 10 min to make a diagnosis and different doctors can make inconsistent diagnoses of the same ChestX-ray image, which means that the results are affected by the cognitive ability of the radiologist, subjective experience, fatigue and other factors [2]. Computer-aided diagnosis (CAD) can overcome the deficiencies of radiologists, make The classic pretrained models, e.g., AlexNet [38], VGGNet [39], ResNet [40] and DenseNet [41], all use a CXR image that is resized to 224 × 224 × 3 as the input. The model encodes the image to C feature maps that are sized S × S and outputs them to the transition layer. Each feature map is reduced to 1 × 1 × D by the transition layer and then transformed into a D-dimensional feature vector by the sampling layer. A sigmoid function transforms the fully connected layer and then outputs probability scores for 14 thorax diseases.
Medical disease diagnosis, however, often needs to find abnormal disease information from dozens of pixels in a picture with millions of pixels to make an accurate disease judgement. Artificial downsampling, or discarding pixels, will result in the loss of disease information, missed diagnoses and misdiagnoses, leading to the treatment of the patient's diseases potentially being delayed.
In this paper, to take full advantage of neural network architectures and fuse image representation features, we adopt a fusion convolutional neural network and introduce the classification layer into a high-resolution network (HRNet) to improve the classification of CXR images. An illustration of HRNet is provided in Figure 2. Specifically, four high-resolution feature The classic pretrained models, e.g., AlexNet [38], VGGNet [39], ResNet [40] and DenseNet [41], all use a CXR image that is resized to 224 × 224 × 3 as the input. The model encodes the image to C feature maps that are sized S × S and outputs them to the transition layer. Each feature map is reduced to 1 × 1 × D by the transition layer and then transformed into a D-dimensional feature vector by the sampling layer. A sigmoid function transforms the fully connected layer and then outputs probability scores for 14 thorax diseases.
Medical disease diagnosis, however, often needs to find abnormal disease information from dozens of pixels in a picture with millions of pixels to make an accurate disease judgement. Artificial downsampling, or discarding pixels, will result in the loss of disease information, missed diagnoses and misdiagnoses, leading to the treatment of the patient's diseases potentially being delayed.
In this paper, to take full advantage of neural network architectures and fuse image representation features, we adopt a fusion convolutional neural network and introduce the classification layer into a high-resolution network (HRNet) to improve the classification of CXR images. An illustration of HRNet Electronics 2020, 9,190 3 of 12 is provided in Figure 2. Specifically, four high-resolution feature maps are first fed into a bottleneck, and the number of output channels is increased to 64, 128, 256 and 512. The high-resolution representations are then downsampled by a 2-stride 3 × 3 convolution layer, which results in 128 channels. Then, all the channels are compiled into representations of the second-level high-resolution representations, and this process is conducted twice, to obtain 256 channels at the low resolution. Finally, the 512 channels are transformed into 1024 channels through one 1 × 1 convolution, which is followed by a global average pooling operation. The output 1024-dimensional representation is fed into the classifier [42]. and 512. The high-resolution representations are then downsampled by a 2-stride 3 × 3 convolution layer, which results in 128 channels. Then, all the channels are compiled into representations of the second-level high-resolution representations, and this process is conducted twice, to obtain 256 channels at the low resolution. Finally, the 512 channels are transformed into 1024 channels through one 1 × 1 convolution, which is followed by a global average pooling operation. The output 1024dimensional representation is fed into the classifier [42].
In summary, our contributions in this work are as follows: First, we propose the fusion highresolution network as a feature extractor, which produces competitive results compared with those of other advanced methods. Second, we introduce a fusion CNN that diagnoses ChestX-ray images by combining local and global cues. The FHRNet improves the performance of thorax disease classification by reducing the impact of noise and highlighting lung regions. Third, we conduct a comparative experiment based on the ChestX-ray 14 dataset. The classification results show that the FHRNet model achieves better performance than other available approaches. The FHRNet is composed of four parallel high-to-low resolution subnetworks that repeatedly exchange information across multiresolution subnetworks. The vertical and horizontal directions correspond to the scale of the feature maps and the depth of the network, respectively.

Dataset
Wang et al. [25] released the ChestX-ray 14 dataset in October 2017, and it is the largest available ChestX-ray dataset by far. The ChestX-ray14 dataset includes 112,120 CXR images, involving 30,805 patients. The pixel size of every CXR image is 1024 × 1024, and all images are saved in PNG format, with an 8-bit greyscale value. Every image is labelled with 14 different thorax diseases, with features extracted from radiologist reports. The ground truth data are mined and labelled through natural language processing (NLP) from patient diagnostic reports, and the label accuracy is estimated to be greater than ninety percent. Among the 112,120 ChestX-ray images, 51,708 images contained one or more diseases, and the remaining 60,412 images were considered normal and labelled "No Finding". An image example is shown in Figure 3. In summary, our contributions in this work are as follows: First, we propose the fusion high-resolution network as a feature extractor, which produces competitive results compared with those of other advanced methods. Second, we introduce a fusion CNN that diagnoses ChestX-ray images by combining local and global cues. The FHRNet improves the performance of thorax disease classification by reducing the impact of noise and highlighting lung regions. Third, we conduct a comparative experiment based on the ChestX-ray 14 dataset. The classification results show that the FHRNet model achieves better performance than other available approaches.

Dataset
Wang et al. [25] released the ChestX-ray 14 dataset in October 2017, and it is the largest available ChestX-ray dataset by far. The ChestX-ray14 dataset includes 112,120 CXR images, involving 30,805 patients. The pixel size of every CXR image is 1024 × 1024, and all images are saved in PNG format, with an 8-bit greyscale value. Every image is labelled with 14 different thorax diseases, with features extracted from radiologist reports. The ground truth data are mined and labelled through natural language processing (NLP) from patient diagnostic reports, and the label accuracy is estimated to be greater than ninety percent. Among the 112,120 ChestX-ray images, 51,708 images contained one or more diseases, and the remaining 60,412 images were considered normal and labelled "No Finding". An image example is shown in Figure 3. Electronics 2020, 9, x FOR PEER REVIEW 4 of 12 The ChestX-ray 14 dataset includes multilabel classification and is large enough for deep learning; therefore, it was used to evaluate and validate the FHRNet model. In this experiment, we divided the whole dataset into a training set (total 75,708 images), a validation set (total 10,816 images) and a test set (total 25,596 images), at the hospital scale. All images from the same patient only appeared once in the training set, the validation set and the test set.

Network Framework
As shown in Figure 4, the proposed FHRNet has three branches: the local feature extractor, the global feature extractor and the feature fusion module. The local and global feature extractors are disease classification networks that output disease classification probabilities from the corresponding images. In contrast, the input image of the local feature extractor is a small lung region that is cropped using a mask inference generated from the global feature extractor. Two of the HRNets were adjusted to obtain the distinguishing features of the local lung region and whole image. The ChestX-ray 14 dataset includes multilabel classification and is large enough for deep learning; therefore, it was used to evaluate and validate the FHRNet model. In this experiment, we divided the whole dataset into a training set (total 75,708 images), a validation set (total 10,816 images) and a test set (total 25,596 images), at the hospital scale. All images from the same patient only appeared once in the training set, the validation set and the test set.

Network Framework
As shown in Figure 4, the proposed FHRNet has three branches: the local feature extractor, the global feature extractor and the feature fusion module. The local and global feature extractors are disease classification networks that output disease classification probabilities from the corresponding images. In contrast, the input image of the local feature extractor is a small lung region that is cropped using a mask inference generated from the global feature extractor. Two of the HRNets were adjusted to obtain the distinguishing features of the local lung region and whole image.
The HRNets are connected to global average pooling layers, a fully connected layer, a sigmoid layer and a loss function. The feature fusion module concatenates the global average pooling layers after two feature extraction steps and is then fine-tuned to make a final classification prediction.

Network Structure
It usually takes three steps to build a model for classifying CXR images, based on the deep learning of multibranch images. These steps are feature extraction, feature fusion and classification prediction. The specific descriptions of these steps are provided below.
Electronics 2020, 9,190 5 of 12 Electronics 2020, 9, x FOR PEER REVIEW 5 of 12 The HRNets are connected to global average pooling layers, a fully connected layer, a sigmoid layer and a loss function. The feature fusion module concatenates the global average pooling layers after two feature extraction steps and is then fine-tuned to make a final classification prediction.

Network Structure
It usually takes three steps to build a model for classifying CXR images, based on the deep learning of multibranch images. These steps are feature extraction, feature fusion and classification prediction. The specific descriptions of these steps are provided below.
Feature extraction from multibranch images. Determining how to better extract features from multiview medical images is one of the main research topics in the field of medical image processing based on deep learning methods [43]. Although a variety of advantageous features have been manually extracted from multiview medical images, for example, HOG, LBP and SIFT, classification predictions based on these features can lead to incompatibility problems, that is, the extracted features cannot effectively classify and predict specific organs or diseases. Feature extraction based on a CNN solves the above problems. With the continuous development of attention mechanisms, feature extraction from multiview medical images has become increasingly ideal [44].
When we take the feature extraction network f as an example, HRNet can be used to extract features. Suppose that the network can be expressed as follows: in which represents an output that is not processed by the activation function of the last layer [45]. The overall output of the network is as follows: in which A represents the activation function of feature extraction network [46]. As shown in Figure 2, the input of feature extraction network includes the global input image g x and the local input image l x , and the ith local input image is represented as = ⊙ .
Therefore, according to the definition of the feature network, the global features and local features can be expressed as follows: Feature extraction from multibranch images. Determining how to better extract features from multiview medical images is one of the main research topics in the field of medical image processing based on deep learning methods [43]. Although a variety of advantageous features have been manually extracted from multiview medical images, for example, HOG, LBP and SIFT, classification predictions based on these features can lead to incompatibility problems, that is, the extracted features cannot effectively classify and predict specific organs or diseases. Feature extraction based on a CNN solves the above problems. With the continuous development of attention mechanisms, feature extraction from multiview medical images has become increasingly ideal [44].
When we take the feature extraction network f as an example, HRNet can be used to extract features. Suppose that the network can be expressed as follows: in which θ := W 1 , b 1 , . . . , W L−1 , b L−1 , W L , b L are the parameters of network f , a l (1 ≤ l < L) represents the activation function of the lth layer, x represents the input of the network f . f (x, θ) represents an output that is not processed by the activation function of the last layer [45]. The overall output of the network is as follows: in which A represents the activation function of feature extraction network f [46]. As shown in Figure 2, the input of feature extraction network f includes the global input image x g and the local input image x l , and the ith local input image is represented as x i l = m i x i g . Therefore, according to the definition of the feature network, the global features and local features can be expressed as follows: O l = A f l x g m, θ Feature fusion from multibranch images. To use the images of different branches for classification prediction, it is necessary to construct unified fusion features to share the features of different branches. After different deep neural networks extract the features from different branch images, the shared fusion features can be obtained by directly concatenating the images from the three branches, Electronics 2020, 9,190 6 of 12 in which w i (1 ≤ i ≤ 2) represents the weight of a feature that is extracted from the ith network of the fusion feature [47]. It is not difficult to find that the features extracted from the three branches will result in feature redundancy. An attention mechanism can be used to reduce feature redundancy. That is, adding a random mask after the last activation layer and removing redundant features can increase the classification accuracy.
Classification prediction. At present, the prediction of lung disease is a multiclassification task that usually adopts the softmax classification function. The classification function is expressed as in which O represents the fusion feature, W represents the mapping matrix that is used to map the high-dimensional fusion feature to a low-dimensional probability distribution representing the disease information and p i (1 ≤ i ≤ 14) represents the probability of identifying the ith disease [48].
To dynamically determine the weights of the three features and further improve the prediction accuracy, global and local consistency classification methods can be used. That is, three classifiers for global, local and fusion features are trained and alternately optimised for classification prediction, represents the probability that the jth network predicts the ith disease. According to the mechanism of global-local consistency, the probabilities of patients suffering from the 14 diseases are p 1 1 p 2 1 , p 1 2 p 2 2 , . . . , and p 1 14 p 2 14 . Due to the range of the probability values, the final diagnosis probability is small and it is processed by a logarithmic function to become useful to doctors.

Experimental Setting
In all pretrained models, input images are expected to be normalised by the same means, such as by creating a minibatch of three-channel RGB images (3 × H × W), in which either H or W is expected to be no less than 224. All images in the ChestX-ray 14 dataset are 1024 × 1024, with an 8-bit greyscale value. We split the dataset into the training set (78,468 images of 21,528 patients), the validation set (11,219 images of 3090 patients) and the test set (22,433 images of 6187 patients), without the same patient overlapping among sets. We converted these greyscale images to three-channel RGB images, cropped them to a 224 × 224 resolution at the centre and then normalised these images by the means ([0.485, 0.456, 0.406]) and standard deviations ([0.229, 0.224, 0.225]). We trained the model by the Adam optimiser and set the initial learning rate and batch size as 1.0 × 10 −4 and 32, respectively. We completed the training procedure after 50 epochs. After each epoch, we validated, tested and saved the model with the best classification performance. For multiclass classification, we used the receiver operating characteristic (ROC) curve and area-under-the-curve (AUC) score to assess the classification performance. The model weights associated with the best AUC scores, based on the validation set, were saved and used to extract representative features. In our experiment, we plotted the ROC curve for each thorax disease and calculated the AUC scores for 14 diseases to evaluate the classification performance. The FHRNet was implemented with the Pytorch 1.0 framework in Python 3.6 on an Ubuntu 16.04 server. The model was trained, validated and tested on an 8-core CPU and four TITAN V GPUs.

Results
The classification results for the existing methods and the FHRNet based on the ChestX-ray 14 dataset are presented in terms of the AUC scores in Table 1. The obtained ROC curves of the FHRNet for each of 14 thorax diseases are shown in Figure 5. Table 1. The area-under-the-curve (AUC) scores of existing methods and the FHRNet based on the ChestX-ray 14 dataset. The scores that displayed a relative increase are marked in bold.

Thorax Disease
Wang [  Based on the available published works performed by other researchers, including Wang [25], Yao [49] and Gundel [50], we recorded and compared the AUC scores they obtained and those of the FHRNet based on the ChestX-ray 14 dataset. We found that the FHRNet method achieved the expected effect and provided a superior classification performance. A numerical comparison of the results for 14 classes of thorax diseases and the average AUC of each method are shown in Table 1. Compared with the three existing methods, the proposed method increased the average AUC by 8.98% (from 0.7451 to 0.812). Notably, for "Mass", the rate of increase in the AUC score reached 19.3% (from 0.6933 to 0.827).
From Table 1, a horizontal comparison shows that the existing methods and our model obtained different classification effects, even for the same thorax disease. Among the 14 diseases, 10 thorax diseases had the best average AUC scores with the FHRNet model: "Atelectasis", "Cardiomegaly", "Effusion", "Infiltration", "Mass", "Pneumothorax", "Consolidation", "Emphysema", "Fibrosis" and Based on the available published works performed by other researchers, including Wang [25], Yao [49] and Gundel [50], we recorded and compared the AUC scores they obtained and those of the FHRNet based on the ChestX-ray 14 dataset. We found that the FHRNet method achieved the expected effect and provided a superior classification performance. A numerical comparison of the results for 14 classes of thorax diseases and the average AUC of each method are shown in Table 1. Compared with the three existing methods, the proposed method increased the average AUC by 8.98% (from 0.7451 to 0.812). Notably, for "Mass", the rate of increase in the AUC score reached 19.3% (from 0.6933 to 0.827).
From Table 1, a horizontal comparison shows that the existing methods and our model obtained different classification effects, even for the same thorax disease. Among the 14 diseases, 10 thorax diseases had the best average AUC scores with the FHRNet model: "Atelectasis", "Cardiomegaly", "Effusion", "Infiltration", "Mass", "Pneumothorax", "Consolidation", "Emphysema", "Fibrosis" and "Hernia". Table 1 also shows that the FHRNet model achieved the best average AUC score.
A vertical comparison shows that the existing methods and our model obtain different classification effects for the 14 thorax diseases. The most accurately identified thorax disease was "Hernia", with an AUC score of 0.916, and the least accurately identified disease was "Pneumonia", with an AUC score of 0.703.
We also plotted the ROC curves of the FHRNet for each of the 14 thorax diseases, as shown in Figure 5. We can observe that the ROC curve of "Infiltration" was flatter than that of "Hernia", which means that the classification of "Pneumonia" was not as good as that of "Hernia".

Discussion
The experimental results show that the proposed FHRNet provides excellent disease classification performance. Our method can obtain satisfactory results because two significant structures are introduced: (1) a high-resolution network is adopted as a feature extractor to exchange image representation features and (2) the local and global branches of the ChestX-ray images are introduced to obtain the most useful features. To illustrate the effectiveness of local and global branches in our method, we conducted a further ablation study that correspondingly yielded different AUC scores. The results of the ablation study of local and global branches are shown in Table 2. We developed a three-branch convolutional neural network for diagnosing CXR images in this study. The fusion branch used two high-resolution networks to adaptively concentrate on pathologically abnormal regions, which thus improved the classification accuracy. The model achieved the effective utilisation of the fusion features extracted from both local lung region images and entire ChestX-ray images. If the fusion branch were to be eliminated, the performance of the FHRNet model would degrade. With reasonable confidence, we conclude that the fusion branch plays an important role in the FHRNet model. Among the existing methods that were trained only on the ChestX-ray 14 dataset, the FHRNet achieved good AUC scores for the 14 thorax diseases.

Conclusions
In this work, an innovative architecture, termed the FHRNet, was applied to classify 14 thorax diseases and diagnose ChestX-ray images. Compared with most previous networks, the difference is that the FHRNet consists of four parallel high-to-low resolution subnetworks and repeatedly exchanges information via multiscale fusion processes. Two HRNets were trained by the local and global feature extraction branches, and the feature fusion module was concatenated and fine-tuned for the final prediction. Our experimental results for the ChestX-ray14 dataset demonstrated the effectiveness and accuracy of the FHRNet model. Additional ablation studies showed that the local and global feature extraction branches affect the classification performance and improve the classification effect after fusion.
In our future work, we will focus on the pixel-level segmentation of the lung region, from CXR images, to further improve the classification performance. Then, we will train the model by using more than 180,000 images from the PLCO dataset [51] as extra training data for applying the model in computer-aided diagnosis.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
In this manuscript the used abbreviations are as follows: