Multiple Feature Integration for Classiﬁcation of Thoracic Disease in Chest Radiography

: The accurate localization and classiﬁcation of lung abnormalities from radiological images are important for clinical diagnosis and treatment strategies. However, multilabel classiﬁcation, wherein medical images are interpreted to point out multiple existing or suspected pathologies, presents practical constraints. Building a highly precise classiﬁcation model typically requires a huge number of images manually annotated with labels and ﬁnding masks that are expensive to acquire in practice. To address this intrinsically weakly supervised learning problem, we present the integration of di ﬀ erent features extracted from shallow handcrafted techniques and a pretrained deep CNN model. The model consists of two main approaches: a localization approach that concentrates adaptively on the pathologically abnormal regions utilizing pretrained DenseNet-121 and a classiﬁcation approach that integrates four types of local and deep features extracted respectively from SIFT, GIST, LBP, and HOG, and convolutional CNN features. We demonstrate that our approaches e ﬃ ciently leverage interdependencies among target annotations and establish the state of the art classiﬁcation results of 14 thoracic diseases in comparison with current reference baselines on the publicly available ChestX-ray14 dataset.


Introduction
Lung diseases are the leading health-related cause of death worldwide. There is a vital need for screening, early detection, and personalized therapies of lung cancer due to the possibility of contracting it from simple thorax illnesses. Historically, chronic obstructive pulmonary disease, emphysema, chronic bronchitis, and pneumonia have been the major causes of lung cancer [1]. Pneumonia affects approximately 450 million people, resulting in 4 million deaths per year [2]. Because it is low-cost and easily accessible, chest radiography, colloquially called chest X-ray (CXR), has become a common technique, making up nearly 45% of all radiological studies, including the diagnosis of a wealth of pathologies. An enormous number of chest radiological images have been produced globally, which are currently being analyzed through visual examinations on a slice-by-slice basis. Meanwhile, each X-ray scan can comprise dozens of patterns corresponding to hundreds of potential lung diseases, resulting in difficulty of interpretation, a high disagreement rate between radiologists, and unnecessary follow-up procedures. To deal with such a large number of data, a high degree of expertise and concentration importance to exploit the integration of both handcrafted features and deep features for the 14 thoracic disease classification tasks. To sum up, the main contributions of this study are as follows.

•
We utilize the efficient pretrained DenseNet-121 to visualize the class activation map (CAM) of pathological abnormalities from the ChestX-ray14 dataset.

•
We extract different types of shallow and deep features and select the coded features to generalize the best feature combination by extensive experiments.

•
We compare the classification results from different classifiers trained in a supervised manner.
The remainder of the work is organized as follows. Section 2 briefly presents the related works on CXR disease classification. In Section 3, we describe our proposed approaches for CAM visualization and multiple feature integration for classification tasks in details. Section 4 introduces the ChestX-ray14 dataset and summarizes our experimental results. The work concludes in Section 5 with a short discussion and future works.

Related Works
Deep learning techniques have led to profound breakthroughs in various computer vision applications, such as the classification of natural and medical images [7][8][9][10][11][12][13][14][15][16][17]. This success has prompted many researchers to adopt deep CNNs for the diagnosis of thoracic diseases on CXR images. TUNA-Net [36] presented unsupervised domain adaption for classifying pneumonia from normal patients and achieved an AUC of 96.3%. Moreover, a CNN with attention feedback (CONAF) was presented by the authors of [37]. They first extracted saliency maps from a repository of over 430,000 X-ray images to see if localization errors occurred and back-propagated when necessary. Then, a recurrent attention model learned to observe a short sequence of smaller image portions for further improvements of abnormal regions localization. In addition, generative adversarial networks (GANs) [38] were applied to create artificial images based on a modestly sized label dataset. Then, a CNN was trained to detect pathologies among five classes from CXR images with a substantial improvement in classification.
There have been many studies attempting to achieve outstanding results of both localization and classification tasks using deep learning applied to the ChestX-ray14 dataset. A unified weakly supervised multilabel classification framework was proposed in [4]. Their objective was first to check if one or more pathologies were present in each X-ray image of the ChestX-ray8 dataset and then localize them using the activation and weights extracted from the DCNN. After employing different pretrained models, e.g., the AlexNet [7], GoogleNet [39], VGGNet-16 [8], and ResNet-50 [40], disregarding the fully connected layers and the soft-max classification layers, they inserted a transition layer, a global pooling layer, a prediction layer, and a loss layer. This allowed the combination of deep activations from the transition layer and the weights of the prediction inner-product layer, enabling them to find the plausible spatial locations.
Because there can be multiple pathological patterns on each chest radiograph, the authors developed an approach to further leverage the interdependences among target labels in 14 pathological pattern predictions and achieved better performances in [41]. They verified, without pretraining and a carefully designed LSTM baseline model that ignored the label dependences, that they outperformed the pretrained model by a large margin. Similarly, the authors experimented with a set of deep learning models and presented a cascaded deep neural network that performed better than the baseline model, which used transfer learning when dealing with imbalanced-14-pathologies diagnosis in [42]. Their proposed approach could model the complex dependency between class labels and could generate the training strategy of boosting methods by considering the loss functions. In addition, the authors designed CheXNet [43], which is a 121-layer CNN using dense connections and batch normalization that feeds input CXR images to the model and outputs the probability of pneumonia, along with a heat-map that localized the most indicative features of pneumonia in the image. They found that their model exceeded both the average radiologist performances on pneumonia detection and the best published results on all 14 diseases at that time.
More recently, different deep learning-based methods have been developed to tackle the ChestX-ray14 problem. The authors proposed the ChestNet model to address the effective diagnosis of 14 thorax diseases in [44]. Their model consisted of two main branches: the first was a classification branch, which served as a unified feature extraction network pretrained with the ResNet-152 model [40] to escape the complexity of handling with handcrafted features; the second was an attention branch, which explored the correlation between class labels, allowing the model to find the locations of abnormal regions. Their proposed model was shown to outperform three previously state-of-the-art deep learning models in ChestX-ray14 by using the official patient-wise split without extra training data. A unified model that jointly employed disease identification and localization with the limited localized annotations from the ChestX-ray14 dataset was proposed and good results were achieved [37]. The text-image embedding network (TieNet) was proposed for text representations of distinctive image extraction [45]. The authors first used TieNet to classify ChestX-ray14 images based on both the image features and the texts received from corresponded reports. Later, TieNet was transformed into a CXR reporting system as a simulation that could output disease classification and a preliminary report. Meanwhile, the authors provided a comparison of different deep learning model settings (ResNet-38 and ResNet-101) in [46]. They achieved the best overall results with the optimized ResNet-38-large-meta architecture trained with CXRs and incorporated non-image dataset (i.e., view positions, age, and gender).
From the existing reports on CXR, we can conclude that transferring the features extracted from pretrained networks is preferable. However, the use of shallow handcrafted features from CXR images or the integration of those conventional features and transferred deep features extracted from CNNs has not been considered. To the best of our knowledge, no method has been applied on natural images for either combining different shallow features [47] or integrating different layers of features from pretrained CNNs [48] that resulted in much higher classification accuracy. Accordingly, in this work, we attempt the multiple feature integration of both shallow handcrafted and deep features for ChestX-ray14 images classification.

Proposed Approach
In this section, we present our proposed framework of multiple feature integration for the ChestX-ray14 classification. The overall approach, illustrated in Figure 1, mainly consists of (1) feature extraction using both shallow handcraft descriptors and pretrained deep CNN, and localization of pathological regions via CAM; (2) appropriate feature integration with massive and expensive experiments; and (3) classification of 14 thoracic diseases using different classifiers. More recently, different deep learning-based methods have been developed to tackle the ChestX-ray14 problem. The authors proposed the ChestNet model to address the effective diagnosis of 14 thorax diseases in [44]. Their model consisted of two main branches: the first was a classification branch, which served as a unified feature extraction network pretrained with the ResNet-152 model [40] to escape the complexity of handling with handcrafted features; the second was an attention branch, which explored the correlation between class labels, allowing the model to find the locations of abnormal regions. Their proposed model was shown to outperform three previously state-of-theart deep learning models in ChestX-ray14 by using the official patient-wise split without extra training data. A unified model that jointly employed disease identification and localization with the limited localized annotations from the ChestX-ray14 dataset was proposed and good results were achieved [37]. The text-image embedding network (TieNet) was proposed for text representations of distinctive image extraction [45]. The authors first used TieNet to classify ChestX-ray14 images based on both the image features and the texts received from corresponded reports. Later, TieNet was transformed into a CXR reporting system as a simulation that could output disease classification and a preliminary report. Meanwhile, the authors provided a comparison of different deep learning model settings (ResNet-38 and ResNet-101) in [46]. They achieved the best overall results with the optimized ResNet-38-large-meta architecture trained with CXRs and incorporated non-image dataset (i.e., view positions, age, and gender).
From the existing reports on CXR, we can conclude that transferring the features extracted from pretrained networks is preferable. However, the use of shallow handcrafted features from CXR images or the integration of those conventional features and transferred deep features extracted from CNNs has not been considered. To the best of our knowledge, no method has been applied on natural images for either combining different shallow features [47] or integrating different layers of features from pretrained CNNs [48] that resulted in much higher classification accuracy. Accordingly, in this work, we attempt the multiple feature integration of both shallow handcrafted and deep features for ChestX-ray14 images classification.

Proposed Approach
In this section, we present our proposed framework of multiple feature integration for the ChestX-ray14 classification. The overall approach, illustrated in Figure 1, mainly consists of (1) feature extraction using both shallow handcraft descriptors and pretrained deep CNN, and localization of pathological regions via CAM; (2) appropriate feature integration with massive and expensive experiments; and (3) classification of 14 thoracic diseases using different classifiers.

Shallow Feature Extraction
To effectively combine complementary local features, we use four different types of handcrafted feature descriptors to extract image information from different aspects. SIFT [23] extracts the structural information from image patches, GIST [49] obtains the scales and orientation information from different parts of images as an envelope of the image, LBP [24] enables the texture information extraction, and HOG [25] counts the occurrence of gradient orientation in the localized portions of an image resulting in the high feasibility for object detection problems. • SIFT: We decompose our SIFT algorithms into four stages. First, using the feature point detection step based on the property of scale invariance, we can find features under various image sizes. After detecting some key point features found in the scale space, which are poorly localized, a subpixel localization refines the positions of these feature points to pinpoint subpixels while removing any poor features. Next, the gradient orientations of sample points within a region form an orientation histogram. HOG: HOG decomposes an image into small squares in a dense manner, computes the histogram of oriented gradients, normalizes the obtained results by a block-wise pattern, concatenates the 3 × 3 grid cells, and returns the HOG descriptor at each grid location.

Deep Feature Extraction
To generate the CAM [50] of the 14 pathological regions and extract deep features, we use the pretrained DenseNet-121 [44], which was trained for pneumonia detection from our targeted ChestX-ray14 dataset. The model architecture of the DenseNet-121 is shown in Figure 2. Compared to the other pretrained CNNs, DenseNet-121 can improve the intake of information and gradients through the network, in which a layer obtains a collective knowledge from all previous layers; passes on its own feature maps to subsequent layers; and then concatenates them into the depth dimension. Thus, the network can be thinner and compact due to having fewer layers. The training error is able to be easily propagated to the previous layers more directly and more computational efficiency tends to be drawn. The network can also learn more diversified and richer feature patterns as the classifier in DenseNet uses features from all complexity levels giving more smooth decision boundaries when training data is insufficient.
where is the ground truth vector and ( ) is the predicted label vector in which each element is binary; 1 and 0 represent the existence and nonexistence of the corresponding diseases, respectively. In this study, the batch size is 16; the momentum is 0.9 (as an optimizer); the initial learning rate is 0.001, which decays by a factor of 10 after each iteration during the validation loss process; and the maximum number of iterations is 50,000.

Feature Integration
After extracting all the features, we obtain the feature To use the best feature integration efficiently, we first evaluate the performances of each descriptor. Features with a high classification performance are selected for the following combinations. Those features are kept if the classification accuracy improves and they are discarded otherwise. Finally, we obtain the best combination of features, denoted as . All extracted features are first normalized between 0 and 1 by 2 = || || 2 , where || || = √| 1 | 2 + | 2 | 2 + ⋯ + | | 2 , and are then concatenated to the final feature representation: = { 2 , = 1,2, … , }, where N is the number of selected features.

Supervised Learning Classifiers
At the final stage, we select a collection of classifiers in a supervised manner to quantitatively evaluate the representative capability of the integrated features, as follows.  The X-ray input image is encoded by a densely connected CNN, similar to DenseNet's in [43]. At the first stage, we resize the X-ray image into a 224 × 224 grid to be fed into the pretrained DenseNet-121. The weights of DenseNet-121 are initialized with a pretrained model on ImageNet [7] using Adam standard parameters [51]. Next, we freeze all the weights from lower convolutional layers, replace the final fully connected layer with the fully connected layers of a 14-dimensional output, and treat the DenseNet-121 as a fixed feature extractor. In the second stage, we fine-tune the weights from all layers by continuing the back-propagation. Each training iteration aims to optimize the cross-entropy losses through the following equation, where Y is the ground truth vector and Y (P) is the predicted label vector in which each element is binary; 1 and 0 represent the existence and nonexistence of the corresponding diseases, respectively. In this study, the batch size is 16; the momentum is 0.9 (as an optimizer); the initial learning rate is 0.001, which decays by a factor of 10 after each iteration during the validation loss process; and the maximum number of iterations is 50,000.

Feature Integration
After extracting all the features, we obtain the feature set To use the best feature integration efficiently, we first evaluate the performances of each descriptor. Features with a high classification performance are selected for the following combinations. Those features are kept if the classification accuracy improves and they are discarded otherwise. Finally, we obtain the best combination of features, denoted as F s . All extracted features are first normalized between 0 and 1 by F L2 = F

Supervised Learning Classifiers
At the final stage, we select a collection of classifiers in a supervised manner to quantitatively evaluate the representative capability of the integrated features, as follows.

•
Gaussian discriminant analysis (GDA) [52]: based on the generative learning algorithm property, we learn the model P(y) distributed according to Bernoulli and P(x|y = k), where k is one of the 14 classes distributed according to the multivariate normal distribution; then, P(y|x) can be expressed as Sigmoid function. • K-nearest neighbor (KNN) [53]: we initialize K = 30 for the number of neighbors to capture and locate similar data points. We gradually increase the value of K so that our KNN predictions become more stable and accurate based on majority voting and averaging. • Naïve Bayes [54]: based on the so-called Bayesian theorem, which calculates the posterior probability P(c|x) from (x), P(c), and P(x|c), the Naïve Bayes assumes that the effect on a given class (x) (x is independent of the predictor, called class conditional independence. • Support vector machine (SVM) [55]: we apply SVM algorithms to find an optimal hyperplane acting as a decision boundary in N-dimensional space that can distinctly classify our feature points. We maximize the margin of the classifier when support vectors affect the position and orientation of the hyperplane. The tuning parameters, including the kernel, regularization, and gamma, are carefully chosen.

•
Adaptive boosting (AdaBoost) [56]: we sequentially add a set of weak classifiers and trains using the weighted training data. First, we initialize the weight for each data point, fit weak classifiers to the dataset, and compute the weight of all weak classifiers. After 100 iterations, we obtain the final prediction with the updated weight for each classifier by the formula F(x) = sign N n = 1 W n f n (x) , where f n is the n th weak classifier and W n is the corresponding weight. • Random Forest [57]: this comprises multiple random decision trees. A random sample from our original dataset forms into each tree. A subset of K features presenting each tree node d is randomly selected to generate the best split. Then, we split the node into daughter nodes and repeating these steps n times to create n trees.

•
Extreme learning machine (ELM) [58]: the ELM includes an input layer, a hidden layer, and an output layer. We set the specific number of hidden neurons, randomize the weight and the bias between input and hidden layers in the execution process, calculate the weight between hidden and output layers by the Moore-Penrose pseudoinverse with a sigmoid activation function, and fit the results with the least-squares method.

ChestX-Ray14 Dataset and Preprocessing
To verify the efficacy of our proposed approach, we conduct experiments on the publicly available ChestX-ray14 dataset recently introduced in [4]. With a total of 112,120 X-ray images acquired from 30,805 unique patients with 14 disease labels, it is the largest collection of front-view chest radiographs to date. Each image is marked by a single or multiple labels based on the radiology reports, with 90% accuracy. Furthermore, 984 labeled bounding boxes (B-Box) are provided by board-certified radiologists. Thus, we select these 984 images as "annotated" for testing CAM visualization and the remaining 111,240 "unannotated" images for training the DenseNet-121 model. We show the complex and diverse distribution of 10,000 sampled images by plotting t-distributed stochastic neighboring entities (t-SNE) [59] and conducting a principle component analysis (PCA) [60] (Figure 3). Before inputting images into the DensNet-121 model (Figure 2), we downscale the original 1024 × 1024 PNG images to 224 × 224 PNG, and we normalize them into the range [−1,1] based on the mean and standard deviations of the images. We also augment the training and validation data with batch augmentation and random horizontal flipping methods. In contrast, to extract features from different perspectives based on our proposed shallow descriptor s (Figure 4), we keep the original size of the images and do not apply any data augmentation techniques.
We used Python 3.6 for (i) both handcrafted and deep feature extractions and (ii) implementation of the deep pretrained DenseNet-121 model for CAM visualization and classification tasks of 14 thoracic diseases with TensorFlow 1.8.0 deep learning framework of CUDA 9 and cuDNN 7.5 dependencies. In the latter part, we adopted the different classifiers implemented in Matlab 2018b. 10-fold cross-validation was also applied for the classifiers. The total computation time costs for our proposed massive experiments took 143.9 h on a system with an i7-4770K 4-core CPU, 32G of memory, and a GPU, GeForce GTX 1070. Before inputting images into the DensNet-121 model (Figure 2), we downscale the original 1024 × 1024 PNG images to 224 × 224 PNG, and we normalize them into the range [−1,1] based on the mean and standard deviations of the images. We also augment the training and validation data with batch augmentation and random horizontal flipping methods. In contrast, to extract features from different perspectives based on our proposed shallow descriptors (Figure 4), we keep the original size of the images and do not apply any data augmentation techniques.
We used Python 3.6 for (i) both handcrafted and deep feature extractions and (ii) implementation of the deep pretrained DenseNet-121 model for CAM visualization and classification tasks of 14 thoracic diseases with TensorFlow 1.8.0 deep learning framework of CUDA 9 and cuDNN 7.5 dependencies. In the latter part, we adopted the different classifiers implemented in Matlab 2018b. 10-fold cross-validation was also applied for the classifiers. The total computation time costs for our proposed massive experiments took 143.9 h on a system with an i7-4770K 4-core CPU, 32G of memory, and a GPU, GeForce GTX 1070.  Before inputting images into the DensNet-121 model (Figure 2), we downscale the original 1024 × 1024 PNG images to 224 × 224 PNG, and we normalize them into the range [−1,1] based on the mean and standard deviations of the images. We also augment the training and validation data with batch augmentation and random horizontal flipping methods. In contrast, to extract features from different perspectives based on our proposed shallow descriptor s (Figure 4), we keep the original size of the images and do not apply any data augmentation techniques.
We used Python 3.6 for (i) both handcrafted and deep feature extractions and (ii) implementation of the deep pretrained DenseNet-121 model for CAM visualization and classification tasks of 14 thoracic diseases with TensorFlow 1.8.0 deep learning framework of CUDA 9 and cuDNN 7.5 dependencies. In the latter part, we adopted the different classifiers implemented in Matlab 2018b. 10-fold cross-validation was also applied for the classifiers. The total computation time costs for our proposed massive experiments took 143.9 h on a system with an i7-4770K 4-core CPU, 32G of memory, and a GPU, GeForce GTX 1070.

CAM Visualization
After extracting the activation weights from the final convolutional layer, we can generate the disease heat-maps for each pathology. We obtain the feature map of the most salient features by summing up associated weights as follows, where is the ℎ feature map and , is the weight of the final convolutional layer at the feature map k leading to pathology . We are able to localize pathologies using CAM by highlighting the pathological regions of the X-ray images that are important for performing a specific disease classification. Despite the small number of annotated bounding boxes (984 instances) compared to the entire dataset, it is sufficient to achieve a rational estimate on the disease localization performance of our proposed framework. Figure 5 shows several CAM visualization examples.

Classification Results
Tables 1 and 2 show the obtained classification accuracies and F1-scores, respectively. As previously mentioned, four types of shallow local features (SIFT, GIST, LBP, and HOG) are designed to describe image patches from different perspectives. We divide the dataset as 80% for

CAM Visualization
After extracting the activation weights from the final convolutional layer, we can generate the disease heat-maps for each pathology. We obtain the feature map M c of the most salient features by summing up associated weights as follows, where F k is the k th feature map and W c,k is the weight of the final convolutional layer at the feature map k leading to pathology c. We are able to localize pathologies using CAM by highlighting the pathological regions of the X-ray images that are important for performing a specific disease classification. Despite the small number of annotated bounding boxes (984 instances) compared to the entire dataset, it is sufficient to achieve a rational estimate on the disease localization performance of our proposed framework. Figure 5 shows

CAM Visualization
After extracting the activation weights from the final convolutional layer, we can generate the disease heat-maps for each pathology. We obtain the feature map of the most salient features by summing up associated weights as follows, where is the ℎ feature map and , is the weight of the final convolutional layer at the feature map k leading to pathology . We are able to localize pathologies using CAM by highlighting the pathological regions of the X-ray images that are important for performing a specific disease classification. Despite the small number of annotated bounding boxes (984 instances) compared to the entire dataset, it is sufficient to achieve a rational estimate on the disease localization performance of our proposed framework. Figure 5 shows several CAM visualization examples.

Classification Results
Tables 1 and 2 show the obtained classification accuracies and F1-scores, respectively. As previously mentioned, four types of shallow local features (SIFT, GIST, LBP, and HOG) are designed to describe image patches from different perspectives. We divide the dataset as 80% for  Tables 1 and 2 show the obtained classification accuracies and F1-scores, respectively. As previously mentioned, four types of shallow local features (SIFT, GIST, LBP, and HOG) are designed to describe image patches from different perspectives. We divide the dataset as 80% for training and 20% for testing. As the first integration strategy, we can see that the classification accuracy of the shallow feature integration is higher than that of each single feature. Furthermore, from the experiments we observe that the classification accuracies keep increasing from Conv1 to Conv5, because the features of shallow layers of deep CNN models are typically basic and shared, preventing the differentiation of the information in X-ray images. Therefore, we disregard the integration of Conv1 to Conv4 with our proposed handcrafted features. The results of Conv5 features, which are integrated with either each single feature descriptor or all the conventional features, are superior to those obtained in the first integration strategy. Regarding the performance of the supervised classifiers at the last stage of our approach, ELM works best among all the single and integrated features, followed by AdaBoost. Figure 6 summarizes the classification results achieved by the seven supervised classifiers. which is higher than the 80.97% accuracy of the pretrained DenseNet-121 model. The experiment reveals the effectiveness of the feature integration strategy in extracting representative and discriminative features to describe X-ray images. Table 3 indicates that our pretrained DenseNet-121 model achieved very competitive accuracies in the diagnosis of the 14 thorax diseases. Note that the authors of [61] trained their model with 180,000 images from the PLCO dataset [62] as extra training data.   To accurately compare our proposed approach with the reference baseline model in the classification of all 14 pathologies, we divided the dataset as follows; 70% for training, 10% for validation, and 20% for testing used for the pretrained DenseNet-121; the preprocessing step is described in Section 4.1. The feature integration approach reaches 84.62% classification accuracy, which is higher than the 80.97% accuracy of the pretrained DenseNet-121 model. The experiment reveals the effectiveness of the feature integration strategy in extracting representative and discriminative features to describe X-ray images. Table 3 indicates that our pretrained DenseNet-121 model achieved very competitive accuracies in the diagnosis of the 14 thorax diseases. Note that the authors of [61] trained their model with 180,000 images from the PLCO dataset [62] as extra training data.

Conclusions and Future Work
The early diagnosis and treatment of lung disease are essential to prevent deaths worldwide. In this study, we propose a novel framework to integrate multiple features from both shallow and deep features. Representative and discriminative features are obtained to distinguish 14 pathologies from the public ChestX-ray14 dataset after conducting comprehensive experiments. We were able to generate the disease heat-maps despite having a limited number of annotated bounding boxes of