Hybrid Learning of Hand-Crafted and Deep-Activated Features Using Particle Swarm Optimization and Optimized Support Vector Machine for Tuberculosis Screening

: Tuberculosis (TB) is a leading infectious killer, especially for people with Human Immunodeﬁciency Virus (HIV) and Acquired Immunodeﬁciency Syndrome (AIDS). Early diagnosis of TB is crucial for disease treatment and control. Radiology is a fundamental diagnostic tool used to screen or triage TB. Automated chest x-rays analysis can facilitate and expedite TB screening with fast and accurate reports of radiological ﬁndings and can rapidly screen large populations and alleviate a shortage of skilled experts in remote areas. We describe a hybrid feature-learning algorithm for automatic screening of TB in chest x-rays: it ﬁrst segmented the lung regions using the DeepLabv3 + model. Then, six sets of hand-crafted features from statistical textures, local binary pattern, GIST, histogram of oriented gradients (HOG), pyramid histogram of oriented gradients and bags of visual words (BoVW), and nine sets of deep-activated features from AlexNet, GoogLeNet, InceptionV3, XceptionNet, ResNet-50, SqueezeNet, Shu ﬄ eNet, MobileNet, and DenseNet, were extracted. The dominant features of each feature set were selected using particle swarm optimization, and then separately input to an optimized support vector machine classiﬁer to label ‘normal’ and ‘TB’ x-rays. GIST, HOG, BoVW from hand-crafted features, and MobileNet and DenseNet from deep-activated features performed better than the others. Finally, we combined these ﬁve best-performing feature sets to build a hybrid-learning algorithm. Using the Montgomery County (MC) and Shenzen datasets, we found that the hybrid features of GIST, HOG, BoVW, MobileNet and DenseNet, performed best, achieving an accuracy of 92.5% for the MC dataset and 95.5% for the Shenzen dataset.


Introduction
Infectious diseases are threats to global health: one is tuberculosis (TB), a serious contagious disease that can be easily transmitted. TB is spread through the air when a TB infected person coughs, sneezes or spits. It is caused by the Mycobacterium tuberculosis. The World Health Organization (WHO) estimated that around 11 million people were ill with TB in 2019 worldwide and 1.5 million died. The early and accurate detection of TB is essential to control the disease progression and to prevent forward transmissions. The WHO recommends that a chest x-ray is an essential tool to end TB. Chest x-rays are non-invasive, fast, affordable, highly sensitive and widely available in urban areas.

•
Second, we optimized an SVM classifier using a Bayesian algorithm. • Third, we compared the classifications from hand-crafted and deep-activated features using the optimized SVM classifier. • Fourth, we combined the selected hand-crafted and deep-activated features to generalize the feature set in extensive experiments. To our knowledge, this is the first approach to predict TB using a hybrid feature set which contained a combination of selected handcrafted and deep-activated features. By using the hybrid feature set, we enhanced the prediction performance compared to individual methods and state-of-the-art.
This paper has four sections. Section 2 presents the datasets. Section 3 describes the steps in our method: lung segmentation, feature extraction, feature selection and classification. Section 4 describes and discusses our experimental results. Section 5 concludes and suggests future work.

Dataset Description
This study used two public datasets: Montgomery (MC) and Shenzhen, published in Jaeger et al. [23]. The MC dataset has 138 frontal chest x-rays-80 normal and 58 show TB; they were collected in Montgomery County, Maryland, USA. The image sizes are 4892 × 4020 or 4020 × 4892. This dataset also has ground truth lung masks for every image and radiological reports describing the lesions. Figure 1 shows example chest x-rays from the MC dataset. The Shenzhen images were collected in the Guandong Hospital, Shenzhen, China; it consists of 326 normal and 336 images containing TB lesions. The image resolutions vary but are approximately 3000 × 3000 pixels. Figure 2 shows some chest-rays in the Shenzhen dataset.
Appl. Sci. 2020, 10, x 4 of 22 • Second, we optimized an SVM classifier using a Bayesian algorithm. • Third, we compared the classifications from hand-crafted and deep-activated features using the optimized SVM classifier. • Fourth, we combined the selected hand-crafted and deep-activated features to generalize the feature set in extensive experiments. To our knowledge, this is the first approach to predict TB using a hybrid feature set which contained a combination of selected handcrafted and deepactivated features. By using the hybrid feature set, we enhanced the prediction performance compared to individual methods and state-of-the-art.
This paper has four sections. Section 2 presents the datasets. Section 3 describes the steps in our method: lung segmentation, feature extraction, feature selection and classification. Section 4 describes and discusses our experimental results. Section 5 concludes and suggests future work.

Dataset Description
This study used two public datasets: Montgomery (MC) and Shenzhen, published in Jaeger et al. [23]. The MC dataset has 138 frontal chest x-rays-80 normal and 58 show TB; they were collected in Montgomery County, Maryland, USA. The image sizes are 4892×4020 or 4020×4892. This dataset also has ground truth lung masks for every image and radiological reports describing the lesions. Figure 1 shows example chest x-rays from the MC dataset. The Shenzhen images were collected in the Guandong Hospital, Shenzhen, China; it consists of 326 normal and 336 images containing TB lesions. The image resolutions vary but are approximately 3000 × 3000 pixels. Figure 2 shows some chest-rays in the Shenzhen dataset.

Methodology
We aimed to implement an automatic TB screening system using chest x-rays, which could facilitate and expedite TB diagnosis and treatment. Figure 3 depicts the data flow of our algorithm that hybridizes the shallow and deep features for TB classification. It has three main steps: (i) lung segmentation supplemented by preprocessing, (ii) feature extraction, selection and concatenation, and (iii) classification of normal and TB. Each step is presented in details in the following subsections. • Second, we optimized an SVM classifier using a Bayesian algorithm. • Third, we compared the classifications from hand-crafted and deep-activated features using the optimized SVM classifier. • Fourth, we combined the selected hand-crafted and deep-activated features to generalize the feature set in extensive experiments. To our knowledge, this is the first approach to predict TB using a hybrid feature set which contained a combination of selected handcrafted and deepactivated features. By using the hybrid feature set, we enhanced the prediction performance compared to individual methods and state-of-the-art.
This paper has four sections. Section 2 presents the datasets. Section 3 describes the steps in our method: lung segmentation, feature extraction, feature selection and classification. Section 4 describes and discusses our experimental results. Section 5 concludes and suggests future work.

Dataset Description
This study used two public datasets: Montgomery (MC) and Shenzhen, published in Jaeger et al. [23]. The MC dataset has 138 frontal chest x-rays-80 normal and 58 show TB; they were collected in Montgomery County, Maryland, USA. The image sizes are 4892×4020 or 4020×4892. This dataset also has ground truth lung masks for every image and radiological reports describing the lesions. Figure 1 shows example chest x-rays from the MC dataset. The Shenzhen images were collected in the Guandong Hospital, Shenzhen, China; it consists of 326 normal and 336 images containing TB lesions. The image resolutions vary but are approximately 3000 × 3000 pixels. Figure 2 shows some chest-rays in the Shenzhen dataset.

Methodology
We aimed to implement an automatic TB screening system using chest x-rays, which could facilitate and expedite TB diagnosis and treatment. Figure 3 depicts the data flow of our algorithm that hybridizes the shallow and deep features for TB classification. It has three main steps: (i) lung segmentation supplemented by preprocessing, (ii) feature extraction, selection and concatenation, and (iii) classification of normal and TB. Each step is presented in details in the following subsections.

Methodology
We aimed to implement an automatic TB screening system using chest x-rays, which could facilitate and expedite TB diagnosis and treatment. Figure 3 depicts the data flow of our algorithm that hybridizes the shallow and deep features for TB classification. It has three main steps: (i) lung segmentation supplemented by preprocessing, (ii) feature extraction, selection and concatenation, and (iii) classification of normal and TB. Each step is presented in details in the following subsections. Appl. Sci. 2020, 10, x 5 of 22

Preprocessing
The images were resized to a common 512 × 512 pixel size for all images to reduce the processing time. Image enhancement largely influences on the performance of better lung segmentation and classification results [3]. Contrast limited adaptive histogram equalization (CLAHE) is used to improve image quality and contrast [24]. The effect of image enhancement is shown in Figure 4.

Lung Segmentation
Lung segmentation is an essential subtask for most disease detection using chest x-rays. Accurate extraction of lung regions impacts the performance of subsequent processes because it defines the region of interest (ROI) where lung abnormalities are searched. Gordienko et al. [25] investigated the impact of lung segmentation and removing bone shadow for lung nodule detection and showed that better accuracy was obtained using the segmented lung and bone shadow removal. Chest x-rays in our study contain the regions other than lungs, which are irrelevant for TB detection. To reduce the risk that the irrelevant regions present in the image mislead the final results, we decided to segregate the lungs. Processing only on the lung regions allows to focus on the useful regions for further processing, thereby improving the algorithm's performance and lowering the computational time. Previously, we evaluated and compared different lung segmentation methods [26], especially deep semantic ones such as the fully convolutional network (FCN) [27], SegNet [28], U-Net [29] and DeepLabv3+ [30]. The segmentation performance was evaluated at the pixel level by comparing the predicted mask generated by the algorithms with the ground truth mask. Three evaluation metrics namely interception over union (IoU), accuracy and dice similarly coefficient, were measured. We found that DeepLabv3+ [30] with a XceptionNet [31] backbone yielded better segmentation than other methods by achieving IoUs of 95.1% for MC and 92.7% for Shenzen datasets [26]. Inspired by those results, we employed it to segment the lung regions here. The data flow for lung segmentation is depicted in Figure 5. DeepLabv3+ consists of two segments-encoder and decoder. The encoder is to downsample the input images and extract the rich semantic information via atrous spatial pyramid pooling (ASPP) for classification of lung or non-lung pixels. The encoder module employs XceptionNet as a backbone network. XceptionNet is a CNN used for the image classification task. Its architecture is a linear stack of depth-wise separable convolutions with residual connections forming the feature extraction. It has a depth of 126 and constitutes with three components: entry, middle and exit flow. There are a total of 36 convolutional layers to extract the features: 8 in the entry, 24 in the middle and 4 in the exit components [31]. Using the last feature map of XceptionNet, ASPP applied four parallel atrous convolutions with different rates to explore the

Preprocessing
The images were resized to a common 512 × 512 pixel size for all images to reduce the processing time. Image enhancement largely influences on the performance of better lung segmentation and classification results [3]. Contrast limited adaptive histogram equalization (CLAHE) is used to improve image quality and contrast [24]. The effect of image enhancement is shown in Figure 4.

Preprocessing
The images were resized to a common 512 × 512 pixel size for all images to reduce the processing time. Image enhancement largely influences on the performance of better lung segmentation and classification results [3]. Contrast limited adaptive histogram equalization (CLAHE) is used to improve image quality and contrast [24]. The effect of image enhancement is shown in Figure 4.

Lung Segmentation
Lung segmentation is an essential subtask for most disease detection using chest x-rays. Accurate extraction of lung regions impacts the performance of subsequent processes because it defines the region of interest (ROI) where lung abnormalities are searched. Gordienko et al. [25] investigated the impact of lung segmentation and removing bone shadow for lung nodule detection and showed that better accuracy was obtained using the segmented lung and bone shadow removal. Chest x-rays in our study contain the regions other than lungs, which are irrelevant for TB detection. To reduce the risk that the irrelevant regions present in the image mislead the final results, we decided to segregate the lungs. Processing only on the lung regions allows to focus on the useful regions for further processing, thereby improving the algorithm's performance and lowering the computational time. Previously, we evaluated and compared different lung segmentation methods [26], especially deep semantic ones such as the fully convolutional network (FCN) [27], SegNet [28], U-Net [29] and DeepLabv3+ [30]. The segmentation performance was evaluated at the pixel level by comparing the predicted mask generated by the algorithms with the ground truth mask. Three evaluation metrics namely interception over union (IoU), accuracy and dice similarly coefficient, were measured. We found that DeepLabv3+ [30] with a XceptionNet [31] backbone yielded better segmentation than other methods by achieving IoUs of 95.1% for MC and 92.7% for Shenzen datasets [26]. Inspired by those results, we employed it to segment the lung regions here. The data flow for lung segmentation is depicted in Figure 5. DeepLabv3+ consists of two segments-encoder and decoder. The encoder is to downsample the input images and extract the rich semantic information via atrous spatial pyramid pooling (ASPP) for classification of lung or non-lung pixels. The encoder module employs XceptionNet as a backbone network. XceptionNet is a CNN used for the image classification task. Its architecture is a linear stack of depth-wise separable convolutions with residual connections forming the feature extraction. It has a depth of 126 and constitutes with three components: entry, middle and exit flow. There are a total of 36 convolutional layers to extract the features: 8 in the entry, 24 in the middle and 4 in the exit components [31]. Using the last feature map of XceptionNet, ASPP applied four parallel atrous convolutions with different rates to explore the

Lung Segmentation
Lung segmentation is an essential subtask for most disease detection using chest x-rays. Accurate extraction of lung regions impacts the performance of subsequent processes because it defines the region of interest (ROI) where lung abnormalities are searched. Gordienko et al. [25] investigated the impact of lung segmentation and removing bone shadow for lung nodule detection and showed that better accuracy was obtained using the segmented lung and bone shadow removal. Chest x-rays in our study contain the regions other than lungs, which are irrelevant for TB detection. To reduce the risk that the irrelevant regions present in the image mislead the final results, we decided to segregate the lungs. Processing only on the lung regions allows to focus on the useful regions for further processing, thereby improving the algorithm's performance and lowering the computational time. Previously, we evaluated and compared different lung segmentation methods [26], especially deep semantic ones such as the fully convolutional network (FCN) [27], SegNet [28], U-Net [29] and DeepLabv3+ [30]. The segmentation performance was evaluated at the pixel level by comparing the predicted mask generated by the algorithms with the ground truth mask. Three evaluation metrics namely interception over union (IoU), accuracy and dice similarly coefficient, were measured. We found that DeepLabv3+ [30] with a XceptionNet [31] backbone yielded better segmentation than other methods by achieving IoUs of 95.1% for MC and 92.7% for Shenzen datasets [26]. Inspired by those results, we employed it to segment the lung regions here. The data flow for lung segmentation is depicted in Figure 5. DeepLabv3+ consists of two segments-encoder and decoder. The encoder is to downsample the input images and extract the rich semantic information via atrous spatial pyramid pooling (ASPP) for classification of lung or non-lung pixels. The encoder module employs XceptionNet as a backbone network. XceptionNet is a CNN used for the image classification task. Its architecture is a linear stack of depth-wise separable convolutions with residual connections forming the feature extraction. It has a depth of 126 and constitutes with three components: entry, middle and exit flow. There are a total of 36 convolutional layers to extract the features: 8 in the entry, 24 in the middle and 4 in the exit components [31]. Using the last feature map of XceptionNet, ASPP applied four parallel atrous convolutions with different rates to explore the image-level features at multiple scales. Here we used four different rates of 1, 6, 12 and 18 for atrous convolution. The extracted features maps from atrous convolutions were then pooled into 1 × 1 convolutional feature map, and fed to the decoder module. The decoder module reconstructs the semantic labels by concatenating low and high-level encoder features, followed by upsampling. The decoder module generates the mask for lung regions. We superimposed the lung mask generated by DeepLabv3+ on original chest x-rays to retrieve the segmented lung region. Finally, a morphological gradient operation was used to correct and refine the boundaries of the segmented lungs [32].
Appl. Sci. 2020, 10, x 6 of 22 image-level features at multiple scales. Here we used four different rates of 1, 6, 12 and 18 for atrous convolution. The extracted features maps from atrous convolutions were then pooled into 1 × 1 convolutional feature map, and fed to the decoder module. The decoder module reconstructs the semantic labels by concatenating low and high-level encoder features, followed by upsampling. The decoder module generates the mask for lung regions. We superimposed the lung mask generated by DeepLabv3+ on original chest x-rays to retrieve the segmented lung region. Finally, a morphological gradient operation was used to correct and refine the boundaries of the segmented lungs [32].

Feature Extraction
Features play a vital role in medical image analysis; they represent the interesting parts of an image in terms of a compact coded attribute. Here, features that will aid TB identification were retrieved and input to classify normal or TB x-rays. The segmented lung regions generated by Section 3.2 were used as input to feature extraction. It would be better if we could use TB lesion region as the ROI. Since there is no labeled dataset that indicates the exact area of TB regions, we could not segment small TB lesions and use them as ROI. We can use either the entire images or the segmented lung regions. As TB lesions appear only on lung region, we used the segmented lung regions as ROIs and extracted the features from them. Two types of features: hand-crafted (shallow) features and deepactivated features from CNNs were extracted. An overview of the applied feature extractors follows.

•
Statistical textural features: Statistical textural features result from the quantitative analysis of the pixel intensities in the grayscale image using different arrangements. Intensity histograms, first-order statistical textures, gray-level co-occurrence matrices (GLCM) and gray-level runlength matrices (GLRLM) are used as the feature descriptors to extract the statistical textural features. We extracted eight first-order statistical features [33], a total of 88 GLCM features, which encoded 22 different features in four directions [34], and a total of 44 GLRLM features which encoded 11 different features in four directions [35], for a total of 140 textural features.

•
Local binary pattern (LBP) features: An LBP is a texture histogram that describes a texture based on differences between central pixels and its neighbors. LBP produces a binary pattern using a threshold value for the central pixel with its neighborhood. A neighbor is 1, when it is greater than or equal to the central pixel, and 0 when it is less. Then the frequency of binary patterns is determined as a histogram of the representative number of binary patterns found in the image [36]. With an 8-pixel neighborhood, 256 features are obtained. • GIST features: GIST is a feature descriptor that proceeds image filtering to develop a low-level feature set including intensity, color, motion, and orientation based on the information of the gradients, orientations, and scales of the image [37]. GIST captures these features toward identifying the salient image locations that significantly differ from those of the neighbors [14]. First, GIST convolves a given input image with 32 Gabor filters at four different scales and eight different orientations to generate a total of 32 feature maps. Each of these feature maps was then splatted into 16 sub-regions with a 4×4 square grid and the feature values within each sub-region were averaged. The averaged values from the 16 sub-regions were concatenated for the 32 different feature maps, resulting in a total of 512 GIST descriptors for a given image.

Feature Extraction
Features play a vital role in medical image analysis; they represent the interesting parts of an image in terms of a compact coded attribute. Here, features that will aid TB identification were retrieved and input to classify normal or TB x-rays. The segmented lung regions generated by Section 3.2 were used as input to feature extraction. It would be better if we could use TB lesion region as the ROI. Since there is no labeled dataset that indicates the exact area of TB regions, we could not segment small TB lesions and use them as ROI. We can use either the entire images or the segmented lung regions. As TB lesions appear only on lung region, we used the segmented lung regions as ROIs and extracted the features from them. Two types of features: hand-crafted (shallow) features and deep-activated features from CNNs were extracted. An overview of the applied feature extractors follows.

Hand-Crafted Features
• Statistical textural features: Statistical textural features result from the quantitative analysis of the pixel intensities in the grayscale image using different arrangements. Intensity histograms, first-order statistical textures, gray-level co-occurrence matrices (GLCM) and gray-level run-length matrices (GLRLM) are used as the feature descriptors to extract the statistical textural features. We extracted eight first-order statistical features [33], a total of 88 GLCM features, which encoded 22 different features in four directions [34], and a total of 44 GLRLM features which encoded 11 different features in four directions [35], for a total of 140 textural features.

•
Local binary pattern (LBP) features: An LBP is a texture histogram that describes a texture based on differences between central pixels and its neighbors. LBP produces a binary pattern using a threshold value for the central pixel with its neighborhood. A neighbor is 1, when it is greater than or equal to the central pixel, and 0 when it is less. Then the frequency of binary patterns is determined as a histogram of the representative number of binary patterns found in the image [36].
With an 8-pixel neighborhood, 256 features are obtained. • GIST features: GIST is a feature descriptor that proceeds image filtering to develop a low-level feature set including intensity, color, motion, and orientation based on the information of the gradients, orientations, and scales of the image [37]. GIST captures these features toward identifying the salient image locations that significantly differ from those of the neighbors [14]. First, GIST convolves a given input image with 32 Gabor filters at four different scales and eight different orientations to generate a total of 32 feature maps. Each of these feature maps was then splatted into 16 sub-regions with a 4 × 4 square grid and the feature values within each sub-region were averaged. The averaged values from the 16 sub-regions were concatenated for the 32 different feature maps, resulting in a total of 512 GIST descriptors for a given image.

•
Histogram of oriented gradients (HOG) Features: A HOG descriptor, introduced by Dalal and Triggs [38], counts gradient orientation occurrences in localized image regions. HOG measures the first-order image gradient pooled in overlapping orientation bins, and gives a compressed and encoded version of an image. It first computes gradients, creating cell histograms, and generating and normalizing the descriptor blocks. Given an image, HOG first fragments the image into to small-connected regions called cells. Following this, it computes the gradient orientations over each cell and plots a histogram of these orientations, giving the probability for a gradient with a specific orientation in a given path. The adjacent connected cells are grouped into small blocks. The features are extracted over small blocks, in a repetitive fashion, to preserve information about local structures, and the block-wise features are finally integrated into a feature vector. We used the cell size of [16 × 16] [39], represents an image by its spatial layout and local shape. First, PHOG tiles the image into sub-regions, at multiple pyramid-style resolutions, and in each sub-region, the histogram of orientation gradients is applied as a local shape descriptor using the distribution of edge directions.
We extracted a total of 168 PHOG features from each image.

•
Bag of visual words features: BoVW is a technique adapted from information theory to computer vision applications [40]. Contrary to text, images do not contain words, so, this method creates a bag of features extracted from the images across the classes, using a custom feature descriptor, and constructs a visual vocabulary. First, speeded-up robust features (SURF) [41] are used as feature descriptors to detect interesting key points. Then, k-means clustering [42] is used to generate a visual vocabulary by reducing the dimensions of the features. The center of each cluster refers to a feature or visual word. We extracted 500 BoVW features, using 500 clusters.
A summary of the feature descriptors, along with the number of extracted features, is in Table 1.

Deep-Activated Features from Pre-trained CNNs
Pre-trained CNNs are convolutional nets, trained on a large datasets and can classify 1000 or more natural objects. There are two ways to use pre-trained CNNs in specific tasks, in our case, medical image classification, transfer learning with fine-tuning as the classifier, and feature extractors along with supervised machine learning classifiers. If there is a limited amount of memory and computational resources, using them as the feature descriptors is a good choice. Here, we used nine different pre-trained CNNs: AlexNet [43], GoogLeNet [44], InceptionV3 [45], XceptionNet [31], ResNet-50 [46], Squeezenet [47], ShuffleNet [48], MobileNet [49] and DenseNet [50], as the feature descriptors to extract high-level deep-activated features. Here, we resized the segmented lung image to the input size, for each CNN, before extracting the features and feeding them to the network. The fully connected layer, which is the last layer before sigmoid classification neuron of each CNN, is retrieved and returns 1000 deep-activated features from each CNN, as listed in Table 1. Pre-trained CNN constructs a hierarchical representation of input images. The early layers extracted fewer low level features. The deep layers extracted high-level features, constructed using earlier layers. Figure 6 displays an example of activated feature maps of three different pooling layers: 'pool1', 'pool2' and 'pool3', of DenseNet. The pooling operation encapsulated the feature maps from convolution layers by highlighting the activated spatial locations, so that the features became more abstract in deeper layers of the CNN. These activation maps reveal the features the CNN learned by overlaying it with the original image.

Deep-Activated Features from Pre-trained CNNs
Pre-trained CNNs are convolutional nets, trained on a large datasets and can classify 1000 or more natural objects. There are two ways to use pre-trained CNNs in specific tasks, in our case, medical image classification, transfer learning with fine-tuning as the classifier, and feature extractors along with supervised machine learning classifiers. If there is a limited amount of memory and computational resources, using them as the feature descriptors is a good choice. Here, we used nine different pre-trained CNNs: AlexNet [43], GoogLeNet [44], InceptionV3 [45], XceptionNet [31], ResNet-50 [46], Squeezenet [47], ShuffleNet [48], MobileNet [49] and DenseNet [50], as the feature descriptors to extract high-level deep-activated features. Here, we resized the segmented lung image to the input size, for each CNN, before extracting the features and feeding them to the network. The fully connected layer, which is the last layer before sigmoid classification neuron of each CNN, is retrieved and returns 1000 deep-activated features from each CNN, as listed in Table 1. Pre-trained CNN constructs a hierarchical representation of input images. The early layers extracted fewer low level features. The deep layers extracted high-level features, constructed using earlier layers. Figure  6 displays an example of activated feature maps of three different pooling layers: 'pool1', 'pool2' and 'pool3', of DenseNet. The pooling operation encapsulated the feature maps from convolution layers by highlighting the activated spatial locations, so that the features became more abstract in deeper layers of the CNN. These activation maps reveal the features the CNN learned by overlaying it with the original image.

Feature Selection
The extracted features include many noisy and irrelevant features. Using those features directly may result in poor classification. Selecting the discriminant features prior to classification is of paramount importance in supervised machine learning methods. The algorithm used for feature selection was a PSO algorithm-a population-based metaheuristic method, inspired by bird flocking or fish swarming, first described by Kennedy and Eberhart [51]. It has been successfully used in global search problems. It is easy to implement, the computation time is reasonable and provides the global search. A PSO flowchart is illustrated in Figure 7. In PSO, each particle has three attributes:

Feature Selection
The extracted features include many noisy and irrelevant features. Using those features directly may result in poor classification. Selecting the discriminant features prior to classification is of paramount importance in supervised machine learning methods. The algorithm used for feature selection was a PSO algorithm-a population-based metaheuristic method, inspired by bird flocking or fish swarming, first described by Kennedy and Eberhart [51]. It has been successfully used in global search problems. It is easy to implement, the computation time is reasonable and provides the global search. A PSO flowchart is illustrated in Figure 7. In PSO, each particle has three attributes: position, velocity, and fitness. The position of each particle is a potential solution. The fitness determines the movement of each particle. position, velocity, and fitness. The position of each particle is a potential solution. The fitness determines the movement of each particle.
The particles move through the solution space, and their fitness values are evaluated. The direction and distance of the particle movement is determined by the velocity. The personal best position of each particle and the global best position among all particles are tracked to update individual positions. The personal best position means the best position and fitness found for particle, where t is the current iteration index, denotes an inertia weight, and are acceleration coefficients, and and are random numbers between 0 and 1. The features are encoded as the particle swarm here. Pseudocode for feature selection using PSO is given in Pseudo Code 1 (Algorithm 1). We input the training dataset, the population size, the maximum number of iteration, and objective function. First, particles positions and velocities were randomly initialized and the fitness of each particle was computed using the objective function. The particle with the highest fitness value are considered as the best particle and so, its feature elements Initialize velocity and position of each particle Compute fitness of each particle Let denote that the population of n particles is X = [X 1 , X 2 , . . . X n ] in the potential solution in a D-dimensional space. The position, X i , and velocity, V i , of particle, i, are: (1) The particles move through the solution space, and their fitness values are evaluated. The direction and distance of the particle movement is determined by the velocity. The personal best position of each particle and the global best position among all particles are tracked to update individual positions. The personal best position means the best position and fitness found for particle, i: Pbest i = [p i1 , p i2 , . . . , p iD ]. The global best position, Gbest , is the best position and fitness for all particles in the swarm. The velocity and position of the particle are updated: where t is the current iteration index, ω denotes an inertia weight, c 1 and c 2 are acceleration coefficients, and r 1 and r 2 . are random numbers between 0 and 1. The features are encoded as the particle swarm here. Pseudocode for feature selection using PSO is given in Algorithm 1. We input the training dataset, the population size, the maximum number of iteration, and objective function. First, particles positions and velocities were randomly initialized and the fitness of each particle was computed using the objective function. The particle with the highest fitness value are considered as the best particle and so, its feature elements will be selected. As our main purpose of optimization is to obtain the higher classification accuracy, we directly used the value of the classification accuracy as the fitness value. The personal and global best positions are tracked to iteratively update the position and velocity of each particle and find the best set of features until a satisfactory fitness or the maximum number of iterations is reached.  (3) and (4), respectively End Set t ← t+1 Until terminal criteria met finalSet ← finalSet U save(particles) End

Classification
A SVM is a supervised learning algorithm, used for classification and regression, originally described by Cortes and Vapnik [52]. An SVM classifies data points of different class by searching for the best hyperplane separating them. Starting with a training dataset, X, comprising I training samples, X = x 1 , x 2 , . . . , x i , and target labels, y = y 1 , y 2 , . . . ,y i , where y i € {−1, +1}, i = [1,2, . . . ,I]. The equation of the linear decision hyperplane, f(x), can be defined: where w is the weight vector or direction of the hyperplane, and bias is the position in the space. To find the best hyperplane of the binary classification of TB and normal, candidate decision surfaces were normalized so that the value of the decision hyperplane f (x) for the support vectors is (w T · x) + b = +1 for TB class and (w T · x) + b = −1 for normal class. The best hyperplane is the one with the largest margin between the two classes. The maximum margin between two classes is equivalent to minimizing w 2 . Therefore, the best separating hyperplane is defined: Minimize : w 2 (6) subject to : y i w T ·x i + bias ≥ 1, i = 1, 2, . . . , I In a real application, the training data set is not usually linearly separable because some data points may fall inside or behind the margin, or wrong side of the decision hyperplane. Let denote ξ € {ξ 1 , ξ 1 , ξ 1, . . . , ξi} as a vector of the error points for I training samples. The decision hyperplane for not linearly separable data is: where ξ i=0, 0 < ξ i < 1, and ξ i > 1 are error points, C is a penalty parameter used for minimizing the errors falling inside or on the other side of the margin. When the data is not linearly separable, SVM maps the feature space to a higher dimension using the mapping function or 'kernel trick' ϕ: K x j , x k = < ϕ x j , ϕ(x k ) >. Two types of mapping functions, i.e., local (Gaussian radial basis function) and global (linear or polynomial) functions are commonly used for SVM and shown in Equations (10) to (13): Using the mapping functions, a non-linear decision surface is defined mathematically: where S v denotes the number of support vectors, α i and y i , represent the Lagrange multipliers and target labels associated with S v , respectively. x j . and x k . represent the observations j and k in the training set X [53]. The performance of an SVM relies heavily on the choice of its parameters. Optimal values of the SVM parameters were searched using the Bayesian algorithm [54], as shown in Figure 8. We first defined the initial parameter search space and used the Bayesian method to iteratively search for the optimal values until the maximum criteria reached or the validation accuracy unchanged.
Quadratic polynomial , = ( • + 1) Cubic polynomial , = ( • + 1) Using the mapping functions, a non-linear decision surface is defined mathematically: where S v denotes the number of support vectors, αi and yi, represent the Lagrange multipliers and target labels associated with S v , respectively. and represent the observations j and k in the training set X [53].
The performance of an SVM relies heavily on the choice of its parameters. Optimal values of the SVM parameters were searched using the Bayesian algorithm [54], as shown in Figure 8. We first defined the initial parameter search space and used the Bayesian method to iteratively search for the optimal values until the maximum criteria reached or the validation accuracy unchanged.

Evaluation Metrics
We used three metrics to evaluate classification performance: accuracy and F1 Score (F1), formulated in Equations (15) and (16), and area under curve (AUC).
• TruePositive (TP) refers to the number of TB cases correctly classified as TB. Additionally, we rated the classifier, using the Kappa Index [55], which takes all elements in the confusion matrix into account, whereas accuracy counts only those on the main diagonal. The Kappa Index was computed:  Table 2, as defined by Landis and Koch [56].

Experimental Results and Discussion
Experiments were run in MATLAB_R2019b using a 9th Generation Core i7 at 3.0 GHz CPU and Nvidia T1660Ti GPU under Windows 10. Two public datasets: MC and Shenzen were used. The datasets were randomly split into training (70%) and testing (30%). Since the MC dataset is limited, the training sets from both datasets were combined to generate the combined training set and used to develop and select algorithms. The testing sets were used to assess the performance. First, we segregated the lung regions, using DeepLabv3+ with the XceptionNet backbone. Segmentation performance is described in our previous study [25]. Example lung segmentations are in Figure 9.
Once the lung regions were retrieved, we extracted six sets of hand-crafted features: statistical textures, LBP, GIST, HOG, PHOG, and BoVW, and nine sets of deep-activated features from nine different pre-trained CNNS: AlexNet, GoogLeNet, InceptionV3, XceptionNet, ResNet-50, SqueezeNet, ShuffleNet, MobileNet and DenseNet. Then, we used a PSO based feature selection algorithm to select the dominant features from each feature set. From each dataset, we selected 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% (all features) of the features and 11 different selected percentages multiplied by 15 different feature sets-a total of 165 tests were performed. The performance of the selected feature subset was assessed using linear SVM. The accuracy for the different feature sets with a corresponding selected feature is plotted in Figure 10 for hand-crafted and in Figure 11 for deep-activated features. We found that selecting small numbers of features, up to 20% of the total, delivered poor accuracy. Conversely, selecting a large number of features, from 80% to 100% of the total, caused a drop in accuracy. Selecting 30-70% of the features provided better accuracy. Thus, we selected an average 50% of the features from each feature set. Each selected feature subset was separately fed to an SVM classifier to predict TB.  Once the lung regions were retrieved, we extracted six sets of hand-crafted features: statistical textures, LBP, GIST, HOG, PHOG, and BoVW, and nine sets of deep-activated features from nine different pre-trained CNNS: AlexNet, GoogLeNet, InceptionV3, XceptionNet, ResNet-50, SqueezeNet, ShuffleNet, MobileNet and DenseNet. Then, we used a PSO based feature selection algorithm to select the dominant features from each feature set. From each dataset, we selected 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% (all features) of the features and 11 different selected percentages multiplied by 15 different feature sets-a total of 165 tests were performed. The performance of the selected feature subset was assessed using linear SVM. The accuracy for the different feature sets with a corresponding selected feature is plotted in Figure 10 for hand-crafted and in Figure 11 for deep-activated features. We found that selecting small numbers of features, up to 20% of the total, delivered poor accuracy. Conversely, selecting a large number of features, from 80% to 100% of the total, caused a drop in accuracy. Selecting 30-70% of the features provided better accuracy. Thus, we selected an average 50% of the features from each feature set. Each selected feature subset was separately fed to an SVM classifier to predict TB. Once the lung regions were retrieved, we extracted six sets of hand-crafted features: statistical textures, LBP, GIST, HOG, PHOG, and BoVW, and nine sets of deep-activated features from nine different pre-trained CNNS: AlexNet, GoogLeNet, InceptionV3, XceptionNet, ResNet-50, SqueezeNet, ShuffleNet, MobileNet and DenseNet. Then, we used a PSO based feature selection algorithm to select the dominant features from each feature set. From each dataset, we selected 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% (all features) of the features and 11 different selected percentages multiplied by 15 different feature sets-a total of 165 tests were performed. The performance of the selected feature subset was assessed using linear SVM. The accuracy for the different feature sets with a corresponding selected feature is plotted in Figure 10 for hand-crafted and in Figure 11 for deep-activated features. We found that selecting small numbers of features, up to 20% of the total, delivered poor accuracy. Conversely, selecting a large number of features, from 80% to 100% of the total, caused a drop in accuracy. Selecting 30-70% of the features provided better accuracy. Thus, we selected an average 50% of the features from each feature set. Each selected feature subset was separately fed to an SVM classifier to predict TB. To obtain a robust and effective SVM, it is crucial to select suitable parameters, e.g., penalty parameter (C), kernel functions and kernel scales. First, we defined the parameter search spaces: C = {0.001, 1000}, kernel functions = {Linear, Gaussian, Quadratic, Cubic}, and kernel scale, = {0.001-1000}, and used a Bayesian algorithm to find the optimal parameters for each feature set. The optimal parameters for each feature set are listed in Table 3. We trained 15 SVM classifiers with the parameters in Table 3. Once the optimized SVM classifiers were trained, we used them to identify TB. The classification metrics were F1, Accuracy, AUC, and Kappa Index. Tables 4 and 5 listed the F1, accuracy and AUC using 15 different methods for MC and Shenzen datasets. Figures 12 and 13 plot and show the performance of each method for the MC and Shenzen datasets We found that GIST, HOG, BoVW from hand-crafted features, and MobileNet and DenseNet from pre-trained CNN performed better than other methods for both datasets, achieving over 90% of F1, Accuracy, and AUC with an excellent Kappa (over 80%). To improve prediction, we combined the five best-performing feature subsets: GIST, HOG, BoVW, MobileNet and DenseNet and built a hybrid feature set that contained local and global texture features and high-level deep activated features. The hybrid feature set contained 50% of the selected features of five feature sets: 256 GIST features, 5400 HOG features, 250 BoVW features, 500 MobileNet features and 500 DenseNet features, and fed them as input to SVM classifier. The performance of the SVM classifier using the hybrid feature set is in Table 4 for MC and Table 5 for Shenzen datasets. Its Kappa indices are plotted in Figures 12 and 13. It achieved favorable performance with 93.3% F1, 92.7% accuracy, 99.5% AUC for MC and 95.4% F1, 95.5% accuracy, 99.5% AUC for Shenzhen. Its Kappa was 'excellent' for both datasets. The hybrid feature set marginally improved the prediction compared to individual best-performing feature sets on Shenzen dataset while matching the best prediction made by HOG and DenseNet feature sets using the MC dataset. To obtain a robust and effective SVM, it is crucial to select suitable parameters, e.g., penalty parameter (C), kernel functions and kernel scales. First, we defined the parameter search spaces: C = {0.001, 1000}, kernel functions = {Linear, Gaussian, Quadratic, Cubic}, and kernel scale, γ = {0.001-1000}, and used a Bayesian algorithm to find the optimal parameters for each feature set. The optimal parameters for each feature set are listed in Table 3. We trained 15 SVM classifiers with the parameters in Table 3. Once the optimized SVM classifiers were trained, we used them to identify TB. The classification metrics were F1, Accuracy, AUC, and Kappa Index. Tables 4 and 5 listed the F1, accuracy and AUC using 15 different methods for MC and Shenzen datasets. Figures 12 and 13 plot and show the performance of each method for the MC and Shenzen datasets We found that GIST, HOG, BoVW from hand-crafted features, and MobileNet and DenseNet from pre-trained CNN performed better than other methods for both datasets, achieving over 90% of F1, Accuracy, and AUC with an excellent Kappa (over 80%). To improve prediction, we combined the five best-performing feature subsets: GIST, HOG, BoVW, MobileNet and DenseNet and built a hybrid feature set that contained local and global texture features and high-level deep activated features. The hybrid feature set contained 50% of the selected features of five feature sets: 256 GIST features, 5400 HOG features, 250 BoVW features, 500 MobileNet features and 500 DenseNet features, and fed them as input to SVM classifier. The performance of the SVM classifier using the hybrid feature set is in Table 4 for MC and Table 5 for Shenzen datasets. Its Kappa indices are plotted in Figures 12 and 13. It achieved favorable performance with 93.3% F1, 92.7% accuracy, 99.5% AUC for MC and 95.4% F1, 95.5% accuracy, 99.5% AUC for Shenzhen. Its Kappa was 'excellent' for both datasets. The hybrid feature set marginally improved the prediction compared to individual best-performing feature sets on Shenzen dataset while matching the best prediction made by HOG and DenseNet feature sets using the MC dataset.     We compared our method with previous studies in Table 6. Our method surpassed the existing studies for the MC dataset with accuracy 92.7% and AUC 99.5%, and obtained comparable results with the state of the art for the Shenzen dataset with accuracy 95.9% and AUC 99.5%. From Table 6, we found that the methods produced better performance on Shenzhen dataset compared to MC dataset. The same pattern is seen Tables 4 and 5 of our study, and also in related works [7,8,10,11,17,18,21,22]. The lower performance with the smaller MC dataset is probably attributed to its limited size containing only 138 x-rays, and therefore lower range of samples to be trained and learnt. Shenzhen dataset is larger than MC dataset where the number of samples is over 300, both for normal and TB cases. Another factor of impairing the classification accuracy could be the unbalanced data distribution. The MC dataset is unbalanced, with a smaller number of TB cases while the Shenzen set is larger and balanced with almost 50% TB cases. It is also noteworthy that the most hand-crafted used here have already been studied. However, they were directly input to the classifier. On the other hand, our method, first filtered out the noisy and irrelevant features and selected the dominant features. For deep-activated features, few pre-trained CNNs were used in previous studies, whereas we studied a wide variety of different CNN structures. We also first selected the important deep activated features, before feeding them to the classifier. Rajaraman et al. [22] used hand-crafted based classifiers and pre-trained CNNs separately and, combined their results via majority voting to make a final decision. Here, we made alternative use of these features. Our method first selected the important features and hybridized the best-performing shallow hand-crafted features and high-level deep features, so that the classifier had better information to make the prediction. Achieving the better results than the previous studies using the same feature sets was due to the feature selection. Using a PSO feature selection and hybridizing the hand-crafted and deep-activated features were the keys to improved prediction. The code supporting the findings of this study are available from the corresponding authors upon request. We compared our method with previous studies in Table 6. Our method surpassed the existing studies for the MC dataset with accuracy 92.7% and AUC 99.5%, and obtained comparable results with the state of the art for the Shenzen dataset with accuracy 95.9% and AUC 99.5%. From Table 6, we found that the methods produced better performance on Shenzhen dataset compared to MC dataset. The same pattern is seen Tables 4 and 5 of our study, and also in related works [7,8,10,11,17,18,21,22]. The lower performance with the smaller MC dataset is probably attributed to its limited size containing only 138 x-rays, and therefore lower range of samples to be trained and learnt. Shenzhen dataset is larger than MC dataset where the number of samples is over 300, both for normal and TB cases. Another factor of impairing the classification accuracy could be the unbalanced data distribution. The MC dataset is unbalanced, with a smaller number of TB cases while the Shenzen set is larger and balanced with almost 50% TB cases. It is also noteworthy that the most hand-crafted used here have already been studied. However, they were directly input to the classifier. On the other hand, our method, first filtered out the noisy and irrelevant features and selected the dominant features. For deep-activated features, few pre-trained CNNs were used in previous studies, whereas we studied a wide variety of different CNN structures. We also first selected the important deep activated features, before feeding them to the classifier. Rajaraman et al. [22] used hand-crafted based classifiers and pre-trained CNNs separately and, combined their results via majority voting to make a final decision. Here, we made alternative use of these features. Our method first selected the important features and hybridized the best-performing shallow hand-crafted features and high-level deep features, so that the classifier had better information to make the prediction. Achieving the better results than the previous studies using the same feature sets was due to the feature selection. Using a PSO feature selection and hybridizing the hand-crafted and deep-activated features were the keys to improved prediction. The code supporting the findings of this study are available from the corresponding authors upon request.

Conclusions
We present a technique for learning with a hybrid of hand-crafted features and deep-activated features from pre-trained CNNs, with help of a PSO algorithm and an optimized SVM classifier. We initially preprocessed the images using CLAHE and subsequently segmented the lung regions using a deep semantic segmentation technique, which used XceptionNet as the backbone in DeepLabv3+. From the segmented lung regions, we extracted six sets of hand-crafted features using statistical textures, LBP, GIST, HOG, PHOG, and BoVW feature descriptors. Then, we used nine pre-trained CNNs as the feature descriptors to retrieve the deep-activated features. To select the important features from each feature set, a PSO based feature selection algorithm selected different fractions of features.
A total of 50% of selected features were input to the optimized SVM classifier. Additionally, 15 features sets were tested, but GIST, HOG, BoVW, MobileNet, and DenseNet features performed better than the rest. To build a classifier that learnt from both local and global hand-crafted features, as well as high-level deep-activated features, we combined the five-best performing feature sets into a hybrid feature set and input it to the classifier to identify TB. The SVM classifier, using the hybrid feature set, obtained 92.7% accuracy, 99.5% AUC for the MC and 95.5% accuracy, 99.5% AUC for the Shenzen dataset, and achieved an excellent Kappa Index for both datasets. Our results surpassed those in previous studies for the MC dataset and matched them for the Shenzen dataset. Using a PSO feature selection method and a hybrid feature set was a key to improved prediction.
We here used a PSO algorithm to rank the importance of the features and select the different fractions of features. It is a drawback and time-consuming that we have to run exhausting tests to find the optimal number for selecting the features. Therefore, we will work on the feature selection algorithm, which will automatically find the optimal number of features itself. Besides, to test the robustness of the existing algorithms or developing a new algorithm, large number of images are required. As the acquisition and labelling of medical images are expensive, we have a great interest of using synthesis images generated by generative adversarial networks for training deep learning models which require large number of training images. In this way, we could develop even more robust TB classifier.