Lung X-ray Segmentation using Deep Convolutional Neural Networks on Contrast-enhanced Binarized Images

: Automatically locating the lung regions effectively and efﬁciently in digital chest X-ray (CXR) images is important in computer-aided diagnosis. In this paper, we propose an adaptive pre-processing approach for segmenting the lung regions from CXR images using convolutional neural networks-based (CNN-based) architectures. It is comprised of three steps. First, a contrast enhancement method speciﬁcally designed for CXR images is adopted. Second, adaptive image binarization is applied to CXR images to separate the image foreground and background. Third, CNN-based architectures are trained on the binarized images for image segmentation. The experimental results show that the proposed pre-processing approach is applicable and effective to various CNN-based architectures and can achieve comparable segmentation accuracy to that of state-of-the-art methods while greatly expediting the model training by up to 20.74% and reducing storage space for CRX image datasets by down to 94.6% on average


Introduction
Detecting the lung boundary in chest X-ray (CXR) images has been extensively utilized in the diagnosis of lung health [1]. An ENT (ear, neck, and throat) radiologist is trained to instinctively recognize any pulmonary disease based on particular differences that occur within the lung regions [2]. For example, shape irregularity, size measurement, and total lung volume provide clues for serious diseases such as cardiomegaly, pneumothorax, pneumoconiosis, or emphysema. This subjective approach relies on the condition and the experience of a radiologist.
The impact of air pollution on human health is well-documented. The probability of a person to suffer from a pulmonary disease shall increase when the air pollution level increases. Therefore, more patients will need to have an X-ray checkup, which adds more workloads to ENT radiologists and may increase the possibility of error diagnosis.
Several studies [3] have shown that computer-aided diagnosis (CAD) systems can indicate the distinctive features for particular respiratory diseases more accurately, reduce radiologist workload, and make remote diagnostics possible. For instance, the National Library of Medicine, in collaboration with the Indiana University School of Medicine [4], is developing a CAD system for the screening of tuberculosis patients in less developed areas where it is of a lack of radiologists and equipment. A robust CAD system can help improve organ segmentation in many aspects, which include strong 1.
The confined-region-based histogram equalization method is applied to CXR images for increasing the difference (contrast) between the lungs and their surrounding regions (both bony structures and other soft tissues), which is proven to increase accuracy based on the experimental results. 2.
The grayscale CXR images are transformed into binary images based on the adaptive binarization method, which can reduce 94.6% of the storage space usage with only a slight drop in prediction accuracy (1.1%).

3.
We verify and compare performance of the proposed method for the lung segmentation task using various convolutional-neural-network-based models that are actively adopted for semantic segmentation, especially for lung segmentation [14], including Fully Convolutional neural Networks (FCNs) [11], U-net [12], and SegNet [13], using the preprocessed CXR datasets.
The experimental results revealed that the proposed pre-processing steps could make the model training process 20.74% faster while maintaining comparable segmentation accuracy compared to those of the state-of-the-art method.
To briefly sum up, we have made three primary contributions. (1) The confined-region-based histogram equalization method we adopt can improve segmentation accuracy. (2) The proposed method can expedite the model training process (20.74% faster). (3) It can substantially save storage space with only a slight drop in prediction accuracy (1.1%). The flowchart of the proposed method is shown in Figure 1. The rest of the paper is organized as follows. In Section 2, the related work will be discussed. The proposed method is described in detail in Section 3. Section 4 introduces the experimental environment and explains the test results. Section 5 concludes the paper.

Related Work
Our review covers the four lines of the literature most relevant to our problem-contrast enhancement, image binarization, lung segmentation, and convolutional neural networks.

CXR Contrast Enhancement
Image enhancement could be an essential component for accurate segmentation, especially for images with low visual quality, such as X-ray images. Existing work on image contrast enhancement broadly falls into two categories as follows-histogram equalization (HE) and gamma correction. HE works by reassigning pixel values to match the uniform distribution for the image histogram, which can enhance the contrast of the input image. Ravia et al. [15] presented a HE technique for bone fracture. Contrast Limited Adaptive Histogram Equalization (CLAHE) locally processes all the small regions of the image, where the contrast is enhanced through adaptive HE, and the chances of noise amplification can be reduced as well. Ahmed et al. [16] proposed an image enhancement algorithm for dental X-ray images based on the adaptive HE technique. Gamma correction can work as a non-linear contrast enhancement technique applied to each pixel and independently modifies the dynamic range of the image. Mustapha et al. [17] proposed an approach to shift and modify the gamma value based on the adaptive factor.

Image Binarization
Image binarization aims to convert a grayscale image to its binary version. For example, scanned electronic documents can be binarized for further use by separating texts and other information from the background. There are two main approaches for image binarization, which are local and global image binarization methods. For the local binarization method, the threshold is determined according to properties of local regions in the image, generally working well on low-quality images. Niblack [18] proposed to calculate the mean and standard deviation of pixels in a sliding window manner to determine the threshold. Sauvola's approach [19] extends Niblack's work [18], which addresses the issue of black noise using the range of intensities of the image. Unlike local image binarization methods, which usually are more time-consuming and computationally expensive, global image binarization only determines one global threshold. If pixel values are more than the threshold, they are classified as foreground. Otherwise, they are background. Otsu's method [20] finds the threshold that maximizes the between-class variance, which is equivalent to minimizing the within-class variance. Ridler et al. [21] proposed to calculate the threshold by iteratively dividing the pixel histogram into two classes.

Lung Segmentation
There has been a lot of work proposed in image segmentation for chest X-ray analysis over the last few decades. We can roughly classify the related approaches into three categories for lung segmentation [22]. First, we have rule-based segmentation schemes, which are also parametric learning algorithms with a sequence of steps and rules such as thresholding [23], the edge detection [24,25], the geometrical fitting models [25], the region growing [24] and the morphological operations [15]. Lihua et al. [15] proposed to replace edge detection in lung segmentation with the first derivative of the horizontal and/or vertical image profiles. However, these methods are mostly heuristic and do not generate accurate results. Therefore, they are often used as an initialization step in more robust segmentation algorithms [26]. Second, pixel classification-based schemes exploit general classifiers, such as the Markov random field modeling or various types of neural networks, to extract lung regions. They are supervised-learning-based methods that classify the pixel values into the lung and non-lung regions using a set of lung masks [11,12,[27][28][29][30][31]. Suzuki et al. [32] proposed to utilize massive training artificial neural networks for suppressing contrast of ribs and clavicles in chest radiographs while the visibility of nodules and lung vessels was maintained. Third, they are deformable model-based schemes that have been widely applied to analyzing medical images because of its shape and size flexibility, such as Active Shape Models (ASMs) are deformable statistical models of the shape of objects that contain a set of landmark points [33]. ASMs have been successfully applied to lung region segmentation [34][35][36] and achieved fair accuracy although their results often not accurate in clavicles and rib cages. There have been a number of studies [37][38][39] that proposed to address this issue. Active Appearance Models (AAMs) [27] utilize the multi-scale filter bank of Gaussian derivatives and k-nearest neighbor classifiers. The major difference between AAMs and ASMs is that AAMs consider all object pixels with a combination of shapes and appearances, while ASMs consider border representation. In addition, hybrid approaches that combine prior schemes to produce better results were also discussed and proposed. For instance, Ginneken et al. [27] integrated deformation-based (active shape model, active appearance model), and pixel classification methods for better performance using the majority rule. Coppini et al. [40] exploited a closed fuzzy-curve algorithm for emphysema detection. The fuzzy-membership functions are determined by Kohonen networks to model lung boundaries. Candemir et al. [4] proposed a lung segmentation method that specifically analyzes input using a content-based image retrieval approach for determining features by SIFT-flow registration to extract fine details.

Common Convolutional Neural Network Models for Segmentation
Current state-of-the-art neural networks based object detection methods generally include two parts-bounding box proposals and semantic segmentation. For bounding box proposals, generating potential bounding boxes in an image and running a classifier on those proposed boxes. Redmon et al. [30] proposed a single regression method that directly deals with image pixels to generate bounding box coordinates and category probabilities. Liu et al. [31] used a small convolution filter to predict object classes and offsets at bounding box locations with different scale detections. Semantic segmentation methods can assign a pre-defined class pixel-wise. Consequently, the prediction accuracy, in general, for medical image segmentation using semantic-based methods could be higher than that using the bounding box-based methods since there may be pixels that do not correspond to any referred objects in bounding boxes. Shelhamer et al. [11] proposed Fully Convolutional neural Networks (FCNs), whose architecture consists of only convolution layers without any fully-connected layers. FCNs have several variants, such as FCN-32, FCN-16, and FCN-8, representing that their outputs are 32, 16, and 8 times upsampled prediction. Ronneberger et al. [12] proposed a U-net model that allows the network to propagate context information to higher resolution layers. Badrinarayanan et al. [13] presented a trainable architecture that consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. There are hybrid methods that combine the prior schemes. For example, Howard et al. [41] utilized depthwise separable convolutions to build lightweight deep neural networks.

Contrast Enhancement with Confined-Region-based HE
The radiographic examination involves the use of high kilovoltage techniques, such as X-rays or Gamma Rays, to check the internal structure of a componentis, which needs an overall penetration through all tissues (decrease in attenuation differences), therefore likely causing low-contrast X-ray images. Before further using these images, we may obtain more accurate segmentation results if these images can be enhanced to have better contrast. Conventionally, applying HE to images can often improve contrast in images; however, HE uniformly stretches out the intensity range of the image, which may cause image under-or over-enhancement. Therefore, we propose using confined-region-based HE for the purpose of better differentiating the lungs from other surrounding regions.
Let I be the b-bit input image I(p) ∈ 0, 2 b − 1 be the intensity of the input image at pixel p. The image histogram H is computed as where l ∈ [0, 2 b − 1] and 1 is the indicator function defined as Generally, a CXR image has a dark background and bright foreground, where SH L represents the histogram of the background with dark features of soft tissues, and SH U represents the foreground with bright features of bone structures. To enlarge the difference between the background and foreground, we define the confined-region cumulative distribution function CDF LU with lower and upper bounds, L and U, as: where W = ∑ U i=L H(i). Based on the confined-region cumulative distribution function CDF LU , the transformation function T of HE is defined as: In our method, we specify L and U as L = SH max L and U = SH max U , where That is, SH max L and SH max U represent the peak bin values of SH L and SH U , as shown in Figure 2. At last, the output image I o after our confined-region-based HE can be obtained as I o (p) = T(I(p)).

Image Binarization
After applying our confined-region-based HE to the input image, we can quantize its intensity range to reduce the storage size. To introduce a more flexible method for image quantization, we adopt a specific image thresholding approach based on the iterative selection [42] to find the thresholds to quantize the intensity range of the input image with different levels. With the initial cluster centers assigned, we can classify pixels into different groups. By observing CXR features, the first two cluster centers are empirically initialized as one of the four corners and the center of the image. We consider the chosen corner pixel as a background pixel with its intensity of T 0 . In contrast, the center pixel is regarded as a foreground pixel with its intensity of T S , where T 0 ≤ T 1 . The remaining cluster centers {T 1 , T 2 , . . . , T S−1 } can be selected evenly between T 0 and T S , and each cluster center T l corresponds to its cluster C l . Next, an image pixel M i is classified to a class with the center that has the shortest distance to the pixel, where the distance is calculated as: Here, i is a pixel index, and j ∈ {0, 1. . . S}, meaning M i ∈ C k as D ik = min ∀j (D ij ). After all the pixels are properly classified, we update the cluster centers as: where T j will be iteratively updated until it converges. With the cluster centers {T 0 , T 1 , T 2 , . . . , T S }, we can then quantize the original intensity range for image quantization. According to our experimental results, we choose to binarize CXR images to reduce the data storage usage with only a slight drop in prediction accuracy (1.1%).

Image Segmentation based on Deep Neural Networks
At last, after applying contrast enhancement and image binarization to CXR images, we choose three state-of-the-art deep-neural networks-based models often used for semantic segmentation, including FCN, U-net, and SegNet [14], to assess the practicality of the proposed method. Note that we train these models over our pre-processed CXR images for lung X-ray segmentation from scratch.
In Figure 3, we show general architectures of FCN [11], U-net [12], and SegNet [13]. An FCN model [11], which consists of only convolutional, pooling, and transposed convolution layers, transforms the input image into pixel categories. Instead of using fully connected layers, the model uses encoder-like layers to extract features from the input image and transform these features back to the size of the input image through the transposed convolution layer. For a pixel at a given location in the input image, the output is a predicted segmentation label of the pixel that corresponds to the location. A U-Net architecture originally derived from the FCN architecture proposed in Reference [12] by adding a full decoder. What U-net differs from FCN is that U-net replaces the transposed convolutional layers with upsampling operations to increase the resolution of the output. Additionally, U-net adds skip-connections to concatenate low-level features from the encoder part with high-level features from the decoder part to provide local information to the global information. SegNet [13] is a convolutional encoder-decoder architecture proposed for semantic pixel-wise segmentation, whose architecture is similar to that of U-net. The differences lie in two aspects. First, the original SegNet does not have skip-connections. Second, it uses unpooling layers to upsample resolutions of feature maps and the output.The general loss function for a lung segmentation task is defined by binary cross entropy as: where S gt (p) ∈ {0, 1} is the ground truth segmentation label of the pixel p andS(p) is the predicted probability of p being the lung regions.   [11], (b) U-net [12], and (c) SegNet [13].

Chest X-ray Datasets
To verify our method, we collected three different CXR datasets for the experiment: 1. Japan Society of Radiology Technology (JSRT) dataset, which contains manually-annotated segmentation labels of lung fields, heart, and clavicles. The JSRT dataset contains 154 nodule-containing digital CXR images (100 malignant cases, 54 benign cases) and 93 normal digital images [43]. The images are grayscale with their bit depth of 12. The size of the images is 2048 × 2048. Both the vertical and horizontal pixel spacing is 0.175 mm.

2.
The Department of Health and Human Services of Maryland (Montgomery dataset) collected X-ray images over many years under Montgomery County's Tuberculosis Control scheme. The dataset consists of 58 digital CXR images with manifestations of tuberculosis and 80 normal digital CXR images [44]. The X-ray images are 12-bit grayscale images, and their size is 4020 × 4892 with 0.0875 mm pixel resolution.

3.
The dataset from a private clinic in India includes 397 chest X-rays with resolutions of 2446 × 2010, 1772 × 1430, and 2010 × 1572. They are all 12-bit grayscale images. The vertical and horizontal pixel spacing are both 0.175 mm.
Here, we randomly split each dataset into the training, validation, and testing datasets, where there are 620 images for training, 69 images for validation, and 69 images for testing [27,44]. To enlarge the dataset, we did random cropping for augmentation as recommended in Reference [45]. Note that all the images are grayscale with 12-bit depth and are resized to 320 × 320 for training and testing. The experiment was run on a computer with Inter® core™ i7-7700 4.20 GHz CPU, 16GB RAM, and an Nvidia GeForce RTX 2080 Ti with 11GB of VRAM.

Object Evaluation
To fairly compare the performance of the above-mentioned models, the measurement metrics used are the Jaccard Simi(Ω JS ), Dice's coefficient (Ω DS ) and Mean Absolute Error (MAE). The Jaccard Similarity coefficient, known as the Jaccard Index, is for measuring the similarity and diversity of sample sets, which defined as Ω JS = |TP| |FP|+|TP|+|FN| , where |TP|, |FP|, and |FN| are the numbers of true positives, false positives, and false negatives. The Dices coefficient also quantifies the similarity like Ω JS but with a different weight on true positive as Ω DS = 2|TP| |FP|+|TP|+|FN| .
Each model is trained and tested on both the original chest X-ray (OCXR) and enhanced chest X-ray (ECXR) dataset for a more detailed comparison. The ECXR dataset is generated using the mentioned contrast enhancement method in Section 3.1. In the experiment, pixels are classified into k groups for testing, where k ∈ {2, 16, 256}. In Table 1, it shows that the proposed contrast enhancement can overall help increase Ω DS 20% on U-net model, 15% on FCN-8 model, 20% on FCN-32 model, and 15% on SegNet model. Besides, using different k on either the OCXR or ECXR dataset with the these network models, they have similar average accuracy on Ω JS , Ω DS , and MAE. Therefore, to save the storage size, we can binarize the images that are used for lung segmentation. Moreover, it can also save time that model access images. Table 1. The segmentation accuracy (measured using Jaccard, Dice, and MAE metrics) of the often-used segmentation models using different pixel clusters.

Model
Index

Convergence Rate
As previously noted, the proposed method uses binarized CXR images for training and testing. Table 2 summarizes the number of iterations required for the training of different models for the lung segmentation to converge. In the experiment, we compare the number of iterations needed for the training of different models to converge with CRX image datasets, unprocessed or processed. To be specific, we generate the ECXR dataset by applying our confined-region-based HE to the OCXR dataset. We binarize the OCXR images to produce the BOCXR dataset. At last, the BECXR dataset is obtained by binarizing the ECXR dataset. The results show that the ECRX dataset is easier for the often used segmentation models to train on. By comparing the training with the OCXR and ECXR datasets, we can see the training on the ECXR dataset converges 11.07% faster than that on the OCXR one. Using the image binarization approach based on pixel clustering, we can accelerate the training by 7.02%, comparing the BOCXR to the OCXR dataset. If we binarize the OCXR and ECXR datasets (BOCXR vs. BECXR), the training on the BECXR dataset converges faster by 14.75% than on the BOCXR one. Moreover, our image binarization approach also expedites the training on the enhanced dataset (ECXR). That is, using the BECXR dataset can speedup the segmentation models in convergence by 10.88% on average than using the ECXR. Table 2 shows in detail all the comparisons of convergence rates among different segmentation models. In summary, if applying our image enhancement method and binarization to the OCXR dataset, we can achieve 20.74% faster for the training to converge on average. Figure 4 demonstrates accurate segmentation results obtained using the U-net model trained on the BECXR dataset, which is the OCXR dataset processed by the proposed pre-processing approach. Table 2. The comparisons of the convergence rates (measured using numbers of iterations needed for the training of the models to converge) using different pre-processing approaches. The second to fifth columns list the total iterations for convergence. The sixth to tenth columns list the iteration reduction percentages "A" versus "B" (e.g., original chest X-ray (OCXR) vs enhanced chest X-ray (ECXR)). The last row shows the average with respect to each column.  Figure 4. An example of segmentation results. The red and green contours represent the expert annotation and the estimated segmentation by the U-net model [12], respectively. Note that contrast of the figure is enhanced for display.

Conclusions
In this work, we have made two primary contributions. First, we propose an effective pre-processing approach that can save storage space for image datasets. Second, we greatly expedite the model training process in lung X-ray segmentation based on CNN-based architectures using the proposed method. More specifically, using the proposed contrast enhancement and image binarization steps, we demonstrate that it can help the training converge faster and take less storage space for data with only a slight drop in prediction accuracy (1.1%). We test our approach using four often-used CNN-based segmentation models with the OCXR, ECXR, BOCXR, and BECXR datasets to verify the effectiveness of our proposed pre-processing approach. Experimental results showed that using the dataset (BECXR) processed by the proposed method can help the training converge 20.74% faster as well as decrease 94.6% of the storage space usage on average compared to using the original dataset (OCXR).

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.