Lung Field Segmentation in Chest X-ray Images Using Superpixel Resizing and Encoder–Decoder Segmentation Networks

Lung segmentation of chest X-ray (CXR) images is a fundamental step in many diagnostic applications. Most lung field segmentation methods reduce the image size to speed up the subsequent processing time. Then, the low-resolution result is upsampled to the original high-resolution image. Nevertheless, the image boundaries become blurred after the downsampling and upsampling steps. It is necessary to alleviate blurred boundaries during downsampling and upsampling. In this paper, we incorporate the lung field segmentation with the superpixel resizing framework to achieve the goal. The superpixel resizing framework upsamples the segmentation results based on the superpixel boundary information obtained from the downsampling process. Using this method, not only can the computation time of high-resolution medical image segmentation be reduced, but also the quality of the segmentation results can be preserved. We evaluate the proposed method on JSRT, LIDC-IDRI, and ANH datasets. The experimental results show that the proposed superpixel resizing framework outperforms other traditional image resizing methods. Furthermore, combining the segmentation network and the superpixel resizing framework, the proposed method achieves better results with an average time score of 4.6 s on CPU and 0.02 s on GPU.


Introduction
Chest X-ray (CXR) is the most common imaging technique widely used for lung diagnosis and treatment, especially for COVID-19. Lung segmentation of CXR images is a fundamental step in many diagnostic applications involving the detection, recognition, and analysis of anatomical structures in computer-aided diagnosis systems. However, manual identification of lung fields is time-consuming and error prone. Thus, accurate automatic segmentation of lung fields has received attention from researchers as an essential preprocessing step in automatically analyzing chest radiographs.
CXR images have higher resolutions. For example, the resolution of the public Japanese Society of Radiological Technology (JSRT) dataset [1] is 2048 × 2048, which is widely used to evaluate the performance of CXR lung segmentation methods. Therefore, most CXR lung field segmentation studies downsample the image size to 128 × 128 or 256 × 256 through linear interpolation to reduce the computation time [2][3][4], especially for deep learning-based methods. However, during downsampling, the graylevel information of several pixels in the high-resolution images is merged to form the graylevel information of a pixel in the downsampled low-resolution images. Thus, the boundary information loss of pixels is unavoidable and causes boundary blurs or missing of the low-resolution images. On the other hand, most methods of lung segmentation usually rely on large gray value contrasts between lung fields and surrounding tissues. As a result, the quality of the segmentation for the image with blurred boundaries will degrade.
Additionally, high-resolution images are necessary for practical medical applications. Thus, the segmentation methods based on downsampling preprocessing need to upsample the segmented results to the original high-resolution images. However, the upsampled results obtained from pixels of the low-resolution images will contain artifacts. Without sufficient information, the boundaries of segmented tissues are hard to correctly recover during upsampling. As a result, the quality of the upsampled segmentation results is worse than that of the results processed from the original high-resolution images.
Solving the downsampling blurred boundary problem and the upsampling artifact problem is necessary for deep learning-based segmentation methods for high-resolution medical images. Most existing stand-alone downsampling and upsampling methods, such as bilinear, bicubic, and nearest neighbor interpolation algorithms, focus on image downsampling and upsampling independent entities, rather than coupling steps simultaneously. Thus, the existing methods cannot solve the problems mentioned above. To alleviate these problems, this study employs a superpixel resizing framework to reduce information loss during downsampling and reconstruct the boundaries of foreground segmentation results during upsampling.
This paper proposes a lung field segmentation combining the ultrafast superpixel extraction via quantization (USEQ) [5] superpixel resizing framework. Using this method can reduce the computational time of high-resolution medical image segmentation and preserve the quality of the segmentation results. In the experimental results, three datasets are used to demonstrate the segmentation performance of the proposed method. Furthermore, we evaluate the performance of USEQ resizing and the bicubic interpolation resizing algorithms in downsampling and upsampling steps. Lung field segmentation results using the USEQ superpixel resizing framework significantly outperform other stand-alone resizing methods.
The remainder of this paper is organized as follows: Section 2 presents related work. Section 3 describes the datasets and critical components of our proposed method. We present our experimental results as well as an analysis in Section 4. Finally, Section 5 concludes this paper.

Related Work
In this section, we briefly revisit recent works on lung field segmentation and superpixel algorithms.

Lung Field Segmentation
Over the past decades, several lung field segmentation methods have been proposed. Hu et al. [6] proposed a three-step process of identifying the lungs in three-dimensional pulmonary X-ray CT (computed tomography) images. Additionally, Wang et al. [7] also used a three-step approach to segmenting lungs with severe interstitial lung disease (ILD) in thoracic CT. Alternatively, a fuzzy-based automatic lung segmentation technique from CT images was proposed [8]. This system needs no prior assumption of images. Sluimer et al. [9] proposed a refined segmentation-by-registration scheme based on an atlas to segment the pathological lungs in CT. Chama et al. [10] introduced an improved lung field segmentation in CT using mean shift clustering. Ibragimov et al. [11] used a supervised landmark-based segmentation in CXR lung field segmentation. Similarly, Yang et al. [4] proposed a computationally efficient method of lung field segmentation using structured random forests to detect lung boundaries from CXR images. Their approach is highly computationally efficient; it promotes a fast and practical procedure of lung field segmentation.
Deformable model-based methods adopt the internal force from object shape and the external force from image appearance to guide the lung segmentation. Back in 2006, Van Ginneken et al. [12] compared three methods for segmenting the lung fields in CXRs, includ-ing active shape model (ASM), active appearance model (AAM), and pixel classification. A hierarchical deformable approach based on shape and appearance models originated from the work conducted by Shao et al. [13]. Similarly, learnable MGRF (Markov-Gibbs random field) was introduced by Soliman et al. to accurately segment pathological and healthy lungs for reliable computer-aided disease diagnostics [14]. Their module integrates two visual appearance sub-models with an adaptive lung shape sub-model. A fully automated approach to segmenting lungs with high-density pathologies has been introduced [15]. They utilized a novel robust active shape model matching method to roughly segment the lungs' outline. Hu and Li [16] proposed an automatic segmentation method of lung field in CXRs based on the improved Snake model. Bosdelekidis and Ioakeimidis [17] introduced a deformation-tolerant procedure based on approximating rib cage seed points for lung field segmentation.
Deep learning is state-of-the-art in semantic image segmentation [18][19][20][21][22]. Novikov et al. [3] proposed convolutional neural network (CNN) architectures for automated multi-class segmentation of lungs, clavicles, and heart on a dataset. Most deep learning segmentation algorithms adapt an encoder-decoder architecture, e.g., U-net and Seg-Net [23,24]. U-Net is an encoder-decoder network model that has served as the baseline architecture for most CXR segmentation models. Many studies have tried to modify the U-Net structure. For example, Wang [2] used a U-Net to segment multiple anatomical structures in CXRs. Arora et al. [25] proposed a modified UNet++ framework, and Yahyatabar et al. [26] offered a Dense-Unet inspired by DenseNet and U-Net for the segmentation of lungs. Moreover, Wang et al. [27] proposed a cascaded learning framework for the automated detection of pneumoconiosis, including a machine learning-based pixel classifier for lung field segmentation, and Cycle-Consistent Adversarial Networks (CycleGAN0) for generating large lung field images for training, and a CNN-based image classier.

Superpixels
A superpixel is a group of perceptually similar pixels. Superpixels represent image regions and adhere to intensity edges for segmentation purposes. There are three main desirable properties for superpixel extraction algorithms [28]: (1) superpixels should accurately adhere to image boundaries and should consist of perceptually similar pixels; (2) superpixels should be computationally efficient as they are used in preprocessing and postprocessing steps; (3) superpixels should improve speed and segmentation quality.
Superpixel algorithms are divided into two categories: graph-based and gradientascent-based methods. Graph-based methods treat each pixel as a node in the graph and the edge weights between two nodes are proportional to the similarity between neighboring pixels. The superpixels are generated by minimizing a cost function defined over the graph [29][30][31][32]. Gradient-ascent-based methods [28,[33][34][35][36] start from a rough initial clustering of pixels and apply gradient-ascent methods. It takes steps proportional to the positive of the gradient and approaches a local maximum of that function. Then, it iteratively refines the clusters until some convergence criterion is met to form superpixels.
Compared to the superpixel mentioned earlier, the USEQ algorithm achieves better and more competitive performance regarding boundary recall, segmentation error, and achievable segmentation accuracy. It is much faster than other methods because it does not use an iterative optimization process. It performs spatial and color quantization in advance to represent pixels and superpixels. Unlike iterative approaches, it aggregates pixels into spatially and visually consistent superpixels using maximum a posteriori (MAP) estimation at pixel-level and region-level. Motivated by these works, we propose an approach that combines the USEQ superpixel resizing framework and an encoder-decoderbased segmentation network in a unified manner.

Datasets and Preprocessing
All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethical Committee of China Medical University Hospital, Taichung, Taiwan (CMUH106-REC2-040 (FR)). The datasets used in this study include two public datasets: JSRT and Lung Image Database Consortium Image Collection (LIDC-IDRI) [37], and a non-public An Nan Hospital (ANH) dataset collected from An Nan Hospital. The images used in these three datasets are 247, 33, and 58, respectively. JSRT images have a fixed resolution of 2048 × 2048. However, the resolutions of LIDC-IDRI and ANH images are varied. The average resolutions of LIDC-IDRI and ANH are 2700 × 2640 and 2705 × 3307, respectively. The resolution distribution of all images is shown in Figure 1.

Datasets and Preprocessing
All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethical Committee of China Medical University Hospital, Taichung, Taiwan (CMUH106-REC2-040 (FR)). The datasets used in this study include two public datasets: JSRT and Lung Image Database Consortium Image Collection (LIDC-IDRI) [37], and a non-public An Nan Hospital (ANH) dataset collected from An Nan Hospital. The images used in these three datasets are 247, 33, and 58, respectively. JSRT images have a fixed resolution of 2048 × 2048. However, the resolutions of LIDC-IDRI and ANH images are varied. The average resolutions of LIDC-IDRI and ANH are 2700 × 2640 and 2705 × 3307, respectively. The resolution distribution of all images is shown in Figure 1. The original image pixels are stored in 12-bit with 4096 graylevels. The file format for the LIDC-IDRI and ANH datasets is Digital Imaging and Communications in Medicine (DICOM), while headers are not used in the JSRT dataset. Therefore, the original images are mapped to 8-bit and stored in PNG format at their actual sizes. In most cases, deep learning algorithms perform better when trained on more data. Therefore, this work augments images by generating randomly rotated images with a maximum rotation of ± 10 degrees for each original image. Examples of augmented images from the JSRT dataset are shown in Figure 2. The work also creates manual reference segmentations drawn by medical experts for each image. The segmentation masks are labeled with values of 0 and 1, corresponding to the background and lung fields. A ground truth example of the LIDC-IDRI dataset is shown in Figure 3.  The original image pixels are stored in 12-bit with 4096 graylevels. The file format for the LIDC-IDRI and ANH datasets is Digital Imaging and Communications in Medicine (DICOM), while headers are not used in the JSRT dataset. Therefore, the original images are mapped to 8-bit and stored in PNG format at their actual sizes. In most cases, deep learning algorithms perform better when trained on more data. Therefore, this work augments images by generating randomly rotated images with a maximum rotation of ± 10 degrees for each original image. Examples of augmented images from the JSRT dataset are shown in Figure 2. The work also creates manual reference segmentations drawn by medical experts for each image. The segmentation masks are labeled with values of 0 and 1, corresponding to the background and lung fields. A ground truth example of the LIDC-IDRI dataset is shown in Figure 3.

Datasets and Preprocessing
All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethical Committee of China Medical University Hospital, Taichung, Taiwan (CMUH106-REC2-040 (FR)). The datasets used in this study include two public datasets: JSRT and Lung Image Database Consortium Image Collection (LIDC-IDRI) [37], and a non-public An Nan Hospital (ANH) dataset collected from An Nan Hospital. The images used in these three datasets are 247, 33, and 58, respectively. JSRT images have a fixed resolution of 2048 × 2048. However, the resolutions of LIDC-IDRI and ANH images are varied. The average resolutions of LIDC-IDRI and ANH are 2700 × 2640 and 2705 × 3307, respectively. The resolution distribution of all images is shown in Figure 1. The original image pixels are stored in 12-bit with 4096 graylevels. The file format for the LIDC-IDRI and ANH datasets is Digital Imaging and Communications in Medicine (DICOM), while headers are not used in the JSRT dataset. Therefore, the original images are mapped to 8-bit and stored in PNG format at their actual sizes. In most cases, deep learning algorithms perform better when trained on more data. Therefore, this work augments images by generating randomly rotated images with a maximum rotation of ± 10 degrees for each original image. Examples of augmented images from the JSRT dataset are shown in Figure 2. The work also creates manual reference segmentations drawn by medical experts for each image. The segmentation masks are labeled with values of 0 and 1, corresponding to the background and lung fields. A ground truth example of the LIDC-IDRI dataset is shown in Figure 3.

Overview of the Lung Field Segmentation
The proposed lung field segmentation method combines an encoder-decoder segmentation network and the USEQ superpixel resizing framework [38] to obtain high-quality segmentation results. The architecture of the method is shown in Figure 4. First, the input image is downsampled using the downsampling interpolation function to find the low-resolution image to reduce the computation time in the subsequent segmentation network. Next, the downsampled low-resolution image is processed through the encoderdecoder segmentation network to segment lung fields. Then, the proposed upsampling interpolation function upsamples the segmentation results based on the superpixel boundary information obtained from the downsampling process. The stored superpixel boundary information is used to recover the high-resolution segmentation results. Finally, post-processing is applied to correct the segmentation results.

USEQ Superpixel Extraction
USEQ algorithm consists of four computationally efficient steps of generating superpixels. First, it employs spatial quantization to generate the initial superpixels based on pixel locations. Second, the color space of each pixel is also quantized to obtain the dominant color within each initial superpixel. In spatial quantification, the USEQ algorithm calculates the initial width and height of a superpixel, and then defines the spatial relationship between pixels and superpixels. The initial width and height of each superpixel are computed as follows:

Overview of the Lung Field Segmentation
The proposed lung field segmentation method combines an encoder-decoder segmentation network and the USEQ superpixel resizing framework [38] to obtain high-quality segmentation results. The architecture of the method is shown in Figure 4. First, the input image is downsampled using the downsampling interpolation function to find the lowresolution image to reduce the computation time in the subsequent segmentation network. Next, the downsampled low-resolution image is processed through the encoder-decoder segmentation network to segment lung fields. Then, the proposed upsampling interpolation function upsamples the segmentation results based on the superpixel boundary information obtained from the downsampling process. The stored superpixel boundary information is used to recover the high-resolution segmentation results. Finally, post-processing is applied to correct the segmentation results.

Overview of the Lung Field Segmentation
The proposed lung field segmentation method combines an encoder-decoder segmentation network and the USEQ superpixel resizing framework [38] to obtain high-quality segmentation results. The architecture of the method is shown in Figure 4. First, the input image is downsampled using the downsampling interpolation function to find the low-resolution image to reduce the computation time in the subsequent segmentation network. Next, the downsampled low-resolution image is processed through the encoderdecoder segmentation network to segment lung fields. Then, the proposed upsampling interpolation function upsamples the segmentation results based on the superpixel boundary information obtained from the downsampling process. The stored superpixel boundary information is used to recover the high-resolution segmentation results. Finally, post-processing is applied to correct the segmentation results.

USEQ Superpixel Extraction
USEQ algorithm consists of four computationally efficient steps of generating superpixels. First, it employs spatial quantization to generate the initial superpixels based on pixel locations. Second, the color space of each pixel is also quantized to obtain the dominant color within each initial superpixel. In spatial quantification, the USEQ algorithm calculates the initial width and height of a superpixel, and then defines the spatial relationship between pixels and superpixels. The initial width and height of each superpixel are computed as follows:

USEQ Superpixel Extraction
USEQ algorithm consists of four computationally efficient steps of generating superpixels. First, it employs spatial quantization to generate the initial superpixels based on pixel locations. Second, the color space of each pixel is also quantized to obtain the dominant color within each initial superpixel. In spatial quantification, the USEQ algorithm calculates the initial width and height of a superpixel, and then defines the spatial relationship between pixels and superpixels. The initial width and height of each superpixel are computed as follows: where W and H represent image width and height, respectively. The target number of superpixels is denoted by δ. Pixels belonging to superpixel sp i are defined as follows: 6 of 18 were p k is the position of the k-th pixel in an image. The spatial neighbor relationship e sp i , sp j between superpixels sp i and sp j is defined as follows: e sp i , sp j = 1 sp i and sp j are neighbor grid 0 otherwise (4) These enabled the algorithm to build a spatial neighbor relationship between superpixels. Third, after the spatial and color quantifications, a non-iterative maximum a posteriori (MAP) pixel label assignment uses both spatial and color quantization results to reassign labels of pixels for better boundary adherence of objects. Finally, a MAP estimation-based neighborhood refinement is used to merge small adjacent superpixels with visual similarity to obtain superpixels with more regular and compact shapes. Figure 5 shows the flowchart of the USEQ superpixel extraction method. An example of the USEQ result is shown in Figure 6.
where W and H represent image width and height, respectively. The target number of superpixels is denoted by δ. Pixels belonging to superpixel are defined as follows: were is the position of the k-th pixel in an image. The spatial neighbor relationship , between superpixels and is defined as follows: , = 1 and are neighbor grid 0 otherwise (4) These enabled the algorithm to build a spatial neighbor relationship between superpixels.
Third, after the spatial and color quantifications, a non-iterative maximum a posteriori (MAP) pixel label assignment uses both spatial and color quantization results to reassign labels of pixels for better boundary adherence of objects. Finally, a MAP estimationbased neighborhood refinement is used to merge small adjacent superpixels with visual similarity to obtain superpixels with more regular and compact shapes. Figure 5 shows the flowchart of the USEQ superpixel extraction method. An example of the USEQ result is shown in Figure 6.

otherwise
These enabled the algorithm to build a spatial neighbor relationship be pixels.
Third, after the spatial and color quantifications, a non-iterative maxim ori (MAP) pixel label assignment uses both spatial and color quantization r sign labels of pixels for better boundary adherence of objects. Finally, a MA based neighborhood refinement is used to merge small adjacent superpixe similarity to obtain superpixels with more regular and compact shapes. Fi the flowchart of the USEQ superpixel extraction method. An example of th is shown in Figure 6.

USEQ Superpixel Resizing Framework
The superpixel resizing framework is mainly composed of the downsampling interpolation function F D (·) and the upsampling interpolation function F U (·). Let I, be the input image, and I D and I U be the downsampled and upsampled images of I. The image matrix I is composed of a homogeneous matrix H of homogeneous regions and a boundary matrix B of the boundaries of objects as follows: Here, the USEQ superpixel extraction separates the image matrix I to the homogeneous matrix H and the boundary matrix B. To obtain I D , a downsampling interpolation function F D (·) is applied to I as follows: To recover the high-resolution image I U , the upsampling interpolation function F U (·) is applied to I D , and I U is represented as follows: To obtain high-quality upsampled results which are similar to original images, the distance function D I U , I between I U and I should be minimized as follows: Substitute Equations (5)- (7) to Equation (8), D I U , I is then derived as follows: For a superpixel sp i , we classify the pixels in sp i into homogeneous and boundary pixels, respectively. The boundary set sp B i of pixels in sp i are the pixels that spatially connected to the pixels in the neighbor superpixel sp j , where i = j, as follows: where d(p k , p l ) = p k − p l || | p k ∈ sp i , p l ∈ sp j , i = j and p k = [x k , y k ] T is the 2D image position of the k-th pixel p k in I. The homogeneous set sp H i is then defined as pixels are in sp i but are not in the boundary set sp B i as: With sp H i and sp B i , H M (p k , sp i ) is then defined as follows: which represents the pixels in the homogeneous regions in sp i . Similarly, B M (p k , sp i ) is defined as follows: superpixel sp i . Therefore, downsampling interpolation function F D (·) is designed to map the colors of pixels of sp i to p D i as follows: Using Equation (14), the color value I p D i of the pixel p D i is then obtained as follows: Because sp H i contains homogeneous pixels of sp i , the obtained I p D i is also visually similar to the colors of the pixels of sp i .
The boundary between sp i and sp j retains between p D i and p D j of the downsampled image, which means that the downsampling interpolation function F D (·) can effectively preserve boundaries of objects during downsampling. In this way, we can obtain a highquality, low-resolution image containing clear boundaries of segmented objects to avoid degradation of segmentation results. Here, sp B i of each superpixel is reserved for boundary information and used to recover the high-resolution segmentation results during the upsampling.
To obtain the high-resolution segmentation results, the upsampling interpolation function F U (·) is designed based on sp B i , which preserves the boundary information for superpixels in an image. Because boundary matrix B stores the boundary information during downsampling, it complements the missing boundary information for image upsampling. Thus, F U (·) is designed to map the colors of pixels p D i of I D to pixels in I U as follows: where I p U k = F U I p D i , p k is the color of p k which belongs to the superpixel sp i of the image. In Equation (16), pixels of the same superpixels of the upsampled image will have consistent colors. Moreover, the colors of pixels between boundaries will differ based on the superpixel information. Thus, the upsampled image can maintain the original boundaries of segmented objects. In addition, the time complexity in the image upsampling step is very low, because only pixel value assignment is performed based on the superpixels.

Encoder-Decoder Segmentation Networks
The encoder-decoder segmentation network consists of five encoder layers with corresponding decoder layers. The network architecture is shown in Figure 7. Each convolutional layer is followed by batch normalization and ReLU (rectified-linear units) nonlinearity. These are followed by max pooling with a 2 × 2 window size without overlapping and a stride length of two. The resulting feature map from max-pooling is subsampled by a factor of two which enables it to achieve translation invariance for robust classification. Although max-pooling and subsampling achieve translation invariance, they cause a loss of resolution in the feature maps. In this work, we use SegNet [24] to address this problem by storing the location of the maximum feature value in each pooling window and passing it to the decoder.
The decoder network uses the stored pooling indices to upsample the encoder s input feature map(s). Each convolutional layer in the decoder is preceded by upsampling with the mapped pooling indices from the encoder and succeeded by batch normalization and ReLU nonlinearity. The feature map from the final decoder layer is fed to a softmax classifier for pixel-wise classification. The output of the softmax classifier is the probability The decoder network uses the stored pooling indices to upsample the encoder′s input feature map(s). Each convolutional layer in the decoder is preceded by upsampling with the mapped pooling indices from the encoder and succeeded by batch normalization and ReLU nonlinearity. The feature map from the final decoder layer is fed to a softmax classifier for pixel-wise classification. The output of the softmax classifier is the probability of the N-channel image, where N is the number of classes. In this study, there are classes, background, and lung fields.

Post-Processing
The softmax classifier at the end of the decoder network classifies each pixel on the image as the background and lung fields. The purpose of this study is to segment the lungs. Because the lung field is full of air, the pixel intensity in the lung fields is very low. Therefore, the dark areas are classified as the lung fields, and the white areas on the image are classified as the background. However, due to anatomy, dark spots outside the lungs were also found in the body cavity, such as the stomach. These dark spots are also classified as lung fields. On the other hand, due to certain diseases, some white spots in the lung field are also classified as the background.
For the above reasons, post-processing is used to correct the segmentation results. Only the two largest segmented regions are considered lung fields, the left and right lungs. Other small-segmented areas will be discarded. Similarly, the hole in the lungs will be filled. Two examples are shown in Figure 8. The entire lung filed segmentation algorithm is summarized in Algorithm 1.

Post-Processing
The softmax classifier at the end of the decoder network classifies each pixel on the image as the background and lung fields. The purpose of this study is to segment the lungs. Because the lung field is full of air, the pixel intensity in the lung fields is very low. Therefore, the dark areas are classified as the lung fields, and the white areas on the image are classified as the background. However, due to anatomy, dark spots outside the lungs were also found in the body cavity, such as the stomach. These dark spots are also classified as lung fields. On the other hand, due to certain diseases, some white spots in the lung field are also classified as the background.
For the above reasons, post-processing is used to correct the segmentation results. Only the two largest segmented regions are considered lung fields, the left and right lungs. Other small-segmented areas will be discarded. Similarly, the hole in the lungs will be filled. Two examples are shown in Figure 8. The entire lung filed segmentation algorithm is summarized in Algorithm 1.

Algorithm 1. Lung Field Segmentation
Input: Given a set of CXR images X and a set of ground truth masks Y. I ∈ X and M ∈ Y. Output: O, the segmentation results. 1 Decompose I into homogeneous matrix H of homogeneous regions and a boundary matrix B of the boundaries of superpixels using superpixel extraction. 2 Downsample I to obtain the downsampled image I D using Equation (14). 3 Downsample M to obtain the downsampled image M D . 4 Store the superpixel label information for each pixel of I. 5 In training phase: 5.1 Input a set of I D and a set of M D to the encoder-decoder segmentation network to train the model. 6 In prediction phase: 6.1 Input I D to the encoder-decoder segmentation network to predict the low-resolution segmentation results O D . 6.2 Upsample O D to obtain the high-resolution segmentation results O using Equation (16)

Datasets and Model Training
The datasets used in this experiment include JSRT, LIDC-IDRI, and ANH, which consist of 247, 33, and 58, respectively. JSRT and LIDC-IDRI are public datasets, while ANH is a non-public dataset. In the experiments, we used these three datasets to train and build five segmentation network models and named each model according to the dataset. Eighty percent of each dataset is used for training and the remaining twenty percent is used for testing.
After the data argumentation process, the JSRT model is trained on 2370 JSRT images, the LIDC model is trained on 1870 LIDC-IDRI images, and the ANH model is trained on 836 ANH images. We also combine training data from the LIDC-IDRI and JSRT datasets to train on the LIDC_JSRT hybrid model. In the same way, all datasets are combined and trained on the LIDC_JSRT_ANH hybrid model. We conducted all experiments on a computer with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40 GHz CPU and GeForce GTX TitanX GPU with 12 GB of memory. The batch size is set to 4 and the maximum number of iterations is 40,000. The Adam optimizer with learning rate of 0.0005 is used to train the network parameters. Table 1 shows the number of images used for the training and testing of each model.

Performance Comparison of Superpixel and Bicubic Interpolations
Since the goal of the experiments is not only to segment the lung, we also demonstrate the performance of the USEQ superpixel resizing framework. As mentioned before, the image boundaries become blurred after the downsampling and upsampling steps. Here, the peak signal-to-noise ratio (PSNR) [39] is used to evaluate the USEQ superpixel interpolation with other interpolation algorithms used in the downsampling and upsampling steps.
Given an image I with size W × H and C channels, image I is downsampled to image I D with a specified size, and then upsampled to the original resolution, which is called image I U . Channels C can be ignored here because CXR images are single channel. The PSNR is defined by the mean squared error (MSE) for a single channel, the MSE is defined as follows: The PSNR is defined as: PSNR = 20·log 10 (MAX I ) − 10·log 10 (MSE) (18) where the MAX I is the maximum value of an image. Higher PSNR means better visual quality and less information loss. We compare USEQ superpixel interpolation with nearest-neighbor interpolation [40], bilinear interpolation [41], and bicubic interpolation [42] at different downsampling rates (i.e., 0.125, 0.25, and 0.5). Figure 9 shows the average PSNR evaluation results for each dataset. As shown in the figures, nearest-neighbor interpolation is the worst, bilinear and bicubic interpolations are moderate, but bicubic is slightly better. The proposed USEQ superpixel interpolation outperforms other interpolations because it considers boundary information during downsampling and upsampling.  We combine the segmentation network with different interpolation algorithms, USEQ superpixel interpolation and bicubic interpolation to evaluate the segmentation results. The segmentation network outputs are upsampled to their original space using the same interpolation algorithm used in the downsampling. Figure 10 shows segmentation results for JSRT, LIDC, and ANH models using USEQ superpixel interpolation and bicubic interpolation. Although the results are broadly similar, there are some differences in the details. We combine the segmentation network with different interpolation algorithms, USEQ superpixel interpolation and bicubic interpolation to evaluate the segmentation results. The segmentation network outputs are upsampled to their original space using the same interpolation algorithm used in the downsampling. Figure 10 shows segmentation results for JSRT, LIDC, and ANH models using USEQ superpixel interpolation and bicubic interpolation. Although the results are broadly similar, there are some differences in the details.
We combine the segmentation network with different interpolation algo USEQ superpixel interpolation and bicubic interpolation to evaluate the segmenta sults. The segmentation network outputs are upsampled to their original space us same interpolation algorithm used in the downsampling. Figure 10 shows segme results for JSRT, LIDC, and ANH models using USEQ superpixel interpolation an bic interpolation. Although the results are broadly similar, there are some differe the details. The contours of the segmentation results are drawn on the original image to show the adherence to the lung field boundaries. Figure 11 shows the boundary adherence between USEQ superpixel interpolation and bicubic interpolation. Figure 11a,c,e are the results of LIDC, JSRT, and ANH. Figure 11b,d,f are the zoom-in versions of it. The blue contours are the results of USEQ, the red contours are bicubic, and the green contours are the ground truth. From the zoom-in parts in Figure 11b,d,f, these figures clearly show that the contour of using the USEQ superpixel interpolation is better than that of using the bicubic interpolation.
tween USEQ superpixel interpolation and bicubic interpolation. Figure 11a,c,e are the re sults of LIDC, JSRT, and ANH. Figure 11b,d,f are the zoom-in versions of it. The blu contours are the results of USEQ, the red contours are bicubic, and the green contours ar the ground truth. From the zoom-in parts in Figure 11b,d,f, these figures clearly show tha the contour of using the USEQ superpixel interpolation is better than that of using th bicubic interpolation. Four metrics are also employed to measure the quantitative impact of segmentation results between USEQ superpixel interpolation and bicubic interpolation. These metric include dice similarity coefficient (DSC), sensitivity, specificity, and Modified Hausdorf distance (MHD) as follows [10]: Four metrics are also employed to measure the quantitative impact of segmentation results between USEQ superpixel interpolation and bicubic interpolation. These metrics include dice similarity coefficient (DSC), sensitivity, specificity, and Modified Hausdorff distance (MHD) as follows [10]: Speci f icity = TN TN + FP where TP (true positives) represents correctly classified lung pixels, FP (false positives) represents pixels classified as lung but are background, FN (false negatives) represents pixels classified as background but are part of the lung, and TN (true negative) represents correctly classified background pixels. The MHD calculates the average distance between the segmentation result and the ground truth, defined as follows: where S seg and S gold represent segmentation results and the ground truth, respectively.
S gold is the total number of pixels in the ground truth. p and q are points on the boundaries of S seg and S gold , and d(p,q) is the minimum distance of a point p on the boundary S seg to the point q on the boundary S gold . The average metric score results for each dataset are shown in Table 2. According to the metric results, USEQ superpixel interpolation has been shown to outperform bicubic interpolation. In USEQ superpixel interpolation, the average DSC, sensitivity, and specificity score is greater than 97%. The average MHD of USEQ is less than 2, while the average MHD of bicubic interpolation is about 4. Therefore, USEQ superpixel interpolation has a better boundary adherence than bicubic interpolation.

Cross-Dataset Generalization
To test the generalization of the segmentation models, the five trained models were tested on different datasets that did not appear during their training. The cross-dataset test results of the DSC metric are shown in Figure 12. Compared with their datasets, the segmentation performance on different datasets has decreased slightly. This is because the three datasets differ in the image size and gray-level range, especially the aspect ratio (width/height). The aspect ratio of JSRT is 1, the ratio of LIDC-IDRI is about 1.02, but the ratio of ANH is 0.82. In Figure 12, except for the ANH model, the DSC scores of the other four models are approximately or higher than 90%. One possibility that may cause this situation is the difference in X-ray imaging machines.
Nevertheless, the models LIDC_JSRT and LIDC_JSRT_ANH trained from the combination of datasets have achieved excellent results. The results show that increasing data diversity can enhance the model generalization and improve performance.

Comparison with other Lung Segmentation Methods
Jaccard index (Ω) [3,4,[11][12][13]43,44] and the mean boundary distance (MBD) are also calculated as additional metrics to compare the proposed method with other lung segmentation methods. Jaccard index is computed as: MBD measures the average distance between the boundary S of the segmentation result and the boundary T of the ground truth, defined as follows [4]: where and are the points on boundaries S and T, respectively. ( , ) is the minimum distance of point on boundary to boundary , defined as follows: Comparisons are made only on the JSRT dataset, as most studies use this dataset, as listed in Table 3. The proposed method outperforms other methods, such as SEDUCM [4], SIFT-Flow [43], and MISCP [44] in the Jaccard index, DSC, and MDB metrics. It also excels in the variance of the Jaccard index and DSC metrics. Variance is understood in machine learning as how much the prediction for a given point varies between model implementations. Bias measures the overall gap between the model′s predictions and the ground truth. In some cases, low variance does not guarantee a model with low bias. However, the Jaccard index measures the overlap between model predictions and ground truth in semantic segmentation. Therefore, the higher the value of Ω, the less biased the model predictions are. The bias-variance trade-off in our method is minimized compared to other methods. That is, our method has the most consistent prediction results.
Although some methods in the literature slightly outperform the proposed method in terms of computational time, our method is still comparable due to the different computing power of machines. The computational time of the methods was computed on images of size 256 × 256. The proposed method achieves an average time score of 4.6 s on CPU and 0.02 seconds on GPU. According to the results described in Table 3, our method

Comparison with other Lung Segmentation Methods
Jaccard index (Ω) [3,4,[11][12][13]43,44] and the mean boundary distance (MBD) are also calculated as additional metrics to compare the proposed method with other lung segmentation methods. Jaccard index is computed as: MBD measures the average distance between the boundary S of the segmentation result and the boundary T of the ground truth, defined as follows [4]: where s i and t j are the points on boundaries S and T, respectively. d(s i , T) is the minimum distance of point s i on boundary S to boundary T, defined as follows: d(s i , T) = min j s i − t j (25) Comparisons are made only on the JSRT dataset, as most studies use this dataset, as listed in Table 3. The proposed method outperforms other methods, such as SEDUCM [4], SIFT-Flow [43], and MISCP [44] in the Jaccard index, DSC, and MDB metrics. It also excels in the variance of the Jaccard index and DSC metrics. Variance is understood in machine learning as how much the prediction for a given point varies between model implementations. Bias measures the overall gap between the model s predictions and the ground truth. In some cases, low variance does not guarantee a model with low bias. However, the Jaccard index measures the overlap between model predictions and ground truth in semantic segmentation. Therefore, the higher the value of Ω, the less biased the model predictions are. The bias-variance trade-off in our method is minimized compared to other methods. That is, our method has the most consistent prediction results.
Although some methods in the literature slightly outperform the proposed method in terms of computational time, our method is still comparable due to the different computing power of machines. The computational time of the methods was computed on images of size 256 × 256. The proposed method achieves an average time score of 4.6 s on CPU and 0.02 seconds on GPU. According to the results described in Table 3, our method outperforms other methods on Ω, DSC, and MBD. Considering the difference in the computing power of machines, the speed of this method is not bad compared to other methods.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethical Committee of China Medical University Hospital, Taichung, Taiwan (CMUH106-REC2-040 (FR)).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: For detailed information of image data availability for research purposes, please contact the corresponding authors.