Lung Segmentation in CT Images: A Residual U-Net Approach on a Cross-Cohort Dataset

: Lung cancer is one of the most common causes of cancer-related mortality, and since the majority of cases are diagnosed when the tumor is in an advanced stage, the 5-year survival rate is dismally low. Nevertheless, the chances of survival can increase if the tumor is identiﬁed early on, which can be achieved through screening with computed tomography (CT). The clinical evaluation of CT images is a very time-consuming task and computed-aided diagnosis systems can help reduce this burden. The segmentation of the lungs is usually the ﬁrst step taken in image analysis automatic models of the thorax. However, this task is very challenging since the lungs present high variability in shape and size. Moreover, the co-occurrence of other respiratory comorbidities alongside lung cancer is frequent, and each pathology can present its own scope of CT imaging appearances. This work investigated the development of a deep learning model, whose architecture consists of the combination of two structures, a U-Net and a ResNet34. The proposed model was designed on a cross-cohort dataset and it achieved a mean dice similarity coefﬁcient (DSC) higher than 0.93 for the 4 different cohorts tested. The segmentation masks were qualitatively evaluated by two experienced radiologists to identify the main limitations of the developed model, despite the good overall performance obtained. The performance per pathology was assessed, and the results conﬁrmed a small degradation for consolidation and pneumocystis pneumonia cases, with a DSC of 0.9015 ± 0.2140 and 0.8750 ± 0.1290, respectively. This work represents a relevant assessment of the lung segmentation model, taking into consideration the pathological cases that can be found in the clinical routine, since a global assessment could not detail the fragilities of the model.


Introduction
Respiratory diseases are the leading causes of death worldwide, and among the most common causes are asthma, chronic obstructive pulmonary disease (COPD), acute respiratory infections, tuberculosis, and lung cancer, contributing to the global burden of respiratory diseases [1,2]. Lung cancer is one of the most common causes of cancer-related mortality. In 2020, approximately 2.2 million people were diagnosed with lung cancer and about 1.79 million individuals died from this condition [3]. The majority of patients are diagnosed in advanced stages, resulting in small chances of 5-year survival rate of around 3.9%. When lung cancer is identified in early stages, which can be achieved through screening, these probabilities increase up to approximately 54% [4]. Among the available imaging modalities for screening, computed tomography (CT) has shown the highest reduction in cancer mortality [3].
Clinical assessment of CT images is a time-consuming task that is prone to discrepancies as a result of the subjective physicians interpretation, and computer-aided diagnosis (CAD) systems can help reduce this burden [4,5] . In automatic image analysis systems of the thorax, the segmentation of the lungs usually constitutes the first stage of processing, in order to reduce the computational cost [6,7] and to regularize the prediction task of the CAD by eliminating unnecessary information, or allow feature extraction from a region of interest for machine learning approaches [8][9][10][11]. This task is very challenging given that the lungs present high variability in shape, size, and volume. Moreover, the presence of abnormalities in the lung parenchyma, such as consolidations and cavities, makes segmentation even more difficult, leading to inaccurate delineations [6]. Current lung segmentation methodologies are able to segment lungs, exhibiting characteristics that are within a specific spectrum of patterns [6], but fail in the cases that present pathologies with complex patterns, for which they were not trained [6], as a consequence of a lack of diverse training data [7].
DL approaches, applied to medical imaging tasks, have been gaining popularity in recent years [5] and are preferred to other techniques, such as traditional imaging processing [12]. U-Net-based approaches have shown promising results on the segmentation tasks of medical images [13]. Skourt et al. [14] proposed a U-Net-based model composed of a contracting path similar to the one of U-Net (two convolutions followed by rectified linear unit (ReLU) activation and max-pooling, repeated four times) and an expansive path in which the upsampled layer is concatenated with a cropped corresponding feature map of the first path. Images from the Lung Image Database Consortium's Image Database Resource Initiative (LIDC-IDRI) dataset were manually segmented to generate the ground truth and later used to train and test the model. The average dice similarity coefficient (DSC) achieved was 0.9502. Shaziya et al. [15] also presented a U-Net model, with a contracting path, formed by three blocks of convolutional layers, ReLU and max-pooling, and an expansive path, formed by three blocks, two of them with two convolutions, concatenation with the corresponding feature map of the contracting path, and upsampling. The last one is similar to the previous blocks, except that it presents three convolutions, instead of two, and following the upsampling, there were two more convolutions and a dropout layer, followed by the output layer. The input images, with dimensions of 128 × 128, were resized to 32 × 32, in order to reduce computational time. Data augmentation was performed by rotation using the available training samples, to increase the number of images used to develop the model, and an accuracy of 0.9678 is achieved. Yoo et al. [16] presented 2D and 3D U-Net models for the segmentation of the lungs as one region and separately. The 2D model has an input dimension of 512 × 512 × 1 and it is formed by 4 encoders and 4 decoders, in which bilinear interpolation is used for the upsampling step. On the other hand, the 3D model has an input dimension of 512 × 512 × 8, 3 downsampling steps, and 3 upsampling steps with trilinear interpolation. For both models, softmax function is used in the output layer and cross entropy is used as a loss function. Two types of models were trained, one for the segmentation of the whole lung region and another for the separated segmentation of each lung. Concerning the latter, each ground truth was separated into two additional masks, each containing only one of the lungs. Afterwards, each of these masks was flipped horizontally and used as training for the segmentation of the opposite lung. The University Hospitals of Geneva's Interstitial Lung Disease (HUG-ILD) dataset was used as external validation. The 2D model presents a DSC of 0.9840 and 0.9840 for the whole segmentation and for the separated segmentation, respectively. Khanna et al. [17] introduced a residual U-Net for the lung segmentation in CT images. Initially, to improve the number of available training images, data augmentation was performed via flips, rotation, zooming, and shifting. The residual U-Net architecture is a combination of ResNet and U-Net architectures; hence, it presents an encoder path and a decoder path, each one with four stages.
The DSC was used as a loss function. In order to improve the model accuracy, a connected component algorithm was applied to remove non-lung regions. The Lung Nodule Analysis 2016 (LUNA16) and the Vessel Segmentation in the Lung 2012 (VESSEL12) datasets were used for training, whereas the model evaluation was performed with the HUG-ILD dataset. Two different architectures, ResNet34 and ResNet50, were implemented alongside the U-Net. It was verified that the latter presents slightly better results, achieving an average DSC of 0.9868.
Lung segmentation can be a very difficult task due to the influence of other pathologies that produce imaging changes. The presence of other respiratory comorbidities alongside lung cancer is frequent. Lung cancer and COPD, both predominantly caused by cigarette smoking, are closely linked, and each condition presents its own range of characteristic imaging features [18]. For this reason, apart from lung-cancer-specific patterns, it is imperative that the segmentation models are likewise able to identify features particular to other pulmonary conditions. The main goal of this work is the development of a deep-learning-based model for lung segmentation in CT images that must be robust on a cross-cohort dataset and capable of coping with the numerous and variable imaging appearances, derived from different pathologies with physiological heterogeneities.

Material and Methods
This section presents the multiple datasets used in the current study-the data selected for both training and performance evaluation of the model-and gives a detailed description of each dataset. It also describes the preprocessing steps taken and the segmentation model developed. The Lung CT Segmentation Challenge (LCTSC) 2017 [19] dataset was part of a competition in which the goal was the development of algorithms for the segmentation of several organs at risk in CT images for radiation treatment planning. The data was collected from 3 different institutions, making a total of 60 CT scans. The dataset is divided into 2 subsets: 1 contains 36 scans that are intended to be used for training (36-LCTSC), and the other subset contains 24 scans intended to be used for the assessment of the developed models (24-LCTSC). The number of slices along the z-axis per scan varies between 103 and 279 and their axial resolution is 512 × 512. The slice spacing is of 1.02 ± 0.11 mm and the slice thickness is of 2.65 ± 0.38 mm. The ground truth of the original images contains the delineation of five anatomical structures: esophagus, heart, left and right lungs, and spinal cord. Given that the lungs are the only organs of interest for this work, a binary ground truth containing solely the pulmonary regions was generated for each slice, using the information regarding these organs extracted from the DICOM RSTRUCT file.

Lung Nodule Analysis 2016
The Lung Nodule Analysis 2016 (LUNA16) [20] dataset was also part of a competition and it was developed to provide a large set for the comparison and evaluation of CAD systems designed for the detection of pulmonary nodules. This database contains 888 CT scans with annotations from another public dataset, LIDC-IDRI, and lung masks are available for each one of them. The scans were divided in 10 folders, intended to be used into a 10-fold cross-validation manner. For this work, only two of them were used, being randomly chosen from all the available folders. The number of slices varies between 103 and 733. The slice spacing is of 0.69 ± 0.09 mm and the slice thickness is of 1.60 ± 0.74 mm.

University Hospitals of Geneva-Interstitial Lung Disease
Motivated by the low availability of public collections of Interstitial Lung Disease (ILD) cases, the University Hospitals of Geneva Interstitial Lung Disease (HUG-ILD) database [21] was created to provide a public platform of interstitial lung disease (ILD) cases for the development and evaluation of CAD systems. This collection comprises the 13 most common histological diagnosis of ILD, including conditions such as groundglass, emphysema, fibrosis, consolidations, reticulation, and micronodules. The HUG-ILD database provides 112 CT scans with their respective binary lung segmentation masks. The number of slices along the z-axis per scan varies between 14 and 60 and the spacing between slices is within the range of 10-15 mm. Each slice is a matrix with 512 × 512 dimension. The slice spacing is of 0.70 ± 0.10 mm and the slice thickness is of 1.00 ± 0.00 mm.

Vessel Segmentation in the Lung 2012
Similar to the LCTSC dataset, the Vessel Segmentation in the Lung 2012 (VESSEL12) [22] dataset was part of a competition and its goal is to serve as a mean of comparison for (semi) automatic models for the vessel segmentation in lung CT scans. This database contains 10 patients with CT scans and the correspondent binary lung masks and comprises cases of alveolar inflammation, diffuse interstitial lung disease, and emphysema. Three extra scans are available as well, but were not intended to be used as part of an evaluation set. All thirteen scans have corresponding binary lung masks. The number of slices along the z-axis , varies between 355 and 534 and their resolution is of 512 × 512. The slice spacing is of 0.74 ± 0.09 mm and the slice thickness is of 0.88 ± 0.15 mm.

University Hospital Center of São João
A private dataset of 141 patients with lung cancer was collected in the University Hospital Center of São João (CHUSJ) and contains severe cases of this pathology. Semantic features were annotated for each scan and lung binary masks were generated for 27 of the available scans. The number of slices of these scans varies between 61 and 281 and their resolution is 512 × 512. The slice spacing is of 0.71 ± 0.08 mm and the slice thickness is of 3.07 ± 0.38 mm.

Summary of Cross-Cohort Dataset
In total, five datasets were used. Table 1 describes the number of scans used from each dataset. These datasets allow the development and test of the segmentation model in cross-cohorts, ensuring the heterogeneity of the data fundamental for a good generalization of the learning model.

Pre-Processing
When using data obtained from multiple sources, whether they are different institutions, different CT equipment, patients, or acquisition protocols, there is a high probability that the CT scans present variations, such as in image resolution, the field of view, and slice spacing and thickness. Each pixel of the 2D slices is represented by a value corresponding to the X-ray attenuation and it is expressed in Hounsfield Units (HU). Thus, the images were submitted to a HU min-max normalization, in which values between −1000 HU (HU min ) and 400 HU (HU max ) are rescaled into a range of [0, 1]. Then, as a second step of the preprocessing phase, the input images were resized to a dimension of 128 × 128 via bilinear interpolation, in order to reduce the computational cost. Regarding voxel spacing, the images were not rescaled in the z-direction.

Learning Model
Based on the model proposed by Khanna et al. [17], a hybrid architecture structure, depicted in Figure 1, was designed, consisting of the combination of the U-Net and ResNet34 architectures. This structure demonstrated better performance when compared to other simpler DL structures, such as U-Net, and when using LUNA16 dataset [17]. Following the typical structure of a U-Net, the model is composed of an encoder path and a decoder path, each one formed by five stages. First, the input images were submitted to 7 × 7 convolution, batch normalization (BN), and max-pooling, similarly to the first block of the ResNet34; likewise, each of the remaining 4 blocks that follow comprises residual units. Each unit contains a convolutional layer, followed by BN and parametric rectified linear unit (PReLU) activation, and a second convolutional layer followed by BN. In the end, in each unit, the input, the designated shortcut connection (SC), is added to the output of the unit and then submitted to a PReLU activation to produce a final result. Because max-pooling is not performed, the first convolution of the first residual unit of each residual block (RB), from RB2 to RB4, is applied with stride two, in order to reduce the dimension. Moreover, the SC of this unit is also submitted to a convolutional layer, so that the dimensions are in conformity. The blocks from RB1 to RB4 contain 3, 4, 6, and 3 residual units, respectively. Concerning the decoder path, the first four stages comprise upsampling via 2D transpose convolutions, followed by concatenation with the output of the corresponding block of the encoder path. Thereafter, the set of operations, including convolution, BN, and rectified linear unit (ReLU) activation, is performed two times. At the final stage, the output of the previous block is submitted to upsampling, followed by convolution, BN, and ReLU. At last, 1 × 1 convolution and sigmoid activation function are performed, producing the final segmentation result, a probability mask.

Training
The loss function used in the training process was based on the DSC , due to the fact that it has proven to be a useful measurement for this type of tasks [23]. The DSC is given by Equation (1), in which DSC is the dice similarity coefficient, X, corresponds to the ground truth mask, Y, is the predicted mask, X ∩ Y is the area of overlap of the two images, and X + Y is the total number of pixels of the two images.
This equation provides a measure of similarity between two images; thus, in order to measure the loss between the ground truth and the predicted mask, Equation (2) was used, in which DSC loss represents the loss and DSC corresponds to the dice similarity coefficient.
With the aim of finding the best hyper-parameters, experiments with different combinations of the optimizer, learning rate, and batch size were performed. Table 2 presents the values taken by each one of these hyper-parameters. The 36-LCTSC and the LUNA16 datasets were chosen to be used for training since these data are not intended to be used for evaluation ( Table 3). As for the validation set, 30% of the training set was chosen randomly to be used as validation data and fixed with a seed to ensure that this distribution was the same across all experiments. At last, regarding the evaluation set, taking into account the importance of the capability of a model to cope with the different lung heterogeneities and its generalization ability, 4 distinct datasets: 24-LCTSC, HUG-ILD, VESSEL12, and CHUSJ, comprising a variety of disease patterns and differences in CT imaging protocols, were used. All tests were performed in each dataset separately. A summary of the distribution of the data per training and test sets is represented in Table 3.

Evaluation
The evaluation metrics used were the DSC, given by Equation (1), Hausdorff distance (HD) and average symmetric surface distance (ASSD) [24]. The HD metric is given by Equation (3) The ASSD metric is given by Equation (4) The HD and ASSD are metrics that take into account pixel spacing and each scan can present its own value for this image property. Therefore, the preliminary results of these two metrics obtained for each image were multiplied by the correspondent image pixel spacing, in order to produce normalized metrics. The three metrics were determined between lung masks from the datasets (used as ground truth) and the segmentation mask produced by the model developed.
Additionally, a clinical assessment of the results was performed by two experienced radiologists. They performed a visual analysis of the 40 randomly selected cases and they evaluated and discussed the cases for the learning models that failed the segmentation.

Results and Discussion
This section includes the results obtained in the quantitative and qualitative evaluations for each test dataset, a clinical assessment performed by two radiologists, and the limitations found.

Performance Results
The set of hyper-parameters that lead to the best performance is as follows: Adam optimizer, the learning rate of 0.0001, and batch size of 8. The results obtained for each test dataset are presented in Table 4. Table 4. Mean and standard deviation (std) results of the three metrics: dice similarity coefficient (DSC), Hausdorff distance (HD), and average symmetric surface distance (ASSD) for each dataset. The maximum value of the range for HD and ASSD metrics was obtained by considering that the maximum distance between 2 distinct objects in an image of 128 × 128 corresponded to the diagonal of that image. The results from the three metrics in the analysis are consistent, showing better segmentations for the VESSEL12 dataset and worst results for HUG-ILD. However, the difference between the worst and the best results are very small, showing a good confidence that the segmentation model is robust to the great variability of the pathological cases used in the test sets. Overall, the model is able to generate good results and because the four test datasets present differences between them regarding acquisition protocols, pathologies included, and segmentation guidelines, the results for each one of them will be discussed individually.
With respect to the 24-LCTSC dataset, the model is generally able to correctly segment the pulmonary images (see the first row in Figure 2), but fails to identify their initial slices, which correspond to the base of the lung, leading to a decrease in the DSC (see the second row in Figure 2). Furthermore, for one of the patients, the masks produced by the model seem to be more accurate than the ground truth images, as the latter excludes part of the lung parenchyma. An example is shown in the third row in Figure 2. Therefore, even though this contributes to a lower DSC due to the discrepancy between them, the predicted mask is more precise. On the other hand, there are cases in which nodules are not included in the ground truth, and there are elements that the model does not include as well, giving rise to a higher DSC, although incorrectly classified. An example is depicted in Figure 2, fourth row.
Regarding the HUG-ILD dataset, the model successfully segments the majority of scans and in some cases, it does not exclude pulmonary regions presenting a higher density. The model was also assessed on this dataset on a pattern level, i.e., each pattern was  the examples are, respectively: good segmentation example; an example in which the model fails to segment the base of the lung; an example of a ground truth image that excludes part of the parenchyma of the lung; an example that excludes a nodule in its ground truth, and which is also misclassified by the model. evaluated individually, to gain a better understanding of its behavior, and the results are presented in Table 5. Table 5. Results obtained for the patterns included in the HUG-ILD dataset. The row "Other" refers to scans that presented more than one pattern. The results presented in Table 5 can be visually verified in Figure 3. For the specific cases of micronodules, bronchioectasis, emphysema, and some cases of fibrosis, the model is able to segment the entire lung area, as these are patterns which do not contain a higher contrast in tissues density (see respective examples on the first four rows of Figure 3). In general, those pathological cases showed a slightly better performance (Table 5).

Pattern
In contrast, for cases of macronodules, reticulation, consolidation, ground-glass, and pneumocystis carinii pneumonia, the model presents a difficulty in performing such tasks in the regions of higher density (see last five examples (rows) of Figure 3). Besides that, the scans from this dataset include the trachea and other respiratory structures (apart from the lungs) in their ground truth, elements that are not identified by the model, and thus contributing to a lower metric (see rows two-five in Figure 3). Once again the model fails to identify the slices corresponding to the base of the lung (see the second row in Figure 6).
Concerning the VESSEL12 dataset, the model is able to produce good segmentation masks (see Figure 4), in general, for all ten scans, which were translated on a higher DSC, since this dataset does not contain intricate patterns. Moreover, the model does not erroneously classify other darker structures that are present in some slices as lungs.
As for the CHUSJ dataset, the model demonstrates, once again, a difficulty in the segmentation of the base of the lung (see the first row of Figure 5). Nonetheless, what contributes most to the decrease in the DSC is the large masses of higher density that are present in the majority of the scans and which the model does not identify. The model correctly segments the surrounding pulmonary tissue and these masses are the only structures that are not included in its predicted masks. Examples of these scans are depicted in the last two rows of Figure 5.

Clinical Assessment
From the visual inspection, radiologists concluded that the model has a good overall performance, especially for healthy cases for which the model always set a correct segmentation. They tried to identify the physiological reasons that made the model fail and that can be taken into consideration in future work. The radiologists identified that the model tends to fail in areas of the lung that have different densities than expected.
In areas of higher density ("whiter" on the CT image) than the surrounding lung, resulting from involvement by interstitial lung disease or inflammatory/infectious pathol-     ogy, they are not recognized by the model as lung tissue (see first three rows in Figure 6), although they corresponded to areas of "diseased" lung, involved by interstitial pathology.
In the case of lung neoplasms, something similar happens: the whole healthy lung is properly recognized; however, only the area of the lung mass, which is denser ("whiter") than the remaining lung, is not properly recognized (see last two rows in Figure 5).
A similar situation occurs in areas of the lung that are even less dense than usual ("blacker" on the CT image) which may also correspond to areas of "diseased" lung, in this case, areas of "air-trapping", emphysema, etc. (see the last row in Figure 6).
Lung pathologies present specific patterns, which correspond to changes in the density of the radiological image. Density increases for pulmonary consolidation present in pneumonia or other inflammatory/infectious processes; areas of opacification in ground-glass present innumerable causes, such as COVID, neoplasm, lung masses, and nodules. On the other hand, the density decreases with air-trapping, which is often associated with small airway disease and emphysema.

Limitations
The main motivation of this work was the combination of multiple cohorts of patients to train the learning model with a large spectrum of the heterogeneities that can be found in the population and the assessment of the segmentation performance on the most frequent lung diseases. The merge of cohorts covers the great majority of the pathophysiological patterns that can be found in lung diseases, and for this reason, the learning model fails only in the most extreme cases. Despite the overall good results, a large number of extreme pathological cases must be used in the training set to allow an even better generalization of the segmentation model in the future.
Another limitation comes from the annotations of the datasets. Those annotations come from different projects and followed different segmentation guidelines. There is no consensus on the inclusion/exclusion of some structures, such as airways or tumor masses. Examples of airways inclusion are shown in the second, third, fourth, and fifth rows in Figure 3 that belong to the HUG-ILD dataset, and examples of airways exclusion are shown in the last row in Figure 2, the first row in Figure 4, and the last row in Figure 5, that belong to the LCTSC, VESSEL12, and CHUSJ datasets, respectively. Examples of tumor masses inclusion are shown in last two rows in Figure 5 that belong to the CHUSJ dataset, and examples of exclusion are shown in the last row in Figure 2 that belong to the LCSTC dataset. Ideal, a very objective protocol of segmentation should be followed for the entire dataset annotation (training and test set) in order to not create label noise, which is responsible for overall quantitative performance degradation.
The need for massive and well-annotated datasets in the medical field is still one of the biggest limitations for the broad and impactful use of AI as support decision systems in the clinical routine. Unsupervised and semi-supervised approaches could be strategies to be used to overcome the lack of labeled medical data [25].

Methods Discussion
Considering the results obtained with the proposed approach and analyzing the different challenges ahead to overcome, some methodology improvements should be explored in the future, aiming to enhance the segmentation robustness of such models. Performance effects caused by increasing network complexity have been discussed in related application scenarios [7], being suggested that it will often be insufficient to overcome some specific problems. In this direction, the idea is raised that the path to more robust lung segmentation models would more likely comprise the development of models capable of being invariant in the presence of specific factors (e.g., severe pathological lung regions) that have caused significant performance drops. A robust segmentation model must not be influenced by the lung status itself, as diseased regions should be included in the predicted masks for further classification pipelines. In a different perspective, diving beyond traditional data augmentation techniques would also be an interesting idea to explore; the possibility of (a) CT Image.
(b) Ground Truth. (c) Predicted Mask. imitating some high-level properties of challenging tissue regions would enable us to reproduce the intended characteristic in any training example, which could result in more heterogeneous training data.

Conclusions
The proposed model was able to produce good results for the 24-LCTSC, the VESSEL12, and the HUG-ILD datasets. Nevertheless, it also generated poorer segmentation masks for the CHUSJ data, which mostly contains images with big tumor masses of higher density, and for some cases of the HUG-ILD data that contained complex patterns with high contrast tissues-all features that are not present in the training data. Thus, this work demonstrated that having a representative training database is crucial to build a robust segmentation model that is able to cope with complex patterns.
Subsequently, taking into account the wide variety of these elements and the limitations mentioned above, in particular the lack of well-annotated datasets, future models could be developed by making use of data augmentation techniques that would mimic the missing imaging features. In addition, one could also explore the field of continuous learning that would possibly allow a model to continuously learn and improve its performance from the given data.