A multi-stage GAN for multi-organ chest X-ray image generation and segmentation

Multi-organ segmentation of X-ray images is of fundamental importance for computer aided diagnosis systems. However, the most advanced semantic segmentation methods rely on deep learning and require a huge amount of labeled images, which are rarely available due to both the high cost of human resources and the time required for labeling. In this paper, we present a novel multi-stage generation algorithm based on Generative Adversarial Networks (GANs) that can produce synthetic images along with their semantic labels and can be used for data augmentation. The main feature of the method is that, unlike other approaches, generation occurs in several stages, which simplifies the procedure and allows it to be used on very small datasets. The method has been evaluated on the segmentation of chest radiographic images, showing promising results. The multistage approach achieves state-of-the-art and, when very few images are used to train the GANs, outperforms the corresponding single-stage approach.

In order to evaluate the performance of the proposed method, synthetic images have been used to train a segmentation network 1 , subsequently applied on a popular benchmark for multi-organ chest segmentation, the Segmentation in Chest Radiographs (SCR) dataset [6]. The results obtained are very promising and exceed (to the best of our knowledge) those obtained by other previous methods. Moreover, the quality of the produced segmentation has been confirmed by physicians. Finally, to demonstrate the capabilities of our approach, especially having little data available, we compared it to two other methods, using only 10% of the images in the dataset. In particular, the multi-stage approach has been compared with a single-stage method -in which chest X-ray images and semantic label-maps are generated simultaneously -and with a two-stage method -where semantic label-maps are generated and then translated into X-ray images. The experimental results show that the proposed three-stage method outperforms the two-stage method, while the two-stage overcomes the single-stage approach, confirming that splitting the generation procedure can be advantageous, particularly when few training images are available.
The paper is organized as follows. In Section 2, the related literature is reviewed. Section 3 presents a description of the proposed image generation method. Section 4 shows and discusses the experimental results. Finally, in Section 5, we draw some conclusions and describe future research.

Related works
In the following, recent works related to the topics addressed in this paper are briefly reviewed, namely regarding synthetic image generation, image-to-image translation, and segmentation of medical images.

Synthetic Image Generation
Methods for generating images are by no means new and can be classified into two main categories: model-based and learning-based approaches. A model-based method consists in formulating a model of the observed data to render the image by a dedicated engine. This approach has been widely adopted to generate images in many different domains [16,17,18]. Nonetheless, the design of specialized engines for data generation requires a deep knowledge of the specific domain. For this reason, in recent years, the learning-based approach attracted increasing research interest. In this context, machine learning techniques are used to capture the intrinsic variability of a set of training images, so that the specific domain model is acquired implicitly from the data. Once the probability distribution that underlies the set of real images has been learned, the system can be used to generate new images that are likely to mimic the original ones. One of the most successful machine learning model for data generation is the Generative Adversarial Network (GAN) [12]. A GAN is composed by two networks: a generator G and a discriminator D. The former learns to generate data starting from a latent random variable z ∈ R Z , while the latter aims at distinguishing real data from generated ones. Training GANs is difficult, because it consists in a min-max game between two neural networks and convergence is not guaranteed. This problem is compounded in the generation of high resolution images, because the high resolution makes it easier to distinguish generated images from training images [19]. One of the most successful approach to face this problem is represented by Progressively Growing GANs (PGGANs) [13]. This model, in fact, is based on a multi-stage approach that aims to simplify and stabilize the training and allows to generate high resolution images. More specifically, in a PGGAN, the training starts at low resolution, while new blocks are progressively introduced in the system to increase the resolution of the generation. The generator and discriminator grow symmetrically until the desired resolution is reached. Based upon PGGANs, many different approaches have been proposed. For instance, StyleGANs [20] maintain the same discriminator as PGGANs, but introduce a new generator which is able to control the style of the generated images at different levels of detail. In StyleGAN2s [21], an improved training scheme is introduced, which achieves the same goal -training starts by focusing on low resolution images and then progressively shifts the focus to higher and higher resolutions -without changing the network topology during training. In this way, the updated model shows improved results at the expense of longer training times and more computing resources.
In this paper, we use PGGANs in three different ways. For the single-stage method, a PGGAN simultaneously generates semantic label-maps and CXR images. For the two-stage method, only semantic label-maps are generated, while for the three-stage method we use a PGGAN to generate "dots" that correspond to different anatomical parts.

Image-to-Image Translation
Recently, beside image generation, adversarial learning has been also employed for image-to-image translation, whose goal is to translate an input image from one domain to another. Many computer vision tasks, such as image super-resolution [22], image inpainting [23], and style transfer [24] can be casted into the image-to-image translation framework. Both unsupervised [25,26,27,28] and supervised approaches [29,13,30] can be used but, for the proposed application to CXR image generation, the unsupervised category is not relevant. Supervised training uses a set of pairs of corresponding images {(s i , t i )}, where s i is an image of the source domain and t i is the corresponding image in the target domain. In the original GAN framework, there is no explicit way of controlling what to generate, since the output depends only on the latent vector z. For this reason, in conditional GANs (cGANs) [31], an additional input c is introduced to guide the generation. In a cGAN, the generator can be defined accordingly as G(c, z). Pix2Pix [29] is a general approach for image-to-image translation and consists of a conditional GAN that operates in a supervised way. Pix2Pix uses a loss function that allows to generate plausible images in relation to the destination domain, which are also credible translations of the input image. With respect to supervised image-to-image translation techniques, in addition to the aforementioned Pix2Pix, the most used models are CRN [30], Pix2PixHD [14], BycicleGAN [32], SIMS [33], and SPADE [34]. In particular, Pix2PixHD [14] improves upon Pix2Pix by employing a coarse-to-fine generator and discriminator, along with a feature-matching loss-function, allowing to translate images with higher resolution and quality.
For the image-to-image translation phase, we use the Pix2PixHD network. The single-stage method does not require a translation step, while for the two-stage method we use Pix2PixHD to obtain a CXR image from the label-map. Finally, in the three-stage method, Pix2PixHD is used in two steps: for the translation from "dots" to semantic label-maps and, after, for the translation of label-maps into CXR images.

Medical Image Generation
In recent years, GANs have attracted the attention of medical researchers, their applications ranging from object detection [35,36,37] to registration [38,39,40], classification [41,42,43] and segmentation [44,45] of images. For instance, in [46], different GANs have been used for the synthesis of each class of liver lesions (cysts, metastases and hemangiomas). However, in the medical domain, the use of complex machine learning models is often limited by the difficulty of collecting large sets of data. In this context, GANs can be employed to generate synthetic data, realizing a form of data augmentation. In fact, the GAN generated data can be used to enlarge the available datasets and to improve the performance in different tasks. As an example, GAN generated images have been successfully used to improve the performance in classification problems, by combining real and synthetic images during the training of a classifier. In [47], Wasserstein GANs (WGANs) and InfoGANs have been combined to classify histopathological images, whereas in [41] WGAN and CatGAN generated images were used to improve the classification of dermoscopic images. Only in a few cases have GANs been used to generate chest radiographic images, as in [42], where images for cardiac abnormality classification were obtained with a semi-supervised architecture, or in [48], where GANs were used to generate low resolution (64 × 64) CXRs to diagnose Pneumonia. More related to this work, in [16], high-resolution synthetic images of the retina and the corresponding semantic label-maps have been generated. Moreover, synthesizing images has been proven to be an effective method for data augmentation, that can be used to improve performance in retinal vessel segmentation.
In this paper, chest X-ray images have been generated with the corresponding semantic label-maps (which correspond to different organs). We then used such images to train a segmentation network, with very promising results.

Organ Segmentation
X-rays are one of the most used techniques in medical diagnostics. The reasons are medical and economic, since they are cheap, non-invasive and fast examinations. Many diseases, such as pneumonia, tuberculosis, lung cancer, and heart failure are commonly diagnosed from CXR images. However, due to overlapping organs, low resolution and subtle anatomical shape and size variations, interpreting CXRs accurately remains challenging and requires highly qualified and trained personnel. Therefore, it is of a great clinical and scientific interest to develop computer-based systems that support the analysis of CXRs. In [49], a lung boundary detection system was proposed, building an anatomical atlas to be used in combination with graph-cut based image region refinement [50,51,52]. A method for lung field segmentation, based on joint shape and appearance sparse learning, was proposed in [53], while a technique for landmark detection was presented in [54]. Haar-like features and a random forest classifier were combined for the appearance of landmarks. Instead, a Gaussian distribution augmented by shape-based random forest classifiers was adopted for learning spatial relationships between landmarks. InvertedNet, an architecture able to segment the heart, clavicles and lungs, was introduced in [55]. This network employs a loss function based on the Dice Coefficient, Exponential Linear Units (ELUs) activation functions, and a model architecture that aims at containing the number of parameters. Moreover, the UNet [56] architecture has been widely used for lung segmentation, as in [57,58,59]. In the Structure Correcting Adversarial Network (SCAN) [60] a segmentation network and a critic network were jointly trained with an adversarial mechanism for organ segmentation in chest X-rays.

Chest X-Ray Generation
The main goal of this study is to prove that by dividing the generation problem into multiple simpler stages, the quality of the generated images improves, so that they can be more effectively employed as a form of data augmentation. More specifically, we compare three different generation approaches. The first method, described in Section 3.1, consists in generating chest X-ray images and the corresponding label-maps in a single stage. In the second approach, presented in Section 3.2, the generation procedure is divided into two stages, where the label-maps are initially generated and then translated into images. The third method, reported in Section 3.3, consists in a three stage approach, that starts by generating the position of the objects in the image, then the label-maps and, finally, the X-ray images. The images generated employing each of the three approaches are comparatively evaluated by training a segmentation network.
To increase the descriptive power of real images, especially with regards to the position of the various organs, standard data augmentation has preventively been applied. Therefore, the original X-ray images, along with their corresponding masks, were augmented by applying random rotations in the interval [−2, 2] degrees, random horizontal, vertical and combined translations from −3% to +3% of the number of pixels, and adding a Gaussian noise -only to the original images -, with 0 mean and variance between 0.01 and 0.03 × 255. For the generation of images, we essentially used two networks well known in the literature, namely PGGANs [13] and Pix2PixHD [14], whose details are given in the following. In particular, in Sections 3.1, 3.2, and 3.3, we extensively describe the three different generation procedures, respectively the single-stage, two-stage and three-stage methods. The next Section 3.4 presents the semantic segmentation network that has been employed. Finally, some details on the training method are collected in Section 3.5.

Single-stage method
This baseline approach consists in stacking X-ray images and labels into two different channels, which are simultaneously fed into the PGGAN. Therefore, the PGGAN is trained to generate pairs composed by an X-ray image and its corresponding label (see Figure 1). Figure 1: The one-stage image generation scheme. The input of the network is a latent vector, while the PGGAN simultaneously produces the label-map and the X-ray image.

Two-stage method
In this approach, the generation procedure is divided into two steps. The first one consists in generating the labels through a PGGAN, while, in the second, the translation from the label to the corresponding chest X-ray image is carried out, using Pix2PixHD (see Figure 2). Figure 2: The two-stage image generation scheme. In the first step, the PGGAN takes in input a latent vector and produces the label-map. The generated label-map is then used as input to a pix2pixHD module, which is trained to output the X-ray image.

Three-stage method
It consists in further subdividing the generation procedure, with a first phase consisting in generating the position and type of the objects that will be generated later, regardless of their shape or appearance. This is obtained by generating label-maps that contain "dots" in correspondence with different anatomical parts (lungs, heart, clavicles). The dots can be considered as "seeds", from which, through the subsequent steps, the complete label-maps are realized (second phase). Finally, in the last step, chest X-ray images are generated from the label-maps. The exact procedure is described in the following. Initially, label-maps containing "dots", with a specific value for each anatomic part, are created. The position of the "dot" center is given by the centroid of each labeled anatomic part. The label-maps generated in this phase have a low resolution (64 × 64), as a high level of detail is not necessary, being the exact object shapes not defined -but only their centroid positions. It should be observed that this also allows to significantly reduce the computational burden of this stage and speedup the computation. The generated label-maps must be subsequently resized to the original image resolution -required in the following stages of generation (a nearest neighbour interpolation has been used to maintain the original label codes) -and translated into labels, which will be finally translated into images, using Pix2PixHD (see Figure 3).

Segmentation Multiscale Attention Network
In this paper, the Segmentation Multiscale Attention Network (SMANet) [15] has been employed. The SMANet is composed by three main components, a ResNet encoder, a multi-scale attention module, and a convolutional decoder (see Figure 4). This architecture, initially proposed for scene text segmentation, is based on the Pyramid Scene Parsing Network (PSPNet) [11], a deep fully convolutional neural network with a ResNet [61] encoder. Dilated convolutions (i.e. atrous convolutions [62]) are used in the Resnet backbone, to widen the receptive field of the neural network in order to avoid an excessive reduction of the spatial resolution due to down-sampling. The most characteristic part of the PSPNet architecture is the pyramid pooling module (PSP), which is employed to capture features at different scale in the image. In the SMANet, the PSP module is replaced with a multi-scale attention mechanism to better focus on the relevant objects present in the image. Finally, a two-level convolutional decoder is added to the architecture to improve the recognition of small objects.

Training Details
The PGGAN architecture, proposed in [13], has been employed for image generation; the number of parameters have been modified to speed up learning and reduce overfitting. More specifically, the maximum number of feature maps for each layer has been reduced to 64. Furthermore, since the PGGAN was used to generate seeds and labels, obtaining only the semantic label-maps in both cases, the output image has only one channel instead of three. The generation procedure (PGGAN and Pix2PixHD) has been stopped by visually examining the generated samples during the training phase. The images, generated in the various steps for all the methods, have a resolution of 1024 × 1024, except in the case of the "dot" label-maps, which, as mentioned before, are generated at a 64 × 64 resolution.
The SMANet is implemented in TensorFlow. Random crops of 377 × 377 pixels have been employed during training, whereas a sliding window of the same size has been used for testing. The Adam optimizer [63], based on an initial learning rate of 10 −4 and a mini-batch of 17 examples, has been used to train the SMANet. All the experiments were carried out in a Linux environment on a single NVIDIA Tesla V100 SXM2 with 32 GB RAM.

Experiments and results
In this section, after having described the dataset on which our new proposed method was tested, we evaluate the results obtained, both qualitatively -based on the judgment of three physicians -and quantitatively, comparing them with related approaches present in the literature.

Dataset
Chest radiographs are provided by the Japanese Society of Radiological Technology (JSRT) database [64]. The JSRT database comprises 247 CXRs and includes images with and without lung nodules. All images have a resolution of 2048 × 2048 pixels and a spatial resolution of .175 mm/pixel, with 12 bit gray levels. Instead, segmentation supervisions for the JSRT database are available in the Segmentation in Chest Radiographs (SCR) dataset [6]. More precisely, this dataset provides chest X-ray supervisions which correspond with the pixel-level positions of the different anatomical parts. Such supervisions were produced by two observers who segmented five objects in each image: the two lungs, the heart and the two clavicles. The first observer was a medical student and his segmentation was used as the gold standard, while the second observer was a computer science student, specialized in medical imaging, and his segmentation was considered that of a human expert.
The SCR dataset comes with an official splitting, which is employed in this paper and consists of 124 images for learning and 123 for testing. We use two different experimental configurations. In the former, called FULL_DATASET, all the training images are exploited. More precisely, the PGGAN generation network is trained on the basis of 744 images, available in the SCR training set and obtained with the augmentation procedure described above. The SMANET is Figure 3: The three-stage image generation scheme. In the first step, dots are generated from a latent vector. Then, pix2pxHD translates dots into a label-map, and finally the label-map is translated into an X-ray image. trained on 7500 synthetic images, generated by the PGGAN, and fine-tuned on the 744 images extracted from the SCR training set, while 2500 synthetic images are used for validation. For the second configuration, called TINY_DATASET, only a 10% of the SCR training set is used and the PGGAN is trained on only 66 images (obtained both from SCR and with augmentation); instead, the SMANET is trained exactly as above, except for the fine-tuning, which is carried out on 66 images.

Quantitative results
Generated images have been employed to train a deep semantic segmentation network. The rationale behind the approach is that the performance of the network trained on the generated data reflects the data quality and variety. A good performance of the segmentation network indicates that the generated data successfully capture the true distribution of the real samples. To assess the segmentation results, some standard evaluation metrics have been used. The Jaccard Index, J, also called Intersection Over Union (IOU), measures the similarity between two finite sample sets -the predicted segmentation and the target mask in this case -, and is defined as the size of their intersection divided by the size of their union. For binary classification, the Jaccard index can be framed in the following formula: J = T P T P + F P + F N where T P, F P, F N denote the number of true positives, false positives and false negatives, respectively. Instead, the Dice Score, DSC, is defined as: DSC = 2 × T P 2 × T P + F P + F N DSC is a quotient of similarity between sets and ranges between 0 and 1.
The experiments can be divided into two phases: first, we evaluate the generation procedure described in Section 3.3 using the FULL_DATASET, then, we compare this approach with the other two methods described in Sections 3.1 and 3.2 using the TINY_DATASET. The purpose of this latter experiment is to evaluate whether multi-stage generation methods are actually more effective in producing data suitable for semantic segmentation with a limited amount of data. In particular, in the experimental setup based on the FULL_DATASET, for the three-stage method, the generation network has been trained on all the SCR training images, to which the augmentation procedure described in Section 3 has been applied. Then, 10,000 synthetic images have been generated and used to train the semantic segmentation network. Moreover, we evaluated a fine-tuning of the network on the SCR real images after the pre-training on the generated images. The results, shown in Table 1, are compared with those obtained using only real images to train the semantic segmentation network, which can be considered as a baseline.
Next, the TINY_DATASET has been used in order to evaluate the performance of the methods with a very small dataset. More precisely, the following experimental setups, whose results are shown in Table 2, are considered: • REAL -only real images are used for training the semantic segmentation network; • SINGLE-STAGE -the segmentation network uses the images generated by the single-stage method (Synth 1 in the tables) for training while real images are employed for fine-tuning (Finetune in the tables); • TWO-STAGES -the images generated with the two-stage method are used to pre-train the segmentation network (Synth 2) while real images are used for fine-tuning;  • THREE-STAGES -the images generated with the three-stage method are used for training the segmentation network (Synth 3), while real images are employed for fine-tuning.
In this case, the PGGAN has been trained on 66 images, based on 11 images randomly chosen from the entire training set to which the augmentation described above has been applied.
In general, we can see that the best results are obtained with the three-stage method followed by fine-tuning. From Table  1, we observe a small improvement in results using a fine-tune on a network previously trained with images generated using the three-stage method. Therefore, the three-stage method provides good synthetic data, but the advantage given by generated images is low when the training set is large. Conversely, when few training images are available, in the TINY_DATASET setup, multi-stage methods outperform the baseline (column REAL of Table 2) and this happens even without fine-tuning. Thus, in this case, the advantage provided by synthetic images is evident. Moreover, the three-stage method outperforms the two-stage approach, even with fine-tuning, which confirms our claim that splitting the generation procedure may provide a performance increase when few training images are available.
Finally, it is worth noting that fine-tuning improves the performance of the three-stage method, both in the FULL_DATASET and in the TINY_DATASET framework, which does not hold for the two-stage method. This behaviour may be explained by some complementary information that is captured from real images only with the three-stage method. Actually, we may argue that, in different phases of a multi-stage approach, different types of information can be captured: such a diversification seems to provide an advantage to the three-stage method, which develops some capability to model the data domain with more orthogonal information.
(a) Figure 5: Examples three-stage generated images based on the FULL_DATASET. Table 3 shows our best results and the segmentation performance published by all recent methods, of which we are aware, on the SCR dataset. According to the results in the table, the three-stage method obtained the best performance score both for the lungs and the heart.

Comparison with other approaches
However, it is worth mentioning that Table 3 gives only a rough idea of the state-of-the-art, since a direct comparison between the proposed method and other approaches is not feasible, being our primary focus on image generation, in contrast with the comparative approaches that are mainly devoted to segmentation, and for which no results are reported on small image datasets. Moreover, the previous methods used different partition of the SCR dataset to obtain the training and the test set, such as 2-fold, 3-fold, 5-fold cross-validation or ad hoc splittings, which are often not publicly available, while, in our experiments, we preferred to use the original partition, provided with the SCR dataset 2 . Finally, also a variety of different image size have been used, ranging from 256 × 256, to 400 × 400, and to 512 × 512 -the resolution used in this work.

Qualitative results
In this section, some examples of images and corresponding segmentations, generated with the approaches described in Section 3, are qualitatively examined. We also report some comments from three physicians on the generated segmentations, to provide a medical assessment of the quality of our method. Figure 5 and Figure 6 display some examples -randomly chosen from all the generated images -of the labelmaps and the corresponding chest X-ray images generated with the three methods described in Section 3, using the FULL_DATASET and the TINY_DATASET, respectively. We can observe that, with the single and two-stage methods, the images tend to be more similar to those belonging to the training set. For example, in most of the generated images there are white rectangles, which resemble those present in the training images, used to cover the names of both the patient and the hospital. Instead, the three-stage method does not produce such artifacts, suggesting that it is less prone to overfitting.
Moreover, in order to clarify the limits of the three-stage method, we assessed the quality of the segmentation results based on three human experts, who were asked to check 20 chest X-ray images, along with the corresponding (a) Single-stage 10% generated images.
(c) Three-stage 10% generated images. Figure 6: Examples of generated images based on the TINY_DATASET.
supervision and the segmentation obtained by the SMANET network. Such images were chosen among those that can be considered difficult, at least based on the high error obtained by the segmentation algorithm. Figure 7 and Figure 8 show different examples of the images evaluated by the experts. The first column represents the chest X-ray image, while the second and the third columns, whose order was randomly exchanged during the presentation to the experts, represent the target segmentation and our prediction, respectively. The three physicians were asked to choose the best segmentation and to comment about their choice. Apart from a general agreement of all the doctors on the good quality of both the target segmentation and the segmentation provided by the three-stage method, surprisingly, they often chose the second one. For the examples in Figure 7, for instance, all the experts share the same opinion, preferring the segmentation obtained by the SMANET over the ground-truth segmentation. To report the results of the qualitative analysis, we numbered the target and predicted segmentation with 1 and 2, respectively, while doctors were assigned unordered pairs to obtain an unbiased result. Then, with respect to Figure 7(a), the comments reported by the experts were: 1) In segmentation 1, a fairly large part of the upper left ventricle is missing; 2) I choose the segmentation number 2 because the heart profile does not protrude to the left of the spine profile; 3) The best is No. 2, the other leaves out a piece of the left free edge of the heart, in the cranial area. Instead, for Figure 7(b), we obtained: 1) The second image is the best for the cardiac profile. For lung profiles, the second image is always better. The only flaw is that it leaks a bit on the right and left costophrenic sinuses. 2) Image 2 is the best, because the lower cardiac margin is lying down and does not protrude from the diaphragmatic dome. Image number 1 has a too flattened profile of the superior cardiac margin. 3) No. 2 for the cardiac profile more faithful to the real contours.
Instead, they reported conflicting opinions or decided not to give a preference with respect to the examples in Figure 8. When they agreed, they generally found different reasons for choosing one segmentation over the other. With respect to Figure 8(a) the comments reported by the experts were: 1) I prefer not to indicate any options because the heart image is completely subverted; 2) Segmentation number 2 is better, even if it is complicated to read because there is a "bottle-shaped" heart. The only thing that can be improved in image 2 is that a small portion of the right side of the heart is lost; 3) No. 1 respects more what could be the real contours of the heart image. Instead, for Figure 8(b) we obtained: 1) I prefer No. 2 because the tip of the heart is well placed on the diaphragm and does not let us see that small wedge-shaped image that incorrectly insinuates itself between heart and diaphragm in image 1 and which has no  These different evaluations, albeit limited by the small number of examined images, confirm the difficulty of segmenting CXRs, a difficulty that is likely to be more evident in the case of the images selected for our quality analysis, which were chosen based on the large error produced by the segmentation algorithm.

Conclusions
In this paper, we have proposed a multi-stage method based on GANs to generate multi-organ segmentation of chest X-ray images. Unlike existing image generation algorithms, in the proposed approach, generation occurs in three stages, starting with "dots", which represent anatomical parts, and initially involves low-resolution images. After the first step, the resolution is increased to translate "dots" into label-maps. We performed this step with Pix2PixHD, thus making the information grow and obtaining the labels for each anatomical part taken into consideration. Finally, Pix2PixHD is also used for translating the label-maps into the corresponding chest X-ray images. The usefulness of our method was demonstrated especially when there were few images in the training set, an affordable problem thanks to the multi-stage nature of the approach.
It is worth observing that our method can be employed for any type of images, not exclusively medical ones, while synthetic and real images can concur in solving the segmentation problem (being used for pre-training and for finetuning the segmentation network, respectively), with a significant increase in performance. As a matter of future research, the proposed approach will be extended to other, more complex domains, such as that of natural images.
(b) NODULES015. Figure 8: Examples of segmented images for which doctors gave conflicting opinions. The first column represents the chest X-ray image, while the second and third columns are the target and our predicted segmentations, respectively.