Deeply supervised UNet for semantic segmentation to assist dermatopathological assessment of Basal Cell Carcinoma (BCC)

Accurate and fast assessment of resection margins is an essential part of a dermatopathologist's clinical routine. In this work, we successfully develop a deep learning method to assist the pathologists by marking critical regions that have a high probability of exhibiting pathological features in Whole Slide Images (WSI). We focus on detecting Basal Cell Carcinoma (BCC) through semantic segmentation using several models based on the UNet architecture. The study includes 650 WSI with 3443 tissue sections in total. Two clinical dermatopathologists annotated the data, marking tumor tissues' exact location on 100 WSI. The rest of the data, with ground-truth section-wise labels, is used to further validate and test the models. We analyze two different encoders for the first part of the UNet network and two additional training strategies: a) deep supervision, b) linear combination of decoder outputs, and obtain some interpretations about what the network's decoder does in each case. The best model achieves over 96%, accuracy, sensitivity, and specificity on the test set.


Introduction
Basal Cell Carcinoma (BCC) is the most common malignant skin cancer with an increasing incidence by up to 10 % a year [1]. It can be locally destructive and is an essential source of morbidity for patients, mainly when located on the face. Thus, it must be adequately treated, despite its slow growth [2]. Although BCC can be effectively managed through surgical excision, determining the most suitable surgical margins is often not trivial. Complete removal of the pathological tissue is the key to a successful surgical treatment. Initially, the tumor is removed with a safety margin of surrounding tissue, and it is sent to a laboratory for analysis. If remaining tumor parts are detected in the margins, further surgery may be performed. For BCC, approximately 10 % of the tumors recur after the usual removal surgeries [3]. The resection margins' microscopic control can reduce the recurrence rate to 1 % [4,5].
As a part of the laboratory routine, each tissue biopsy is cut into several slices in the center of the probe and close to its edges. The pathologist can only be sure that the surgery was successful if no tumor is present in the latter ones. However, this task is time-consuming and error-prone, even for skilled pathologists. Examination under the microscope is currently the most common practice. Still, pathology laboratories can use scanners to digitalize the samples and obtain Whole Slide Images (WSI) to benefit from a Computer-Aided Diagnostic system. We aim to develop an artificial intelligence method that can support pathologists in providing fast, reliable, and reproducible decisions in this context. The first and most crucial step we do is to automatically highlight where the tumors are located with the highest probability. That also allows the pathologist to interpret the computer-based decisions better. To this end, we obtain a deep learning-based semantic segmentation of a WSI into two classes: Tumor and Normal, i.e., each pixel of tissue in the image is assigned to one of these labels. Following, we decide for each section, whether it contains tumors or not, based on the segmentation, and account for possible model errors.

Related work
Deep learning has shown great potential to address several problems in understanding, reconstructing, and reasoning about images. In particular, convolutional neural network approaches have been actively used for classification and segmentation tasks in a wide field of applications [6][7][8], ranging from robot vision and understanding to the support of critical medical tasks [9][10][11][12]. The nearly human-expert performance achieved in some medical imaging applications [13,14] has come to show the capabilities and potential of these algorithms.
Besides the success of deep learning methods in various histological imaging tasks [15][16][17], to the best of our knowledge, there are only a few works [18] on tumor recognition in dermatopathological microscope images. We argue that the automated diagnostic of microscopic images of the human skin can be incredibly challenging due to the large variety of relevant data. Dermatological features like the structure of the extracted tissue depend on several aspects, e.g., the body part or the patient's skin type. Therefore, the pursued deep learning model has to be more robust towards such changes than models that segment histological images of inner organs. An extensive database, such as the one we use for this study, is essential for reliable digital dermatopathology solutions.
Most recent semantic image segmentation methods are based on fully convolutional networks (FCN) [19]. In this work, we use the UNet [9] as base architecture, which has been successfully applied in a large variety of medical image analysis applications [10,20,21]. Recent work has also used a UNet to segment the epidermis from the skin slice [22]. While a previous study [18] obtained promising results in diagnosing a BCC subtype, we focus on the automatic detection of BCC in general, including several subtypes such as sclerodermiform, nodular, and superficial.

Data collection and data parts
The tissue biopsy samples were prepared in the "Dermatopathologie Duisburg" laboratory using standard protocols. After fixation in formalin and the usual paraffin embedding, the probe was cut into slices of ≈ 3 µm thickness, and standard hematoxylin and eosin (H&E) staining was applied. The samples were digitized using a "Hamamatsu NanoZoomer S360" scanner with a 20x lens. In total, this study includes 650 WSI annotated in different manners and used for training, validation, and testing. Two clinical dermatopathologists annotated 100 WSI, which were included in the Train and Validation I parts of the data, see Table 1. The annotations contain different interest regions of the tissue: Tumornest, indicates where tumor islands are; Stroma, highlights supporting tissue around the tumor islands; and Normal, contains parts of the tissue that look similar to tumors, such as hair follicles, actinic keratosis, or cysts. However, only the Tumornest annotations were done exhaustively, whereas the others were annotated only in some cases. For that reason, we only use Tumornest annotations for training, and leave the other types for data balancing. Some annotation examples are shown in Figure 1.
Due to the huge number of WSI, it was not feasible to create detailed annotations for all the slides. Two parts of the data, namely Validation II and Test only contain section-wise labels. That means that for  each section of tissue in the WSI, there is a label that indicates whether there is some tumor inside or not, see Figure 2. To speed up the section-wise annotation process, we first generated bounding boxes and assigned the label Normal or Tumor resulting from one of our models. The pathologists then efficiently corrected these labels by just changing those which were wrong, see Section 3.2. Table 1 contains the amount of WSI for each part of the data and the number of sections.

Model architecture
The neural network designs we use in this work follow the UNet architecture [9], a fully convolutional encoder-decoder network with skip-connections between the encoder blocks and their symmetric decoder blocks. We use two different encoders and a standard decoder, similar to the original one [9] with only minor changes. The input to the network is a patch x ∈ R 512×512×3 , and the output is a segmentation map Φ(x) ∈ [0, 1] 512×512×2 . The segmentation map contains two matrices of size 512 × 512, which correspond to the predicted pixel-wise probabilities for each of the two classes: Tumor and Normal. Since we use the Softmax activation to compute the segmentation map, this can be seen as only one matrix of size 512 × 512 with the Tumor probabilities.

Encoder
The first encoder follows nearly the same design as the original one [9] with 5 blocks, which successively down-sample the spatial resolution to increasingly catch higher-level features. Each of the blocks doubles the number of feature channels and halves the spatial resolution. It uses two 3 × 3 convolutions, where the first one uses a stride of 2. The second encoder is exactly the convolutional backbone of a ResNet34 [23], as can be observed in Figure 3, and also contains 5 blocks. Both encoders contain an initial block with a 7 × 7 convolution with 64 filters and a stride of 2, which decreases the input resolution by half. That allowed us to effectively enlarge the patches' size (512 × 512) to incorporate a larger context without a significant increase in computation and memory consumption.

Decoder
The decoder contains an expanding path that seeks to build a segmentation map from the encoded features, see Figure 3. It has the same number of blocks as the encoder. Each decoding block duplicates the spatial resolution while halving the number of feature channels. It performs a bilinear up-sampling and concatenates the result with the output of its symmetric block in the encoder (skip-connection). Next, it applies two 3 × 3 convolutions with the same number of filters. Unlike the original decoder proposed in [9], it has one extra block that does not obtain any skip-connection. The output of the last decoder block is passed to a 1 × 1 convolution that computes the final score-map.
Additionally, at each block of the decoder, we added an extra 1 × 1 convolution, i.e., a linear combination of the block's final feature maps, to produce an intermediate score-map. We used these maps in two settings: a) deep supervision and b) to merge them through a linear combination to produce the final output. Except for the 1 × 1 convolutions, we always use a batch-normalization layer [24] and a ReLU activation function right after each convolution in the whole network. Furthermore, we always use the corresponding padding to keep the same spatial size unchanged by the convolution's kernel. Residual Basic Block with downsampling ψ4(x) 32 16 Figure 3. UNet architecture with a ResNet-34 encoder. The output of the additional 1 × 1 convolution after Softmax is shown next to each decoder block.

Model training
For training the models, we used the Train set, which contains 85 annotated WSI. We extracted small patches of size 512 × 512 at 10x level of magnification, on which we performed the semantic segmentation. In almost all slides, the tumor-free tissue is dominant; therefore, it was necessary to balance the training data to avoid biases and improve the performance. We did this based on the amounts of pixels belonging to the 3 types of annotations: Tumornest (T), Stroma (S), and Normal (N). The re-sampling was done as depicted in Table 2. That means that some of the patches were used several times during a training epoch. That is the case for patches that contain tumor, are close to tumor areas (Stroma), or contain normal regions which look similar to tumors (Normal). Additionally, we did extra over-sampling for patches with high tumor density.
During training, all patches were extensively augmented using: random rotations, scaling, smoothing, color variations, and elastic deformations to increase the variety of the data effectively. Additionally, we used the Focal-Loss with γ = 2.0 [25], which has a similar effect to down-weighting the easy examples, to make their contribution to the total loss smaller. We did not include any re-sampling or augmentation for the validation dataset.
We trained the models using a maximum of 40 epochs, a batch-size of 64, lr = 5 · 10 −4 , and a scheduler to multiply the learning rate by 0.8 every 5 epochs. Optimization was performed with the Adam method [26] and all computations were done on 4 NVIDIA GeForce GTX 1080 Ti.

Deep supervision
The deep supervision strategy consists of forcing the decoder blocks' outputs to yield a meaningful segmentation map. We compare each of the decoder blocks' output with the corresponding down-sampled  Table 2. Patches distribution and pixel-wise unbalance before and after re-sampling the data. The pixel unbalance is computed as the ratio between the amounts of tumor-free and tumor pixels.
version of the target segmentation map and add the discrepancy to the total loss. This technique was originally introduced in [27], for obtaining transparency and robustness of the features extracted in the middle of the network and helping address the vanishing gradient problem. In this case, it allows gradient information to flow back directly from the loss to every block of the decoder. Some recent works [28,29] used a similar idea for training a UNet. Let x ∈ R 512×512×3 be an input patch and its corresponding target segmentation map y ∈ {0, 1} 512×512×2 . We define ψ (x) as the output of the -block of the decoder (after Softmax), see Figure 3. The contribution to the loss function from this single data point (x, y) is then defined as where k = 5 is the number of decoder blocks, Π : {0, 1} 512×512×2 → {0, 1} 32·2 ×32·2 ×2 is a down-sampling operator, and f l is the Focal-Loss. We also tried using different weights for each of the scales, but we did not observe any improvement. This strategy does not involve any change on the previously described architecture; it only needs the 1 × 1 convolution and the Softmax at the end of each block of the decoder. The final output is given by Φ(x) = ψ k−1 (x).

Linear merge
The second strategy merges the decoder outputs through a linear combination. We add a linear layer with weights w ∈ R k to the architecture that computes the final output, i.e., whereψ (x) is the score-map computed at the -block of the decoder, i.e., the same as ψ (x) but without the Softmax activation, and Γ l : R 32·2 ×32·2 ×2 → R 512×512×2 is a bilinear up-sampling operator. In this case, the Softmax is only applied after the linear combination. The weights w are trained together with the other parameters of the model. Note that the standard model without this strategy is a special case, since the model could learn to assign w k−1 = 1.0 and w = 0 for 0 ≤ < k.

Section-wise classification
The final classification task for each section of the WSI (see Figure 2) is to decide whether it contains tumor or not. First of all, we use the trained model to generate a heatmap by combining several patches' predictions and highlighting the tissue's parts with the highest probabilities to contain tumor. We extract the patches so that they cover the whole tissue area and have at least 256 pixels (50 %) overlapping at every border. Following, we apply a prediction threshold to obtain a binary mask and find connected regions. To account for some possible model errors and reduce the false positive rate, we filter out regions below a certain area threshold. If any predicted tumor area is left in the section, it is classified as Tumor; otherwise, it is classified as Normal. The prediction and area thresholds are selected for each model independently as part of the model selection.

Model selection
We trained our implementation of the original UNet and three other variants using a ResNet34 encoder. The first alternative does not employ any decoder strategy (ResNet34-UNet), the second one uses the deep supervision strategy (ResNet34-UNet + DS), and the third one uses the linear combination of the decoder outputs (ResNet34-UNet + Linear). During training, we evaluated the models on the Validation I part of the data. We used the Intersection over Union (IoU) metric to select the models from the best five training epochs for each setting. Afterward, we evaluated these models on the section-wise classification task in the Validation I and Validation II parts of the data to select the best model together with the prediction and area thresholds. To this end, we used a grid-search and the F β score with β = 1.5, to give higher importance to the recall/sensitivity. The performance of the best models and the selected thresholds are shown in Table 3. Additionally, Figure 4 presents the outputs of the models and the resulting masks after applying the corresponding thresholds for some patches from the Validation I data.

Setting
Prediction Threshold  Table 3. Results on the section-wise classification task for all sections from the Validation I and Validation II parts of the data. The values correspond to the best models and selected thresholds.

Results
Finally, we evaluated the selected models from the validation phase on the Test data. The results are depicted in Table 4 and Figure 5 shows the number of wrongly classified sections by each model. The ResNet34-UNet + DS achieved the best results, obtaining more than 96 % overall accuracy, sensitivity, and specificity. It wrongly classified only 74 (30 FP and 44 FN) out of 1962 sections. Figure 6 shows some correctly classified sections for different BCC subtypes and the corresponding heatmaps generated with this model.
In most false negative cases, the model detects the tumor but without enough confidence. Due to the selected prediction and area thresholds these sections are then classified as Normal. On the other hand, false positives are often due to a hair follicle or other skin structure that is identified as a tumor, see for example Figure 7. x y   Table 4. Results on the section-wise classification task for all sections from the Test part of the data.
The ResNet34-UNet + DS was the second-best in the model selection phase. We argue that the slightly different results on the test set are due to the fact that the validation data is biased. Initially, the Validation II part was larger, but we selected challenging slides, based on the pathologists' feedback, which were then fully annotated and included in the Training part. All in all, the classification error was reduced from 8.5 % (baseline UNet) to 3.8 %. The baseline UNet, which uses the standard encoder, has excellent sensitivity but quite a low specificity. In contrast, the models with the ResNet34 encoder are more balanced and exhibit better performance.

Discussion and interpretability
We now analyze in more detail the models trained with the two additional strategies, namely ResNet34-UNet + DS and ResNet34-UNet + Linear. In Figure 8, we show the decoder outputs for some patches from the Validation I data using these models.
In the deep supervision case, one can observe that, indeed, it is possible to guide the UNet to produce a meaningful segmentation already from the first block (ψ 0 ) of the decoder, see Figure 8. The difference between ψ 0 (x) and ψ 4 (x) is only the resolution and the number of details on the borders. We observed that already in ψ 0 (x), the model knows where the tumor is if there is any. Therefore, we separately evaluated the performance of ψ 0 and the other decoder blocks on the final classification task, using the same prediction and area threshold selected for the original model. The results are nearly the same as when using the whole model, see Table 5; in the case of ψ 0 , the difference is only 5 sections, whereas in ψ 1 it is only 1. Those sections were wrongly classified because the predicted tumor area was right at the limit.
Moreover, performing inference using the encoder followed by only the decoder's first block ψ 0 is much faster and boosts the heatmap generation process's speed. The original model takes approximately 30 s per WSI (several sections) on average on an NVIDIA GeForce GTX 1080 Ti. In contrast, the reduced one needs approximately only 20 s, which represents a reduction of 33 % of the time.  On the other hand, the setting that uses a linear merge of the decoder blocks' outputs has a fascinating behavior. In this case, the model has more freedom since we do not guide the decoder blocks to output meaningful segmentations. We only include the final linear combination in the loss function. In Figure 8, one can observe that in ψ 0 (x), the model identifies a rough approximation of the tumor's location, which is much larger than the final result. Following, in the subsequent blocks, it creates more details. The learned weights assigned to each block were w = [0.1590, −0.1645, 0.1603, −0.2963, 0.0036]. Even though the observed behavior makes sense and offers some interpretability, it seems that, at least for the final classification task, it does not bring any advantage. As we have seen, only one block of the decoder seems to be good enough. The linear merge strategy might be more effective if the aim is to obtain very exact borders.

Blind study
The results presented in this work were obtained as part of a blind study. At the moment of training the section-wise annotations for the Test data did not exist yet. We created an initial classification based on the ResNet34-UNet model, and the pathologists checked them and corrected the wrong ones. That allowed us to obtain the section-wise annotations of the Test data in a very time-efficient manner. The process was ResNet34-UNet + DS ResNet34-UNet + Linear x y

ResNet34-UNet + DS
ResNet34-UNet + Linear x y done using our experimental online platform 1 , which was created for reviewing and visualizing WSI, annotations, and model predictions 2 . Finally, after the wrong labels were corrected, we computed the final scores for all our models.

Conclusions
In this work, we used a UNet architecture with two different encoders and several training strategies for performing automatic detection of BCC on skin histological images. Training the network with deep supervision was the decisive factor for improving the final performance. After trained with this strategy, each decoder's block focused on obtaining a segmentation with more details than the previous one but did not add or remove any tumor content. We found out that the decoder's first block is enough for obtaining nearly the best results on the classification task. That implies a substantial speed improvement on the forward pass of the network and, therefore, on the heatmap generation process (33 % time reduction). We performed the final evaluation on a rather large Test dataset compared to the Training part of the data. Still, the best model obtained a 96.2 % accuracy and similar sensitivity and specificity on the section-wise classification task. These results are very promising and show the potential of deep learning methods to assist dermapatopathological assessment of BCC.