An Analysis of Loss Functions for Heavily Imbalanced Lesion Segmentation

Heavily imbalanced datasets are common in lesion segmentation. Specifically, the lesions usually comprise less than 5% of the whole image volume when dealing with brain MRI. A common solution when training with a limited dataset is the use of specific loss functions that rebalance the effect of background and foreground voxels. These approaches are usually evaluated running a single cross-validation split without taking into account other possible random aspects that might affect the true improvement of the final metric (i.e., random weight initialisation or random shuffling). Furthermore, the evolution of the effect of the loss on the heavily imbalanced class is usually not analysed during the training phase. In this work, we present an analysis of different common loss metrics during training on public datasets dealing with brain lesion segmentation in heavy imbalanced datasets. In order to limit the effect of hyperparameter tuning and architecture, we chose a 3D Unet architecture due to its ability to provide good performance on different segmentation applications. We evaluated this framework on two public datasets and we observed that weighted losses have a similar performance on average, even though heavily weighting the gradient of the foreground class gives better performance in terms of true positive segmentation.


Introduction
Magnetic resonance imaging (MRI) is one of the most common techniques using sensors to obtain imaging information from the brain.Furthermore, lesion segmentation is the most common image analysis task applied to brain MRI for different neurological pathologies, such as multiple sclerosis [1,2], Alzheimer's disease [3] or vascular cognitive impairment [4].Brain lesions, usually identified as white matter hyperintensities on T2-weighted images, represent a small percentage of the whole brain volume, can have different shapes and volumes and their spatial distribution can greatly vary between patients.Due to these issues, the accurate detection and segmentation of white matter hyperintensities is a challenging segmentation task.
Recently, deep learning approaches have rapidly become the state of the art for medical image analysis [5,6], including binary image segmentation [7].However, a severe data imbalance can bias the training of the models and produce unexpected results (i.e., reduce the accuracy of the lesion segmentation) [8,9].The most common solutions are the use of sampling techniques to increase the number of positive samples [10] and loss functions that give different weights to the foreground and background classes, either implicitly [11,12] or explicitly [13].While different losses have been presented for medical imaging [14], they are rarely analysed for datasets with heavy imbalances.Specifically, their effect on the underrepresented class is not properly understood, how the loss evolves through time per epoch is rarely studied, and the effect of randomness is usually excluded from the results, even though weight optimisation is inherently a stochastic process.To that end, multiple studies have been published on understanding prediction uncertainty through probabilistic frameworks and deep ensembles [15][16][17][18], suggesting the importance of reporting results with multiple runs and random seeds.Furthermore, losses are usually defined based on their value, but they are rarely designed taking into account their derivative and their effect on optimisation.
In this paper, we present an analysis of the most common losses used for imbalanced datasets, and we study their evolution during training focusing on the effect on the underrepresented class.Furthermore, we propose a new weighted loss function based on the gradient of the cross-entropy loss (as opposed to focusing only on the loss function alone).The rest of the paper is structured as follows: in Section 2, we present the framework and public datasets we used, followed by the results in Section 3, a short discussion in Section 4 and conclusions and future works in Section 5.

Public Datasets
For this analysis, we used two public datasets with manual annotations of brain lesions and a heavy imbalance.The statistics of the number of foreground and background voxels are summarised in Table 1 and examples of images are given in Figure 1.While the ratio between foreground and background voxels is bigger inside the training patches, they still represent less than 5% of the whole patch on average.Table 1.Statistics on the number of voxels for the foreground and background class for the analysed datasets.% (brain) refers to the lesion voxel percentage mean ± standard deviation for the whole brain, while % (batch) refers to the lesion voxel percentage for all the batches.Numbers in brackets represent the global percentage for the whole dataset and batches.2.1.1.White Matter Hyperintensities (WMH) Challenge 2017 [4] The WMH Segmentation Challenge was held during the MICCAI 2017 conference, and it provided a standardized assessment of automatic methods for the segmentation of WMH.The task for the challenge was defined as the segmentation of WMH of presumed vascular origin on brain MR images.A total of 60 labelled images were acquired from three different scanners with different sensors from three different vendors, in three different institutes.For each subject, a 3D T1-weighted and a 2D multi-slice FLAIR image were provided.Details on the acquisition and preprocessing steps for these images are summarised in the original challenge paper [4].In brief, all images were manually labelled for white matter hyperintensities by an expert, and these annotations were afterwards reviewed by a second expert who created the final mask.Other pathological regions were also labelled.However, these regions were not evaluated during the challenge and, thus, we ignored this label for our analysis.Finally, T1-weighted images were co-registered to the FLAIR space during pre-processing, and we used these pre-processed images for our experiments.

Dataset % (Brain) % (Batch) Mean Lesion Size Lesion Number
The parameters of each acquisition are summarised as follows: This database contains baseline and follow-up MR images of 20 multiple sclerosis (MS) patients.The images were acquired on a 1.5 T Philips MRI machine at the University Medical Centre Ljubljana (UMCL), and the data were anonymized [1].Each patient's MR acquisition contained a 2D T1-weighted (spin echo sequence, repetition time (TR) = 600 ms, echo time (TE) = 15 ms, flip angle (FA) = 90°, sampling of 0.9 × 0.9 × 3 mm with no interslice gap resulting in a 256 × 256 × 45 lattice), a 2D T2-weighted (spin echo sequence, TR = 4500 ms, TE = 100 ms, FA = 90°, sampling of 0.45 × 0.45 × 3 mm with no inter-slice gap, resulting in a 512 × 512 × 45 lattice), and a 2D FLAIR image (TR = 11,000, TE = 140, TI = 2800, FA = 90, sampling of 0.9 × 0.9 × 3 mm with no inter-slice gap, resulting in a 256 × 256 × 49 lattice).The median time between the baseline and follow-up studies was 311 days, ranging from 81 to 723 days with the interquartile range (IQR) of 223 days.For our analysis, we only focused on the FLAIR images that were skull stripped and bias corrected using N4, followed by coregistration to the follow-up space using rigid registration.To create a ground truth mask, an image neuroanalyst expert at our group labelled all positive activity (new and enlarging lesions) using the coregistered FLAIR images and their subtraction.Furthermore, 3 of the subjects were excluded due to a lack of positive activity after the labelling.Therefore, only 17 patients from the original dataset were used for the experiments.

Data Preparation
For both datasets we extracted patches of size 32 × 32 × 32 using a sliding window (with an overlap of 16 voxels) on the brain bounding box to minimise the number of voxels that did not belong to the brain.These patches were then sampled to decrease the imbalance between foreground and background voxels.Only patches containing lesion voxels were used for training in groups of 16.

Network Architecture
In the last few years, the 3D Unet architecture [19] has become the state of the art in volumetric medical image segmentation for different tasks and images [7].In order to train all the losses with a robust architecture, we decided to use a simple 3D Unet of 2,906,913 parameters using basic convolutional blocks (with 32, 64, 128 and 256 kernels of size 3 × 3 × 3), followed by ReLU activation, group norm (with groups of 8 channels) and max pooling of size 2 (see Figure 2).These hyperparameters were chosen empirically based on the best performance of our previous works on other private datasets for lesion segmentation [20,21].To guarantee a fair comparison, these hyperparameters were also shared for all the experiments (different seeds and loss functions).The weights were randomly initialised using the same seed for all the losses in a given experiment to reduce the effect of random initialisation.This seed was also used to ensure that the training batches per epoch were shuffled the same way for all the losses.

Loss Functions 2.3.1. Cross-Entropy
The cross-entropy loss (xent) is one of the most commonly used loss functions for deep neural networks.If we consider the distribution of the real labels (y i ) and the predicted ones ( ŷi ), the cross-entropy expresses the similarity of these two distributions in the following form: In our case, this equation represents the loss for a single voxel, and the final value is computed as the mean of the loss for all the voxels.By definition, all voxels have an equal weight; therefore, during training, the optimiser will tend to favour the majority class.

Focal Loss
The focal loss function was presented by Tsung-Yi Lin et al. [13] for dense object detection in the presence of an imbalance between positive and negative samples.Through the use of a weighting variable (α) and a modulating factor (γ), they extended the crossentropy loss to increase the importance of positive samples and reduce the effect of samples correctly classified with a high confidence.Following the notation for the cross-entropy loss, the focal loss for a prediction ŷi can be defined as: For our experiments, we used γ = 2 as suggested on the paper, and two different α values: α 1 = 0.25 (focal1) and α 2 = 0.75 (focal2).On the original paper, they suggest to decrease α as γ is increased, and they noted that α = 0.25 gave the best results with γ = 2.However, we also wanted to test the loss with a higher α value to give a higher weight to positive samples.

Generalised Dice Loss
The generalised Dice loss was presented by Sudre et al. [12] to balance the effect of positive and negative samples, based on a generalisation of the Dice similarity metric (DSC) [22].This loss function can be defined as: where Ŷ and Y represent the predicted and true segmentation for the N image voxels ( ŷi and y i ), and w bg and w f g are weights for the background and foreground class, respectively.In the special case where w bg = 0, the generalised Dice loss becomes the commonly used binary Dice loss originally presented by Milletari et al. [11]: where ϵ is a small smoothing constant to avoid numerical issues when dividing by 0. In our experiments, we used the generalised loss (gdsc) as defined on the original paper (w bg = 1/N 2 bg and w f g = 1/N 2 f g , where N bg and N f g refer to the number of background and foreground voxels), the binary Dice loss (dsc) (w bg = 0 and w f g = 1) and a combination of the binary Dice loss and cross-entropy with equal weights (mixed).

Weighted Gradient Loss
Similarly to the focal loss, we propose to introduce a weighting factor and a modular factor to the cross-entropy loss.However, instead of applying that factor on the loss function itself, we propose to apply it to the gradient of the cross-entropy when combined with the gradient of the sigmoid function as follows: where f i represents the output of the last layer before applying the final sigmoid activation.

Experimental Design
In order to counter the effect on random initialisation and shuffling, we developed a framework to train the same network with the same initialisation and batches by setting the same random seed for each loss for a given experiment.With that setup, we run 5 different 5-fold cross-validation experiments with the same training/testing splits.We then evaluate the models at the end of each epoch for a maximum of 25 epochs by analysing the Dice similarity coefficient (DSC = 2 * TP 2 * TP+FN+FP ); sensitivity, also known as true positive fraction or TP f for short (TP f = TP TP+FN ); and precision (P = TP TP+FP ).We compute all three metrics for the training patches, and the training and testing image segmentations reconstructed from patches.In brief, patches of 32 × 32 × 32 are uniformly sampled for each image with a sliding window of 16 × 16 × 16, and the final prediction for each voxel is averaged between all the overlapping results.For the binary metrics, we first binarise the predicted output of the network (with a threshold of 0.5).The goal of this setup is to analyse whether the difference in different executions are actually due to the training variance caused by randomisation or the losses themselves.

Implementation Details
The framework was implemented using Python 3.6.9and pytorch version 1.5.0 and the code is publicly available at https://github.com/marianocabezas/rethinking_dsc(accessed on 1 January 2024).The whole analysis was run on a NVIDIA DGX-1 server with 80 CPU cores, 504 GB of RAM and two Tesla V100-SXM2.

Results
To analyse the performance of the different trained models for each epoch and seed, we used the TP f and precision to illustrate the balance between over-and undersegmentation and the DSC to highlight the overall segmentation.Ideally, we would like to maximise all metrics, but in practise, increasing the TP f also increases the false positives, affecting the precision and DSC values.Table 2 summarises the best performance when the best epoch and random seed are known, while Figure 3 illustrates the curves for all folds and seeds for each dataset in terms of each of the three different metrics we analysed (plots for each fold are detailed in Figures 4 and 5).The curves are separated by the type of data being evaluated (input patches and whole image segmentations reconstructed from patch segmentations) and represent a summarised version of all the metrics for each different training seed setup.The mean metric value is represented with a solid coloured line, while error bands around that line are bounded by the minimum and maximum value per epoch for all the seeds.By using upper and lower bounds, we can represent stochastic variability that could lead to different conclusions if different seeds are used for each method.Furthermore, the error bands also illustrate how robust the models can be to the problems represented by each dataset.Observing the table and curves, there is a clear discrepancy between the performance on the training patches and the overall segmentation for the whole image.This difference is emphasised on the Dice-based losses.Furthermore, it is also clear that weighting the foreground class, either explicitly (focal and weighted loss) or implicitly (Dice based), improves the performance in terms of the final DSC metric.However, the larger the weight, the better the results.Furthermore, a high weight (higher than 0.9 for the weighted loss) has a direct impact on the TP f measure throughout the whole training process.
A closer analysis of the curves highlights the trade-off between the precision and TP f segmentation measures (also illustrated in the qualitative examples from Figures 4 and 5).As the training progresses, the TP f metric decreases slowly, while the precision improves (leading to an overall slightly better DSC value).Furthermore, if we observe the bands for the losses, there is a clear overlap between all of them for the DSC metric.This suggests that there is a large variability in the model performance depending on the chosen random seed and fold.In general, the observed training pattern is similar for the weighted losses (Dice based, focal and weighted) with an initial higher peak in terms of TP f and a slower decrease for the losses with a higher weight for the foreground class.While the behaviour is similar for both datasets, there is a bigger difference between losses for the longitudinal dataset, which also presents a higher imbalance between classes.In fact, the variability in the LIT dataset is also larger (wider error bands), leading to a larger overlap and an increased difficulty to separate between methods.Curves for the WMH dataset, on the other hand, have thinner error bands, and there is a clear separation between our proposed loss function and the rest of methods for the TP f and precision metrics on the first few epochs, even though the DSC values remain similar for the training and testing images.Nonetheless, the curves converge to similar results as the number of epochs trained increases.Finally, the curves on Figures 4 and 5 show the same trends observed on the general dataset curves, albeit with reduced error bands (fewer samples are analysed per fold).Analysing folds carefully (each row on the figures), the LIT shows larger differences between folds.Specifically, the first row shows overall higher TP f and DSC values for all methods when compared to the other folds.
To complement the analysis of the curves, qualitative results for epoch 25 (the last epoch) are provided in Figures 6 and 7 for the LIT and WMH dataset, respectively.While there seems to be a general agreement in terms of true positives (green colour), most losses and examples show an increase in false negatives (blue colour) when compared to the new proposal (as also evidenced by the numerical results).In particular, small lesions tend to be missed by losses that do not heavily weight positive predictions,with cross-entropy having the lowest detection.This is contrasted by a decrease in the number of false positives (red colour), also evidenced by the precision metric on the numerical results.From the Dicebased losses, mixed has the overall best results, with the dsc loss having a large number of false positives (especially on the LIT dataset).Comparing the two focal losses, the results are fairly similar with small differences in detection depending on the case, suggesting that the value of α might not be as important as expected.
Finally, we summarise the DSC, TP f and precision metrics on the training and testing images using a violin plot in Figure 8 to analyse the distribution of these metrics for each loss according to the random seeds.Once again, we can observe that the best performance (highest bound) is obtained with the heavily weighted gradient loss (new), even though the plots present a similar shape for DSC in both datasets for most of the losses and a large variability in general.However, we can also observe some important differences on the TP f distribution for both datasets.On the longitudinal dataset, the focal losses and the cross-entropy present a larger concentration of lower TP f values when compared to the other losses, while only the lightly weighted focal loss (focal1) and cross-entropy present a lower performance on the WMH dataset.These two losses also present the worst overall performance on the longitudinal dataset in terms of DSC (as illustrated by a longer and spread plot, concentrated on the lower bound).

The Effect of Confident Errors on the Loss Function
The Dice loss is one of the most commonly used techniques to address class imbalance due to its application as a metric for segmentation [23].Furthermore, it is usually believed that it is one of the best choices for heavily imbalanced datasets.However, the original metric was not developed for optimisation and was designed to address a set theory problems (specifically, to measure the overlap between two sets).As a consequence, there are certain unexpected side effects when pairing it with common activations for segmentation (sigmoid and softmax).One of these unexpected side effects is the effect of confidently wrong predictions (for a mathematical analysis through derivation, we refer the reader to Appendix A).This effect is evidenced by the metric curves for the Dice loss where we found the largest discrepancy between patch-based metrics (the ones during training) and image-based metrics (the final desired output).On the first step, the Dice loss favours the prediction of positive samples, leading to a high true TP f at the cost of a large number of false positive predictions (as evidenced by the low DSC value).Some of these erroneous predictions maintain a high confidence score throughout the whole process, leading to false positive predictions that are never corrected even with a large number of epochs.These findings align with another recent study where the effect of different loss functions on the logits of the network were analysed [9] and extends to the other two related losses (mixed and gdsc).
If we zoom out and analyse the other losses we explored in this study, we can observe that they can be easily derived from the cross-entropy loss.This loss in particular is ubiquitous in classification studies (including segmentation defined as pixel classification) due to its simplicity and its relationship to logistic regression.As opposed to the Dice loss, this function was derived to interact with sigmoid (and softmax in problems with multiple labels), leading to a linear gradient.In that sense, if we observe the curves for these losses (xent, focal1, focal2 and new), we can observe a smoother transition and increasing tendency when it comes to sensitivity over number of epochs and a higher mean precision.This suggests that false confident predictions are improved over time, as the weight updates are mostly linear with respect to the error on the prediction.Furthermore, the weights on the target class (lesions) are the most important trade-off between sensitivity and precision as observed in the differences between the focal1 and focal2 losses.Finally, even though cross-entropy is the most basic function, it is also highly reliable and can obtain similar results, even though it might have a higher number of epochs.
Overall, the positive detection of lesions voxels remains high after 25 epochs as evidenced by the qualitative examples in Figures 6 and 7.However, small differences in false positives and negatives are observed depending on the loss, with a common trade-off between them.

The Effect of Randomness during Training
To understand the effect of randomness during training, we use five random sets that affect the initial value of the layer's weights and the shuffling of the patches used for training at each epoch.To illustrate this effect, we plot the results of each epoch with a set of bands with the minimum, maximum and mean values over all seeds for each method in Section 3. One of the first conclusions that can be observed is the large overlap between all methods and metrics for image-wise results.As expected, this effect is made stronger on the testing images that were not seen during the training process.In that sense, while the mean curves can be easily distinguished between methods, it is also obvious that changing the random seed could lead to different conclusions if the lowest bound for the "baseline" methods is selected and the highest bound "seed" is used for a method we would like to highlight over the others.Furthermore, datasets with a low percentage of foreground voxels are more sensitive to that issue as exemplified by the curves on the LIT dataset shown in Figure 4.

The Discrepancy between Patch-Based Results and Image-Based Results
The most common way to analyse how the training process evolves is to monitor the training and validation losses.In the scenario where a model is trained on patches (due to memory constraints), these two curves represent the ability of the model to segment patches.Even though such measures are generally considered a good proxy for the real end goal (segmenting lesions on the whole brain), our experiments suggest that these values are not only over-optimistic but also highly misleading (especially when comparing methods).If we look at the patch-based results for both datasets, all methods have high DSC values over 0.7 (considered a good value for lesions) [24], with some losses reaching a value close to 1 (a perfect score).While this is not surprising because the model is trained with these same patches, the same metric drops drastically for the LIT dataset (the one with the lowest percentage of lesion voxels) and less so for the WMH dataset.Furthermore, our proposed loss function that obtains the lowest DSC on the training patches by a large margin obtains comparable results when evaluated on the image segmentation results reconstructed from patches as shown in Figure 3.
On a related note, while it might seem counterintuitive that the sensitivity is also affected when comparing results between patches and images, this phenomenon might be caused by the reconstruction of the final results through averaging.In particular, boundary lesion voxels might have a decreased score when considering different predictions for the same voxel in different relative positions within the patch.In fact, previous studies on CNNs have proven that while convolutions are inherently shift-invariant, pooling and padding can lead to encoding positional information during training [25,26].In that sense, both reconstructing image segmentations from patches or performing inference directly on the whole image would lead to different results when comparing metrics on images and patches.Moreover, if overlapping patches are used during training, the loss metric might be unrealistic image-wise, as some voxels would be counted as independent occurrences more than once.
To conclude, these results suggest a few possible explanations that have also been corroborated by previous studies focusing on the DSC metric and its role as a measure of overlap that provides a balanced overview in terms of true and false predictions.On one hand, it is well known that the DSC metric is sensitive to the size of the object being analysed [23,27].In our results, this is reflected by the difference between patch-wise and image-wise results but also by the fact that LIT results are a decimal point lower than those obtained with the WMH dataset.On the other hand, patch-wise results focus exclusively on patches with both foreground and background voxels (as background-only patches would not contribute to improve lesion segmentation and could be detrimental).In that sense, our analysis further emphasises the importance of not relying on loss values alone when evaluating model performance and to validate with metrics that are as close as possible to the desired end goal.

Conclusions
In this paper, we presented an evaluation of a set of common losses used for binary lesion segmentation on heavily imbalanced scenarios.We observed that training with sampled patches from images gives an unrealistic measure of the final model when applied to whole images.In other words, while some losses might provide a better performance on the training patches, that improvement is not reflected on the training images as a whole, due to the bias introduced during sampling.Furthermore, we have observed that random initialisation and shuffling can cause a large variance on the performance metrics, giving a range of different conclusions depending on which random seed and stopping epoch we take to evaluate the losses.However, most of the analysed losses have similarly optimal performance results (if the best stopping epoch and seed are known), even though they show clearly different trends throughout the training process depending on what the loss prioritises (e.g., an increase in positive predictions for our proposal).To conclude, heavily weighting the gradient of the foreground class gives the best results in terms of true positives, while still being able to obtain a competitive DSC metric on average, with an acceptable number of false positives.
Finally, we can express the third term as: If we now group all three terms together again, the final gradient with respect to Θ is given by: If we ignore the normalising factor to focus on the relationship between background and foreground voxels, we can simplify that equation as: (A20) Since ŷi = σ( f i (X, Θ)) as defined by Equation (A6), its derivative with respect to a given weight is given by: and so: Analysing that expectation, it becomes 0 when all the predictions have a value of either 0 or 1, independently of the real label.Furthermore, the maximum gradient for a voxel is given when its probability is 0.5.This is usually the case after initialisation.However, as the weights are updated and that probability moves away from 0.5, that gradient will decrease whether the voxel is well-classified or not.
Taking that into account, the gradient from Equation (A19) becomes 0 when all voxels (including background and foreground) have a prediction of 0 or 1.That means that extreme false positives and false negatives will not have an effect on updating the network parameters using gradient descent.Furthermore, in the extreme case where all the foreground voxels have been misclassified as background, they will not be taken into account while updating the network weights.In an extremely unbalanced dataset, this risk is important since the network might not be able to recover from that and miss all positive examples.Now, if we analyse Equation (A20) with respect to the expectations of ŷi , we can see how E i∈Y 1 [ ŷi ] represents the true positive rate for foreground voxels, while E i∈Y 0 [ ŷi ] can be regarded as a the rate of misclassified background voxels.In that sense, if we analyse the first term, the gradient of the background voxels increases as the true positives increase, while the gradient of the foreground voxels decreases as the false positive predictions decrease.On the other hand, the second term only depends on the number of positive and negative voxels.In an extremely unbalanced case, the gradient of the background voxels would dominate that second term since |Y 0 | ≫ |Y 1 |.Furthermore, in order for the weight of the first term (|Y 1 |(w 1 − w 0 )) to be larger than the weight of the second term (Nw 0 ), w 1 > Nw 0 |Y 1 | + w 0 .Finally, in the special case where w 0 = 0, the most common use of the Dice loss originally presented by Milletari et al. [11], the second term becomes 0 and the normalising factor is simplified to N ( Ŷ, Y) = 2 .In that case, the derivative (and loss function) are undefined when all predictions are set to 0 and all the voxels belong to the background.Furthermore, the issues from the first term and the derivative of the sigmoid still apply to that function.

WMH2017 1 .Figure 1 .
Figure 1.Example of slices of a randomly selected subject for each dataset.The first row depicts an example of a T1 slice, a FLAIR slice and the boundary of all the lesions for a WMH2017 subject, while the second row depicts an example of a baseline, follow-up image and the boundary of all the new lesions (not appearing on the baseline image) for a LIT longitudinal subject.

Figure 2 .
Figure 2. Scheme of the simple Unet architecture used for this analysis.

Figure 3 .
Figure 3. Band plots of the DSC, sensitivity and precision metric for the weights of each epoch for all the folds.The upper and lower bands represent the minimum and maximum values, while the middle line represents the mean for all random seeds and folds.The first three rows summarise the metrics for the LIT dataset, while the last three summarise the results for the WMH challenge dataset.

Figure 4 .
Figure 4. Band plots of the DSC and sensitivity metric for the weights of each epoch for each fold of the LIT dataset.The upper and lower bands represent the minimum and maximum values, while the middle line represents the mean for all the random seeds.

Figure 5 .
Figure 5. Band plots of the DSC and sensitivity metric for the weights of each epoch for each fold of the WMH dataset.The upper and lower bands represent the minimum and maximum values, while the middle line represents the mean for all the random seeds.

Figure 6 .
Figure 6.Qualitative examples of three subject from the LIT dataset where new lesions are segmented (those that appear only on the follow-up image on the right).Classified voxel are colour coded to represent true positives (green), false positives (red) and false negatives (blue).

Figure 7 .
Figure 7. Qualitative examples of three subject from the WMH dataset, where white matter hyperintensities are segmented.Classified voxel are colour coded to represent true positives (green), false positives (red) and false negatives (blue).

Figure 8 .
Figure 8. Violin plots for the DSC, sensitivity and precision metrics for all the folds and training seeds.The first row represents the results on the LIT dataset, while the second row represents the results on the WMH dataset.
of the background voxels and E i∈Y 1 the expectation of the foreground voxels.Continuing with the foreground term:

Table 2 .
Summary of the measures for the average of the best results for each fold.Patch and train refer to the patch-wise and image-wise measures for the training set, while test stands for the image-wise measures for the testing set.