Going to Extremes: Weakly Supervised Medical Image Segmentation

Medical image annotation is a major hurdle for developing precise and robust machine learning models. Annotation is expensive, time-consuming, and often requires expert knowledge, particularly in the medical field. Here, we suggest using minimal user interaction in the form of extreme point clicks to train a segmentation model which, in effect, can be used to speed up medical image annotation. An initial segmentation is generated based on the extreme points utilizing the random walker algorithm. This initial segmentation is then used as a noisy supervision signal to train a fully convolutional network that can segment the organ of interest, based on the provided user clicks. Through experimentation on several medical imaging datasets, we show that the predictions of the network can be refined using several rounds of training with the prediction from the same weakly annotated data. Further improvements are shown utilizing the clicked points within a custom-designed loss and attention mechanism. Our approach has the potential to speed up the process of generating new training datasets for the development of new machine learning and deep learning-based models for, but not exclusively, medical image analysis.


Introduction
A major bottleneck for the development of novel machine learning (ML) based models is the annotation of datasets that are useful to train such models.This is especially true for healthcare applications, where annotation typically needs to be performed by experts with clinical domain knowledge.This bottleneck inhibits our ability to integrate ML-based models into clinical workflows and in order to increase their productivity.At the same time, there is a growing demand for ML methods to improve clinical image analysis workflows, driven by the growing number of medical images taken in routine clinical practice.
In particular, volumetric analysis has shown several advantages over 2D measurements for clinical applications (Devaraj et al., 2017), which in turn, further increases the amount of data (a typical CT scan contains hundreds of slices) needing to be annotated to train accurate 3D models.Apart from acquiring accurate measurements, volumetric segmentation is widely desirable for visualization, 3D printing, radiomics, radiation treatment planning, image-guided surgery, and registration.Despite the increasing need for 3D volumetric training data to train accurate and efficient ML models for medical imaging, the majority of annotation tools available today are constrained to performing the annotation in multiplanar reformatted views.The annotator needs to either use a virtual paint brush or draw boundaries around organs of interest, often on a slice-by-slice basis (Yushkevich et al., 2006).Classical techniques like 3D region growing or interpolation can speed up the annotation process by starting from seed points or sparsely annotated slices but its usability is often limited to certain types of structures.Some tools allow the user to skip certain regions of the image by using interpolation between slices or cross-sectional views can be helpful, but they often ignore the underlying image information.Hence, these approaches cannot always generalize to the varied use cases in medical imaging.
In this work, we propose to use only minimal user interaction in the form of extreme point clicks at the boundary of the object or organ of interest in order to train a deep learning (DL) based segmentation model.The proposed approach integrates an iterative training and refinement scheme to gradually improve the models' performance.Starting from user-defined extreme points along each dimension of a 3D medical image, an initial segmentation is produced based on the random walker (RW) algorithm (Grady, 2006).This segmentation is then used as a noisy supervisory signal to train a fully convolutional network (FCN) that can segment the organ of interest-based on the provided user clicks.Furthermore, we propose several variations on the deep learning setup to make full use of the extreme point information provided by the user.For example, we integrate the point information into a novel point-based loss function and combine it with an attention mechanism to further guide the segmentations.Through large-scale experimentation, we show that the networks predictions can be iteratively refined using several rounds of training and prediction.Always using the same weakly annotated point data as our only manually provided supervision signal.

Related work
Segmentation networks Fully convolutional networks (FCNs) (Long et al., 2015) have established themselves as the state-of-the-art methods for medical image segmentation in recent years (Ronneberger et al., 2015;Milletari et al., 2016;C ¸içek et al., 2016;Liu et al., 2018;Myronenko, 2018).However, a major drawback is that they are very data-hungry, limiting their application in healthcare where data annotation is very expensive.To reduce the cost of labeling, semiautomated/interactive and weakly supervised methods have been proposed in the literature (Guo et al., 2018;Tajbakhsh et al., 2020).

Interactive segmentation
The integration of semiautomated approaches has been an active area of development (An et al., 2017), typically utilizing classical methods such as graph cut (Boykov and Funka-Lea, 2006), random walks (Grady, 2006), active shape models (van Ginneken et al., 2003;Schwarz et al., 2007), and others (Dougherty, 2011).Machine learning methods have also been considered as a viable way for interactive algorithms.In Wang et al. (2016), an online random forest is used in combination with conditional random fields and 4D graph cuts to segment, in a minimally interactive framework, the human placenta in fetal MRI scans.Recently, building on advances in deep learning, several new methods have been proposed.One popular form of interaction is user-drawn scribbles.In Amrehn et al. (2017), a user can iteratively add scribble hints as seed points to improve the segmentation result given by an FCN.In Wang et al. (2018), the DeepIGeoS algorithm leverages geodesic distance transforms and scribbles to allow interactive segmentation.An alternative method (Wang et al., 2018) uses image-specific fine-tuning and leverages both bounding boxes and scribble-based interaction.Can et al. (2018) proposes to use scribbles with random walks (Grady, 2006) and FCN predictions to achieve semi-automated segmentation.Scribbles are also used to generate pixellevel maximum category likelihood via propagation to their neighborhood in (Dias et al., 2019).Instead of scribbles, point clicks is another widely practiced interaction.In Sakinis et al. (2019), the authors utilize the clicks as Gaussian kernels and put them in a separate input channel to an FCN to model user interactions via seed-point placing.Khan et al. (2019) extends the Gaussian kernel idea to a confidence map derived from extreme points that quantitatively encodes some priors.Majumder and Yao (2019) transforms the positive and negative clicks into images based on superpixel and object proposals, so that image information can be utilized with clicks to generate a guidance map.In addition to scribbles and points, Ling et al. (2019) parameterizes the segmentation boundary as polygons/splines, which are further modeled as a graph.Location shifts for each node are then predicted via Graph Convolutional Networks (GCN).
Weakly-supervised segmentation Weak supervision significantly reduced the time needed for user annotation, and therefore is an important research area for DL.One popular idea is to apply classical nonlearning-based methods over a DL-generated feature map.For example, in Dias and Medeiros (2019), Monte Carlo region growing is triggered from confidence scores given by a network, and in Cerrone et al. (2019), random walks is performed over learnt edge weights.An "opposite" idea is to use classical unsupervised methods as initial estimate for further learning process.In Rajchl et al. (2017), an initial GrabCut segmentation is used for this purpose, and segmentation performance is then improved with several rounds of predictions using CNN plus Dense CRF post-processing.Similarly, in Zhang et al. (2018), segmentation results based on Kmeans are used to train a deep segmentation network on cystic lung regions.Without proper supervision, such approaches might work well if unsupervised techniques can have good enough initial performance.However, completely unsupervised techniques might fail to generalize to organs where the boundary information is not as clear.One possible way to address this issue is to add a confidence network (Nie et al., 2018) to judge the quality of additional information generated, so that unlabeled data can be included to adversarially train the segmentation network.More recently, Kervadec et al. (2019) introduced inequality constraints based on target-region size and image tags in the loss function of a CNN in order to train the network for weakly supervised segmentation.Instead of information extracted by classical methods, weakly-supervised or self-learning can also make use of measurements readily available, or use non-experts' judgements.One example is the measurements acquired during evaluation of the RE-CIST criteria (Cai et al., 2018) in the hospital picture archiving and communication system (PACS).However, such measurements are typically constraint to 2D and might miss adequate constraints for more complex three-dimensional shapes.Non-expert annotations can be acquired by utilizing crowd-sourcing platform, Rajchl et al. (2016) distributes super-pixel weak annotation tasks and collects such annotations from a crowd of non-expert raters, and further use them as weaksupervision for network training.

Contributions
This work follows our preliminary study presented in (Roth et al., 2019) which investigated a 3D extension of (Maninis et al., 2018) in a weakly supervised setting and building on random walker initialization from scribbles.In this work, we extend this approach and add the following contributions: • We utilize a modern network architecture shown to be very efficient for medical image segmentation tasks, namely the architecture proposed in Myronenko (2018) and integrate the attention mechanism proposed by Oktay et al. (2018).• We make proper use of the point channel information not just at the input level of the network, but throughout the network, namely in the new attention gates.• We furthermore propose a novel loss function that integrates the extreme point locations to encourage the boundary of our model's predictions to align with the clicked points.• We extend the experimentation to a new multi-organ dataset that shows the generalizability of our approach.

Method
The starting point for our framework are a set of userprovided clicks on the extreme points {e} that lie on the surface of the organ of interest.We follow the approach of Maninis et al. (2018) and assume the users to provide only the extreme points along each image dimension in a three-dimensional medical image.This information is then utilized at several places within the network and during our iterative training scheme.The overall proposed algorithm for weakly supervised segmentation from extreme points can be divided into the steps which are detailed below: 1. Extreme point selection 2. Initial segmentation from scribbles via random walker (RW) algorithm 3. Segmentation via deep fully convolutional network Steps 2, 3, and 4 will be iterated until convergence.Here, convergence is defined based on a hold out validation set.

Step 1: Extreme point selection
Defining extreme points {e} on the surface of the organ will allow the extraction of a bounding box around the organ of interest.Additional padding is typically useful to allow the network to learn some contextual information around the organ of interest.
Bounding box selection significantly reduces the image content to be analyzed and simplifies the machine learning problem, as previous work on cascaded approaches showed (Roth et al., 2018).The computer vision literature has extensively studied bounding boxes and extreme points on objects (Maninis et al., 2018).They give some advantages over the technical drawbacks of bounding box selection in which the user often has to pick the corners of bounding boxes outside of the object of interest.This is particularly difficult to do for three-dimensional objects where users typically have to traverse three multi-planar reformatted views (axial, coronal, sagittal) to accomplish the task.Recent studies also demonstrated the time savings achieved with extreme point selection instead of conventional bounding box selection (Maninis et al., 2018;Papadopoulos et al., 2017).At the same time, extreme points can provide the segmentation model with additional information which can be seen in our experimental section, Table 1 where we compare various ways of integrating the extreme point information into the model training.They lie on the surface of the object.In the basic approach, we can model them together with the image intensities as an additional input channel.This extra channel G({e}) includes 3D Gaussians centered on each user clicked point location e.This approach is similar to Maninis et al. (2018) but we have extended this approach to problems with 3D medical imaging.At the same time, we can utilize the point information to guide the loss function towards making predictions whose boundary aligns with the point locations (see 2.3) or use it as an additional signal that can be used to guide model attention mechanisms (see 2.3).
Figure 1 illustrates the different ways of how the extreme point information can be used by our proposed network architecture.We ask the user to click on six extreme points that describe the largest extent of the organ.Here, six click locations are shown after conversion to Gaussians in the extra input channel to the network, loss, and attention gates.These points are then used to compute a bounding box B automatically, including some padding p.In this study, we extract the extreme points automatically during training from a given ground truth mask.In order to simulate user interaction, we add some Gaussian noise to the x, y, z point locations at each DL training iteration as in Maninis et al. (2018).
After cropping the image based on B, we resize each bounding box region to a constant size S = s x ×s y ×s z .In all our experiments we set s x = s y = s z = 128 and choose p=20 mm which can include enough contextual information for typical applications of clinical CT scanning (see Section 3).

Step 2: Initial segmentation from scribbles via random walker algorithm
In this step, we turn the generated scribbles into a probability map Ŷ that can act as a pseudo-dense or "noisy" label map that can supervise a 3D deep network to learn the segmentation task.To achieve this goal, we select a set of foreground and background scribbles based on the initial set of extreme points {e} that serve as the input seeds for the random walker algorithm (Grady, 2006).The shortest path between each extreme point pair along each image axis is computed via the Dijkstra algorithm (Dijkstra, 1959).Here, we model the distance between neighboring voxels by their gradient magnitude where I denotes the image intensity.The resulting path can be seen as an approximation of the geodesic distance between the two extreme points in each dimension (Wang et al., 2018) with respect to the content of the image.Figure 2 displays the foreground scribbles to be used for the random walker algorithm used as input seeds and shows the ground truth surface information for reference.Note that this ground truth is not used to computed the scribbles (apart from simulating the extreme points).To increase the number of foreground seeds for the random walker, each path will also be dilated with a 3D ball structure element of radius r foreground = 2.The background seeds are estimated as the dilated and inverted version of the input scribbles.
The amount of dilation needed for successful initialization depends on the size of the organ of interest.We typically dilate the scribbles with a ball structure element of radius r background = 30 which achieves good initial seeds for organs such as spleen and liver (see Fig. 4).
Random walker Next, the random walker algorithm (Grady, 2006) is used to produce an initial prediction map Ŷ based on the background s 0 and foreground s 1 scribbles mentioned above.The random walker basically solves the diffusion equation between voxels by turning the scribbles S = s 0 , s 1 into a source and sink.The 3D volume here is defined as a G(E, V ) graph with e ∈ E edges and v ∈ V vertices.Each edge between two vertices of v i and v j is referred to as e ij and a weight of w ij can be assigned based on gradients of the image intensities.In addition, d i = w ij defines the degree of a given vertex.In order to get a probability p(ω|x) of whether each vertex v i belongs to the foreground ω 1 , we solve the diffusion equation.L is the Laplacian matrix of the weighted image graph G with each element of the matrix defined as: The weights between adjacent voxels can be defined as w ij = e −β|zj −zi| 2 .This will make the diffusion between similar voxel intensities z i and z j easier and hence allow them to be assigned to the same class.Here, β is a tunable hyperparameter that controls the amount of diffusion.We keep β = 130 in all our experiments.By separating the voxels marked by scribbles S and unmarked voxels, the Laplacian matrix L can be decomposed into blocks.
Here, M corresponds to voxels marked by scribbles S and U to unmarked voxels.This can be formulated as a system of equations which can be analytically solved as: where M is made of elements m ω j which are 1 for marked voxels of s ω for the given class ω, and 0 otherwise.Solving Equ. 4, results in a probability for each voxel p(ω|x) = x ω i , resulting in our pseudo label Ŷ

Step 3: Segmentation via deep fully convolutional network
Next, we can train a fully convolutional neural network to segment the given foreground class with P (X) = f (X) with pairs of X and pseudo labels Ŷ .Our preferred network architecture follows the encoderdecoder network proposed in Myronenko (2018) (without the VAE part), using 3D convolutions throughout the network.
Encoder The encoder uses residual blocks (He et al., 2016), where each block consists of two convolutions with normalization and ReLU, followed by additive skip connection.Here, we use group normalization (GN) (Wu and He, 2018), which typically shows better performance than batch normalization (Ioffe and Szegedy, 2015) when batch size is small (in our case batch size 4).We adopt a standard FCN approach to slowly decrease the number of image dimensions by 2 and simultaneously increase the number of features by 2 as in Ronneberger et al. (2015).For downsizing, we use strided convolutions with a stride of 2. All conversions are 3 × 3 × 3 with an initial filter number equal to 8 in the input layer of the network.
Decoder The design of the decoder is identical to the one of the encoder, but with a single residual block per each spatial level of the network.Each level of decoders starts with up-sampling that involves reducing the number of features by a factor of 2 (using 1 × 1 × 1convolutions) and doubling the spatial dimension using trilinear up-sampling.This is followed by adding or concatenating the features from the equivalent spatial level encoder.In this study we use addition due to the lower memory consumption of the resulting network.At the end of the decoder, the features have the same spatial size as the original image and the number of features equal to the size of the initial input function.This is followed by 1 × 1 × 1conversion into one output channel followed by a final sigmoid activation as we are assuming the binary segmentation case in this work.
Attention We follow the approach of Oktay et al. (2018) to implement attention gates in the decoder part of our segmentation network.The attention gates help the model to focus on the structures of interest.Attention gates can encourage the model to suppress regions that are irrelevant to the segmentation task and highlight the regions of interest most relevant to the segmentation task (see figure 1).
The attention gate can be further augmented by the point channel information available from extreme point selection.We propose to add the extreme point channels G({e}) at each level of the decoder to further guide the network to learn the relevant information.In practice, we downsample the initial input point channel to match the resolution of each decoder level and concatenate it with the gating features from the encoder path of the network in each attention gate.
Dice loss The Dice loss (Milletari et al., 2016) is a popular objective function for segmentation tasks in medical imaging.Its properties allow it to automatically scale to unbalanced labeling problems.At the same time, it also naturally adapts without any changes to the original formulation to learn from probability maps: Here, y i is the predicted probability from our network f (X) and ŷi is the weak label probability from our pseudo label map Ŷ at voxel i.

Point loss
The extreme points selected by the user for weak annotation cannot only be used for generating initial scribbles but also in an additional loss function during the training of our deep neural network.We add an additional constraint to the deep learning training making use of the extreme points the user has already selected.This new loss L points penalizes the distance between the boundary of our models predicted segmentation mask P = f (X) and the location of the extreme points.To compute it, we apply a Gaussian filter G(•) to our models prediction P (X) which can be easily implemented using standard 3D convolutional operations with a constant n × n × n kernel with each element being 1/n 3 .The resulting point distance loss between the filtered prediction G (P (X)) and the extreme points channel G({e}) (which includes a Gaussian kernel placed over each extreme point) is therefore where N are the number of voxels i in the image and g i ∈ G (P (X)) and g i ∈ G({e}), respectively.The point loss computation is illustrated in Fig. 3.This results in a new total loss used for training: Here, α is hyperparameter weight that controls the influence of the point distance loss.Point loss implementation We implement the Gaussian filter G(•) using a set of standard 3D convolutions.
First, we use B convolution operations to enhance the boundary of the prediction P (X): G 0 (P (X)) = conv B (... (conv 3 (conv 2 (conv 1 (P (X))))) ...) G 1 (P (X)) = (G 0 (P (X)) − 0.5) 2 G(P (X)) = e −G1(P (X)) (8) Here, the convolutional kernel in each conv operation is set to be constant n × n × n kernel with each element being 1/n 3 .B should be adjusted depending on the size of input image and the extent of organ of interest.In our setting, we use B = 25 to achieve a good boundary enhancement at the scale of the images and targeted organs.
The resulting point distance loss between the filtered prediction G (P (X)) and the extreme points channel G({e}) (which includes a Gaussian kernel placed over each extreme point) is therefore as in Equ. 6.

Step 4: Regularization using random walker algorithm
We could stop learning after the above segmentation network f (X) is trained on the Ŷ pseudo labels for the first time.Nevertheless, we note that an additional regularization step by an additional random walker segmentation as mentioned above may be of great benefit to the convergence of our weakly-supervised segmentation approach.This approach is close in spirit to Rajchl et al. (2017), where a DenseCRF is used as the post-processing step during iterative refinement.
To increase the amount of regularization that the random walker can give to the predictions P (X) of the network, we define an area of uncertainty U(P (X)).The foreground and background in the prediction map can be defined as P (X) >= 0.5 and P (X) < 0.5, respectively.Here, we chose a ball structure element of radius r randomwalker = 4 to erode both the foreground and background regions in all our segmentation tasks to compute U which in turn is acting as the unmarked voxels in the random walker algorithm.This allows the random walker to generate new predictions around the foreground object's boundary that differ from previous 3D network's predictions and, in turn, help the next deep learning training iteration to learn new features from the same set of training images and to not get stuck in a poor local minimum.Besides, we find that our weakly supervised segmentation framework becomes unstable without this step and does not converge as easily to a satisfactory result (see Figure 6).

Experiments & Results
Datasets We utilize the training datasets (as they include ground truth annotations) from public resources, specifically, from the multi-organ (MO) segmentation study in Gibson et al. (2018) 1 which provided annotation for abdominal CT data from previously published datasets: Roth et al. (2015) 2 and BTCV (2015)3 .Furthermore, we utilize data from the Medical Segmentation Decathlon (MSD) challenge (Simpson et al., 2019) 4 .From MO, we utilize the spleen, liver, pancreas, left kidney, and gallbladder segmentation masks, denoted as MO-Spleen, MO-Liver, MO-Pancreas, MO-L.Kidney, and MO-Gallbladder, respectively.From MSD, we include the spleen mask, denotes as MSD-Spleen.Qualitative results are shown in Fig. 4 for each segmentation task on example cases from the validation set.For MO, we use a constant data split of 81 training and 9 validation cases, respectively.For MSD, there are 32 training and 9 validation cases, respectively, available.
Experiments In all cases, we iterate our algorithm for a maximum of 20 iterations as shown in Fig. 6.In Table 1, we compare training with and without using random walker (RW) regularization after each round of 3D FCN learning.In addition, by running the framework with RW regularization but without the extreme points channel, we quantify the benefit of modeling the extreme points as an extra input channel to the network versus only using the bounding box as in Rajchl et al. (2017).It can be further observed that the greatest changes occur after initial random walker segmentation in the first round of FCN training.While the average Dice score is not always enhanced by random walker regularization alone, it helps to incorporate enough "novelty" into our learning system to boost the overall Dice score in later iterations as shown in Fig. 6.We furthermore, show the average Dice scores on the validation set after convergence when utilizing the proposed point loss, point loss plus attention gates as in Oktay et al. (2018), and a setting when using the point information as an additional guiding feature to the attention gates.The fully supervised case using Dice loss with the strong label ground truth masks are shown for reference.It can be observed that utilizing the point channel information in the point loss function and the attention gates generally improves the performance of the model.The addition of point loss and point attention works best in four out of six weakly supervised cases, while the addition of point loss alone showed an advantage in two out of the six tasks.Notice, that the average Dice score in the MO-Gallbladder task even outperforms the fully supervised setting.

Implementation
The training and evaluation of the deep neural networks used in the proposed framework were implemented based on the NVIDIA Clara Train SDK5 using 4 NVIDIA Tesla V100 GPUs with 16 GB memory for each round of training.All models were trained using the deterministic training setup in Tensorflow6 with the same random seed initialization in order to guarantee comparable results between the different variations of training.For the random walker algorithm, we use the default parameters7 .
Analysis of point loss An analysis of the impact of the point loss on our weakly supervised models' predictions is shown in Fig. 5.

Discussion
We provided a method for weakly-supervised 3D segmentation from extreme points.Asking the user to simply on the surface of the organ in each spatial dimension can drastically reduce the cost of labeling.The point clicks can simultaneously identify the region of interest and simplify the 3D machine learning task.
The extreme points can also be used to create an initial noisy pseudo label based on the extreme points using the random walker algorithm.From our experiments, it can be observed that this initialization is relatively robust for six different tasks from medical image segmentation.
Occasionally, the random walker may lack robustness for organs with very diverse interior textures or   1 at each round of deep network training.While the performance of the MO-Liver models generally improves with the number of iterations, it can also be noticed that for MO-Pancreas , a poor initialization by the random walker can cause the models to degrade quickly.Notice, that adding the point channel information results in a more stable training behavior.highly concave curved shapes, for example, the pancreas (see MO-Pancreas task in Table 1).In this situation, the shortest path result might sometimes lie outside the organ.A boundary search algorithm might provide a better initial segmentation here.Still, the initial segmentation can be significantly enhanced by the first round of FCN training.In this study, we utilized one dataset (MSD-Spleen) as our development set and kept the hyperparameters of the full approach constant across different segmentation tasks and datasets.One might achieve better performance when optimizing the hyperparameters, especially for the initial random walker, based on the task at hand.In practice, we performed model selection for each round of training in our approach based on the pseudo labels Ŷ alone.However, we do need a fully annotated validation set to practically evaluate the overall convergence of our iterative approach for it to be clinically useful.One could use the predictions of the first round of FCN training to build an ML-based annotation tool that could speed up the creation of such a hold-out "gold standard" validation dataset and reduce the amount of manual labeling and editing needed in total.
Previous work primarily used boundary box annotations for weakly supervised learning in 2D/3D medical imaging, such as Rajchl et al. (2017).We consider, however, that selecting extreme points on the surface of the organ is more natural than selecting corners of a bounding box outside the organ of interest and more efficient than adding scribbles within and around the organ (Wang et al., 2018;Can et al., 2018).This is consistent with recent findings in the computer vision literature (Papadopoulos et al., 2017).An applica-tion of the proposed approach to the 2D case would be straightforward.
We conducted a comprehensive ablation study of our proposed method in Table 1.Some of these settings are similar to previous work.For example, performing the network training without the extra point channel is equivalent to studies using bounding boxes alone such as in Rajchl et al. (2017).From Table 1, we can see that adding the additional point-click information in the loss and as attention mechanism is however beneficial while not increasing the labeling cost.
In summary, we proposed a weakly-supervised 3D segmentation framework based on extreme point clicks.Experimentation on six datasets showed that the approach can achieve performance close to the fully supervised setting in four tasks and even outperforms the fully supervised training in one of them (MO-Gallbladder ).In the future, an automatic proposal network could assist the user with the region of interest and extreme point selection to further reduce the manual burden of medical image annotation.
Figure 1: A high-level overview of our proposed network architecture.The network receives both a image channel input and a point channel input that represents the user-provided extreme points.The point channel is then used throughout the network to further guide the segmentation training, i.e. as an additional input to attention gates and in the loss function.

Figure 2 :
Figure 2: Examples of automatically created foreground "scribbles" (yellow) from extreme point clicks, modelled as 3D Gaussians in our network learning.We use a geodesic shortest path algorithm to compute a scribble based on the image information alone that connects two opposing extreme points across one of the three image dimensions.(a)-(e) are showing examples from MSD-Spleen, MO-Spleen, MO-Liver, MO-Pancreas, MO-L.Kidney, and MO-Gallbladder, respectively.The surface rendering show the ground truth segmentations for reference in red.Best viewed in color.

Figure 3 :
Figure 3: Visualization of the boundary enhancement map computed in Equ.6-8 in the paper.In this example, we show (a) the ground truth overlaid on the image, (b) the boundary enhancement map b(P ), and (c) the point channel G({e}) on the corresponding axial slice of the 3D volume used in the computation of the point loss L points (see Equ. 9).The loss is minimized if the prediction's boundary b(P ) aligns with the center of each clicked extreme point e in G({e}).Note, that during training, we compute the boundary on the model's prediction P but here we show it computed on the ground truth for illustration purpose.

Figure 4 :
Figure 4: Our results on six different segmentation tasks on example cases from the validation set.We show (a) the image after cropping based on extreme points, (b) overlaid (full) ground truth (used for evaluation only), (c) initial random walker prediction, (d) our final segmentation result produced by the weakly supervised segmentation scheme, (e) the fully supervised result for reference.Specifically, we compare example cases for weak.sup.dextr3D (w RW) Dice + Point loss + Point Attn and fully sup.dextr3D Dice loss for (d) and (e), respectively.The probability maps are scaled between 0 and 1 and we show all non-zero probabilities.

Figure 5 :
Figure 5: The impact of adding the point loss and point attention to our weakly supervised models.We show the results of the top: weak.sup.dextr3D (w RW) Dice; and bottom: weak.sup.dextr3D (w RW) Dice + Point loss + Point Attn settings.Examples from the MSD-Spleen (left) and MO-Pancreas (right) datasets are shown, respectively.The clicked extreme points are shown by a yellow cross.Best viewed in color.The predictions learned together with the point loss do lie markedly closer to the clicked point locations.

Figure 6 :
Figure 6: Weakly supervised training from random walker initialization.For illustration, we only show the MO-Liver and MO-Pancreas segmentation tasks with the varying training settings as shown in our ablation study of Table1at each round of deep network training.While the performance of the MO-Liver models generally improves with the number of iterations, it can also be noticed that for MO-Pancreas , a poor initialization by the random walker can cause the models to degrade quickly.Notice, that adding the point channel information results in a more stable training behavior.