Image segmentation and related tasks, such as object and scene segmentation, have a wide range of applications, including (but not limited to) content-based image retrieval, medical diagnosis, autonomous driving, object detection, face recognition, etc.
While image segmentation is a generic problem, object segmentation is the problem of delineating the boundary of a specific type of object, such as a dog in an image or a liver in a CT scan. This problem is very important for medical imaging where it is used for delineating tumors or other pathologies, estimating the volume of a heart, a liver or a swollen lymph node, etc. Even though radiologists have been handling the aforementioned medical imaging tasks, an increasing amount of research indicates that computer vision techniques have the potential to outperform radiologists in terms of speed and accuracy.
Generic image segmentation is usually a low-level task that finds the boundary of a region purely based on the intensity difference with the neighboring regions. Object segmentation is a high-level task that aims at finding the boundary of a specific object and uses the shape of the object to eliminate distractors and to project where the boundary should be in places where it is not visible.
This paper takes the CVNN approach even further, bringing it to the level of the state of the art in 3D liver segmentation, with the following contributions:
Related Work
The U-Net has been extended to 3D segmentation tasks: 3D U-Net, which essentially replaces 2D convolutions with 3D convolutions; thus segments out 3D objects [
4]. In principle, any variation of the 2D U-Net architectures [
5,
6] can be adapted for 3D tasks by using 3D convolutions instead. ComboNet combines 2D and 3D U-Net architectures in an end-to-end fashion where the 2D portion takes a full-resolution input and the 3D portion takes a resized input to reduce computation [
7]. The outputs of the two sub-networks are combined with a series of convolution layers.
Because of state of the art results, researchers found different ways to enhance the U-Net, by introducing alterations to the architecture while maintaining the residual connections and its symmetric nature. For instance, the Attention U-Net added attention gating layers prior to each convolution block on the decoding part [
5]. The authors claim that Attention-U-Net outperforms U-Net by around
; however, the ComboNet exemplified in their ablation study that the performance improvement achieved by the Attention U-Net might be case-specific, obtaining decreased performance compared to the U-Net [
7].
Another state of the art U-Net variation is U-Net++ [
6]. It replaces the residual connections with a series of nested residual connections and provides a
improvement in accuracy while increasing the number of parameters by around
. In principle, the encoding part of the U-Net can be replaced by classification architectures without the fully connected layer(s) of the classification models. For instance UResNet [
8] combines the state of the art ResNet [
9] architecture with the U-Net architecture. Another recent U-Net-based method is ObeliskNet [
10]. It learns spatial filters and filter offsets in an end-to-end manner, obtaining a sparse model with few parameters.
One of the many recent examples of U-Net-based methods is nnU-Net [
11]. It uses a 3D U-Net architecture with slight modifications and tailored training, as well as data augmentation and post-processing methods to obtain state of the art results. The authors of the nnU-Net also proposed modifications to the U-Net architecture itself but the ablation study showed no significant improvement from the architectural changes.
The latest organ segmentation works use transformers as part of the U-Net to obtain state of the art segmentation performance. Transformers are attention-based models that have obtained state of the art results in many applications, from natural language processing to image classification and object detection [
12,
13,
14]. For organ segmentation, SETR and UNETR replace the encoder part of the U-Net with a vision transformer [
15,
16]. CoTr places the vision transformer between the encoder and decoder parts of the U-Net [
17]. Any of these developments could be used to replace the 3D UNet from our model to further boost the segmentation of the method.
One of the issues with the deep learning methods is that they need a large number of manually labeled images, which are almost always scarce in the medical imaging domain. Furthermore, annotations, especially segmentations, are not accurate, and the data are noisy. Although deep learning is known to handle noise well, it is not immune to overfitting. Once the lack of annotated data is taken into consideration, overfitting becomes a more significant issue. Often, researchers exercise different techniques to avoid overfitting, such as data augmentation and early stopping. Thus, in order to maintain a state of the art accuracy with a small data set, and with a relatively small number of parameters and computations, researchers are combining neural networks with level sets.
An approach that combined NN with level sets to segment out the left ventricle of the heart from cardiac cine magnetic resonance (MR) images was proposed in [
18]. They used Deep Belief Networks (DBN) [
19] as a region of interest (ROI) detector, which mainly yields a rectangle bounding box that encloses the object of interest. Then, within the ROI, Otsu’s gray-scale histogram-based thresholding [
20] is used to obtain an initial segmentation. The segmentation derived at this stage is used as a shape prior and/or initialization for the next stage. Then, Otsu’s segmentation is fed into a distance regularized level set formulation, which eventually yields the final segmentation [
21].
Some recent works merge level sets with deep learning [
22,
23]. Level sets are combined with VGG16 [
24] to segment out salient objects in [
22]. The level set formulation of active contours is used along with the optical flow for the task of moving object segmentation in [
23]. Also, to segment out lung nodules, machine learning regression models are used in conjunction with level sets to obtain a better curve evolution velocity model at a given point in [
25].
LevelSet R-CNN modified the Mask R-CNN [
26] architecture such that it has 3 additional mask heads, of which one predicts a truncated signed distance transform, the other predicts Chan-Vese features and the last predicts the Chan-Vese hyper parameters [
27].
The Deep Implicit Statistical Shape Model (DISSM) method uses implicit shape models based on deep learning with an iterative refinement also based on deep learning, to segment out certain organs in 3D CT scans [
28]. The method does not necessarily reduce the computation cost, yet it definitely improves the segmentation quality.
A hybrid active contour and UNet architecture was designed to segment out breast tumors in [
29]. The method takes in radiologist annotation as initialization, while in our case we use a detection algorithm to provide the initialization.
Our method differs from all these level set formulations. First, unlike [
18], we are using a U-Net CNN instead of a DBN, and the U-Net is used as the shape model instead of distance or length-based regularization, and we are not using Otsu’s thresholding. One study [
25] uses a least-squares-based regression method to model velocity; we are using the CNN to replace the curvature term in an Euler-Lagrange equation. The researchers in [
22] use VGG16 as a backbone to compute the initialization along with upsampling and refining the upsampled level sets. Moreover, they use a level set function as a loss function that is minimized. In contrast, our formulation uses the level set produced by the CNN as a shape model instead of length-based regularization and combines it with the Chan-Vese intensity-based update to obtain a model with very few parameters. In their next paper, [
23] minimize a Euler-Lagrange equation based on level sets produced by ResNet101 [
9], and unlike us, their formulation does not use the output of the CNN to replace the curvature. They only use it to estimate the Heaviside and subsequent average intensity. The LevelSet R-CNN [
27,
29] use the CNN model to learn almost every single parameter in the Chan-Vese and active contour formulation respectively, whereas we estimate those hyper-parameters algebraically. The VGG16, ResNet101, Mask R-CNN are all very computationally intensive; in comparison, our CNN is very efficient in terms of computation complexity and small in terms of number of parameters. Last but not least, our model is an RNN (Recurrent Neural network); we iterate over the same input to improve the result, whereas none of the aforementioned algorithms use an RNN. However, we must mention that [
23] works iteratively from one frame to the next, i.e., the segmentation of frame at time
t is used as initialization for time
.
In this paper we show how to use the U-Net model as part of the Chan-Vese NN framework (see
Section 2.3.1), to obtain state of the art 3D segmentation results with 140+ times fewer parameters than the original U-Net. In principle most of the above methods could be used as part of our method to further improve results.
Chan-Vese Overview
The Chan-Vese Active contour [
1] is aimed at minimizing the Mumford-Shah energy [
30]:
where
I denotes the image intensity,
C is the curve to be fitted,
,
are the regions inside and outside the curve
C, respectively, and
and
are the intensity averages of image
I inside and outside the curve
C, respectively.
The Chan-Vese method takes a level set approach where the curve
C is represented as the 0—level set of a surface
, i.e.,
. Usually
is initialized as the signed Euclidean distance transform of
C, i.e.,
inside the curve
C and
outside, and the magnitude of
is the distance of the point
to the closest point on curve
C. Then the energy (
1) is extended to an energy of the level set function
:
where
is the smoothed Heaviside function
and
is its derivative. The parameter
controls the curve length regularization
. When
is small, the curve (segmentation)
C will have many small regions while when
is large, the curve
C will be smooth and the segmented regions will be large.
The energy is minimized alternatively by updating
then updating
:
where
.