Next Article in Journal
Stochastic Modeling of Within-Host Dynamics of Plasmodium Falciparum
Next Article in Special Issue
Community Detection Fusing Graph Attention Network
Previous Article in Journal
Deep Learning-Based Plant Classification Using Nonaligned Thermal and Visible Light Images
Previous Article in Special Issue
Fourier Neural Solver for Large Sparse Linear Algebraic Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FAS-UNet: A Novel FAS-Driven UNet to Learn Variational Image Segmentation

1
The School of Mathematics and Computational Science, Xiangtan University, Xiangtan 411105, China
2
Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan 411105, China
3
Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan 411105, China
4
Hunan National Applied Mathematics Center, Xiangtan 411105, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(21), 4055; https://doi.org/10.3390/math10214055
Submission received: 30 September 2022 / Revised: 22 October 2022 / Accepted: 24 October 2022 / Published: 1 November 2022
(This article belongs to the Special Issue Computational Intelligence: Theory and Applications)

Abstract

:
Solving variational image segmentation problems with hidden physics is often expensive and requires different algorithms and manually tuned model parameters. The deep learning methods based on the UNet structure have obtained outstanding performances in many different medical image segmentation tasks, but designing such networks requires many parameters and training data, which are not always available for practical problems. In this paper, inspired by the traditional multiphase convexity Mumford–Shah variational model and full approximation scheme (FAS) solving the nonlinear systems, we propose a novel variational-model-informed network (FAS-UNet), which exploits the model and algorithm priors to extract the multiscale features. The proposed model-informed network integrates image data and mathematical models and implements them through learning a few convolution kernels. Based on the variational theory and FAS algorithm, we first design a feature extraction sub-network (FAS-Solution module) to solve the model-driven nonlinear systems, where a skip-connection is employed to fuse the multiscale features. Secondly, we further design a convolutional block to fuse the extracted features from the previous stage, resulting in the final segmentation possibility. Experimental results on three different medical image segmentation tasks show that the proposed FAS-UNet is very competitive with other state-of-the-art methods in the qualitative, quantitative, and model complexity evaluations. Moreover, it may also be possible to train specialized network architectures that automatically satisfy some of the mathematical and physical laws in other image problems for better accuracy, faster training, and improved generalization.

1. Introduction

Image segmentation is one of the most important problems in computer vision and also is a difficult problem in the medical imaging community [1,2,3]. It has been widely used in many medical image processing fields such as the identification of cardiovascular diseases [4], the measurement of bone and tissue [5], and the extraction of suspicious lesions to aid radiologists. Therefore, image segmentation has a vital role in promoting medical image analysis and applications as a powerful image processing tool [5,6].
Deep learning (DL) has achieved great success in the field of medical image segmentation [5,7,8]. One of the most important reasons is that the convolutional neural networks (CNNs) can effectively extract image features. Therefore, much work at present involves design a network architecture with strong feature extraction ability, and many well-known CNN architectures have been proposed such as UNet [9], V-Net [10], UNet++ [11], 3D UNet [12], Y-Net [13], Res-UNet [14], KiU-Net [15], DenseUNet [16], and nnU-Net [17]. More and more studies based on data-driven methods have been reported for medical image segmentation. Although UNet and its variants have achieved considerably impressive performance in many medical image segmentation datasets, they still suffer two limitations. One is that most of researchers have introduced more parameters to improve the performance of medical image segmentation, but have tended to ignore the technical branch of the model’s memory and computational overhead, which makes it difficult to popularize the algorithm to industry applications [18]. The other disadvantage is that these variants only design many suitable architectures through the researcher’s experience or experiments, but do not focus on the mathematical theoretical guidance of network architectures such as explainability, generalizability, etc., which limits the application of these models and the improvement of task-driven medical image segmentation methods [19,20].
Recently, many works on image recognition and image reconstruction have been focusing on the interpretability of the network architecture. Inspired by some mathematical viewpoints, many related unroll networks have been designed and successfully applied. He et al. [21] proposed the deep residual learning framework, which utilizes an identity map to facilitate training; it is well known that it is very similar to the iterative method solving ordinary differential equations (ODEs) and also achieves promising performance on image recognition. G. Larsson et al employed the fractal idea to design a self-similar FractalNet [22], also discovering that its architecture is similar to the Runge–Kutta (RK) scheme in numerical calculations. According to the nature of polynomials, Zhang et al. designed PolyNet [23] by improving ResNet to strengthen the expressive ability of the network, and Gomez et al. [24] proposed RevNet by using some ideas of the dynamic system. Chen et al. [25] analyzed the process of solving ODEs, then proposed Neural ODE, which further shows that mathematics and neural networks have a strong relationship. Meanwhile, He et al. designed a network architecture for the super-resolution task based on the forward Euler and RK methods of solving ODEs [18] and achieved good performance. Sun et al. [26] designed ADMM-Net through the alternating direction method to learn an image reconstruction problem. Inspired by a multigrid algorithm for solving inverse problems, He et al. [27] proposed a learnable classification network denoted as MgNet to extract image features u , which uses a few parameters to achieve good performance on the CIFAR datasets. Alt et al. [28] analyzed the behavior and mathematical foundations of UNet, and interpreted them as approximations of continuous nonlinear partial differential equations (PDEs) by using full approximation schemes (FASs). Experimental evaluations showed that the proposed architectures for the denoising and inpainting tasks save half of the trainable parameters and can thus outperform standard ones with the same model complexity.
Unfortunately, only a few studies based on model-driven techniques have been reported for the segmentation task. In this paper, we mainly focus on the explainable DL framework combining the advantages of the FAS and UNet for medical image segmentation.

1.1. Problem

H. Helmholtz proposed that the ill-posed problem of producing reliable perception from fuzzy signals can be solved through the process of “unconscious inference” (the Helmholtz Hypothesis) [29]. This theory implies that human vision is incomplete and that details are inferred by the unconscious mind to create a complete image. That is, our perception system can also integrate the fuzzy evidence received from the senses into the situation based on its own environmental model.
Let p ( u | f ; α ) be a probabilistic distribution for feature representations u of the source image f . The prior probability of u can be modeled as the multivariate normal distribution. In general, u can be extracted from a given image f by optimizing the maximum a posteriori (MAP) estimation as
arg max u log p ( u | f ; α ) ,
where α is the environmental parameter in classical “unconscious inference” or the inverse problem, and this problem leads to the nonlinear system defined by
F ( u ; α ) = b ,
where the nonlinear operator F ( · ; α ) is employed to generate the image b , e.g., b = A T f is a deconvoluted image of f in the image deblurring problem with a convolution operator A .
We consider that image segmentation refers to a composite process of feature extraction (6) and feature fusion segmentation. Here, the fusing process for feature u is defined by
s = S ( u ; β ) ,
where S ( · ; β ) denotes a fusing segmentation with a fixed conscious parameter β , and s is the segmentation results or probability maps.
Such strongly interpretable segmentation models [30,31,32] are so general that, depending on the amount of well-predefined sparsity priors of the input image, they have the advantages of theoretical support and strong convergence. The total flowchart of classical variational segmentation can be summarized as shown in Figure 1a. However, they usually require expensive computations, but also have to face the problems of the selection of suitable regularizers ϕ ( · ) and model parameters ( α , β ) . Consequently, some reconstructed results are unsatisfactory.
It is well known that the solution u usually has the multiscale property, so a natural idea is to exploit the multi-layer convolution and multigrid architecture, which can describe multiscale features to learn u . Based on the above facts, we propose a two-stage segmentation framework for learning feature u in Stage 1 and segmentation s in Stage 2, which is shown in Figure 1b.

1.2. Contributions

In this work, we focus on analyzing the feature extraction inverse problem (2) and the feature fusion segmentation (3) to design an explainable deep learning network. It is well known that the unrolled iterations of the classical solution algorithm can be considered as the layers of a neural network, so we propose a novel FAS-driven UNet (FAS-UNet), which integrates image data and a multiscale algorithm for solving the nonlinear inverse problem (7). The major differences with our approach are that MgNet is not a U-shaped architecture and is only used for image classification, which leads to the output result not being able to be converted to the segmentation prediction of the input image. Besides, the proposed network was inspired by the traditional multiphase convexity Mumford–Shah variational model [30] and FAS algorithm for solving nonlinear systems [33], which exploits the model and algorithm priors’ information to extract the image features. Indeed, the goal of our work is to show that, under some assumptions about the operators, it is possible to interpret the smoothing operations of the FAS and image geometric extracting operations of the variational model as the layers of a CNN, which in turn, provide fairly specialized network architectures that allow us to solve the standard nonlinear system (7) for a specific choice of the parameters involved.
Our main contributions are summarized as follows:
  • We propose a novel variational-model-informed two-stage image segmentation network (FAS-UNet), where an explainable and lightweight sub-network for feature extraction is designed by combining the traditional multiphase convexity Mumford–Shah variational model and FAS algorithm for solving nonlinear systems. To the best of our knowledge, it is the first unrolled architecture designed based on model and algorithm priors in the image segmentation community.
  • The proposed model-informed network integrates image data and mathematical models, and it provides a helpful viewpoint for designing the image segmentation network architecture.
  • The proposed architecture can be trained from additional model information obtained by enforcing some of the mathematical and physical laws for better accuracy, faster training, and improved generalization. Extensive experimental results show that it performs better than the other state-of-the-art methods.
The rest of the paper is organized as follows. The novel FAS-UNet framework for solving nonlinear inverse problems by analyzing variational segmentation theory and the FAS algorithm is proposed in Section 2. We show experimental results in Section 3. Finally, we conclude this work in Section 4.

2. Variational Segmentation via the CNN Framework

The goal of image segmentation is to partition a given image f : Ω R into r regions Ω i i = 1 r that contain distinct objects and satisfy Ω i Ω j = , j i , and i = 1 r Ω i = Ω , where the image domain Ω is a bounded and open subset of R 2 . Assume that Γ = Ω i is the union of boundaries of Ω i , | Γ | , denoting the arc length of curve Γ .

2.1. Multiphase Variational Image Segmentation

As mentioned, various ways of variational image segmentation have been proposed. Below, we review a few of them.

2.1.1. Variational Image Segmentation

The Mumford–Shah (M-S) model is a well-known variational image segmentation method proposed by Mumford and Shah [34], which can be defined as follows:
min u , Γ τ 1 Ω ( f u ) 2 d x + τ 2 Ω Γ | u | 2 d x + | Γ | ,
where τ 1 and τ 2 are the weight parameters. The first term requires that u : Ω R approximates f, the second term that u does not vary much on each Ω i , and the third term that the boundary Γ is as short as possible. This shows that u is a piecewise smooth approximation of f.
In particular, Chan and Vese considered the special case of the M-S model where the function u is chosen to be a piecewise constant function; thus, the minimization for two-phase segmentation is given as
min Γ , c 1 , c 2 λ 1 inside ( Γ ) f c 1 2 d x + λ 2 outside ( Γ ) f c 2 2 d x + | Γ | ,
where c 1 and c 2 are the average image intensities inside and outside of boundary Γ , respectively, and λ 1 and λ 2 are the weight parameters.
Sometimes, the given image is degraded by noise and problem-related blur operator A . Therefore, Cai et al. [30] extended the two-stage image segmentation strategy using a convex variant of the Mumford–Shah model as
min u W 1 , 2 ( Ω ) Ω κ 1 ( f A u ) 2 + κ 2 | u | 2 + | u | d x ,
where κ 1 and κ 2 are positive parameters, and the existence and uniqueness of u were analyzed in their work.
We assume the image features u = ( u 1 , , u d ) T : Ω R d , where u i : Ω i R is a smooth mapping defined on the tissue or lesion Ω i . In this work, we extend the above model (4) to the multiphase case, which can deal with d-phase segmentation (multiple objects), which refers to a two-stage composite process of feature extraction (6) and feature fusion segmentation (3).

2.1.2. Feature Extraction

The first stage is to extract image features u by maximizing a posterior probabilistic distribution (6) for feature representations u of a given image f as
arg max u p ( u | f ; α ) = arg max u log p ( u | f ; α ) = arg max u log p ( f | u ; α ) p ( u ; α ) p ( f ) = arg max u log p ( f | u ; α ) p ( u ; α ) ,
where α is the environmental parameter in classical “unconscious inference” or the inverse problem. Especially, the likelihood probability p ( f | u ; α ) and the prior probability p ( u ; α ) can be modeled as normal distributions, respectively, denoted by
p ( f | u ; α ) e 1 2 σ 2 Ω ( A u f ) 2 d Ω = e γ Ω ( A u f ) 2 d Ω , p ( u ; α ) e λ Ω ϕ ( u ) d Ω ;
thus, the first stage is to find a smooth approximation u by minimizing the multiphase generalizability (TS-MCMS) of (4), which can be rewritten as
min u W 1 , 2 ( Ω ) Ω ( f A u ) 2 d x + μ Ω ϕ ( u ) d x ,
where A : R d R is a convolutional blur operator, ϕ ( u ) = ν | u | 2 + | u | is a geometric prior of u , and μ = λ γ . Hence, this leads to the nonlinear system as
F ( u ; α ) : = A T A u μ · ( ϕ ( u ) ) = b ,
where b = A T f and α = ( A , , μ , ν ) .

2.1.3. Feature Fusion Segmentation

Once the features u are obtained, the segmentation is performed by fusing u properly in the second stage; for example, many novel image segmentation methods [30,31,32] have been proposed based on thresholding the smooth solution u . Then, the fusing process for feature u is finished in (3).
The model-driven methods introduce prior knowledge regarding many desirable mathematical properties of the underlying anatomical structure, such as phase field theory, Γ -approximation, smoothness, and sparseness. The informed priors may help to render the segmentation method more robust and stable. However, these model-inspired methods generally solve the optimization problem in the image domain, while the numerical minimization method for the feature representations u is very slow because the regularization of the TV-norm, the high dimensionality of u , as well as the nonlinear relationship between the images and the parameters pose significant computational challenges. Furthermore, it is challenging to introduce priors flexibly under different clinical scenes. These limitations make it hard for purely model-based segmentation to obtain the solutions efficiently and flexibly.
The goal of this work was to learn powerful solvers of (7) and (3) to aggregate a variety of mechanisms to address the medical image segmentation problem efficiently.

2.2. Proposed Learnable Framework of TS-MCMS Algorithm

We summarized the two-stage algorithm to formulate medical image segmentation based on the TS-MCMS model, inspired by the CNN architectures of unrolled iterations, and we propose a learnable framework with two CNN modules on multiscale feature spaces, FAS-UNet (see Figure 1b), aimed at learning the nonlinear inverse operators of (7) and (3) in the context of the variational inverse problem to segment a given image f.
It is already well known that the unrolled iterations of many classical algorithms can be considered as the layers of a neural network [22,23,24,25,26]. In this part, we are not interested in designing another approach for inferring the classes in MgNet [27], but rather, we aim at extracting the features of a given image f.
Inspired by the variational segmentation model (6), one of the key ideas in the proposed architecture is that we split our framework into a solution module T K ( f ; θ 1 ) and a feature fusion module S K ( u ; θ 2 ) , where T K ( f ; θ 1 ) is the feature extraction part of the framework (in the multi-stage case) and S K ( · ; θ 2 ) is the stage fusion part to be learned. Therefore, how to design the effective function maps T K for approximately solving (7) and S K for approximating (3) is an important problem.
This work applies a nonlinear multigrid method to design FAS-UNet for explainable medical image segmentation by learning the two following modules:
u = T K ( f ; θ 1 ) s = S K ( u ; θ 2 ) ,
where f is an input image, u is the feature maps, and s is the prediction for the truth partitions, leading to the overall approximation function as
s = S K ( T K ( f ; θ 1 ) ; θ 2 ) ,
where θ 1 and θ 2 are parameters to be learned in the proposed explainable FAS-UNet architecture.
To understand the approximation ability of the proposed modules T K ( f ; θ 1 ) and S K ( u ; θ 2 ) generated by the FAS-UNet architecture, we refer the readers to D. Zhou’s work [35], which answers an open question in CNN learning theory about how deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough.

2.3. FAS-Module for Feature Extraction

In this part, we discuss how the multigrid method can be used to solve nonlinear problems. The Helmholtz Hypothesis [29] demonstrates that the extracted features can also be represented by solving the equation:
F ( u , α ) = b : = A T f ,
subject to
min u S K ( u ; θ 2 ) y ,
where F denotes the transformation of combining feature u with a deblurred image b = A T f , u is the unknown features, and y is the ground-truth of image f. Our starting point is the traditional FAS algorithm solving (10).

2.3.1. The Full Approximation Scheme

The multigrid method is usually used to solve nonlinear algebraic systems (10). For simplicity, the parameter α in F ( u , α ) is omitted when only the classical FAS algorithm is involved, i.e.,
F ( u ) = b .
The multigrid ingredients including the error smoothing and the coarse grid correction ideas are not restricted to the linear situation, but can be immediately used for the nonlinear problem itself, which leads to the so-called FAS algorithm. The fundamental idea of the nonlinear multigrid is the same as in the linear case, and the FAS method can be recursively defined on the basis of a two-grid method. We start with the description of one fine–coarse cycle (finer grid layer and coarser grid layer + 1 ) of the nonlinear two-grid method for solving (11). To proceed, let the fine grid equation be written as
F ( u ) = b .
Firstly, we compute an approximation u ¯ : = u m of the fine grid problem by applying m pre-smoothing steps to u as follows
  • u 0 = u ;
  • for k = 1 : m   do
  •    u k = u k 1 + ( F ) ( b F ( u k 1 ) ) ,
  • end for
  • u ¯ = u m ,
which can be obtained via solving the least-squares problem defined by
min u E ( u ) : = 1 2 F ( u ) b 2 .
Secondly, the errors to the solution have to be smoothed such that they can be approximated on a coarser grid. Then, the defect r = b F ( u ¯ ) is computed, and an analog of the linear defect equation is transferred to the coarse grid, which is defined by
F + 1 ( u + 1 ) = b + 1 : = I + 1 r + F + 1 ( I + 1 u ¯ ) .
The coarse grid corrections are interpolated back to the fine grid by
u ¯ u ¯ + I + 1 u + 1 I + 1 u ¯ ,
where u + 1 is a solution of the coarse grid equations, then the errors are finally post-smoothed by
  • u 0 = u ¯ ;
  • for j = 1 : m   do
  •    u j = u j 1 + ( F ) ( b F ( u j 1 ) ) ,
  • end for
  • u ^ = u m .
This means that, once the solution of the fine grid problem is obtained, the coarse grid correction does not introduce any changes via interpolation. We regard this property as an essential one, and in our derivation of the coarse grid optimization problem, we make sure that it is satisfied.

2.3.2. FAS-Solution Module—A Learnable Architecture for FAS Solution

In this part, we unroll the multiscale correction process of the multigrid method and design a series of deep FAS-Solution modules to propagate the features of input image f. Figure 2 demonstrates the cascade of all ingredients at each FAS iteration of our propagative network.
We now consider a decomposition of the image representation into partial sums, which correspond to the multiscale feature sequence, having the following idea in mind. To learn features that are invariant to noise and uninformative intensity variations, we propose a generative feature module u = T K ( f ; θ 1 ) allowing for a significant reduction of the number of parameters involved, which involves a learnable FAS update for solving the nonlinear system (10). Next, we analyze its three key components as follows.
Pre-/coarsest-/post-smoothing block: Error smoothing is one of the two basic principles of the FAS approach. At each pre-smoothing (or coarsest-processing or post-processing) step, to establish an efficient error correction and reduce the computational costs, we propose to generate the error-based iterative scheme. The main motivation of the learnable pre-/coarsest-/post-smoothing blocks (LSB/CSB/RSB) is to provide another way to robustly resolve the ambiguity of the feasible solutions. Therefore, we further unroll the Newton update process for calculating the approximate solution of the feature maps, then design a series of deep pre-/coarsest-/post-smoothing blocks as
  • for   j = 0 : k q 1   do
  •    u j + 1 = u j + M ( u j ; K q , , K q , , j , b ) = u j + F K q , j ( b F K q ( u j ) ) ,
  • end for
  • u ¯ = u k q ;
where the residual network block M ( u ; K , K , b ) is a trainable feature correction network; here, = 1 , , L with L -grid cycles. The above deep smoothing series denotes the pre-smoothing block when q : = l , the L t h coarsest-smoothing when q : = m , and the post-smoothing when q : = r . F K q , j (or F K q , which means that the convolution K q , will share the same weights in the overall pre-/post-precessing smoothing steps of the t h grid cycle) represents operations consisting of three main components, including convolution K q , , j (or K q , ) with p filters, ReLU function φ ( · ) , and batch normalization ψ ( · ) , such that F K q , j ( · ) : = ψ ( φ ( K q , , j ( · ) ) ) . Especially, b 0 : = K 0 f is an initial feature in the finest grid, which is obtained by learning the convolution K 0 with p filters.
Feature downsample block: The choice of restriction and interpolation operators I + 1 and I + 1 in the FAS algorithm, for the intergrid transfer of grid functions, is closely related to the choice of the coarse strategy. Here, we design the learnable convolution for transfer operators, i.e., the grid transfers between the finer grid and the coarser grid + 1 .
The low-frequency components represent meaningful image features on a coarse grid + 1 , whereas the high-frequency components do not because they are not “visible” on the coarse grid, which means that the frequency information on the coarse grid can be extracted from the right-side term defined by
b + 1 = K + 1 ( b F K ( u ¯ ) ) + F K + 1 ( K + 1 u ¯ ) ,
where b and u ¯ are the inputs of the downsample block and b + 1 is the output of the downsampling module in the feature space; here, = 1 , , L 1 with L -grid cycles. Similar to F K j , K + 1 is a learnable downsample operation that would be used to approximate the restriction function I + 1 in (12) or (13), such as convolution with a stride of 2 and p filters. F K and F K + 1 are the nonlinear convolutional blocks in the fine and coarse layers, respectively. In general, F K denotes the operator consisting of three main components, including convolution K with p filters, ReLU function φ ( · ) , and batch normalization ψ ( · ) , such that F K ( · ) : = ψ ( φ ( K ( · ) ) ) . Note that b F K ( u ¯ ) is equivalent to the residual of the images in the fine layer, then F K + 1 ( K + 1 u ¯ ) is added to reduce the loss of image information compared with directly pooling the image. The feature downsample block (FDB) architecture is shown in Figure 2b.
Feature correction block: The purpose of the feature correction block (FCB) is to take the detailed information extracted from the coarser grid into account and help to compensate the encoded features u ¯ . The coarse grid corrections are interpolated back to the fine grid, i.e.,
u ^ u ¯ + K + 1 u + 1 K + 1 u ¯ ,
where K + 1 is a learnable upsampling operation that would be used to approximate the interpolation function I + 1 in (13), such as the transposed convolution with a stride of 2 and p filters; here, = 1 , , L 1 with L -grid cycles. Obviously, e + 1 = u + 1 K + 1 u ¯ is the residual features on the coarse grid. Compared to directly upsampling u + 1 , the transposed convolution K + 1 e + 1 of the residual feature maps e + 1 is used as the error corrections to update the fine grid approximation u ^ , which will compensate the information of feature maps u ¯ . Such a transposed convolution could learn a self-adaptive mapping to restore features with more detailed information.
Based on these designs for nonlinear operator F K and two grid transfer convolutions K + 1 and K + 1 with p filters, we aimed to approximate the feature solution of (6) by learning these feature extraction parameters as
θ 1 = K + 1 , K + 1 , ( K q , ) , ( K q , , j ) j = 1 k q , K = 1 L 1 , K 0 , ( K m , L ) , ( K m , L , j ) j = 1 k m | q { l , r }
in u = T K ( f ; θ 1 ) , thus further improving image segmentation.

2.4. Learning Feature Fusion Segmentation

It is well known that, in the segmentation task, each pixel is labeled as either 0 or 1 so that organ pixels can be accurately identified within the tight bounding box. In the second stage of the two-stage multiphase variational image segmentation (6), the traditional method is that users manually set one or more thresholds according to their professional prior, with all pixels in the same object sharing the threshold, and then, filter the feature to obtain the segmentation result. Another method is to obtain the final segmentation result by k-means clustering (the number of categories is given artificially, and the initial clustering center is adjusted continuously during the clustering process) [30]. This approach leads to a large amount of computation (recalculation of the metrics for each iteration) and unstable segmentation results (only considering the relationship between pixels and centers, not the relationship between pixels).
The second key component of our proposed FAS-UNet framework is how to design the segmentation module S K ( u ; θ 2 ) to compute segmented mask s . However, the fusion segmentation module takes a batch of multiscaled features from the FAS module as the input and outputs the mass segmentation masks. Finally, the pixel segmentation computes the mapping from smaller-scale possibility predictions to binary masks.
Based on this idea, the feature fusion segmentation module is constructed, which comprises a convolutional operation and an activation function. Intuitively, the module T K ( f ; θ 1 ) of the feature extraction based on deep learning obtains the multi-channel feature maps u (much larger than the number of categories) in the first stage. Then, a shallow convolutional network is constructed to learn the parameters K p corresponding to the mapping ρ ( K p ( · ) ) : R p R c from the feature maps u to the segmentation probability maps through softmax function ρ ( · ) , which improves the traditional practice and has better generalization.
Based on these designs for channel transfer convolution K p with c filters (c is the number of segmentation categories), we aimed to approximate the final multiphase segmentation probability maps S ( · ; β ) of (6) by learning these fusion parameters as
θ 2 = { K p }
in s = S K ( u ; θ 2 ) , thus further refining the segmentation mask.

2.5. Loss Function

The proposed FAS-UNet architecture can be rewritten as
s = S K ( T K ( f ; θ 1 ) ; θ 2 ) ,
which requires the loss function L ( θ ; D train ) to optimize the model parameters θ : = { θ 1 , θ 2 } . It can measure the error between the prediction and labels, and the gradients of the weights in the loss function can be back-propagated to the previous layers in order to update the model weights.
To proceed, we considered the training data D train = f i , y i i = 1 n from a set of classes C train = { 0 , , c 1 } d used for training a pixel classifier, where f i is an image sample, y i C train is the corresponding label, c is the number of object categories segmented in the datasets, d indicates the number of image pixels, and n denotes the number of training samples. We employed the cross-entropy as the loss function, leading to the optimization problem as follows:
min θ L ( θ ; D train ) : = i = 1 , j = 1 n , d s ¯ i ( j ) · log ( s i ( j ) ) + ( 1 s ¯ i ( j ) ) · log ( 1 s i ( j ) ) ,
where s i ( j ) denotes the predicted probabilistic vector of the j t h pixel in the i t h sample and s ¯ i ( j ) corresponds to the one-hot-encoded label of the ground-truth y i ( j ) at the j t h pixel in the i t h sample. s i ( j ) R c . Finally, the predicted class of the j t h pixel of the i t h image would be given by
y i ( j ) = arg max k { s i ( j ) ( 1 ) , , s i ( j ) ( k ) , , s i ( j ) ( c ) } ,
where s i ( j ) ( k ) C train .

3. Datasets and Experiments

In this section, we first introduce the evaluation metrics for medical image segmentation and also describe the datasets and the experimental settings that we used for 2D CT image segmentation and 3D medical volumetric segmentation. Next, we analyzed the sensitivity of 2D FAS-UNet to each hyperparameter configuration by a series of experiments. Finally, we evaluated the effectiveness of the proposed 2D/3D FAS-UNet through comparative experiments.

3.1. Evaluation Metrics

There are many metrics to quantitatively evaluate segmentation accuracy, each of which focuses on different aspects. In this work, we employed the average Dice similarity coefficient (a-DSC), average precision (a-Preci), and average symmetric surface distance (a-SSD), which are widely used in the segmentation task as evaluation metrics to evaluate the performance of the model. The a-DSC/a-Preci/a-SSD are calculated by averaging the DSC/Preci/SSD of each category [36].
The Dice score is the most-used metric in validating medical image segmentation, also called the DSC score [37], defined by
DSC ( S , Y ) = 2 × | S Y | | S | + | Y | ,
where S and Y denote the automatically segmentation set of images and the manually annotated ground-truth, respectively. | · | denotes the measure of a set. The above formulas compute the overlap between the prediction and the ground-truth, to evaluate the overall effect of the segmentation results. However, it is fairly insensitive to the precise boundary of the segmented regions. Precision effectively describes the purity of the prediction relative to the ground-truth or measures the number of those pixels actually having a matching ground-truth annotation by calculating
Preci = T P T P + F P ,
where a true positive (TP) is observed when a prediction–target mask pair has a score that exceeds some predefined threshold and a false positive (FP) indicates that a predicted object mask has no associated ground-truth. The SSD value between two finite point sets S and Y is defined as follows:
SSD ( S , Y ) = s S d ( s , Y ) + y Y d ( y , S ) | S | + | Y | ,
where d ( v , X ) = min x X v x denotes the minimum Euclidean distance from point v to all points of X.

3.2. Datasets and Experimental Setup

We evaluated the proposed method and other state-of-the-art methods on the 2D SegTHOR datasets [38], 3D HVSMR-2016 datasets [39], and 3D CHAOS-CT datasets [40,41], respectively. We introduce their details and data processing methods as follows.

3.2.1. Data Preparation

For the SegTHOR datasets, in order to reduce the GPU memory cost and reduce the image noise, we first split the original 3D data into many 2D images along the Z-axis. Secondly, we used f = I [ 96 : 400 , 172 : 396 ] as the input image, where I denotes the original 2D slice. We removed the slices of the pure black ground-truth when it was in training on the datasets.
For the 3D HVSMR-2016 datasets, we directly used the sliding window cropping method with strides of 64 × 64 × 32 to crop the volumes. In general, before cropping the 3D whole-volume into several overlapping sub-volumes of size 128 × 128 × 64 , a ( 32 , 32 , 16 ) -voxel padding with zero filling was first added to each direction of the 3D whole-volume. Then, after these operations, all remaining sub-volumes whose sizes were smaller than 128 × 128 × 64 were resized to 128 × 128 × 64 with zero-filling, and the intensity values of all patches were in [ 0 , 4808 ] .
For the 3D CHAOS-CT datasets, which were used as for the liver segmentation experiments, we first cropped the volumes in the x , y directions to obtain an ROI with a size of 380 × 440 × z and then used the above sliding window cropping method to crop out several 3D sub-volumes, where those intensity values were in [ 1200 , 1096 ] .
Although the noise problem can be improved by data pre-processing, our aim was not to pursue the best performance of the network on these datasets; we compared the performance of each method under fairer conditions. Using some data pre-processing techniques may be particularly beneficial to some methods, while at the same time, they may degrade the performance of others, so we did not use more complex data pre-processing techniques.

3.2.2. Experimental Configurations

We used mini-batch stochastic gradient descent (SGD) to optimize the proposed model, in which the initial learning rate, momentum parameter, and weight decay parameter were set to 0.01, 0.99, and 10 4 , respectively. We set the batch size as 16, 4, and 4 for the 2D SegTHOR datasets, 3D HVSMR-2016 datasets, and 3D CHAOS-CT datasets, respectively. The maximum epochs of the three datasets were set to 150, 150, and 300, respectively. We used also the decay strategy to update the learning rate. The network initialization method was defined as Kaiming initialization, and the activation function was set as ReLU. The numbers of grid cycles for 2D and 3D FAS-UNet were L = 5 and L = 4 , respectively. The kernel sizes of the 2D and 3D networks were set to 3 × 3 and 3 × 3 × 3 as the defaults, respectively. We did not use the weight-sharing scheme for the convolution K q , , j (within the outer-level nonlinear operator F K q , j ) within one smoothing block. Table 1 shows the details of the 2D FAS-UNet framework, and 3D FAS-UNet has a similar architecture, except for replacing the 2D convolution with a 3D convolution.

3.2.3. Parameter Complexity

To compute the number of parameters of the proposed model, we first denote the number of parameters of 2D convolution kernel K 2 d with the shape p × p × k c × k c as follows:
η ( K 2 d ) = p 2 N = p 2 ( k c ) 2 .
Similarly, the number of parameters of 3D convolution K 3 d with the shape r p × r p × k c × k c × k c is defined as
η ( K 3 d ) = r 2 p 2 ( k c ) 3 .
Especially if the channel ratio satisfies that r < 3 k c , one has η ( K 2 d ) > η ( K 3 d ) . Thus, the number of parameters of convolution kernel set θ 1 in the proposed FAS-UNet can be computed by
η ( θ 1 ) = c p ( k c ) 2 + ( ( k m + 1 ) + ( L 1 ) ( k l + k r + 5 ) ) η ( K ) ( k m + 1 ) + ( L 1 ) ( k l + k r + 5 ) η ( K ) ,
where c is the channel number of the input image and η ( K ) is the number of parameters of each 2D or 3D convolution K .
If setting k c = 3 , p = 64 , { k l , k m , k r } = { 3 , 7 , 4 } and L = 5 , one has η ( K 2 d ) = 36,864 ; hence, 2D FAS-UNet has approximately η ( θ 1 ) = 56 η ( K 2 d ) = 2,064,384 parameters. In addition, one also has η ( K 3 d ) = 27,648 when k c = 3 , p = 32 . If setting { k l , k m , k r } = { 3 , 5 , 2 } and L = 4 , thus 3D FAS-UNet has approximately η ( θ 1 ) = 36 η ( K 3 d ) = 995,328 parameters. Here, we did not compute η ( θ 2 ) , where a small amount of parameters are involved.
Our experiments were implemented on the PyTorch framework and two NVIDIA Geforce RTX 2080Ti GPUs with 11GB memory. For each quantitative result in the experiments, we repeated the experiment twice and chose the best one to compute the mean/std. Note that we used the same pipeline for all these experiments of each dataset for a fair comparison. The networks under comparison were trained from scratch.

3.3. Ablation Studies

We conducted four groups of ablation studies on the 2D SegTHOR datasets to optimize the hyperparameter configurations of the proposed framework.

3.3.1. Blocks’ Sensitivity

Firstly, we assessed the effect of smoothing block configurations { k l , k m , k r } , where { k l , k m , k r } denote the k l , k m , and k r smoothing iterations in the LSB, CSB, and RSB, respectively. Here, we first fixed the channel configuration with p = 32 as the default. Table 2 shows a quantitative results of different block parameter sets, and we observed from the pre-smoothing experiments (fixing k m and k r , varying k l ) that 2D FAS-UNet with k l = 3 iterations achieved the best a-DSC score and precision of 85.60% and 85.99%, respectively. For the post-smoothing (fixing k l and k m , varying k r ), we saw that the model with k r = 4 iterations achieved the highest values of a-SSD and precision, the a-DSC score being slightly lower than the model with the block configuration { 3 , 7 , 5 } by 0.04%. To balance the prediction performance and computational costs, we set the block configuration as { 3 , 7 , 4 } in all 2D experiments.

3.3.2. The Input Feature Initialization

We also evaluated the initialization configuration of the input feature b 0 = K 0 f on the finest cycle = 1 to verify its sensitivity. We compared different variants of the initialization method, such as zero initialization, random normal distribution initialization, and ψ ( K 0 f ) initialization with the batch normalization operation ψ ( · ) , where K 0 was obtained by learning the convolutional kernel with a size of p × 3 × 3 for 2D segmentation or p × 3 × 3 × 3 for 3D segmentation.
Table 3 shows the quantitative comparison of these variants. The model with ψ ( K 0 f ) initialization achieved a-DSC, a-SSD, and a-Preci values of 85.58% (ranked second), 2.61 mm (top-ranked), and 89.75% (top-ranked), respectively. Although the model with random initialization had the highest a-DSC score, the a-SSD and a-Preci scores were significantly lower than the model with ψ ( K 0 f ) initialization. To this end, we set the proposed framework with ψ ( K 0 f ) initialization as the default in this work.

3.3.3. Weight Sharing

To demonstrate the flexibility of the proposed framework, which does not have to be the different K q , , j parameter configurations in different nonlinear F K q , j (with respect to j) within the t h pre-smoothing ( q = l ) or post-smoothing ( q = r ) block, we conducted several variants that had different sharing settings among K q , , j (with respect to the iteration step j).
We only varied the weight sharing settings to verify the sensitivity of the model. We compared four variants, and the results are shown in Table 4. From the evaluation metrics, we see that the model without the weight sharing configuration had moreparameters and achieved the best performance on the a-DSC, a-SSD, and a-Preci values, respectively. The performance of the other three models did not show a significant differences. Therefore, we adopted the default unshared version in the rest of this paper.

3.3.4. Effects of Varying the Number of Channels

In this section, we analyze the segmentation performance of the proposed method with varying the number p of feature channels. In the FAS-Solution module, the pre-smoothing and post-smoothing steps (with k l and k r iterations, respectively) had the same update structure with the coarsest smoothing step, which included k m iterations. Therefore, we set the number of smoothing iterations as { k l , k m , k r } = { 3 , 7 , 4 } and adopted a series of channel parameters p = 80 , 64 , 48 , 32 , 16 , respectively, for comparison. Here, p = 80 means that, in each convolutional operation of the FAS-Solution module, there were 80 filters with the same kernel size of 3 × 3 .
Table 5 shows the quantitative comparison of different p-configurations. It reveals that, as the number of channels increased, the parameter of our model squarely increased. Additionally, the networks with the numbers of 64 and 16 achieved a-DSC scores of 86.83% (ranking first) and 84.88% (ranking lowest), respectively. When the number of channels was less than 64, increasing the number of channels could improve the a-DSC value, and one can see from Table 5 that the number of channels had a significant impact on the performance of the model. Based on this observation, the configuration with 64 channels is a preferable setting to balance the segmentation performance and computational costs, and we fixed p = 64 throughout all the 2D experiments.
To provide insights into the model hyperparameter configurations of the proposed 3D FAS-UNet version on the 3D HVSMR-2016 datasets and 3D CHAOS-CT datasets, we also carried out a series of ablation experiments to investigate the influence of two key design variables, the number of channels and the number of convolutional blocks. The evaluation indicated that the network performed better with the configurations p = 32 , { k l , k m , k r } = { 3 , 6 , 2 } for the 3D HVSMR-2016 datasets and { 3 , 5 , 2 } for the 3D CHAOS-CT datasets as the default. Here, we do not detail these comparisons.
Finally, we illustrate the hyperparameter configurations of the proposed FAS-UNet on each dataset throughout all experiments, as shown in Table 6.

3.4. The 2D FAS-UNet for the SegTHOR Datasets

We evaluated the proposed network on the 2D SegTHOR datasets and compared the visualizations and quantitative metrics with the existing state-of-the-art segmentation methods, including 2D UNet [9], CA-Net [20], CE-Net [42], CPFNet [43], ERFNet [44], UNet++ [11], and LinkNet [45].
In Table 7, we show the quantitative results of each organ’s segmentation compared with the other seven models. We can see that the segmentation performance of the heart was the best among all organs, and its a-DSC score was more than 92% for each method, followed by the aorta, and the worst was the esophagus. The main reason for the good performance in extracting the heart was that the heart region is the largest, and its inner pixel value changes little, while its boundary is more obvious (see Figure 3), while the esophageal region is the smallest among all organs, which increased the segmentation difficulty.
The proposed method achieved an a-DSC value of 86.83% (ranked first) and an a-Preci value of 88.21% (ranked first) with only 2.08 M parameters. Compared with the state-of-the-art models CA-Net (second-ranked in the a-DSC score) and UNet++ (ranked second in the a-Preci value), the proposed 2D FAS-UNet obtained a 0.17% improvement of the a-DSC score with only 75% as many parameters as CA-Net and a 0.19% improvement of the a-Preci value with only 22% as many parameters as UNet++. Our method also had a higher a-DSC score than the third-ranked UNet by 0.75%, but had far fewer parameters than the 17.26 M of UNet. Compared with ERFNet, which achieved a-DSC and a-Preci values of 82.86% and 80.48% with the fewest parameters, respectively, FAS-UNet was higher in the overall a-DSC rankings. The a-DSC score of ERFNet ranked last, so we think it may fall into an under-fitting situation. This shows that ERFNet reduces the performance of the network while saving the parameters. CA-Net obtained a good a-DSC value because the attention mechanism may improve the segmentation results of small organs.
The evaluation results were also measured in terms of the a-SSD value for segmentation predictions of the eight methods. The proposed method achieved an a-SSD value of 5.19 mm (ranked fourth); CE-Net ranked first, which achieved an a-SSD value of 4.08 mm; ERFNet with the fewest parameters ranked fifth with 5.92 mm. CA-Net and Link-net ranked second and third, which had 2.78 M and 21.79 M parameters, respectively. The good results of CA-Net in terms of the a-SSD metric may be due to the use of multiple attention mechanisms, which enables the network to suppress the background region, and the network has a stronger ability to recognize the object region. The a-SSD score of UNet++ was much higher than that of 2D FAS-UNet, which shows that the over-segmented pixels were less than the under-segmented pixels. Meanwhile, we observed that the a-SSD value of our method was close to that of LinkNet in three organs; only the tracheal region was significantly worse than it, which makes it better than our method in the mean a-SSD score.
Figure 3 evaluates the visualizations of the segmented predictions obtained by the popular methods. One can observe that all methods except CA-Net (with an obvious over-segmentation) can accurately segment the aortic region (red). The reason may be that, during the imaging process, the aortic organ may be assigned to the very same image pixel, which leads to a small difference of the internal pixel value; in particular, the network can accurately learn its features. Almost all methods can also approximately segment the trachea organ shape (green); only the organ boundary is not clearly visible. This may be a common problem in small object segmentation because the hard-to-detect small-scale feature will be degraded rapidly with convolution and pooling. All methods were able to extract the heart location (magenta), while the visual quality of the proposed method was significantly better than the state-of-the-art methods. Our approach achieved the least missing pixelsat the organ boundary, which resulted in substantially better performance than the existing results; the visualization remained comparable. It should also be emphasized that only a few methods performed well on the left boundary because the left side of the heart’s boundary is very blurry. For example, UNet++ presented over-segmentation, and CA-Net on the left boundary was significantly different from the ground-truth, while the results of ERFNet and LinkNet had significant differences compared with the ground-truth in the shape aspect. This also shows that our method is robust to the heavy occlusion of illumination and large background clutters.
For the esophagus organ (blue) in Figure 3, the segmented results of CE-Net, CPFNet, and UNet had significantly differences compared with the ground-truth in the shape aspect. The results of ERFNet and UNet++ had some trachea pixels within the esophagus, and LinkNet’s prediction had some aortic pixels, all of which were clearly error-segmented. The predictions of CA-Net and FAS-UNet were similar to manual segmentation, but the result of CA-Net had a small aorta patch in the background region, while our method obviously performed better. The reasons for the esophagus’s bad results were the fuzzy boundary and the small pixel value difference; in particular, the esophagus and aorta almost overlap in the second slice. Therefore, the proposed method is more robust, and it is more difficult for it to be affected by noise. This further shows that our method is effective in medical image segmentation.

3.5. The 3D FAS-UNet for the HVSMR-2016 Datasets

We also conducted the segmentation experiments of our 3D FAS-UNet on the HVSMR-2016 datasets. We compared our predictions with seven baseline models including 3D UNet [46], AnatomyNet [47], DMFNet [48], HDC-Net [49], RSANet [50], Bui-Net [51], and VoxResNet [52]. Table 8 shows the segmentation results of different methods. Clearly, our method with fewer parameters ranked second in both the a-DSC and a-SSD values and obtained the top rank in mean precision.
The proposed method achieved an a-DSC value of 82.75% (ranked second) with only 1.01 M parameters and followed the first-ranked 3D UNet by 0.1% in the a-DSC score with only 15% as many parameters as 3D UNet. Meanwhile, our method outperformed the third-ranked DMFNet by 2.84%. Although the numbers of parameters of HDC-Net and AnatomyNet were lower than that of our method, the a-DSC score of 3D FAS-UNet was 3.82% and 4.96% higher than theirs, respectively. One may notice that a black-box (unexplainable) network with a small number of parameters has low segmentation performance, which may be due to under-fitting. However, the number of parameters in RSANet is a bit large, and the effect was also not good enough. This may be due to too little training data, so the model appears to be over-fitting. Our method also obtained an a-SSD value of 2.44 mm (2nd rank), which was lower than DMFNet’s 2.40 mm (1st rank) by 0.12 mm, and it slightly improved compared with 3D UNet, whose a-SSD value was 2.56 mm (3rd ranked). Although 3D FAS-UNet had a slightly lower a-SSD than DMFNet, it had 2.86 M fewer parameters. Compared with HDC-Net and AnatomyNet with fewer parameters, our method performed better on the a-SSD metric for myocardium and blood pool. The result of the a-SSD value shows that our method has good performance in segmenting object boundaries. The proposed method obtained the best a-Preci score of 87.90%, which was higher than the second-ranked Bui-Net by 1.10%. Although 3D FAS-UNet ranked third in the number of parameters, all three metrics were better than HDC-Net and AnatomyNet with fewer parameters. Therefore, our method achieves a good balance between the number of parameters and performance. Experiments on these datasets showed that the proposed FAS-driven explainable model can be robustly applied to 3D medical segmentation tasks.
Figure 4 visualizes the segmentation results of different methods on two slices of the HVSMR-2016 datasets, and it can be clearly observed that the proposed model highlighted less over-segmented regions outside the ground-truth compared with other methods. Meanwhile, it also can show that it was hard for our method to be affected by the voxels in the background region, where it did not predict the voxels of the background region as blood pools or myocardium, but most other methods predicted more background voxels as the object. The results showed that these methods are easily affected by noise in the background region.
In general, one can observe that AnatomyNet, VoxResNet, and 3D UNet showed obvious segmentation noise (over-segmented region). The reason is that the network collects much noise information in the interactions from input data of the network due to a too simple data pre-processing method, which affects the feature extraction. Several methods presented over-segmentation in the myocardial zoom-in, because the pixel value of this organ is very close to the background. The myocardium is structurally distorted, which makes the shape of the myocardium completely different compared to normal/healthy myocardium. Although VoxResNet did not have this phenomenon, it divided the middle part of the myocardial region into blood pools, which was also an obvious error segmentation. Only 3D UNet and our method performed better; especially, our method was closer to the ground-truth in shape. Further, all methods had poor segmentation results in the upper myocardial region; the intensity homogeneity between this organ and the upper background indicates that this region is very difficult to segment. We can see from the zoom-in results that many methods have obvious over-segmentation or under-segmentation for the myocardium and blood pool. Compared with Bui-Net and RSANet, we observed that the proposed learnable specialized FAS-UNet network still had obvious advantages in this region, and the results were very close to the ground-truth in the myocardial region (red) with respect to the shape and size. For the blood pool region (blue), our results did not show significant differences with other methods.
The proposed network integrates medical image data and the variational convexity MS model and algorithm (FAS scheme), and implements them through convolution-based deep learning, so it may be possible to design specialized modules that automatically satisfy some of the physical invariants for better accuracy and robustness. Qualitative and quantitative experimental results demonstrated the effectiveness and superiority of our method. It can not only correctly locate the position of the myocardium, but also segment the myocardium and blood pool in the complex marginal region. Moreover, the integrity and continuity of our method in the object were well preserved. Overall, it performed better than the existing state-of-the-art methods in 3D medical image segmentation.

3.6. The 3D FAS-UNet for the CHAOS-CT Datasets

In this part, the proposed 3D FAS-UNet was compared with seven baseline models on the 3D CHAOS-CT datasets, including 3D UNet [46], Bui-Net [51], DMFNet [48], 3D ESPNet [53], RSANet [50], RatLesNetV2 [54], and HDC-Net [49].
Firstly, we used a post-processing technique to improve the prediction results, where small undesirable clusters of voxels separated from the largest connection component may be over-segmented or the “holes” inside the liver may also be under-segmented. Table 9 shows the prediction results of different methods. Before post-processing, the proposed method achieved an a-DSC of 96.69% (top-ranked) with only 1.00 M parameters, which is slightly higher than the second-ranked RSANet by 0.08% in the mean a-DSC score with only about 4% as many parameters as RSANet, and further outperformed the third-ranked DMFNet by 0.28%. Our method also obtained an a-SSD value of 4.04mm (ranked second), which is the same as RSANet (top-ranked). Although 3D FAS-UNet had a lower mean a-Preci value than 3D ESPNet by 1.54%, it had 2.57 M fewer parameters.
The a-DSC scores of all methods were significantly improved by post-processing techniques. The 3D FAS-UNet achieved an a-DSC score of 97.11% (ranked second) and followed the first-ranked 3D UNet by 0.05% in the mean a-DSC with only 15% as many parameters as 3D UNet. Our method also obtained an a-SSD value of 1.16 mm (top-ranked), and it was better than 3D UNet (ranked 2nd) and RSANet (ranked 3rd) by 0.07 mm and 0.10 mm, respectively. The 3D FAS-UNet achieved an a-Preci of 96.73%, which is lower than 3D ESPNet and RatLesNetV2. Although 3D FAS-UNet ranked third in the parameter evaluation, all five metrics were better than HDC-Net and RatLesNetV2 with fewer parameters; only the a-Preci value with post-processing was lower than that of RatLesNetV2. Thus, our method achieves a good balance between the number of parameters and segmentation performance.
Figure 5 visualizes the prediction results of different networks on two slices of the CHAOS-CT datasets, and it can be clearly observed that the other networks highlighted more over-segmented regions outside the ground-truth compared with our network. Further, we can observe that the results for Bui-Net, DMFNet, HDC-Net, RatLesNetv2, and RSANet showed obvious segmentation noise in the background region. Although 3D ESP-Net did not present this phenomenon, the result showed an obvious “hole” inside of the liver, which was an obvious error-segmentation. However, our approach did not show these evident inaccurate results. In addition, most methods showed over-segmentation or under-segmentation on the boundaries of liver because it is very blurred in the CT image. From the zoom-in results of the first two rows of Figure 5, we can see that all mentioned methods had obvious under-segmentation except HDC-Net and our method, but our method had less noise in the background region. From the zoom-in results of the last two rows in Figure 5, we observe that many methods showed obvious over-segmentation on the liver boundaries. Only our method and 3D ESPNet achieved a good performance in this region, but 3D ESPNet extracted a “hole” in the liver region. In summary, compared to other methods, the boundary results obtained by our method were smoother, and the shape of the liver was more similar to the ground-truth.
We show visual comparisons before and after post-processing in Figure 6. The results demonstrated that the “hole” (the first row in Figure 6) was effectively filled, and the “island” (the second row in Figure 6) in the background was removed by post-processing. The experiments indicated that our model-driven approach with post-processing was more effective.
Qualitative and quantitative results demonstrated the effectiveness and advantages of the proposed method. Our method achieved a robust and accurate performance compared with the existing state-of-the-art methods in the 3D liver segmentation task, which can be applied to 3D medical image segmentation.

4. Conclusions

In this work, we proposed a novel deep learning framework, FAS-UNet, for 2D and 3D medical image segmentation by enforcing some of the mathematical and physical laws (e.g., the convexity Mumford–Shah model and FAS algorithm), which focuses on learning the multiscale image features to generate the segmentation results.
Compared with other existing works that analyzed the connection between the multigrid and CNN, FAS-UNet integrates medical image data and mathematical models and enhances the connection between data-driven and traditional variational model methodologies; it provides a helpful viewpoint for designing image segmentation network architectures. Compared with UNet, the proposed FAS-UNet introduces the concept of the data space, which exploits the model prior information to extract the features. Specifically, the feature extraction task leads to solving nonlinear equations, and an iterative scheme of numerical algorithms was designed to learn the features. Our experimental results showed that the proposed method is able to improve medical image segmentation in different tasks, including the segmentation of thoracic organs at risk, the whole-heart and great vessel, and liver segmentation. It is believed that the approach is a general one and can be applied to other image processing tasks, such as image denoising and image reconstruction. In addition, we found that the topological interaction module proposed by [55] can effectively improve the performance of many segmentation methods. Therefore, we will use this module in FAS-UNet to improve its performance in the future work.

Author Contributions

Conceptualization, H.Z., S.S. and J.Z.; methodology, H.Z., S.S. and J.Z.; software, H.Z.; validation, H.Z. and J.Z.; formal analysis, H.Z., S.S. and J.Z.; investigation, H.Z., S.S. and J.Z.; resources, H.Z., S.S. and J.Z.; data curation, H.Z.; writing—original draft preparation, H.Z. and J.Z.; writing—review and editing, H.Z., S.S. and J.Z.; visualization, H.Z.; supervision, S.S. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by by the National Natural Science Foundation of China (NSFC) under Grants 11971414 and 11771369 and partly by grants from the Natural Science Foundation of Hunan Province (Grants 2018JJ2375, 2018XK2304, and 2018WK4006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and datasets are available at https://github.com/zhuhui100/FASUNet (accessed on 2 October 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
  2. Boveiri, H.R.; Khayami, R.; Javidan, R.; Mehdizadeh, A. Medical Image Registration Using Deep Neural Networks: A Comprehensive Review. Comput. Electr. Eng. 2020, 87, 106767. [Google Scholar] [CrossRef]
  3. Cai, L.; Gao, J.; Zhao, D. A Review of the Application of Deep Learning in Medical Image Classification and Segmentation. Ann. Transl. Med. 2020, 8, 713. [Google Scholar] [CrossRef] [PubMed]
  4. Chen, C.; Qin, C.; Qiu, H.; Tarroni, G.; Duan, J.; Bai, W.; Rueckert, D. Deep Learning for Cardiac Image Segmentation: A Review. Front. Cardiovasc. Med. 2020, 7, 25. [Google Scholar] [CrossRef]
  5. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Miotto, R.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J.T. Deep Learning for Healthcare: Review, Opportunities and Challenges. Briefings Bioinform. 2018, 19, 1236–1246. [Google Scholar] [CrossRef] [PubMed]
  7. Fu, Y.; Lei, Y.; Wang, T.; Curran, W.J.; Liu, T.; Yang, X. Deep Learning in Medical Image Registration: A Review. Phys. Med. Biol. 2020, 65, 20TR01. [Google Scholar] [CrossRef] [Green Version]
  8. Liu, X.; Song, L.; Liu, S.; Zhang, Y. A Review of Deep-Learning-Based Medical Image Segmentation Methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
  9. Ronneberger, O.; Fischer, P.; Brox, T. UNet: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015. [Google Scholar]
  10. Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef] [Green Version]
  11. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
  12. Cicek, O.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; Lecture Notes in Computer Science; Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9901, pp. 424–432. [Google Scholar] [CrossRef] [Green Version]
  13. Mehta, S.; Mercan, E.; Bartlett, J.; Weaver, D.; Elmore, J.G.; Shapiro, L. Y-Net: Joint Segmentation and Classification for Diagnosis of Breast Biopsy Images. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2018; Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 893–901. [Google Scholar]
  14. Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted Res-UNet for High-Quality Retina Vessel Segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar] [CrossRef]
  15. Valanarasu, J.M.J.; Sindagi, V.A.; Hacihaliloglu, I.; Patel, V.M. KiU-Net: Towards Accurate Segmentation of Biomedical Images Using Over-Complete Representations. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2020; Lecture Notes in Computer Science; Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12264, pp. 363–373. [Google Scholar] [CrossRef]
  16. Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef] [Green Version]
  17. Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
  18. He, X.; Mo, Z.; Wang, P.; Liu, Y.; Yang, M.; Cheng, J. ODE-Inspired Network Design for Single Image Super-Resolution. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1732–1741. [Google Scholar]
  19. Lu, Y.; Zhong, A.; Li, Q.; Dong, B. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations; StockholmsmÃdssan: Stockholm Sweden, 2018; Volume 80, pp. 3276–3285. [Google Scholar]
  20. Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Zhang, S. CA-Net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 699–711. [Google Scholar] [CrossRef] [PubMed]
  21. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  22. Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
  23. Zhang, X.; Li, Z.; Change Loy, C.; Lin, D. Polynet: A pursuit of structural diversity in very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 718–726. [Google Scholar]
  24. Gomez, A.N.; Ren, M.; Urtasun, R.; Grosse, R.B. The reversible residual network: Backpropagation without storing activations. arXiv 2017, arXiv:1707.04585. [Google Scholar]
  25. Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural ordinary differential equations. arXiv 2018, arXiv:1806.07366. [Google Scholar]
  26. Yang, Y.; Sun, J.; Li, H.; Xu, Z. Deep ADMM-Net for Compressive Sensing MRI. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 10–18. [Google Scholar]
  27. He, J.; Xu, J. MgNet: A unified framework of multigrid and convolutional neural network. Sci. China Math. 2019, 62, 1331–1354. [Google Scholar] [CrossRef] [Green Version]
  28. Alt, T.; Schrader, K.; Augustin, M.; Peter, P.; Weickert, J. Connections between numerical algorithms for PDEs and neural networks. arXiv 2021, arXiv:2107.14742. [Google Scholar] [CrossRef]
  29. Heyer, D.; Mausfeld, R. Perception and the Physical World: Psychological and Philosophical Issues in Perception; John Wiley & Sons, Ltd.: New York, NY, USA, 2002. [Google Scholar]
  30. Cai, X.; Chan, R.; Zeng, T. A Two-Stage Image Segmentation Method Using a Convex Variant of the Mumford–Shah Model and Thresholding. SIAM J. Imaging Sci. 2013, 6, 368–390. [Google Scholar] [CrossRef] [Green Version]
  31. Liu, C.; Ng, M.K.P.; Zeng, T. Weighted Variational Model for Selective Image Segmentation with Application to Medical Images. Pattern Recognit. 2018, 76, 367–379. [Google Scholar] [CrossRef]
  32. Ma, Q.; Peng, J.; Kong, D. Image Segmentation via Mean Curvature Regularized Mumford–Shah Model and Thresholding. Neural Process. Lett. 2018, 48, 1227–1241. [Google Scholar] [CrossRef]
  33. McCormick, S.F. Multigrid Methods; SIAM: Philadelphia, PA, USA, 1987. [Google Scholar]
  34. Mumford, D.B.; Shah, J. Optimal approximations by piecewise smooth functions and associated variational problems. Commun. Pure Appl. Math. 1989, 42, 577–685. [Google Scholar] [CrossRef] [Green Version]
  35. Zhou, D.X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef] [Green Version]
  36. Taha, A.A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
  38. Lambert, Z.; Petitjean, C.; Dubray, B.; Kuan, S. SegTHOR: Segmentation of Thoracic Organs at Risk in CT images. In Proceedings of the 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA), Paris, France, 9–12 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
  39. Pace, D.F.; Dalca, A.V.; Geva, T.; Powell, A.J.; Moghari, M.H.; Golland, P. Interactive Whole-Heart Segmentation in Congenital Heart Disease. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 80–88. [Google Scholar]
  40. Kavur, A.E.; Gezer, N.S.; Barış, M.; Aslan, S.; Conze, P.H.; Groza, V.; Pham, D.D.; Chatterjee, S.; Ernst, P.; Özkan, S.; et al. CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation. Med. Image Anal. 2021, 69, 101950. [Google Scholar] [CrossRef] [PubMed]
  41. Kavur, A.E.; Selver, M.A.; Dicle, O.; Barış, M.; Gezer, N.S. CHAOS—Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge Data. Zenodo 2019. [Google Scholar] [CrossRef]
  42. Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [Green Version]
  43. Feng, S.; Zhao, H.; Shi, F.; Cheng, X.; Wang, M.; Ma, Y.; Xiang, D.; Zhu, W.; Chen, X. CPFNet: Context Pyramid Fusion Network for Medical Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 3008–3018. [Google Scholar] [CrossRef]
  44. Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2018, 19, 263–272. [Google Scholar] [CrossRef]
  45. Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
  46. Cicek, O.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D UNet: Learning Dense Volumetric Segmentation from Sparse Annotation; Springer: Cham, Switzerland, 2016; pp. 424–432. [Google Scholar]
  47. Zhu, W.; Huang, Y.; Zeng, L.; Chen, X.; Liu, Y.; Qian, Z.; Du, N.; Fan, W.; Xie, X. AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Med. Phys. 2019, 46, 576–589. [Google Scholar] [CrossRef] [Green Version]
  48. Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D Dilated Multi-Fiber Network for Real-time Brain Tumor Segmentation in MRI. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Shenzhen, China, 13–17 October 2019. [Google Scholar]
  49. Luo, Z.; Jia, Z.; Yuan, Z.; Peng, J. HDC-Net: Hierarchical decoupled convolution network for brain tumor segmentation. IEEE J. Biomed. Health Inform. 2020, 25, 737–745. [Google Scholar] [CrossRef]
  50. Zhang, H.; Zhang, J.; Zhang, Q.; Kim, J.; Zhang, S.; Gauthier, S.A.; Spincemaille, P.; Nguyen, T.D.; Sabuncu, M.; Wang, Y. Rsanet: Recurrent slice-wise attention network for multiple sclerosis lesion segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; pp. 411–419. [Google Scholar]
  51. Bui, T.D.; Shin, J.; Moon, T. Skip-connected 3D DenseNet for volumetric infant brain MRI segmentation. Biomed. Signal Process. Control 2019, 54, 101613. [Google Scholar] [CrossRef]
  52. Chen, H.; Dou, Q.; Yu, L.; Qin, J.; Heng, P.A. VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage 2018, 170, 446–455. [Google Scholar] [CrossRef]
  53. Nuechterlein, N.; Mehta, S. 3D-ESPNet with pyramidal refinement for volumetric brain tumor image segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Granada, Spain, 16 September 2018; pp. 245–253. [Google Scholar]
  54. Valverde, J.M.; Shatillo, A.; De Feo, R.; Gröhn, O.; Sierra, A.; Tohka, J. Ratlesnetv2: A fully convolutional network for rodent brain lesion segmentation. Front. Neurosci. 2020, 14, 610239. [Google Scholar] [CrossRef] [PubMed]
  55. Gupta, S.; Hu, X.; Kaan, J.; Jin, M.; Mpoy, M.; Chung, K.; Singh, G.; Saltz, M.; Kurc, T.; Saltz, J.; et al. Learning Topological Interactions for Multi-Class Medical Image Segmentation. arXiv 2022, arXiv:2207.09654. [Google Scholar]
Figure 1. Classical variational image segmentation and model-inspired learning method. (a) The first stage solves the nonlinear differential equations using the classical iterative method, and then, the second stage thresholds the smooth solution in the first stage to extract objects. (b) The first stage learns the solution mapping T K ( f ; θ 1 ) by optimizing the convolution kernel θ 1 to extract image features. The second stage learns the feature fusion and segmentation thresholding parameter.
Figure 1. Classical variational image segmentation and model-inspired learning method. (a) The first stage solves the nonlinear differential equations using the classical iterative method, and then, the second stage thresholds the smooth solution in the first stage to extract objects. (b) The first stage learns the solution mapping T K ( f ; θ 1 ) by optimizing the convolution kernel θ 1 to extract image features. The second stage learns the feature fusion and segmentation thresholding parameter.
Mathematics 10 04055 g001
Figure 2. The overall flowchart of the proposed feature extraction module (FAS-Solution module) with the multigrid architecture. It consists of three major ingredients, i.e., pre-/coarsest-/post-smoothing blocks (LSB/CSB/RSB), feature downsample block (FDB), and feature correction block (FCB).
Figure 2. The overall flowchart of the proposed feature extraction module (FAS-Solution module) with the multigrid architecture. It consists of three major ingredients, i.e., pre-/coarsest-/post-smoothing blocks (LSB/CSB/RSB), feature downsample block (FDB), and feature correction block (FCB).
Mathematics 10 04055 g002
Figure 3. Comparison with the other state-of-the-art networks on the validation set of the 2D SegTHOR datasets. The blue, pink, green, and red regions represent the esophagus, heart, trachea, and aorta, respectively. Form left to right in the first and third rows: the original images, ground-truth, and segmentation results of CANet, CE-Net, and CPFNet, respectively. Form left to right in the second and fourth rows: the segmentation results of ERFNet, LinkNet, 2D UNet, UNet++, and 2D FAS-UNet, respectively.
Figure 3. Comparison with the other state-of-the-art networks on the validation set of the 2D SegTHOR datasets. The blue, pink, green, and red regions represent the esophagus, heart, trachea, and aorta, respectively. Form left to right in the first and third rows: the original images, ground-truth, and segmentation results of CANet, CE-Net, and CPFNet, respectively. Form left to right in the second and fourth rows: the segmentation results of ERFNet, LinkNet, 2D UNet, UNet++, and 2D FAS-UNet, respectively.
Mathematics 10 04055 g003
Figure 4. Visualizations of different methods for cardiovascular MR segmentation of different slices. From left to right in the first and third rows: the original images, the ground-truth, and the segmentation results of AnatomyNet, Skip-connected 3D DenseNet, and DMFNet, respectively. From left to right in the second and fourth rows: the segmentation results of HDC-Net, RSANet, 3D UNet, VoxResNet, and 3D FAS-UNet, respectively. The blue and red colors represent blood pool and myocardium, respectively.
Figure 4. Visualizations of different methods for cardiovascular MR segmentation of different slices. From left to right in the first and third rows: the original images, the ground-truth, and the segmentation results of AnatomyNet, Skip-connected 3D DenseNet, and DMFNet, respectively. From left to right in the second and fourth rows: the segmentation results of HDC-Net, RSANet, 3D UNet, VoxResNet, and 3D FAS-UNet, respectively. The blue and red colors represent blood pool and myocardium, respectively.
Mathematics 10 04055 g004
Figure 5. Visualizations (without post-processing) of different methods for liver CT segmentation of two slices. From left to right in the first and third rows: the original images, ground-truth, the segmentation results of skip-connected 3D DenseNet, DMFNet, and HDC-Net, respectively. From left to right in the second and fourth rows: the segmentation results of 3D ESPNet, RatLesNetv2, RSANet, 3D UNet, and 3D FAS-UNet, respectively.
Figure 5. Visualizations (without post-processing) of different methods for liver CT segmentation of two slices. From left to right in the first and third rows: the original images, ground-truth, the segmentation results of skip-connected 3D DenseNet, DMFNet, and HDC-Net, respectively. From left to right in the second and fourth rows: the segmentation results of 3D ESPNet, RatLesNetv2, RSANet, 3D UNet, and 3D FAS-UNet, respectively.
Mathematics 10 04055 g005
Figure 6. The segmentation results with post-processing. From left to right: the selected scans of the validation set (first column), ground-truths (second column), segmentation results of 3D FAS-UNet without post-processing (third column), and results with post-processing (fourth column).
Figure 6. The segmentation results with post-processing. From left to right: the selected scans of the validation set (first column), ground-truths (second column), segmentation results of 3D FAS-UNet without post-processing (third column), and results with post-processing (fourth column).
Mathematics 10 04055 g006
Table 1. A standard configuration of the proposed 2D FAS-UNet. p × H × W denotes the number of channels and the height and width of the feature maps, respectively. “conv3, s = 1 or 2” indicates the convolution with a kernel size of 3 × 3 and the stride of convolution being 1 or 2, respectively. ReLU and BN are the ReLU activation function and batch normalization.
Table 1. A standard configuration of the proposed 2D FAS-UNet. p × H × W denotes the number of channels and the height and width of the feature maps, respectively. “conv3, s = 1 or 2” indicates the convolution with a kernel size of 3 × 3 and the stride of convolution being 1 or 2, respectively. ReLU and BN are the ReLU activation function and batch normalization.
Name2D OperationOutput SizeName2D OperationOutput Size
Inputf,   initialization u 0 1 p × H × W
LSB1 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k l p × H × W RSB1 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k r p × H × W
FDB1conv3, s = 2 for
conv3, s = 2 for u
p × H 2 × W 2 FCB1deconv3, s = 2 p × H × W
LSB2 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k l p × H 2 × W 2 RSB2 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k r p × H 2 × W 2
FDB2conv3, s = 2 for
conv3, s = 2 for u
p × H 4 × W 4 FCB2deconv3, s = 2 p × H 2 × W 2
LSB3 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k l p × H 4 × W 4 RSB3 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k r p × H 4 × W 4
FDB3conv3, s = 2 for
conv3, s = 2 for u
p × H 8 × W 8 FCB3deconv3, s = 2 p × H 4 × W 4
LSB4 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k l p × H 8 × W 8 RSB4 conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k r p × H 8 × W 8
FDB4conv3, s = 2 for
conv3, s = 2 for u
p × H 16 × W 16 FCB4deconv3, s = 2 p × H 8 × W 8
CSB conv 3 , s = 1 ReLU + BN conv 3 , s = 1 ReLU + BN × k m p × H 16 × W 16
Table 2. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different pre-/coarest-/post- smoothing iteration numbers { k l , k m , k r } using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underline, respectively.
Table 2. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different pre-/coarest-/post- smoothing iteration numbers { k l , k m , k r } using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underline, respectively.
Number of Smoothing StepsFixing k m and k r , Varying k l Fixing k l and k m , Varying k r
{2, 7, 2}{3, 7, 2}{4, 7, 2}{3, 7, 3}{3, 7, 4}{3, 7, 5}
Params0.41 M0.44 M0.48 M0.48 M0.52 M0.56 M
a-DSC (%)Esophagus73.56 ± 9.5874.55 ± 8.8371.82 ± 9.4274.06 ± 9.8074.24 ± 8.0173.19 ± 9.79
Heart93.63 ± 2.2292.96 ± 3.4693.21 ± 2.2892.98 ± 1.5594.22 ± 1.5292.64 ± 4.72
Trachea83.06 ± 4.3385.78 ± 4.3284.22 ± 4.0486.43 ± 4.0184.47 ± 5.3384.54 ± 6.81
Aorta89.78 ± 4.6689.09 ± 4.989.96 ± 5.7888.27 ± 5.1389.39 ± 7.2692.11 ± 1.36
Mean85.0185.6084.8085.4485.5885.62
a-SSD (mm)Esophagus4.16 ± 1.912.67 ± 0.983.35 ± 1.343.22 ± 1.142.67 ± 0.792.69 ± 1.06
Heart4.61 ± 3.9712.48 ± 16.5316.54 ± 24.382.66 ± 0.941.94 ± 0.6618.5 ± 27.98
Trachea5.22 ± 2.373.18 ± 1.156.61 ± 4.574.80 ± 4.703.01 ± 2.486.14 ± 4.48
Aorta4.14 ± 2.335.41 ± 3.322.16 ± 1.222.26 ± 0.752.82 ± 1.335.27 ± 4.90
Mean4.535.947.173.242.618.15
a-Preci (%)Esophagus79.32 ± 7.9877.58 ± 6.9674.22 ± 11.2576.66 ± 7.9479.67 ± 6.6782.10 ± 6.82
Heart95.73 ± 3.5792.84 ± 6.1395.10 ± 4.6096.34 ± 3.6496.60 ± 2.2792.70 ± 8.71
Trachea77.01 ± 8.8385.75 ± 6.5678.43 ± 6.7884.60 ± 7.8290.98 ± 6.7380.29 ± 10.06
Aorta89.70 ± 3.3887.79 ± 5.6890.51 ± 3.8492.28 ± 2.9891.76 ± 2.7890.87 ± 3.10
Mean85.4485.9984.5687.4789.7586.49
Table 3. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different initialization techniques (zero initialization, random initialization, and feature extraction initialization ψ ( K 0 ( f ) ) ) using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Table 3. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different initialization techniques (zero initialization, random initialization, and feature extraction initialization ψ ( K 0 ( f ) ) ) using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Feature InitializationZero InitiRandom Initi ψ ( K 0 ( f ) ) Initi
Params0.52 M0.52 M0.52 M
a-DSC (%)Esophagus72.56 ± 11.6675.84 ± 10.0274.24 ± 8.01
Heart92.40 ± 4.4892.37 ± 5.0694.22 ± 1.52
Trachea84.21 ± 4.2785.08 ± 3.3884.47 ± 5.33
Aorta91.17 ± 4.4989.81 ± 4.2289.39 ± 7.26
Mean85.0885.7785.58
a-SSD (mm)Esophagus2.39 ± 1.22.85 ± 0.872.67 ± 0.79
Heart15.38 ± 21.816.82 ± 32.151.94 ± 0.66
Trachea5.67 ± 2.984.66 ± 3.593.01 ± 2.48
Aorta2.85 ± 1.977.59 ± 8.942.82 ± 1.33
Mean6.577.982.61
a-Preci (%)Esophagus81.8 ± 6.4181.3 ± 6.4979.67 ± 6.67
Heart92.53 ± 8.4393.31 ± 8.2596.6 ± 2.27
Trachea79.92 ± 8.3479.0 ± 5.4690.98 ± 6.73
Aorta91.92 ± 2.7588.57 ± 5.2891.76 ± 2.78
Mean86.5585.5589.75
Table 4. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of using the proposed 2D FAS-UNet framework with/without weight sharing of pre-/post smoothing steps on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Table 4. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of using the proposed 2D FAS-UNet framework with/without weight sharing of pre-/post smoothing steps on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Block Weight Sharingw/oLSBRSBLSB and RSB
Params0.52 M0.45 M0.41 M0.34 M
a-DSC (%)Esophagus74.24 ± 8.0172.53 ± 10.1773.01 ± 9.9172.53 ± 10.28
Heart94.22 ± 1.5290.51 ± 6.9889.33 ± 5.1192.00 ± 3.97
Trachea84.47 ± 5.3384.14 ± 5.2184.55 ± 2.6083.37 ± 4.10
Aorta89.39 ± 7.2690.46 ± 2.9990.17 ± 4.3090.03 ± 2.40
Mean85.5884.4184.2684.48
a-SSD (mm)Esophagus2.67 ± 0.793.22 ± 0.972.69 ± 0.724.26 ± 1.91
Heart1.94 ± 0.6622.37 ± 16.9934.91 ± 31.7614.51 ± 19.82
Trachea3.01 ± 2.486.30 ± 4.505.17 ± 3.365.96 ± 3.38
Aorta2.82 ± 1.334.64 ± 3.874.61 ± 1.788.12 ± 5.49
Mean2.619.1311.858.21
a-Preci (%)Esophagus79.67 ± 6.6777.94 ± 7.6777.36 ± 6.6074.64 ± 8.14
Heart96.60 ± 2.2788.05 ± 11.5088.72 ± 9.1492.19 ± 7.80
Trachea90.98 ± 6.7382.34 ± 9.2678.14 ± 5.0077.36 ± 8.26
Aorta91.76 ± 2.7888.84 ± 3.5787.08 ± 6.0985.82 ± 3.77
Mean89.7584.2982.8282.50
Table 5. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different channel numbers p using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Table 5. Quantitative assessment with the a-DSC, a-SSD, and a-Preci values of different channel numbers p using the proposed 2D FAS-UNet framework on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Number of Channels8064483216
Params3.24 M2.08 M1.17 M0.52 M0.13 M
a-DSC (%)Esophagus75.20 ± 11.5375.34 ± 12.5974.50 ± 11.6274.24 ± 8.0170.30 ± 11.04
Heart94.19 ± 1.9593.79 ± 1.7992.04 ± 5.6694.22 ± 1.5293.17 ± 2.41
Trachea84.77 ± 4.0886.97 ± 5.0587.29 ± 3.6584.47 ± 5.3385.86 ± 3.62
Aorta91.46 ± 5.1491.20 ± 3.4590.20 ± 6.7689.39 ± 7.2690.17 ± 5.29
Mean86.4186.8386.0185.5884.88
a-SSD (mm)Esophagus2.72 ± 0.982.49 ± 1.222.37 ± 1.262.67 ± 0.793.18 ± 1.71
Heart8.33 ± 12.28.45 ± 16.5412.79 ± 26.801.94 ± 0.6612.11 ± 18.35
Trachea7.52 ± 4.884.61 ± 3.503.66 ± 2.633.01 ± 2.483.20 ± 1.95
Aorta1.99 ± 1.025.20 ± 4.922.80 ± 1.732.82 ± 1.332.38 ± 1.33
Mean5.145.195.402.615.22
a-Preci (%)Esophagus81.02 ± 6.2181.19±10.7080.33 ± 8.2379.67 ± 6.6777.39 ± 8.61
Heart95.84 ± 3.9095.85 ± 3.4591.52 ± 8.8996.60 ± 2.2794.31 ± 2.92
Trachea80.30 ± 8.4484.64 ± 9.5086.50 ± 8.4190.98 ± 6.7385.14 ± 9.38
Aorta93.04 ± 2.3691.17 ± 3.8191.93 ± 3.6191.76 ± 2.7889.85 ± 3.56
Mean87.5588.2187.5789.7586.67
Table 6. The hyperparameter configurations of the proposed FAS-UNet framework on different datasets.
Table 6. The hyperparameter configurations of the proposed FAS-UNet framework on different datasets.
DatasetsLayer
Number L
Feature
Number p
{ k l , k m , k r }Weight Sharing
on K q , , j
Feature
Initialization
Params
2D SegTHOR564{3, 7, 4}No ψ ( K 0 ( f ) ) ) 2.08 M
3D HVSMR 2016432{3, 6, 2}No ψ ( K 0 ( f ) ) ) 1.01 M
3D CHAOS CT432{3, 5, 2}No ψ ( K 0 ( f ) ) ) 1.00 M
Table 7. Performance comparisons between the proposed 2D FAS-UNet and the popular networks using the a-DSC, a-SSD, and a-Preci values on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Table 7. Performance comparisons between the proposed 2D FAS-UNet and the popular networks using the a-DSC, a-SSD, and a-Preci values on the 2D SegTHOR validation datasets with four organs: esophagus, heart, trachea, and aorta. “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Method2D UNetCA-NetCE-NetCPFNetERFNetUNet++LinkNet2D FAS-UNet
[9][20][42][43][44][11][45]
Params17.26 M2.78 M29.00 M30.65 M2.06 M9.05 M21.79 M2.08 M
a-DSC (%)Esophagus73.77 ± 13.2376.26 ± 8.7264.17 ± 10.7866.58 ± 10.0269.68 ± 5.9369.63 ± 9.2865.19 ± 9.1875.34 ± 12.59
Heart93.42 ± 3.27093.61 ± 2.4594.01 ± 1.1792.05 ± 4.3992.59 ± 5.2993.46 ± 2.3693.19 ± 9.1893.79 ± 1.79
Trachea86.71 ± 2.6985.58 ± 4.3885.12 ± 4.3787.41 ± 2.8479.10 ± 5.7985.63 ± 5.7886.16 ± 4.3586.97 ± 5.05
Aorta90.42 ± 3.4591.2 ± 1.5988.38 ± 3.9588.92 ± 4.9890.08 ± 4.0690.61 ± 2.6688.94 ± 3.6191.2 ± 3.45
Mean86.0886.6682.9283.7482.8684.8383.3786.83
a-SSD (mm)Esophagus2.78 ± 1.364.14 ± 1.462.54 ± 0.922.57 ± 0.793.92 ± 0.934.77 ± 1.332.88 ± 1.222.49 ± 1.22
Heart14.37 ± 22.784.87 ± 7.094.57 ± 5.3116.02 ± 22.267.76 ± 15.639.99 ± 11.178.42 ± 17.028.45 ± 16.54
Trachea6.05 ± 6.146.85 ± 4.873.81 ± 2.054.46 ± 3.918.62 ± 4.265.84 ± 5.382.57 ± 1.354.61 ± 3.50
Aorta6.36 ± 6.513.64 ± 1.685.42 ± 1.627.51 ± 5.533.36 ± 2.314.41 ± 2.295.85 ± 4.025.2 ± 4.92
Mean7.394.884.087.645.926.254.935.19
a-Preci (%)Esophagus82.82 ± 4.6982.67 ± 5.4679.14 ± 7.3973.92 ± 7.2968.78 ± 7.8879.83 ± 7.7274.1 ± 7.3281.19 ± 10.70
Heart93.47 ± 5.3994.91 ± 3.7995.26 ± 2.6691.85 ± 8.3992.61 ± 9.6294.27 ± 4.5895.00 ± 4.7195.85 ± 3.45
Trachea85.21 ± 5.9181.72 ± 7.2984.81 ± 9.7484.48 ± 6.9072.46 ± 9.7187.59 ± 9.0583.88 ± 9.2684.64 ± 9.50
Aorta90.01 ± 2.8188.84 ± 4.0486.10 ± 4.7388.84 ± 2.9288.07 ± 3.3590.39 ± 3.5788.91 ± 5.2891.17 ± 3.81
Mean87.8887.0386.3384.7780.4888.0285.4788.21
Table 8. Performance comparisons of the proposed 3D FAS-UNet and the popular networks using the a-DSC, a-SSD, and a-Preci values on the 3D HVSMR-2016 validation datasets with both organs: myocardium and blood pool. “MY” and “BP” denote myocardium and blood pool, respectively, “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Table 8. Performance comparisons of the proposed 3D FAS-UNet and the popular networks using the a-DSC, a-SSD, and a-Preci values on the 3D HVSMR-2016 validation datasets with both organs: myocardium and blood pool. “MY” and “BP” denote myocardium and blood pool, respectively, “Mean” denotes an average score segmenting all organs. The best and second places are highlighted in bold font and underlined, respectively.
Method3D UNetAnatomyNetDMFNetHDC-NetRSANetBui-NetVoxResNet3D FAS-UNet
[46][47][48][49][50][51][52]
Params8.16 M0.73 M3.87 M0.30 M24.55 M2.53 M1.70 M1.01 M
a-DSC (%)MY76.42 ± 3.3269.50 ± 4.6970.30 ± 1.0369.77 ± 5.1071.94 ± 0.6969.84 ± 0.4765.79 ± 6.0076.42 ± 4.38
BP89.21 ± 1.6186.01 ± 1.5189.45 ± 0.1888.03 ± 2.2484.71 ± 0.9389.4 ± 0.4684.66 ± 1.6789.01 ± 0.18
Mean82.8277.7679.8878.9078.3379.6275.2382.72
a-SSD (mm)MY2.06 ± 0.734.03 ± 0.312.74 ± 0.203.36 ± 0.322.31 ± 0.302.56 ± 0.222.52 ± 1.062.32 ± 0.27
BP3.05 ± 1.532.98 ± 1.212.07 ± 0.482.65 ± 0.024.01 ± 0.552.58 ± 0.513.47 ± 0.042.57 ± 1.15
Mean2.563.502.403.003.162.572.992.44
a-Preci (%)MY81.02 ± 9.1371.25 ± 6.2280.27 ± 4.2267.41 ± 12.9476.76 ± 9.0083.94 ± 6.7377.94 ± 10.4084.61 ± 7.14
BP87.44 ± 2.6289.73 ± 4.6191.78 ± 0.5487.84 ± 2.0982.86 ± 6.0889.66 ± 0.5684.29 ± 1.1391.18 ± 2.36
Mean84.2380.4986.0377.6279.8186.8081.1187.90
Table 9. Performance comparisons of liver segmentation between the different 3D networks with/without post-processing using the a-DSC, a-SSD, and a-Preci values on the 3D CHAOS-CT validation datasets. The best and second places are highlighted in bold font and underlined, respectively.
Table 9. Performance comparisons of liver segmentation between the different 3D networks with/without post-processing using the a-DSC, a-SSD, and a-Preci values on the 3D CHAOS-CT validation datasets. The best and second places are highlighted in bold font and underlined, respectively.
MethodParamsWithout Post-ProcessingWith Post-Processing
a-DSC (%)a-SSD (mm)a-Preci (%)a-DSC (%)a-SSD (mm)a-Preci (%)
3D UNet [46]8.16 M96.18 ± 1.725.65 ± 5.1594.87 ± 4.1297.16 ± 0.541.23 ± 0.2896.71 ± 2.01
Bui-Net [51]2.24 M89.41 ± 4.8611.54 ± 4.5285.90 ± 8.2092.50 ± 4.024.42 ± 3.6091.85 ± 7.92
DMFNet [48]3.87 M96.41 ± 0.994.45 ± 2.9895.34 ± 3.2896.93 ± 0.471.29 ± 0.1196.31 ± 2.42
3D ESPNet [53]3.57 M95.72 ± 0.814.32 ± 1.7497.46 ± 2.1596.15 ± 0.801.68 ± 0.6798.33 ± 1.31
RSANet [50]24.54 M96.61 ± 1.004.04 ± 2.3095.23 ± 2.7097.08 ± 0.671.26 ± 0.3396.11 ± 1.95
RatLesNetV2 [54]0.83 M95.73 ± 1.227.04 ± 3.1295.13 ± 3.4096.82 ± 0.691.27 ± 0.2697.23 ± 2.12
HDC-Net [49]0.29 M95.30 ± 1.647.50 ± 4.9293.77 ± 4.2396.42 ± 0.671.30 ± 0.1795.87 ± 2.51
3D FAS-UNet1.00 M96.69 ± 0.764.04 ± 3.1195.92 ± 2.7897.11 ± 0.321.16 ± 0.1396.73 ± 2.08
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhu, H.; Shu, S.; Zhang, J. FAS-UNet: A Novel FAS-Driven UNet to Learn Variational Image Segmentation. Mathematics 2022, 10, 4055. https://doi.org/10.3390/math10214055

AMA Style

Zhu H, Shu S, Zhang J. FAS-UNet: A Novel FAS-Driven UNet to Learn Variational Image Segmentation. Mathematics. 2022; 10(21):4055. https://doi.org/10.3390/math10214055

Chicago/Turabian Style

Zhu, Hui, Shi Shu, and Jianping Zhang. 2022. "FAS-UNet: A Novel FAS-Driven UNet to Learn Variational Image Segmentation" Mathematics 10, no. 21: 4055. https://doi.org/10.3390/math10214055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop