Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms

Chen, Zhebin; Dou, Meng; Luo, Xu; Yao, Yu

doi:10.3390/app15073985

Open AccessArticle

Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms

¹

Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu 610209, China

²

The School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3985; https://doi.org/10.3390/app15073985

Submission received: 13 February 2025 / Revised: 1 April 2025 / Accepted: 2 April 2025 / Published: 4 April 2025

Download

Browse Figures

Versions Notes

Abstract

Automatic liver and tumor segmentation in contrast-enhanced magnetic resonance imaging (CE-MRI) images are of great value in clinical practice as they can reduce surgeons’ workload and increase the probability of success in surgery. However, this is still a challenging task due to the complex background, irregular shape, and low contrast between the organ and lesion. In addition, the size, number, shape, and spatial location of liver tumors vary from person to person, and existing automatic segmentation models are unable to achieve satisfactory results. In this work, drawing inspiration from self-attention mechanisms and multitask learning, we propose a segmentation network that leverages Swin-Transformer as the backbone, incorporating self-supervised learning strategies to enhance performance. In addition, accurately segmenting the boundaries and spatial location of liver tumors is the biggest challenge. To address this, we propose a multitask learning strategy based on segmentation and signed distance map (SDM), incorporating an attention gate into the skip connections. The strategy can perform liver tumor segmentation and SDM regression tasks simultaneously. The incorporation of the SDM regression branch effectively improves the detection and segmentation performance for small objects since it imposes additional shape and global constraints on the network. We performed comprehensive evaluations, both quantitative and qualitative, of our approach. The model we proposed outperforms existing state-of-the-art models in terms of DSC, 95HD, and ASD metrics. This research provides a valuable solution that lessens the burden on surgeons and improves the chances of successful surgeries.

Keywords:

liver cancer; segmentation; transformer; self-supervised learning; multitask learning; attention mechanism

1. Introduction

According to Global Cancer Statistics 2020, there were 905,000 new cases of primary liver cancer and 830,000 deaths worldwide in 2020, ranking sixth and third among all cancers, bringing huge losses to patients [1]. Hepatocellular carcinoma (HCC) is the most common pathological type of liver cancer, accounting for 70–85% of all cases [2]. At present, hepatectomy remains the primary treatment for liver cancer, though its effectiveness heavily relies on the precise segmentation of the organ and lesions. Identifying tumor location and count pre-operatively while maximizing the preservation of liver volume and function is crucial for enhancing long-term survival rates following resection [3].

Compared with computed tomography (CT), contrast-enhanced magnetic resonance imaging has higher soft-tissue resolution and can realize multi-sequence and multi-parameter imaging. It is an essential medical imaging method for diagnosing and monitoring HCC. After the contrast agent is injected, the CE-MRI scan images, such as the arterial, venous, and delayed phases, can be obtained by scanning in different time periods. These images have higher contrast and can display the lesion more clearly, especially in the arterial phase for liver tumors (shown in Figure 1). Liver and tumor segmentation can be the basis for disease diagnosis, surgical planning, and evaluation of efficacy [4]. Therefore, it is significant to segment liver and tumor from CE-MRI images. In the traditional segmentation of liver and tumor, clinicians mainly perform manual delineation slice by slice, which is tedious and time-consuming. In addition, there are intra-observer and inter-observer subjective factors. Therefore, there is an urgent clinical need to segment liver and tumor for CE-MRI images automatically and accurately.

In earlier research, various traditional image-processing methods were employed for automatic liver tumor segmentation [5,6,7,8,9,10,11]. These methods primarily focused on structural, grayscale, and texture-based approaches. However, traditional segmentation methods often rely heavily on expert knowledge, which can impact the model’s performance. In recent years, convolutional neural networks (CNNs) have achieved remarkable success in medical image segmentation, surpassing traditional methods in terms of accuracy, with the encoder–decoder architecture (U-Net) being the most commonly adopted [12,13,14,15]. Although models based on convolutional neural networks have made substantial progress over traditional techniques, segmenting HCC lesions remains challenging. This is primarily due to the low contrast between the liver and lesions, variability in contrast levels, tissue abnormalities, and significant differences in lesion size, quantity, and morphology [16]. Consequently, CNNs still need enhancements to fulfill clinical requirements. A key limitation of CNNs is their focus on local features, often failing to capture the broader context of an image. When it is challenging to recognize the liver and tumor from local features, CNN-based networks may not be the optimal framework for segmentation. In contrast to CNNs, Vision Transformers (ViT) [17] offer an innovative architecture that encodes visual information across a series of patches within the entire image, overcoming the limited receptive field of CNNs. Among ViT models, Swin-Transformer [18] has emerged as one of the most efficient, achieving state-of-the-art performance in computer vision tasks such as BTCV [19], MSD [20], ImageNet [21].

Inspired by Swin-Transformer, we aim to propose a Swin-Transformer-based network framework for segmenting the liver and tumor in CE-MRI images, addressing the needs of clinical practice and surgical procedures. As illustrated in Figure 1, the exact location and shape of the liver and tumor are distinctly outlined. Following the administration of a contrast agent, the blood flow dynamics of the tumor and normal liver differ, leading to relatively clear boundaries on CE-MRI images. However, the tumor’s boundary can be either smooth or irregular and may vary in size. Thus, accurately locating the tumor within the liver and delineating its boundaries is essential. To capture the global spatial relationships between the liver and tumor, we have designed a Swin-Transformer-based U-Net framework that extracts global features from the CE-MRI. Furthermore, we integrated attention gate and SDM regression tasks into the Swin-Transformer-based encoder–decoder architecture, providing additional shape and global constraints for the segmentation task. These enhancements significantly improve the model’s ability to capture local details of small targets, such as tumors.

Deep-learning methods typically require large amounts of annotated data, but in medical imaging, only a small portion is labeled, leaving much data unannotated. Self-supervised learning is an innovative approach that enables the model to learn relevant visual features for segmentation tasks through pretraining on pretext tasks. Compared to CNN-based architectures, ViT-based frameworks can learn more stable visual representations during pretraining on these tasks [22]. Recent studies have developed creative pretext tasks, which yield promising results when fine-tuned for downstream tasks [23,24,25,26,27,28]. In our approach, we apply contrastive learning to differentiate between various image regions, enabling the model to capture more meaningful data representations. We pretrain the encoder using contrastive learning to acquire valuable visual features, which are then refined through supervised learning in the subsequent training phase.

The main contributions of our work are threefold:

We propose a deep-learning architecture based on the Swin-Transformer structure, which is employed to capture the spatial relationships and global information for liver and tumor segmentation.
To enhance the Swin-Transformer framework’s ability to capture local details of small targets (such as tumors), we incorporate attention gate and SDM regression tasks into the Swin-Transformer-based encoder–decoder architecture, providing additional shape and global constraints for the segmentation task.
We design an upstream pretext task to pretrain the proposed network, enabling it to learn effective visual features from a larger amount of unlabeled data, thereby enhancing the model’s performance.

Our model is evaluated on a dataset of 192 CE-MRI from liver cancer patients, with performance compared against other segmentation approaches. Additionally, we conduct several ablation studies to demonstrate the significance of our key design choices.

2. Related Works

2.1. Liver and Tumor Segmentation

Recently, many researchers have proposed liver and tumor segmentation methods based on CNN. Christ et al. [29] proposed a method for automatically segmenting livers and tumors in CT and MRI abdominal images using cascaded fully convolutional neural networks, which included two sub-networks for liver segmentation and tumors. The first FCN segments the liver, and the second FCN only segments the lesions within the liver predicted in step 1. In order to extract multiscale features, Zhang et al. [30] proposed a Scale-Axis-Attention (SAA) mechanism to model multiscale features and spatial information, respectively. Furthermore, they incorporated it into a U-shaped network (SAA-Net) to improve liver tumor segmentation performance. Wang et al. [31] introduced CPAD-Net, a novel network for liver tumor segmentation, which integrates parallel contextual attention and dilated convolutions. This network substitutes max-pooling with a subsampling module to retain fine-grained features, integrates a contextual parallel attention module at the skip connections, and combines multiscale contextual features while concurrently extracting channel-space features. Li et al. [32] proposed a deeply supervised network based on channel attention and Res-U-Net++ for segmenting liver CT images. An efficient attention module is introduced to combine the deep feature map with spatial information to alleviate the influence of uneven sample distribution. Kushnure et al. [33] proposed HFRU-Net to modify-skip connections using a feature fusion mechanism and local feature reconstruction. Jiang et al. [34] proposed a novel residual multiscale attention U-Net (RMAU-Net) for liver and tumor segmentation by introducing two modules, Res-SE-Block and Multiscale Attention Block (MAB). Res-SE-Block improves the quality of representations by explicitly modeling interdependencies between feature channels and feature recalibration. MAB can leverage rich multiscale feature information while simultaneously capturing inter-channel and spatial relationships of features. Currently, most of the algorithms are based on CT images, and there needs to be more research on CE-MRI images.

2.2. Multitask Learning

The concept of multitask learning has been widely used in different medical image analysis tasks and different kinds of medical images. Myronenko et al. [35] proposed a semantic segmentation network based on an encoder–decoder architecture for segmenting tumor sub-regions from 3D MRI while adding a variant auto-encoder branch to reconstruct the input image, imposing additional constraints on the network. Chakravarty et al. [36] proposed a multitask convolution neural network that can jointly segment the optic disc and optic cup and predict the presence or absence of glaucoma in color fundus images. Chen et al. [37] proposed a deep-learning-based, fully automated framework for left atrium segmentation in MRI images. The network can gain additional anatomical information and achieve more accurate atrial segmentation by sharing features between related tasks. Zhou et al. [38] proposed a novel multitask learning framework for joint segmentation and classification of tumors in ultrasound images. The framework comprises two sub-networks: an encoder–decoder network for segmentation and a lightweight multiscale network for classification. The proposed multitask framework improves tumor segmentation and classification. Qu et al. [39] proposed a multitask learning framework to achieve the two tasks of nucleus segmentation and classification in pathological images, which can segment and classify a single nucleus into the tumor, lymphocyte, and stromal nuclei. At the same time, perceptual loss is used to strengthen the segmentation of details. Compared with the above methods, this paper incorporates shape constraints into the segmentation task via multitask learning.

2.3. Segmentation of Medical Images Using Vision Transformers

Transformers were originally designed for machine translation tasks in natural language processing (NLP) [40]. Inspired by this task, ViT based on Transformers was developed for image classification tasks. ViT has shown superior performance over traditional convolutional models, such as ResNet [41], in benchmarks like ImageNet [21]. Models such as TransUNet [42] and UNetR [43] incorporated ViT by replacing or enhancing convolutional layers in the U-Net encoder for both 2D and 3D segmentation tasks. Segmentation networks, like U-Net, rely on hierarchical architectures to capture multiscale information for pixel-level predictions. However, the quadratic computational complexity of self-attention in ViT poses a challenge when applied to high-resolution images. Swin-Transformer addresses this using a shift window mechanism to build hierarchical encodings. Recent studies, including DS-TransUNet [44] and SwinUNet [45], have successfully employed Swin-Transformer for 2D segmentation tasks, yielding promising outcomes. SwinUNetR [46] combines the spatial representation capabilities of Swin-Transformer in the encoder with a convolutional decoder, utilizing pretraining on large datasets followed by fine-tuning on more specific datasets. This model achieved state-of-the-art results in both the BTCV Multi-Organ Segmentation Challenge (BTCV) [19] and the MSD Challenge (MSD) [20]. In addition, Yang et al. [47] proposed a 3DUNet network based on Swin-Transformer for brain tumor segmentation, which demonstrated highly competitive performance on the BraTs dataset [48,49], proving the effectiveness of Swin-Transformer-based networks in MRI image segmentation.

2.4. Self-Supervised Learning for Medical Image Analysis

Self-supervised learning (SSL) learns effective visual features from unlabeled data by creating supervisory signals in upstream pretext tasks. These learned features are then used as initialization in downstream tasks, where further optimization of the task model takes place. Self-supervised learning has become increasingly popular in medical image segmentation, where annotated datasets are limited, but unlabeled data are abundant. Numerous studies have shown the efficacy of self-supervised learning in this field.

Zheng et al. [50] proposed the MVRL model, which utilizes SSL to improve medical image segmentation. This model incorporates multiscale representations, canvas matching, embedding pre-sampling, a centrality branch, and cross-level consistency loss to reduce dependence on expert annotations and enhance performance. When pretrained on unlabeled data, MVRL surpassed existing competitive methods across various datasets, highlighting its effectiveness in non-labeled segmentation tasks and its ability to manage scale variations. Zhao et al. [51] employed a multimodal network with SSL for brain tumor segmentation. They introduced a pretext task aimed at filling masked regions, which facilitated SSL and strengthened the network’s capability to extract multimodal features and resist noise. Their experimental findings demonstrated that this approach outperformed other methods, validating its efficacy in brain tumor segmentation.

A core aspect of SSL is the creation of appropriate pretext tasks to help the network learn valuable visual feature representations. The three main types of pretext tasks are prediction, generation, and contrastive learning. Predictive self-supervised learning tasks, such as relative position prediction [52], rotation prediction [53], and jigsaw [54], treat the task as a classification or regression problem. Generative self-supervised learning tasks, including methods like variational autoencoders (VAE) [55] and generative adversarial networks (GAN) [56], aim to uncover latent features through the image synthesis process. On the other hand, contrastive learning involves differentiating between similar (positive) and dissimilar (negative) pairs or enhancing consistency across multiple positive pairs. Techniques such as MOCO [24,28], SimCLR [23,27], BYOL [25], and DINO [26] have emerged in this field.

3. Methodology

The structure of the proposed network is illustrated in Figure 2. It comprises both an encoder and a decoder path, both based on Swin-Transformer. Each MRI volume is processed in the encoder path as a patch. After passing through the encoder, we repeatedly apply upsampling followed by a Swin-Transformer block to generate high-resolution segmentation results and the corresponding

S D M

. The semantic gap problem is alleviated by the skip connections between the encoder and decoder. In addition, an attention-gating mechanism is added between the encoder and decoder features to suppress the features of irrelevant background regions. Finally, after the decoder, the segmentation result and the signed distance map of the liver and the tumor are, respectively, output through the

S o f t m a x

layer and the

T a n h

layer. In addition, we used self-supervised learning to pretrain the model. The details are as follows:

3.1. Segmentation Network Based on Swin-Transformer

Our proposed architecture is built around an encoder based on the Swin-Transformer architecture that processes patches linked to a decoder based on the Swin-Transformer architecture through skip connections at multiple stages. The overall workflow of the architecture is depicted in Figure 2, and we will provide further details in this section.

Swin-Transformer Encoder: To hierarchically capture global features, we employ a four-stage Swin-Transformer that progressively reduces the resolution of the input images for segmentation. The input is a patch with dimensions

(H, W, D, 1)

, where H, W, and D represent the height, width, and depth of the MRI, and the last dimension corresponds to the single input channel. The input is initially divided into non-overlapping 3D tokens of size

(S, S, S)

by a patch partitioning layer, resulting in

\frac{H}{S} \times \frac{W}{S} \times \frac{D}{S} \times 1

tokens. This step reduces computation costs while preserving local characteristics in each patch. In our approach, S is set to 2, which results in a feature size of

M = S^{3} = 8

per token, and the token dimension becomes

(\frac{H}{2} \times \frac{W}{2} \times \frac{D}{2}, 8)

. A linear embedding layer then projects this feature into a hidden dimension C. To achieve multiscale feature extraction and hierarchical representation, a patch merging layer groups tokens with a

2 \times 2 \times 2

resolution, concatenating them and producing the feature map with

4 C

dimensions at each stage. The linear layer follows downsample resolution and reduces dimension to

2 C

. The process generates feature maps at stage 2, stage 3, and stage 4 with dimensions

\frac{H}{4} \times \frac{D}{4} \times \frac{W}{4} \times 2 C

,

\frac{H}{8} \times \frac{D}{8} \times \frac{W}{8} \times 4 C

, and

\frac{H}{16} \times \frac{D}{16} \times \frac{W}{16} \times 8 C

, respectively. For modeling token interactions efficiently, self-attention is computed for each token through Swin-Transformer blocks. In Figure 3, we illustrate that each Swin-Transformer block applies a Window-based Multi-head Self-Attention (W-MSA) module followed by a Shifted Window-based Multi-head Self-Attention (SW-MSA) module. The self-attention is computed by these two modules using both regular and shifted window partitioning strategies. Between the attention modules, the two-layer MLP incorporating GELU activation functions is utilized. LayerNorm is introduced before each MSA module, and residual connections are added after each module to enhance stability and efficiency during training. To lessen the computational burden, W-MSA partitions tokens into non-overlapping windows, computing self-attention (SA) within each window. To maintain computational efficiency while enabling cross-window interactions and global attention, SW-MSA adjusts the window positions by shifting them

(\frac{M}{2}, \frac{M}{2}, \frac{M}{2})

pixels from their original locations. Figure 4 illustrates the 3D multi-head self-attention mechanism in Swin-Transformer.

Swin-Transformer Decoder: To transform the global information obtained by the encoding into a segmentation map, the 3-stage Swin-Transformer decoder is employed, which gradually upsamples the features. At each decoder stage i, an upsampling layer first increases the resolution of H, W, and D by a factor of two while concurrently decreasing the channel dimensions via deconvolution. After upsampling, the resulting features are combined with the output from the previous encoder stage

i - 1

, denoted as

y^{e (i - 1)}

, which acts as a skip connection to mitigate any information loss during the encoder’s downsampling. The concatenated features are then input into a Swin-Transformer block, which captures long-range dependencies throughout the feature map. In this architecture, the deconvolution layer is solely responsible for upsampling, while the Swin-Transformer block handles the decoding of the features. We believe this strategy facilitates the extraction of richer and more meaningful features while also improving the modeling of spatial dependencies, especially for accurate segmentation of the liver and tumor boundaries. Finally, as illustrated in Figure 5, a feature extraction block is employed to separately extract features for the SDM task (blue branch) and the segmentation task (orange branch).

3.2. Attention Gate and SDM Branch

Attention gates are added between the encoder and its corresponding decoder layers to improve features in important regions and suppress responses in the irrelevant background. As shown in Figure 6, the attention gate receives two input features:

x^{l}

and g. These inputs undergo two

1 \times 1

convolution operations followed by batch normalization. Afterward, the resulting features are summed element-wise, and this sum is passed through a

R e L U

activation layer. To generate the attention map, a final

1 \times 1

convolution layer is applied, followed by a

S i g m o i d

operation. The output of the attention gate is the element-wise multiplication of the input features and the attention map. The specific formula is as follows:

a t t = σ_{2} (ψ (σ_{1} (W_{x} \cdot x_{i}^{l} + W_{g} \cdot g_{i})))

{\hat{x}}_{i}^{l} = a t t \otimes x_{i}^{l}

Among them,

W_{x}

,

W_{g}

, and

ψ

represent three different

1 \times 1 \times 1

convolution operators,

σ_{1}

and

σ_{2}

represent

R e L U

and Sigmoid activation functions,

x_{i}^{l}

is the feature output from the layer l of the encoder, and

g_{i}

is The features from the

u p s a m p l i n g

process of the

l + 1

layer of the decoder,

a t t

is the attention vector generated by the attention-gating mechanism. ⊗ represents the matrix element-wise multiplication operation, and

{\hat{x}}_{i}^{l}

is the output result of the attention-gating mechanism.

In our multitask network, we define task 1 as a segmentation task and task 2 as an SDM regression task. In existing work, pixel-level classification for segmentation has been extensively studied, while the SDM regression is a traditional task that captures active geometric contours and distance information [57], recently when combined with CNN rejuvenate [58], which is specifically defined as follows:

T (x) = \{\begin{matrix} - inf_{y \in \partial S} {∥x - y∥}_{2}, & x \in S_{i n} \\ 0, & x \in \partial S \\ + inf_{y \in \partial S} {∥x - y∥}_{2}, & x \in S_{o u t} \end{matrix}

where x and y are two different pixels in the same segmentation mask,

\partial S

is a zero-level set, which also represents the outline of the target object.

S_{i n}

and

S_{o u t}

denote the inner and outer regions of the segmentation target. We then define

T (x)

as the transformation from a segmentation map to an SDM. In this experiment, a segmentation mask is first converted into three different binarized channels by

O n e h o t

operation, representing the background, liver, and tumors. Then, the corresponding SDMs are calculated for the liver and tumor channels using the above formula.

The proposed network is trained using a mixed loss function, which includes cross-entropy loss (CE loss), DSC loss, and SDM loss. CE loss is defined as follows:

L_{C E} (p, \hat{p}) = - \frac{1}{N} \sum_{i = 1}^{N} [p_{i} log {\hat{p}}_{i} + (1 - p_{i}) log (1 - {\hat{p}}_{i})]

Among them, N represents the number of speed-ups for each sample, p,

\hat{p}

represent ground truth and prediction, respectively, and

p_{i}

,

{\hat{p}}_{i}

represent the probability values at position i from ground truth and prediction, respectively. The specific definition of DSC loss is as follows:

L_{D S C} = 1 - \frac{2 | G \cap P |}{| G | + | P |}

In this case, G and P represent the voxel sets for the ground truth and prediction, respectively, and

| . |

denotes the voxel count calculation. The definition of the Sdm loss is as follows:

L_{S d m} = {∥S - T (M)∥}^{2}

Among them, S and M, respectively, represent the predicted SDM and the corresponding mask.

T (.)

represents the operation of calculating the SDM ground truth through the mask, and

{∥.∥}^{2}

represents the operation of calculating the

L_{2}

loss. Combine the above three parts of the loss, and use the parameter

α

to adjust the weight of

L_{S d m}

.

L = L_{C E} + L_{D S C} + α L_{S d m}

3.3. The Self-Supervised Learning in Our Framework

As illustrated in Figure 7, pretext tasks are employed for self-supervised learning by combining contrastive learning with an exponentially weighted moving average (EWMA) technique. This method helps in learning strong feature representations from the input data. Once the pretext training is finished, the obtained weights are utilized to initialize the encoder, and the entire model undergoes fine-tuning via supervised learning.

Contrastive learning boosts representation robustness by distinguishing similar (positive) from dissimilar (negative) pairs or by improving the alignment within positive pairs. Given an input batch x, we apply random image augmentations

T

, generating two augmented views:

x_{1} = t (x)

and

x_{2} = t^{'} (x)

, where t and

t^{'}

are drawn from

T

. These augmented versions are passed through both the base and momentum networks, each consisting of a Swin-Transformer encoder and a two-layer MLP projector. The base model

f_{θ}

is trained with backpropagation, while the momentum network

f_{ζ}

updates its parameters using an exponentially weighted moving average from the base model:

ζ \leftarrow ϕ ζ + (1 - ϕ) θ

(1)

where

ϕ \in [0, 1]

is the decay rate. A one-layer MLP predictor is added to the base network to prevent representational collapse. The similarity between representations is computed via the dot product, and the contrastive loss for a pair

z_{i}

and

z_{j}

is given by:

L_{con} = - log \frac{exp ((z_{i} \cdot z_{j}) / τ)}{\sum_{m = 1}^{2 N} I [m \neq i] exp ((z_{i} \cdot z_{m}) / τ)}

(2)

In this equation,

I_{[m \neq i]} \in {0, 1}

serves as an indicator function that equals 1 when

m \neq i

, and

τ

denotes the temperature parameter. As a pretext task, contrastive learning draws intra-class pairs closer together while separating inter-class pairs, thus capturing essential semantic features.

4. Experiments

4.1. Datasets and Preprocessing

Segmentation Dataset

The network was trained and validated using a clinical dataset obtained from the First Affiliated Hospital of Dalian Medical University. This dataset consists of contrast-enhanced MRI (CE-MRI) scans, including arterial phase, venous phase, and delayed phase images, collected from 192 patients diagnosed with hepatocellular carcinoma (HCC). The regions of interest (ROIs) of the liver and tumors were manually annotated by two radiologists with more than five years of clinical experience. Each image slice has a resolution of 512 × 512 pixels, with 60–140 slices per scan and a voxel spacing of [0.78 mm, 0.78 mm, 2.2 mm]. Based on patient-level grouping, the dataset was divided into a training cohort (128 patients, 384 MRIs), a validation cohort (32 patients, 96 MRIs), and a test cohort (32 patients, 96 MRIs). MRI scans from different phases of the same patient were kept within the same cohort to prevent data leakage.

To improve the accuracy of liver tumor segmentation, all images underwent several preprocessing steps. First, the MR images were cropped to remove irrelevant background regions. Next, the images were resampled to an isotropic resolution of [1.00 mm, 1.00 mm, 1.00 mm], and pixel intensities were normalized to the range [0, 255]. Furthermore, to enhance the model’s generalization and robustness, several image augmentation techniques were applied during training. These included random rotations, horizontal and vertical flips, scaling, elastic deformations, and intensity variations such as Gaussian noise and contrast adjustment. These augmentations help simulate real-world variability and reduce the risk of overfitting.

SSL Dataset

Our self-supervised learning dataset was created by gathering chest MRI scans from 594 patients, comprising 192 annotated cases and 402 unannotated cases. During pretraining, only the unannotated cases were used, with no reliance on the existing annotations. To maintain consistency, the preprocessing applied to the unlabeled patient MRIs mirrored that of the segmentation dataset.

The study was approved by the ethics committee of the First Affiliated Hospital, Dalian Medical University (PJ-KS-KY-2019-167).

4.2. Implementation Details and Evaluation Metrics

The deep network is trained with the Adam optimizer. The initial learning rate is

1 \times 10^{- 4}

, and the learning rate is automatically attenuated according to the CosineAnnealingLR strategy. Until the 100th epoch, the learning rate is attenuated to

1 \times 10^{- 6}

. Due to the limitation of computing resources, images must be randomly cropped before input into the network. The patch size is set to

[96, 96, 96]

. Each image is randomly divided into eight patches, and the batch size is set to 4. The network is implemented using Pytorch 2.0 and MONAI 1.2.0 [59], and the server configuration is Intel(R) Xeon(R) CPU E5-16200, 3.60 GHZ, 64 GB RAM, Nvidia GeForce RTX 3090 (24 GB), Ubuntu18.04 (Intel, Santa Clara, CA, USA). To improve the generalization performance of the model and reduce overfitting, we use data augmentation during the training procedure, including random bias field, random Gaussian noise, random contrast adjustment, and random intensity value offset, with a probability of 0.5. We did not use any post-processing or ensemble methods for a fair comparison. In the self-supervised learning phase, a batch size of 4 is utilized. The learning rate begins at

1 \times 10^{- 5}

, employing the Adam optimizer alongside a warm-up scheduler for the initial 500 steps. The model undergoes training for 200,000 iterations, with the final model being used for subsequent fine-tuning tasks.

We use the Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (95HD), and Average Symmetric Distance (ASD) to evaluate the segmentation performance of the proposed model. The DSC coefficient measures the volumetric overlap between ground truth and segmentation results, while the 95HD measures the distance between ground truth and the boundary pixels of the segmentation result. Additionally, the ASD provides a measure of the average distance between the boundary points of the segmentation and ground truth. These metrics give a well-rounded understanding of both the volumetric overlap and the boundary discrepancies between the ground truth and the segmentation results.

4.3. Quantitative Results

This section compares the proposed network with five state-of-the-art methods to evaluate its effectiveness and robustness on the CE-MRI dataset. 3DUNet [60] is widely used for medical image segmentation. It includes an encoder for downsampling to extract features, a decoder for upsampling to generate the segmentation map and skip connections that bridge the encoder and decoder to recover information lost during the downsampling phase. The structure of V-Net [61] is like 3DUNet but employs residual connections to enhance feature learning across layers. nnU-Net [62] presents a framework that automatically adapts to different datasets, built upon the foundation of standard U-Nets. UNETR [43] is a Transformer-based medical image segmentation network that combines the advantages of U-Net and Transformer, uses the self-attention mechanism to capture the global information in the image, and uses U-Net’s encoder–decoder structure to perform local feature extraction and upsampling. SwinUNetR [46] uses Swin-Transformer for feature extraction and encoding. Swin-Transformer is a new type of Transformer that uses a layered local attention mechanism to capture global information in an image while using a cross-layer attention mechanism to promote feature transmission and integration.

Comparing the proposed method with other advanced methods, DSC, 95HD and ASD are used to measure the segmentation performance of the model for liver and tumor, respectively. As shown in Table 1, the proposed method outperforms other approaches across five metrics, achieving a DSC score of

0.9662 \pm 0.0219

for liver segmentation and

0.8647 \pm 0.0256

for tumor segmentation. The proposed model outperforms other segmentation networks on tumor segmentation by a large margin and slightly outperforms other models on liver segmentation. This is mainly because the proposed model can segment the small tumor more accurately.

Table 2 compares the computational cost and parameter size of our method with several representative 3D medical image segmentation networks. Although transformer-based architectures are generally associated with increased computational burden, our approach demonstrates a favorable trade-off between accuracy and efficiency. Notably, while our model achieves competitive performance (as shown in Table 1), it only introduces 63.4 M parameters and 188.6 GFLOPs, which is comparable to SwinUNetR (62.1 M, 237.9 GFLOPs) and significantly lower than nnU-Net (91.2 M, 213.8 GFLOPs). Compared with UNetR and SwinUNetR, our method shows only a modest increase in parameters (by 4.9 M and 1.3 M, respectively) and even a notable reduction in GFLOPs compared to SwinUNetR. These results suggest that our design is efficient in terms of computational complexity and model size and that the performance gains do not come at the expense of a significant increase in resource consumption.

To evaluate the statistical significance of the improvements made by our model, A paired t-test was conducted between the others and our model, with the calculation of the two-sided p-value. A threshold of 0.05 was set for significance. In Table 3, we report the p-values for liver and tumor based on DSC. The results indicate a statistically significant difference between all the compared models and our model, with the lowest different p-values observed in comparison to nnU-Net: 0.0151 for liver and 0.0032 for tumor.

To determine if the contours predicted by our model are generally larger or smaller than the annotated ones, we calculated the Volume Similarity for all models, and the results are shown in Table 4. It was observed that the majority of models tend to predict smaller contours for both the liver and tumor. However, our model and nnU-Net [62] typically predict slightly larger contours for both the liver and tumor, showing better performance in detecting smaller tumors. In particular, for the liver, the average difference between our model’s predictions and the annotated contours in the test set is 4.32% smaller, with a standard deviation of 9.53%. For the tumor, the model’s predictions are, on average, 4.67% smaller, with a standard deviation of 8.67%.

4.4. Qualitative Results

To further evaluate the performance of our model, we visualized the segmentation results and compared them with the annotations provided by the radiation oncologist. The segmentation results for two different patients across three MRI phases are displayed in Figure 8. The predicted liver contours are in close agreement with the manual annotations, showing minimal discrepancies. For the tumor, the predicted contours exhibit a slight outward expansion relative to the manual annotations, while the overall shape still aligns with the manually drawn contours.

We present slices from arterial phase MRI images of the same patient, highlighting liver and tumor regions at six distinct locations, along with the segmentation results from various models to visually assess their performance. As illustrated in Figure 9, the prediction outcomes of different models are compared. Notable differences are observed in the segmentation of tumor edges, particularly for smaller tumors. Specifically, 3DUNet and V-Net do not perform well in tumor segmentation, and 3DUNet does not even detect small tumors. While the other three baseline models identify the presence of tumors, their results significantly deviate from the ground truth. In contrast, our model demonstrates superior performance in both liver and tumor segmentation.

4.5. Ablation Experiment

In this section, we conducted different ablation experiments to analyze the proposed model comprehensively. First, we evaluate the improvement of the SDM regression branch on segmentation results to demonstrate the importance of multitask learning. Table 5 shows the segmentation results of removing the SDM branch and retaining the SDM branch. We also use DSC and 95HD for evaluation. The results show that the model that retains the SDM branch has a higher DSC value and a lower 95HD in terms of liver and tumor segmentation, indicating that it can better find the tumor location and segment the marginal region.

Introducing the SSL method for pretraining, the network helps improve the performance of the liver and tumor segmentation model, as shown in Table 5. By pretraining on a large amount of unlabeled data, the model can learn richer feature representations, enabling better generalization on limited labeled data. Experimental results demonstrate that, compared to models without self-supervised pretraining, models trained with self-supervised learning achieve significant improvements in DSC score, 95HD, and ASD, leading to more accurate target region predictions and smoother boundary segmentation. This indicates that self-supervised pretraining can effectively enhance the model’s representation capability, thereby improving liver and tumor segmentation performance.

After introducing the attention gate mechanism, the performance of the liver and tumor segmentation model has significantly improved, as shown in Table 5. Specifically, DSC increased by approximately 0.59% and 1.39% for liver and tumor segmentation, respectively, indicating that the model can more accurately capture the boundaries and shapes of the target regions. Furthermore, 95HD for liver and tumor decreased by approximately 0.2 and 0.25, respectively, demonstrating a notable enhancement in prediction accuracy for boundary regions, particularly when handling complex and small-sized tumors. These improvements suggest that the attention gate mechanism effectively enhances the model’s ability to focus on critical regions, thereby improving segmentation accuracy and robustness.

Table 6 presents the segmentation results of our model on different CE-MRI images. It is worth noting that the model performed best in the arterial enhanced MRI images. The liver and tumor DSC values reached

0.968 \pm 0.025

and

0.878 \pm 0.066

, respectively, higher than the performance of the model in the venous phase and delayed phase, which may be attributed to tumors being associated with more excellent contrast and sharper visualization of surrounding areas in the arterial phase. Figure 8 shows the segmentation results of the model for different MRI images of the two different patients. The results showed that CE-MRI in the arterial phase was superior to the other two MRI images, with more complete segmentation results and more accurate edges.

Finally, to analyze the SDM branch’s impact on the segmentation performance, we set

α

with different weights for experiments. The results in Table 7 show that with the increase of the weight

α

, the model performance does not produce a corresponding improvement but produces an inevitable decline. This is mainly reflected in the 95HD and the ASD of the tumor, indicating that the segmentation performance of the edge part of the model has declined. Finally, when

α = 0.1

, the model achieved the best performance.

5. Discussion

To address the challenging problem of liver and tumor segmentation in HCC patients, we developed a Swin-Transformer-based segmentation framework with a multitask and attention strategy for liver and tumor segmentation. In contrast to CNN-based segmentation models, the Swin-Transformer leverages self-attention mechanisms, enabling it to capture contextual information and long-range dependencies across the entire CE-MRI. The global attention capability enhances the model’s understanding of the spatial relationships between the liver and tumor in MRI scans, thereby improving the segmentation accuracy for both regions.

To further enhance the attention mechanism, the liver and tumor segmentation model showed significant performance improvements following the integration of the attention gate mechanism, as shown in Table 5. Specifically, the DSC scores for liver and tumor segmentation increased by approximately

0.59 %

and

1.39 %

, respectively, highlighting the model’s enhanced ability to accurately delineate the boundaries and shapes of the target regions. Additionally, the 95HD for both liver and tumor decreased by approximately

0.2

and

0.25

, respectively, indicating a marked improvement in boundary prediction accuracy, particularly in the context of small or complex tumors. These results underscore the effectiveness of the attention gate mechanism in improving the model’s focus on critical regions, thereby boosting both segmentation accuracy and robustness.

We introduced the SDM regression branch in the proposed method, which realized the simultaneous learning of the segmentation and SDM regression tasks. At the same time, introducing the SDM regression branch added additional shape constraints to the segmentation task, further improving the robustness and stability. To evaluate the effectiveness of the proposed method, we conducted comprehensive experiments on ablation analysis. Table 1 and Figure 9 show the comparison results with other methods. Our proposed method is superior to other methods in terms of DSC, 95HD, and ASD. The DSC values for liver and tumor segmentation reached

0.966 \pm 0.022

and

0.865 \pm 0.026

, 95HD reached

3.62 \pm 1.91

and

4.12 \pm 1.55

, and ASD reached

0.92 \pm 0.28

and

1.02 \pm 0.32

. The visualization of the segmentation results shows that compared with the baseline model, the proposed model can detect small tumors better, and the segmentation results of the target edge region are smoother. Several works have explored the application of distance sign maps to medical images. Xue et al. [58] converted the segmentation task to predicting SDM. They introduced an approximate Heaviside function to train the model by simultaneously predicting SDM and segmentation mask so that the segmentation results have better smoothness and shape continuity. Ma et al. [63] embed the global geometric information of objects into the learning framework through the classic geodesic active contours (GAC) and propose a level set function (LSF) regression network, which not only predicts the segmentation mask but also minimize the GAC energy function. Li et al. [64] incorporated SDM into the semi-supervised segmentation task to enforce geometric shape constraints in the segmentation output. A multitask deep network is proposed to predict semantic segmentation jointly and SDM of object surfaces. Based on this, Luo et al. [65] transformed the SDM into an approximate segmentation map via a differentiable task conversion layer. They introduced dual-task consistency regularization between the SDM map and the directly predicted segmentation map. However, it has yet to be found that the predicted distance sign map and segmentation results are applied to the liver tumor segmentation of CE-MRI images. This study is the first to explore this.

Obtaining high-quality annotated images for HCC patients is a challenging endeavor, as deep-learning models need big datasets to realize their full potential. To alleviate the manual labeling burden for training data, we employ SSL methods by designing a pretraining task focused on liver and tumor segmentation. By collecting A significant volume of unlabeled data sourced from hospitals, the model can learn meaningful visual representations from this extensive dataset before transitioning to supervised learning. Experimental results show that our self-supervised pretraining task significantly boosts the accuracy of the final segmentation model. Specifically, without self-supervised learning, the DSC scores for liver and tumor segmentation are

96.15 %

and

86.18 %

, respectively, with 95HD values of 3.76 mm and 4.21 mm and ASD values of 0.97 mm and 1.09 mm. After applying self-supervised learning, the DSC scores remain unchanged at

96.62 %

and

86.47 %

, while the HD95 values improve to 3.62 mm and 4.12 mm, and the ASD values reduce to 0.92 mm and 1.02 mm.

Despite the strong performance of our method on the test set, several limitations remain in this study. The most significant limitation is that the test set comprises only 32 patients, all from a single medical center. This relatively small and homogeneous dataset may not capture the variability in imaging protocols, scanner types, and patient populations that exist across different institutions. For example, in our study, all enrolled cases were patients diagnosed with hepatocellular carcinoma (HCC), which accounts for approximately 80–90% of all liver cancer cases. Other types of primary liver tumors, such as intrahepatic cholangiocarcinoma, combined hepatocellular–cholangiocarcinoma, and hepatic angiosarcoma, were not included in our dataset. As a result, there may be potential dataset biases, and the current results may not fully generalize to broader clinical settings. To address this issue, multicenter validation is essential to evaluate the robustness and applicability of the proposed method in real-world clinical practice. In future work, we plan to collect more diverse datasets from multiple medical centers and MRI devices to comprehensively assess the generalizability and reliability of our approach. This will help mitigate potential biases and ensure that our method performs consistently across varying clinical environments.

6. Conclusions

Automatic segmentation of the liver and tumor plays a crucial role in clinical settings by reducing the workload of surgeons and enhancing the likelihood of successful surgeries. In this study, we present a Swin-Transformer-based segmentation framework incorporating multitask learning and attention mechanisms for liver and tumor segmentation. Our experimental results demonstrate that the proposed approach outperforms current state-of-the-art medical image segmentation networks for liver and tumor segmentation in CE-MRI images. The DSC for liver and tumor segmentation are

0.966 \pm 0.022

and

0.865 \pm 0.026

, respectively.

Author Contributions

Conceptualization, Z.C.; Methodology, Z.C.; Software, Z.C.; Validation, Z.C. and M.D.; Formal analysis, Z.C. and M.D.; Investigation, Z.C.; Resources, Z.C. and X.L.; Data curation, Z.C. and M.D.; Writing—original draft, Z.C.; Writing—review & editing, X.L.; Visualization, Z.C.; Supervision, X.L. and Y.Y.; Project administration, X.L.; Funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Natural Science Foundation of China (Grant No. 61971091).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ASD	Average Symmetric Distance
BTCV	BTCV Multi-Organ Segmentation Challenge
CE Loss	Cross-Entropy Loss
CE-MRI	Contrast-Enhanced Magnetic Resonance Imaging
CNN	Convolutional Neural Network
CT	Computed Tomography
DSC	Dice Similarity Coefficient
EWMA	Exponentially Weighted Moving Average
FCN	Fully Convolutional Network
GAC	Geodesic Active Contours
GAN	Generative Adversarial Networks
HCC	Hepatocellular Carcinoma
LSF	Level Set Function
MAB	Multiscale Attention Block
MSD	MSD Challenge
NLP	Natural Language Processing
RMAU-Net	Residual Multiscale Attention U-Net
SAA	Scale-Axis-Attention
SA	Self-Attention
SDM	Signed Distance Map
SSL	Self-Supervised Learning
SW-MSA	Shifted Window-Based Multi-Head Self-Attention
VAE	Variational Autoencoders
ViT	Vision Transformers
W-MSA	Window-Based Multi-Head Self-Attention
95HD	95% Hausdorff Distance

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
Perz, J.F.; Armstrong, G.L.; Farrington, L.A.; Hutin, Y.J.; Bell, B.P. The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. J. Hepatol. 2006, 45, 529–538. [Google Scholar] [PubMed]
Yang, J.; Tao, H.S.; Cai, W.; Zhu, W.; Zhao, D.; Hu, H.Y.; Liu, J.; Fang, C.H. Accuracy of actual resected liver volume in anatomical liver resections guided by 3-dimensional parenchymal staining using fusion indocyanine green fluorescence imaging. J. Surg. Oncol. 2018, 118, 1081–1087. [Google Scholar] [PubMed]
Heidari, M.; Taghizadeh, M.; Masoumi, H.; Valizadeh, M. Liver Segmentation in MRI Images using an Adaptive Water Flow Model. J. Biomed. Phys. Eng. 2021, 11, 527. [Google Scholar]
Dakua, S.P. Use of chaos concept in medical image segmentation. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2013, 1, 28–36. [Google Scholar]
Dakua, S.P.; Abinahed, J.; Al-Ansari, A.A. Pathological liver segmentation using stochastic resonance and cellular automata. J. Vis. Commun. Image Represent. 2016, 34, 89–102. [Google Scholar]
Mesejo, P.; Valsecchi, A.; Marrakchi-Kacem, L.; Cagnoni, S.; Damas, S. Biomedical image segmentation using geometric deformable models and metaheuristics. Comput. Med. Imaging Graph. 2015, 43, 167–178. [Google Scholar]
Yang, X.; Yu, H.C.; Choi, Y.; Lee, W.; Wang, B.; Yang, J.; Hwang, H.; Kim, J.H.; Song, J.; Cho, B.H.; et al. A hybrid semi-automatic method for liver segmentation based on level-set methods using multiple seed points. Comput. Methods Programs Biomed. 2014, 113, 69–79. [Google Scholar] [CrossRef]
Das, A.; Sabut, S.K. Kernelized fuzzy C-means clustering with adaptive thresholding for segmenting liver tumors. Procedia Comput. Sci. 2016, 92, 389–395. [Google Scholar]
Seal, A.; Bhattacharjee, D.; Nasipuri, M. Predictive and probabilistic model for cancer detection using computer tomography images. Multimed. Tools Appl. 2018, 77, 3991–4010. [Google Scholar] [CrossRef]
Wiseman, Y.; Fredj, E. Contour extraction of compressed JPEG images. J. Graph. Tools 2001, 6, 37–43. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Ben-Cohen, A.; Diamant, I.; Klang, E.; Amitai, M.; Greenspan, H. Fully convolutional network for liver segmentation and lesions detection. In Proceedings of the Deep Learning and Data Labeling for Medical Applications: First International Workshop, LABELS 2016, and Second International Workshop, DLMIA 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, 21 October 2016; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 2016; pp. 77–85. [Google Scholar]
Sun, C.; Guo, S.; Zhang, H.; Li, J.; Chen, M.; Ma, S.; Jin, L.; Liu, X.; Li, X.; Qian, X. Automatic segmentation of liver tumors from multiphase contrast-enhanced CT images based on FCNs. Artif. Intell. Med. 2017, 83, 58–66. [Google Scholar] [CrossRef] [PubMed]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef]
Gul, S.; Khan, M.S.; Bibi, A.; Khandakar, A.; Ayari, M.A.; Chowdhury, M.E. Deep learning techniques for liver and liver tumor segmentation: A review. Comput. Biol. Med. 2022, 147, 105620. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Landman, B.; Xu, Z.; Igelsias, J.; Styner, M.; Langerak, T.; Klein, A. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, Munich, Germany, 5–9 October 2015; Volume 5, p. 12. [Google Scholar]
Antonelli, M.; Reinke, A.; Bakas, S.; Farahani, K.; Kopp-Schneider, A.; Landman, B.A.; Litjens, G.; Menze, B.; Ronneberger, O.; Summers, R.M.; et al. The medical segmentation decathlon. Nat. Commun. 2022, 13, 4128. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
Matsoukas, C.; Haslum, J.F.; Söderberg, M.; Smith, K. Is it time to replace cnns with transformers for medical images? arXiv 2021, arXiv:2108.09038. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G.E. Big self-supervised models are strong semi-supervised learners. Adv. Neural Inf. Process. Syst. 2020, 33, 22243–22255. [Google Scholar]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9640–9649. [Google Scholar]
Christ, P.F.; Ettlinger, F.; Grün, F.; Elshaera, M.E.A.; Lipkova, J.; Schlecht, S.; Ahmaddy, F.; Tatavarty, S.; Bickel, M.; Bilic, P.; et al. Automatic liver and tumor segmentation of CT and MRI volumes using cascaded fully convolutional neural networks. arXiv 2017, arXiv:1702.05970. [Google Scholar]
Zhang, C.; Lu, J.; Hua, Q.; Li, C.; Wang, P. SAA-Net: U-shaped network with Scale-Axis-Attention for liver tumor segmentation. Biomed. Signal Process. Control 2022, 73, 103460. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Zhang, Z.; Yin, X.; Wang, T.; Li, N. CPAD-Net: Contextual parallel attention and dilated network for liver tumor segmentation. Biomed. Signal Process. Control 2023, 79, 104258. [Google Scholar] [CrossRef]
Li, J.; Liu, K.; Hu, Y.; Zhang, H.; Heidari, A.A.; Chen, H.; Zhang, W.; Algarni, A.D.; Elmannai, H. Eres-UNet++: Liver CT image segmentation based on high-efficiency channel attention and Res-UNet++. Comput. Biol. Med. 2023, 158, 106501. [Google Scholar] [CrossRef]
Kushnure, D.T.; Talbar, S.N. HFRU-Net: High-level feature fusion and recalibration unet for automatic liver and tumor segmentation in CT images. Comput. Methods Programs Biomed. 2022, 213, 106501. [Google Scholar] [CrossRef]
Jiang, L.; Ou, J.; Liu, R.; Zou, Y.; Xie, T.; Xiao, H.; Bai, T. RMAU-Net: Residual Multi-Scale Attention U-Net For liver and tumor segmentation in CT images. Comput. Biol. Med. 2023, 158, 106838. [Google Scholar] [CrossRef]
Myronenko, A. 3D MRI brain tumor segmentation using autoencoder regularization. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part II 4. Springer: Berlin/Heidelberg, Germany, 2019; pp. 311–320. [Google Scholar]
Chakravarty, A.; Sivswamy, J. A deep learning based joint segmentation and classification framework for glaucoma assesment in retinal color fundus images. arXiv 2018, arXiv:1808.01355. [Google Scholar]
Chen, C.; Bai, W.; Rueckert, D. Multi-task learning for left atrial segmentation on GE-MRI. In Proceedings of the Statistical Atlases and Computational Models of the Heart. Atrial Segmentation and LV Quantification Challenges: 9th International Workshop, STACOM 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers 9. Springer: Berlin/Heidelberg, Germany, 2019; pp. 292–301. [Google Scholar]
Zhou, Y.; Chen, H.; Li, Y.; Liu, Q.; Xu, X.; Wang, S.; Yap, P.T.; Shen, D. Multi-task learning for segmentation and classification of tumors in 3D automated breast ultrasound images. Med. Image Anal. 2021, 70, 101918. [Google Scholar] [CrossRef]
Qu, H.; Riedlinger, G.; Wu, P.; Huang, Q.; Yi, J.; De, S.; Metaxas, D. Joint segmentation and fine-grained classification of nuclei in histopathology images. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; IEEE: New York, NY, USA, 2019; pp. 900–904. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual, 27 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 272–284. [Google Scholar]
Liang, J.; Yang, C.; Zhong, J.; Ye, X. BTSwin-Unet: 3D U-shaped symmetrical Swin transformer-based network for brain tumor segmentation with self-supervised pre-training. Neural Process. Lett. 2023, 55, 3695–3713. [Google Scholar]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 1–13. [Google Scholar]
Bakas, S.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M.; et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv 2018, arXiv:1811.02629. [Google Scholar]
Zheng, R.; Zhong, Y.; Yan, S.; Sun, H.; Shen, H.; Huang, K. MsVRL: Self-supervised multiscale visual representation learning via cross-level consistency for medical image segmentation. IEEE Trans. Med. Imaging 2022, 42, 91–102. [Google Scholar]
Zhao, L.; Jia, C.; Ma, J.; Shao, Y.; Liu, Z.; Yuan, H. Medical image segmentation based on self-supervised hybrid fusion network. Front. Oncol. 2023, 13, 1109786. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 69–84. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Zhao, H. A fast sweeping method for eikonal equations. Math. Comput. 2005, 74, 603–627. [Google Scholar] [CrossRef]
Xue, Y.; Tang, H.; Qiao, Z.; Gong, G.; Yin, Y.; Qian, Z.; Huang, C.; Fan, W.; Huang, X. Shape-aware organ segmentation by predicting signed distance maps. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–8 February 2020; Volume 34, pp. 12565–12572. [Google Scholar]
Cardoso, M.J.; Li, W.; Brown, R.; Ma, N.; Kerfoot, E.; Wang, Y.; Murrey, B.; Myronenko, A.; Zhao, C.; Yang, D.; et al. Monai: An open-source framework for deep learning in healthcare. arXiv 2022, arXiv:2211.02701. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; Proceedings, Part II 19. Springer: Berlin/Heidelberg, Germany, 2016; pp. 424–432. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; He, J.; Yang, X. Learning geodesic active contours for embedding object global information in segmentation CNNs. IEEE Trans. Med. Imaging 2020, 40, 93–104. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhang, C.; He, X. Shape-aware semi-supervised 3D semantic segmentation for medical images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part I 23. Springer: Berlin/Heidelberg, Germany, 2020; pp. 552–561. [Google Scholar]
Luo, X.; Chen, J.; Song, T.; Wang, G. Semi-supervised medical image segmentation through dual-task consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 19–21 May 2021; Volume 35, pp. 8801–8809. [Google Scholar]

Figure 1. The first row from left to right represents MRI scans in the arterial phase, venous phase, and delayed phase, respectively. The second row shows the corresponding annotated liver and tumor regions for each phase, where white indicates the tumor, gray represents the liver, and black corresponds to the background.

Figure 2. The schematic of our segmentation model is as follows. It includes an encoder that extracts and compresses features and a decoder that reconstructs and restores them. Skip connections with attention gates are used between the encoder and decoder at each stage. Both components are built upon the Swin-Transformer block. After feature extraction and reconstruction, the resulting features are used separately for segmentation tasks and SDM regression tasks. The segmentation block includes a

2 \times 2

deconvolution layer and a

1 \times 1

convolution layer.

Figure 2. The schematic of our segmentation model is as follows. It includes an encoder that extracts and compresses features and a decoder that reconstructs and restores them. Skip connections with attention gates are used between the encoder and decoder at each stage. Both components are built upon the Swin-Transformer block. After feature extraction and reconstruction, the resulting features are used separately for segmentation tasks and SDM regression tasks. The segmentation block includes a

2 \times 2

deconvolution layer and a

1 \times 1

convolution layer.

Figure 3. The Swin-Transformer block in the network architecture is composed of two consecutive Swin-Transformer layers.

Figure 4. The flowchart illustrating the process of computing multiple self-attention using the shifted window strategy in the Swin-Transformer architecture.

Figure 5. Flowchart of the feature extraction block.

Figure 6. Schematic diagram of the attention gate mechanism.

Figure 7. The diagram of our pretraining method is shown. A random transformation is applied to the input patch x, generating two different views,

x_{1}

and

x_{2}

, which are subsequently input into the framework for the contrastive learning task.

Figure 7. The diagram of our pretraining method is shown. A random transformation is applied to the input patch x, generating two different views,

x_{1}

and

x_{2}

, which are subsequently input into the framework for the contrastive learning task.

Figure 8. Qualitative comparison of our method in different MRI images.

Figure 9. Qualitative comparison between our method and other segmentation methods.

Table 1. Performance comparison of different methods.

Methods		3DUNet [60]	V-Net [61]	nnU-Net [62]	UNetR [43]	SwinUNetR [46]	Ours
DSC (%)	Liver	$94.43 \pm 4.72$	$95.84 \pm 3.11$	$93.53 \pm 4.16$	$95.28 \pm 4.09$	$95.51 \pm 3.67$	$96.62 \pm 2.19$
DSC (%)	Tumor	$73.84 \pm 7.65$	$77.66 \pm 6.54$	$82.91 \pm 3.83$	$81.07 \pm 4.03$	$83.24 \pm 3.69$	$86.47 \pm 2.56$
95HD	Liver	$8.73 \pm 7.63$	$4.96 \pm 3.62$	$4.73 \pm 2.82$	$7.73 \pm 6.82$	$4.13 \pm 3.18$	$3.62 \pm 1.91$
95HD	Tumor	$8.32 \pm 3.82$	$6.89 \pm 2.54$	$5.06 \pm 1.94$	$6.01 \pm 1.86$	$4.46 \pm 1.61$	$4.12 \pm 1.55$
ASD	Liver	$1.68 \pm 0.55$	$1.38 \pm 0.48$	$1.09 \pm 0.42$	$1.32 \pm 0.46$	$1.09 \pm 0.39$	$0.92 \pm 0.28$
ASD	Tumor	$1.99 \pm 0.63$	$1.50 \pm 0.50$	$1.21 \pm 0.36$	$1.61 \pm 0.58$	$1.25 \pm 0.39$	$1.02 \pm 0.32$

Table 2. Comparison of computational cost and model parameters.

Method	Parameters (M)	GFLOPs
3DUNet [60]	$16.9$	$89.4$
V-Net [61]	$10.8$	$65.2$
nnU-Net [62]	$91.2$	$213.8$
UNetR [43]	$58.5$	$155.1$
SwinUNetR [46]	$62.1$	$237.9$
Ours	$63.4$	$188.6$

Table 3. Comparison of different methods on DSC metrics with p-Values.

Methods	DSC (%)
Methods	Liver	p-Value	Tumor	p-Value
3DUNet [60]	$94.43 \pm 4.72$	$5.19 \times 10^{- 3}$	$73.84 \pm 7.65$	$7.64 \times 10^{- 10}$
V-Net [61]	$95.84 \pm 3.11$	$7.37 \times 10^{- 3}$	$77.66 \pm 6.54$	$4.25 \times 10^{- 8}$
nnU-Net [62]	$93.53 \pm 4.16$	$1.15 \times 10^{- 2}$	$82.91 \pm 3.83$	$3.19 \times 10^{- 3}$
UNetR [43]	$95.51 \pm 3.67$	$9.64 \times 10^{- 3}$	$81.07 \pm 4.03$	$1.65 \times 10^{- 5}$
SwinUNetR [46]	$95.51 \pm 3.67$	$5.13 \times 10^{- 3}$	$83.24 \pm 3.69$	$2.49 \times 10^{- 4}$
Ours	$96.62 \pm 2.19$	-	$86.47 \pm 2.56$	-

Table 4. Comparison of different methods on Volume Similarity metrics.

Methods	Volume Similarity (%) [mean ± std]
Methods	Liver	Tumor
3DUNet [60]	$- 10.02 \pm 14.96$	$- 12.34 \pm 17.62$
V-Net [61]	$- 4.71 \pm 16.93$	$- 15.89 \pm 18.12$
nnU-Net [62]	$+ 3.86 \pm 11.25$	$+ 8.21 \pm 10.56$
UNetR [43]	$- 6.08 \pm 18.97$	$- 13.45 \pm 14.22$
SwinUNetR [46]	$+ 5.72 \pm 12.39$	$- 7.62 \pm 11.73$
Ours	$+ 4.32 \pm 9.53$	$+ 4.67 \pm 8.67$

Table 5. Ablation study for liver and tumor segmentation.

SSL	AG	SDM	DSC (%)		95HD		ASD
SSL	AG	SDM	Liver	Tumor	Liver	Tumor	Liver	Tumor
			$95.28 \pm 3.45$	$83.62 \pm 3.62$	$4.08 \pm 2.99$	$4.56 \pm 1.78$	$1.09 \pm 0.40$	$1.16 \pm 0.35$
	✓		$95.87 \pm 2.34$	$85.01 \pm 2.91$	$3.88 \pm 2.32$	$4.31 \pm 1.53$	$1.11 \pm 0.36$	$1.09 \pm 0.75$
	✓	✓	$96.15 \pm 2.91$	$86.18 \pm 2.68$	$3.76 \pm 2.19$	$4.21 \pm 1.65$	$0.97 \pm 0.31$	$1.09 \pm 0.38$
✓	✓	✓	$96.62 \pm 2.19$	$86.47 \pm 2.56$	$3.62 \pm 1.91$	$4.12 \pm 1.55$	$0.92 \pm 0.28$	$1.02 \pm 0.32$

Table 6. Qualitative comparison of our method on different MRI phases.

Methods	DSC		95HD		ASD
Methods	Liver	Tumor	Liver	Tumor	Liver	Tumor
Arterial phase	$0.968 \pm 0.025$	$0.878 \pm 0.066$	$3.93 \pm 1.49$	$4.11 \pm 1.33$	$0.91 \pm 0.27$	$1.01 \pm 0.30$
Venous phase	$0.967 \pm 0.022$	$0.868 \pm 0.065$	$4.53 \pm 3.74$	$6.74 \pm 2.28$	$1.82 \pm 0.54$	$2.97 \pm 0.88$
Delayed phase	$0.963 \pm 0.023$	$0.851 \pm 0.074$	$4.16 \pm 3.64$	$5.81 \pm 2.73$	$1.61 \pm 0.93$	$2.46 \pm 1.07$
Combined	$0.966 \pm 0.022$	$0.865 \pm 0.026$	$3.62 \pm 1.91$	$4.12 \pm 1.55$	$0.92 \pm 0.28$	$1.02 \pm 0.32$

Table 7. Ablation study on the impact of the SDM loss coefficient on model performance.

Weight (α)	DSC		95HD		ASD
Weight (α)	Liver	Tumor	Liver	Tumor	Liver	Tumor
$α = 0.9$	$0.958 \pm 0.025$	$0.848 \pm 0.066$	$6.03 \pm 4.00$	$6.20 \pm 3.33$	$1.52 \pm 0.44$	$1.86 \pm 0.69$
$α = 0.7$	$0.957 \pm 0.022$	$0.858 \pm 0.025$	$5.53 \pm 3.74$	$5.74 \pm 2.28$	$1.75 \pm 0.31$	$1.97 \pm 0.44$
$α = 0.5$	$0.963 \pm 0.023$	$0.851 \pm 0.024$	$5.16 \pm 2.64$	$5.81 \pm 2.73$	$1.62 \pm 0.39$	$1.67 \pm 0.45$
$α = 0.3$	$0.965 \pm 0.022$	$0.862 \pm 0.026$	$4.24 \pm 1.64$	$4.62 \pm 1.86$	$0.90 \pm 0.26$	$1.43 \pm 0.36$
$α = 0.1$	$0.966 \pm 0.022$	$0.865 \pm 0.026$	$3.62 \pm 1.91$	$4.12 \pm 1.55$	$0.92 \pm 0.28$	$1.02 \pm 0.32$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Dou, M.; Luo, X.; Yao, Y. Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms. Appl. Sci. 2025, 15, 3985. https://doi.org/10.3390/app15073985

AMA Style

Chen Z, Dou M, Luo X, Yao Y. Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms. Applied Sciences. 2025; 15(7):3985. https://doi.org/10.3390/app15073985

Chicago/Turabian Style

Chen, Zhebin, Meng Dou, Xu Luo, and Yu Yao. 2025. "Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms" Applied Sciences 15, no. 7: 3985. https://doi.org/10.3390/app15073985

APA Style

Chen, Z., Dou, M., Luo, X., & Yao, Y. (2025). Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms. Applied Sciences, 15(7), 3985. https://doi.org/10.3390/app15073985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Liver and Tumor Segmentation Using a Self-Supervised Swin-Transformer-Based Framework with Multitask Learning and Attention Mechanisms

Abstract

1. Introduction

2. Related Works

2.1. Liver and Tumor Segmentation

2.2. Multitask Learning

2.3. Segmentation of Medical Images Using Vision Transformers

2.4. Self-Supervised Learning for Medical Image Analysis

3. Methodology

3.1. Segmentation Network Based on Swin-Transformer

3.2. Attention Gate and SDM Branch

3.3. The Self-Supervised Learning in Our Framework

4. Experiments

4.1. Datasets and Preprocessing

4.2. Implementation Details and Evaluation Metrics

4.3. Quantitative Results

4.4. Qualitative Results

4.5. Ablation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI