Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation

Alasu, Serdar; Talu, Muhammed Fatih

doi:10.3390/electronics15030506

Open AccessArticle

Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation

by

Serdar Alasu

^* and

Muhammed Fatih Talu

Department of Computer Engineering, Inonu University, Malatya 44200, Türkiye

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 506; https://doi.org/10.3390/electronics15030506

Submission received: 13 December 2025 / Revised: 19 January 2026 / Accepted: 22 January 2026 / Published: 24 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Deep learning models based on supervised learning rely heavily on large annotated datasets and particularly in the context of medical image segmentation, the requirement for pixel-level annotations makes the labeling process labor-intensive, time-consuming and expensive. To overcome these limitations, self-supervised learning (SSL) has emerged as a promising alternative that learns generalizable representations from unlabeled data; however, existing SSL frameworks often employ highly parameterized encoders that are computationally expensive and may lack robustness in label-scarce settings. In this work, we propose a scattering-based SSL framework that integrates Wavelet Scattering Networks (WSNs) and Parametric Scattering Networks (PSNs) into a Bootstrap Your Own Latent (BYOL) pretraining pipeline. By replacing the initial stages of the BYOL encoder with fixed or learnable scattering-based front-ends, the proposed method reduces the number of learnable parameters while embedding translation-invariant and small deformation-stable representations into the SSL pipeline. The pretrained encoders are transferred to a U-Net and fine-tuned for cardiac image segmentation on two datasets with different imaging modalities, namely, cardiac cine MRI (ACDC) and cardiac CT (CHD), under varying amounts of labeled data. Experimental results show that scattering-based SSL pretraining consistently improves segmentation performance over random initialization and ImageNet pretraining in low-label regimes, with particularly pronounced gains when only a few labeled patients are available. Notably, the PSN variant achieves improvements of 4.66% and 2.11% in average Dice score over standard BYOL with only 5 and 10 labeled patients, respectively, on the ACDC dataset. These results demonstrate that integrating mathematically grounded scattering representations into SSL pipelines provides a robust and data-efficient initialization strategy for cardiac image segmentation, particularly under limited annotation and domain shift.

Keywords:

self-supervised learning; wavelet scattering networks; parametric scattering networks; label-efficient learning; medical image segmentation; cardiac image segmentation

1. Introduction

In the past decade, deep learning models based on supervised learning, which can automatically extract problem-specific features without the need for expert knowledge, have demonstrated remarkable performance in computer vision tasks such as image classification [1,2,3], object detection [4,5], and image segmentation [6,7,8]. However, supervised learning approaches need large amounts of manually labeled data to extract distinctive features effectively [9]. Although many labeling tools have been developed to facilitate the annotation process, obtaining high-quality labeled data remains labor-intensive, time-consuming, and expensive. This challenge becomes particularly pronounced in fields such as medical imaging, where image acquisition is difficult and expert knowledge is essential for accurate annotation [10]. Furthermore, the problem is even more critical in image segmentation, where the need for pixel-level annotations significantly increases the annotation burden [11].

To mitigate the dependence on large annotated datasets, transfer learning has become a widely adopted strategy, where models pretrained on large-scale natural image datasets such as ImageNet [12] are fine-tuned for a target task with limited labeled data [13,14]. However, transfer learning relies on pretraining with very large labeled datasets, and its effectiveness diminishes when there is a substantial domain gap between the source and target domains, causing the learned representations to generalize poorly to the target task [15,16].

Due to the limitations of supervised learning-based deep learning methods, self-supervised learning (SSL) approaches have gained significant attention in recent years. SSL enables models to learn generalizable representations directly from unlabeled data by constructing supervision signals from the data itself [17,18,19,20,21]. By leveraging large-scale unlabeled datasets, SSL methods have achieved performance competitive with supervised approaches [22,23,24,25,26,27]. In these methods, the supervision signal is derived directly from the data itself through a pretext task, where the model learns to solve an auxiliary task without requiring manual annotations. A model pretrained using an SSL framework can be effectively transferred to downstream tasks such as image classification, object detection, and image segmentation, often requiring only a limited amount of labeled data. This strategy offers a strong alternative to training models from scratch with random initialization, as the pretrained representations provide a more effective and generalizable starting point for downstream learning.

Despite the significant success of SSL methods in leveraging large unlabeled datasets, these networks have a large number of learnable parameters, making their training computationally expensive. Wavelet Scattering Networks (WSNs) [28] offer a promising solution to this challenge. As a mathematically structured and predefined set of wavelet filters, WSNs produce translation-invariant and small deformation-stable representations without requiring extensive parameter tuning or large training datasets. By replacing the initial stages of the SSL encoder with a WSN, the number of learnable parameters can be significantly reduced, while preserving the model’s ability to capture discriminative features. This parameter-efficient and stability-enhanced design is particularly advantageous for downstream medical image segmentation, where datasets are small and robust feature representations are essential for accurate fine-tuning.

While fixed WSNs provide strong inductive priors, their predefined filters may limit adaptability to domain-specific image characteristics. To address this limitation, Parametric Scattering Networks (PSNs) extend the scattering framework by introducing a small set of learnable wavelet parameters, enabling the wavelet filters to adapt to the underlying data distribution while preserving the stability properties of the scattering transform. Building upon the complementary strengths of SSL, WSNs, and PSN, this work proposes a scattering-based SSL framework that integrates both fixed and parametric scattering front-ends into the SSL encoder architecture. By replacing the initial stages of the SSL encoder with scattering-based front-ends, we introduce a more stable and parameter-efficient representation learning framework tailored to the characteristics of cardiac MRI. Our experiments demonstrate that this design not only reduces model complexity but also yields substantial gains in downstream segmentation performance compared to conventional SSL and supervised pretraining when the amount of available labeled data is limited. These results highlight the potential of scattering-based SSL as a powerful and data-efficient alternative for cardiac image segmentation.

The main contributions of this paper are summarized as follows:

We propose a scattering-based self-supervised learning (SSL) framework that integrates both Wavelet Scattering Networks (WSNs) and Parametric Scattering Networks (PSNs) into the initial stages of the SSL encoder. By replacing the early convolutional layers with scattering-based front-ends, the proposed design reduces the number of learnable parameters while preserving the expressive capacity of the deep encoder.
We introduce a stability-aware representation learning strategy that explicitly exploits the translation invariance and deformation stability properties of scattering transforms within an SSL pipeline. By embedding these mathematically grounded priors into self-supervised pretraining, the proposed approach yields a more robust and data-efficient initialization for downstream cardiac image segmentation, particularly under limited supervision.
Through extensive experiments on two cardiac imaging datasets with different modalities (cardiac cine MRI and cardiac CT), we demonstrate that the proposed scattering-based SSL framework consistently improves segmentation performance over random initialization and standard SSL baselines in label-scarce regimes. Among the evaluated variants, the PSN-based approach exhibits particularly strong performance in low-label settings, highlighting the benefit of combining structured scattering priors with a small number of learnable wavelet parameters for cardiac image segmentation.

The remainder of this paper is organized as follows: Section 2 presents a review of related work covering SSL, its application to semantic segmentation, and WSNs. Section 3 introduces the proposed scattering-based SSL method. Section 4 presents the experimental design and reports quantitative and qualitative results. Section 5 discusses our findings, while Section 6 offers the conclusion of our work.

2. Related Works

2.1. Self-Supervised Learning

Self-supervised learning (SSL) is a representation learning approach that leverages the intrinsic structure of data to generate supervision signals, eliminating the need for manually annotated labels [29]. In SSL, a model is initially trained to solve a pretext task which is a carefully designed auxiliary task that enables the model to learn generalizable representations [30]. The representations learned through the pretext task are transferred to downstream tasks such as image classification, object detection, or image segmentation which often involve only a limited amount of labeled data [31].

Early SSL methods relied on handcrafted pretext tasks intended to capture spatial and semantic relationships within images. These include context prediction [21], which involves predicting the relative position of one image patch with respect to another, jigsaw puzzle solving [18], in which the model predicts the correct spatial arrangement of randomly shuffled image patches, and rotation prediction [19], which requires the model to classify the degree of rotation applied to an image. While these handcrafted pretext tasks demonstrated the potential of self-supervised learning, their effectiveness was often constrained by task-specific biases, limited scalability across domains, and restricted architectural generality.

To overcome these limitations, contrastive learning emerged as a more effective and generalizable framework. Instead of relying on manually crafted supervision signals, contrastive methods train models to distinguish between positive pairs, which are two different augmented views of the same image, and negative pairs, which are views from different images. The goal is to bring positive pairs closer together in the feature space while pushing negative pairs apart [32]. This instance discrimination [33] objective encourages the model to learn invariant and discriminative representations, leading to significant improvements in downstream task performance. Prominent contrastive methods such as SimCLR [22] and MoCo [27] have played a pivotal role in advancing this direction. SimCLR relies on large batch sizes to provide sufficient negative samples and uses a simple yet effective architecture consisting of an encoder and a projection head trained with a contrastive loss [34,35,36]. In contrast, MoCo introduces a momentum-updated encoder and a memory bank to maintain a large and consistent queue of negative samples across mini-batches, enabling efficient training even with smaller batch sizes.

While contrastive approaches have shown strong performance, their reliance on large numbers of negative samples has motivated the development of non-contrastive approaches such as BYOL [26] and SimSiam [37], which aim to learn meaningful representations without using negative pairs. These methods employ Siamese architectures in which two augmented views of the same image are processed by parallel networks. BYOL employs a momentum-updated target encoder and a prediction head to align representations, whereas SimSiam simplifies the design by removing the momentum encoder and incorporating a stop-gradient operation to prevent representational collapse. In the context of medical imaging, where visual distinctions between pathological and healthy structures can be subtle, non-contrastive self-supervised learning methods offer a promising avenue [38].

2.2. Self-Supervised Learning for Semantic Segmentation

In traditional deep learning pipelines for semantic segmentation, the encoder network is often initialized with weights pretrained on large-scale labeled datasets such as ImageNet. Methods like DeepLab [39], FCN [7], and SegNet [6] have demonstrated substantial performance improvements when leveraging ImageNet-pretrained backbones. However, supervised pretraining on natural image datasets such as ImageNet introduces domain-specific biases that may not transfer effectively to target domains with fundamentally different visual characteristics [40]. This limitation becomes particularly pronounced in medical imaging, where anatomical structures, imaging modalities, and intensity distributions differ markedly from those of natural scenes.

To address the limitations of supervised pretraining, SSL has emerged as a promising alternative for learning transferable image representations without manual annotations [41]. Among SSL strategies, image-level self-supervised learning methods aim to learn global feature representations by pretraining encoders on instance discrimination tasks and subsequently fine-tuning them for downstream semantic segmentation. A widely adopted strategy is to pretrain only the encoder, which corresponds to the downsampling path of segmentation models, while the decoder is randomly initialized and trained during the supervised fine-tuning phase.

Several image-level SSL frameworks have been proposed for image segmentation. For example, Zeng et al. [42] leverages the inherent positional relationships between image slices in volumetric datasets to design a contrastive learning framework that captures spatial context during pretraining. BT-UNet [43] integrates Barlow Twins with a U-Net encoder to learn robust, redundancy-reduced representations from unlabeled biomedical images. More recently, Kalapos and Gyires-Tóth [44] demonstrated that initializing a U-Net encoder with BYOL pretrained weights leads to notable improvements in medical image segmentation performance. Building on this idea, we adopt a similar self-supervised pretraining strategy based on BYOL to initialize the U-Net encoder. However, our approach introduces a key architectural modification by replacing the initial stages of the BYOL encoder with a scattering-based front-end.

In contrast to image-level approaches, pixel-level self-supervised learning methods aim to directly learn dense, spatially aware representations for semantic segmentation. Rather than operating on global image embeddings, these methods formulate pretext tasks at the resolution of individual pixels or regions, encouraging the model to capture fine-grained semantic and structural information required for dense prediction tasks. DenseCL [45] extends the instance discrimination paradigm to dense prediction by aligning pixel-wise features between different augmentations of the same image while contrasting them with pixels from other images. PixPro [46] proposes a novel pixel-to-propagation consistency loss that smooths pixel features using a propagation module, enabling the model to learn spatially coherent and semantically meaningful representations for dense prediction tasks. Furthermore, Chaitanya et al. [47] propose a contrastive learning framework that jointly leverages both global image representations and localized pixel-level features, enabling the model to capture domain-specific structural patterns critical for accurate segmentation.

2.3. Wavelet Scattering Networks (WSNs)

Scattering transforms, originally introduced by Mallat [48], are mathematically grounded feature extractors. Unlike conventional Convolutional Neural Networks (CNNs) that learn filters from data, the scattering transform employs a fixed set of wavelet filters. WSNs compute representations through cascades of wavelet convolutions, complex modulus nonlinearities, and low-pass filtering operations, generating hierarchical features that are both stable to small deformations and invariant to local translations [28,48]. Furthermore, rotation invariance can be incorporated through extensions such as the roto-translation scattering transform proposed by Sifre and Mallat [49]. Early scattering-based models demonstrated strong performance on tasks involving handwritten digits and texture recognition, but their scalability to more complex visual tasks has been limited [50].

To address these limitations, hybrid architectures that combine WSNs with learnable CNN layers have been widely explored. In such models, WSNs serve as fixed front-end feature extractors, followed by trainable convolutional layers. Oyallon et al. [51,52] showed that such hybrid WSN-CNN architectures can achieve competitive accuracy with fewer parameters and improved generalization. Furthermore, Cotter and Kingsbury [53] proposed a locally invariant convolutional layer, a learnable extension of the scattering transform in which the fixed scattering coefficients are modulated with learnable weights. More recently, Gauthier et al. [54] introduced the PSN, overcoming the design constraints of classical WSNs by learning the parameters of the mother wavelet via backpropagation rather than relying on a predefined filter bank.

Beyond supervised and hybrid architectures, scattering transforms have also been explored in the context of self-supervised learning. ScatSimCLR [55] is the first study to demonstrate the use of WSNs within a self-supervised learning framework. Although ScatSimCLR appears similar to our proposed method in its use of WSNs within self-supervised learning, our framework differs in several key aspects. We employ a non-contrastive BYOL objective rather than contrastive SimCLR, replace only the initial stages of ResNet-50 instead of the entire encoder, and evaluate the learned representations on segmentation rather than classification. Moreover, unlike ScatSimCLR, our method extends beyond fixed WSNs by also incorporating the learnable Parametric Scattering Network.

3. Methods

In this section, we first provide background on the key components of our proposed method, followed by a detailed description of the overall proposed method.

3.1. Bootstrap Your Own Latent (BYOL)

BYOL is a non-contrastive self-supervised learning method designed to learn useful visual representations without the need for negative samples. This method involves two networks called online and target that interact and learn from each other. Both are composed of an encoder and a projection head. The online network includes an additional component, a prediction head, which creates an asymmetric architecture essential for preventing collapse. During training, two distinct augmented views of the same image are passed through the online and target networks, and the online network is trained to predict the representation generated by the target network. The online network is updated by gradient-based optimization whereas the target network is updated using an exponential moving average (EMA) of the online network’s parameters. This asymmetric architecture and the delayed update via EMA are the core mechanisms that prevent representational collapse, allowing the model to learn meaningful and stable features. Unlike contrastive learning methods that rely on contrastive loss functions, BYOL employs a Mean Squared Error (MSE) loss, which minimizes the distance between the embedding vectors of the online and target networks. The BYOL loss function is given in Equation (1), where the embedding vectors obtained from the online network and the target network are

q_{θ} (z_{θ})

and

z_{ξ}^{’}

respectively.

L_{θ}^{B Y O L} = 2 - 2 \cdot \frac{〈q_{θ} (z_{θ}), z_{ξ}^{’}〉}{{‖q_{θ} (z_{θ})‖}_{2} \cdot {‖z_{ξ}^{’}‖}_{2}}

(1)

BYOL has also been shown to exhibit greater robustness to variations in batch size and augmentation strategies compared to contrastive learning methods. Moreover, because it does not require the definition of negative samples, BYOL is particularly well suited for medical imaging, where the high structural similarity across scans makes reliable negative pair selection challenging. The overall architecture and mechanism of the BYOL method are shown in Figure 1.

3.2. Scattering-Based Feature Extraction

A wavelet is a localized function in both the spatial and frequency domains and is characterized by having zero mean. The scattering transform employs a wavelet filter bank parameterized by the number of scales

J

and the number of orientations

L

. Each wavelet

ψ_{j, θ}

is constructed by dilating the mother wavelet

ψ

with a scale factor

2^{j}

and applying a rotation of angle

θ

. This results in a family of wavelets

{ψ_{j, θ}}

that capture signal variations across multiple resolutions and directions. The 2D Wavelet Scattering Transform constructs hierarchical, translation-invariant and deformation-stable signal representations by cascading wavelet convolutions with complex modulus nonlinearities followed by low-pass filtering. Given an input signal

x (u)

, where

u

denotes the spatial location, the transform produces a set of scattering coefficients

S_{0} x

,

S_{1} x

, and

S_{2} x

corresponding to the zeroth, first, and second order features, respectively. To compute the zeroth-order scattering coefficients, the input grayscale image

x

is convolved with a low-pass averaging filter typically chosen as a Gaussian kernel with a spatial window corresponding to scale

2^{J}

. Formally, it is defined as:

S_{0} x = x * ϕ_{J}

(2)

The low-pass filtering operation yields a coarse representation that captures the global structure of the image and results in a downsampling by a factor of

2^{J}

. Consequently, for an input grayscale image of size

N \times N

, the zeroth order scattering output

S_{0} x

is a feature map of reduced resolution

N / 2^{J} \times N / 2^{J}

. While this process provides robustness to representations invariant to translations smaller than

2^{J}

, it also discards high frequency information that may be critical for distinguishing fine details. The subsequent stages of the scattering transform address this limitation by recovering the lost high-frequency content through cascaded wavelet filtering and nonlinearity. The first order coefficients are computed by convolving

x

with complex wavelets

ψ_{λ_{1}}

, applying a modulus nonlinearity and then a low-pass filter:

S_{1} x (λ_{1}) = |x * ψ_{λ_{1}}| * ϕ_{J}

(3)

For a grayscale image input of size

N \times N

, the first-order scattering output

S_{1} x

consists of

J \times L

feature maps, each with a spatial resolution of

N / 2^{J} \times N / 2^{J}

. The second-order scattering coefficients extend the cascade by applying an additional wavelet transform to the modulus output of the first-order scattering stage:

S_{2} x (λ_{1}, λ_{2}) = ||x * ψ_{λ_{1}}| * ψ_{λ_{2}}| * ϕ_{J}

(4)

For a grayscale image input of size

N \times N

, the second-order scattering output

S_{2} x

consists of

\frac{J (J - 1)}{2} \cdot L^{2}

feature maps, each with a spatial resolution of

N / 2^{J} \times N / 2^{J}

. The final scattering representation is the concatenation of

S_{0} x

,

S_{1} x

and

S_{2} x

, which captures local textures and structures across multiple scales and orientations. An example configuration of a Wavelet Scattering Network with

J = 2

scales and

L = 3

orientations is illustrated in Figure 2.

3.3. Parametric Scattering Networks (PSNs)

In standard scattering transforms, wavelet filters are constructed from a parameterized mother wavelet in a way that forms a tight frame with well-established energy preservation properties. This tight-frame condition is typically achieved by selecting the wavelet parameters according to a predefined analytical scheme. Parametric Scattering Networks revisit the optimality of such conventional filter bank designs by introducing wavelet parameters that are learned during training. This formulation focuses on Morlet wavelets and introduces a learning-based approach to adapt their Gaussian window scale

(σ)

, global orientation

(θ)

, frequency scale

(ξ)

, and aspect ratio

(γ)

, thereby enabling task-specific parameterization of the scattering transform. The Morlet wavelets are expressed as:

ψ_{σ, θ, ξ, γ} (u) = \exp (- \frac{| D_{γ} R_{θ} u |^{2}}{2 σ^{2}}) (e^{i ξ u^{’}} - β),

(5)

where

R_{θ}

is the rotation matrix,

D_{γ}

is an anisotropic scaling matrix, and

β

is a normalization constant ensuring zero mean. To provide a stable starting point for optimization, the PSN initializes the wavelet parameters using a tight-frame scheme. From this initialization, the parameters are optimized end-to-end via backpropagation, allowing the scattering representation to adapt to the structure of the data. This task-specific adaptation has been shown to yield substantial performance improvements in small-sample classification settings compared to fixed scattering transforms.

3.4. Image Segmentation with U-Net

U-Net is a widely used fully convolutional network designed for pixel-level prediction tasks and has demonstrated particular effectiveness in biomedical image segmentation. The network comprises an encoder path that progressively reduces the spatial resolution to extract high-level feature representations and a decoder path that gradually reconstructs the original resolution through upsampling, ultimately producing a dense pixel-wise prediction map. To mitigate the loss of spatial detail introduced by downsampling, skip connections are used to concatenate feature maps from the encoder with those in the corresponding decoder stages. These skip connections facilitate the recovery of fine-grained spatial information and substantially improve localization accuracy in segmentation tasks. The U-Net architecture is shown in Figure 3.

3.5. Proposed Method

The proposed method integrates scattering-based feature extractors into a BYOL self-supervised learning pipeline and subsequently transfers the pretrained encoder to a U-Net for downstream cardiac MRI segmentation. It consists of two main stages: (1) self-supervised pretraining using a modified BYOL architecture and (2) supervised fine-tuning for image segmentation with a U-Net model. In the pretraining stage, the modified BYOL framework is employed to learn general-purpose image representations from unlabeled cardiac MRI slices. The encoder network within the BYOL framework is traditionally a standard ResNet-50 architecture. ResNet-50 is conventionally structured into five sequential stages, with each stage operating at a specific spatial resolution and capturing progressively more abstract visual features.

In the proposed method, the early stages of ResNet-50, commonly referred to as Stage 1 and Stage 2, are replaced with either a WSN, which uses predefined Morlet wavelets, or a PSN, in which wavelet parameters are optimized end-to-end via backpropagation. Stage 1, often called the initial stem, consists of a

7 \times 7

convolution, batch normalization, ReLU, and max-pooling, which performs initial feature extraction and downsampling. Stage 2 follows with a set of residual blocks that extract early semantic features while maintaining the spatial resolution. Replacing these components with a scattering front-end introduces mathematically grounded invariances and improves the robustness of low-level feature extraction.

Because the dimensionality of the scattering representation does not match the expected input shape of the subsequent ResNet-50 layers, a lightweight Feature Alignment Module (FAM) is introduced to ensure compatibility. The FAM first applies Group Normalization (GN) with the number of groups equal to the number of scattering channels, thereby performing channel-wise normalization in which each scattering coefficient is normalized independently. This normalization strategy is well suited to scattering representations, as each channel corresponds to a distinct scale–orientation response and should be normalized independently without introducing inter-channel coupling. Following normalization, a 2D convolutional projection with kernel size 3 × 3, stride 2, and padding 1 is employed to align both the spatial resolution and the channel dimensionality of the scattering features with the input requirements of ResNet-50 Stage 3. The stride-2 operation adjusts the spatial resolution accordingly, while the 3 × 3 convolution acts as a projection layer that maps the scattering coefficients to the 256-channel feature space expected by the subsequent residual blocks. A second Group Normalization layer with 32 groups is then applied to the projected feature map, followed by a ReLU activation. This configuration follows established normalization practices in deep residual architectures and provides stable training under the small batch sizes commonly used in self-supervised medical imaging setups. The design of the FAM is illustrated in Figure 4. After this module, the remaining stages of ResNet-50 (Stage 3, Stage 4, and Stage 5) are preserved in their standard form and operate sequentially on the adapted features. In this way, the model retains the deep encoder’s capacity to extract high-level semantic information while benefiting from the scattering front-end, which provides stable and robust low-level feature representations.

This architectural replacement also yields a notable reduction in model complexity. After accounting for the FAM, substituting Stage 1 and Stage 2 with a scattering-based front-end provides a net reduction of approximately 180 k learnable parameters compared with the standard BYOL encoder. The WSN variant introduces no additional trainable parameters, while the PSN front-end remains extremely compact, adding only 68 learnable wavelet parameters. This design substantially decreases the parameter burden of the early feature extractor while preserving the expressive capacity of the deeper encoder.

During supervised fine-tuning, the pretrained encoder is transferred directly into a U-Net architecture for cardiac MRI segmentation. To ensure compatibility with the segmentation model, the architecture of the encoder used during self-supervised pretraining is preserved when transferring its weights to the U-Net. This architectural consistency is essential for enabling direct weight initialization from the pretrained model. Specifically, the encoder, incorporating either a WSN or PSN, is retained without modification during transfer, while the decoder follows the standard U-Net design and is initialized randomly, as it is not part of the pretraining process. Skip connections are established by extracting intermediate feature maps from the retained encoder stages, allowing the decoder to progressively recover spatial detail during upsampling. This modular design integrates the low-level, invariant features produced by the scattering front-end with the mid- and high-level representations learned by the remaining ResNet encoder, while the decoder focuses on spatial reconstruction and boundary refinement.

The segmentation model takes a 2D grayscale cardiac MRI slice as input and produces a multi-class, pixel-wise segmentation mask. To optimize the segmentation network, we employ a combination of Dice loss and cross-entropy loss, a formulation widely used in medical image segmentation due to its complementary strengths. The cross-entropy loss focuses on pixel-wise classification accuracy, while the Dice loss directly maximizes the overlap between the predicted and ground truth masks. By jointly minimizing these two losses, the model benefits from both robust per-pixel classification and improved region-level consistency, enabling more accurate boundary delineation. The overall structure of the proposed method is illustrated in Figure 5.

4. Experiments and Results

4.1. Dataset

We used the publicly available cardiac cine-MRI dataset from the 2017 Automated Cardiac Diagnosis Challenge (ACDC) [56] in this study. The dataset comprises short-axis cine-MRI scans from 150 patients, which are divided according to the official challenge configuration into 100 cases for training/validation and 50 held-out cases for testing. These 150 patients are evenly distributed across five clinical subgroups: normal, myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, and abnormal right ventricle. The cine-MRI scans were acquired over a six-year period using 1.5 T and 3.0 T MRI scanners, with spatial resolutions ranging from 1.22 to 1.68 mm² per pixel. Each patient’s scan captures the full cardiac cycle as a 3D volumetric stack of short-axis slices. Segmentation masks for the left ventricle (LV), right ventricle (RV), and myocardium (MYO) are provided only at the end-diastolic (ED) and end-systolic (ES) phases, while all remaining frames in the cardiac cycle are unlabeled and are leveraged for self-supervised pretraining. The dataset also contains substantial slice-wise variability in structure presence, with some slices lacking any cardiac anatomy and others containing different combinations of LV, RV, and MYO [57].

In addition to ACDC, we conducted experiments on the publicly available CHD dataset [58], which consists of cardiac computed tomography (CT) scans acquired from patients with congenital heart disease. The dataset originally contains 68 three-dimensional cardiac CT volumes; however, 1 case was excluded due to inconsistent spatial resolution, resulting in a total of 67 patients used in our experiments. The CHD dataset covers 14 distinct congenital heart disease types and exhibits substantially higher anatomical variability compared to the relatively regular anatomy present in ACDC. Manual voxel-level annotations are provided for seven cardiac structures, including the left ventricle (LV), right ventricle (RV), left atrium (LA), right atrium (RA), myocardium (MYO), aorta, and pulmonary artery.

4.2. Experimental Setup

All experiments were implemented using the PyTorch framework (version 2.9.0) and were conducted on an NVIDIA A100 GPU.

4.2.1. Self-Supervised Pretraining

Self-supervised representation learning was performed using the BYOL framework on unlabeled slices from both datasets. For self-supervised pretraining, all image slices from the unlabeled 4D cine-MRI volumes of the 100 training patients were used without accessing any segmentation annotations. This includes slices from all cardiac phases, while the ED and ES labels were explicitly ignored during this stage. In total, pretraining on ACDC utilized 27.253 unlabeled slices. For the CHD dataset, all image slices from the 50 training patients were used without accessing any segmentation annotations. This resulted in a total of 13.188 unlabeled slices used for pretraining.

Three encoder architectures were evaluated: (1) a standard ResNet-50 baseline, (2) a modified ResNet-50, in which Stage 1 and Stage 2 were replaced with a WSN, and (3) a modified ResNet-50, in which Stage 1 and Stage 2 were replaced with a PSN.

The WSN was implemented using Kymatio [59] with a scale

J = 1

and orientation

L = 16

instead of a deeper setup such as

J = 2

. This choice was motivated by the characteristics of cardiac MRI, where critical structures, such as thin MYO and RV, contain fine spatial details. Because larger scattering scales introduce stronger averaging that tends to smooth out these small anatomical features, using

J = 1

helps the preservation of fine spatial detail, while

L = 16

provides sufficient directional information to effectively capture cardiac edges and textures. This configuration produces 17 scattering coefficients and these coefficients were then projected to 256 channels using Feature Alignment Module (FAM) to match the expected input dimensionality of Stage 3 of ResNet-50. The PSN followed the same architectural structure as the WSN-based encoder but employed learnable wavelet parameters, using the official code provided by the authors of the Parametric Scattering Network paper.

The projection and prediction heads were implemented as two-layer multilayer perceptrons (MLPs) with hidden dimensions set to 4096. LARS [60] was used as the optimizer as proposed in the BYOL paper. The learning rate was scheduled with cosine decay over 400 epochs and included a 10-epoch linear warmup phase. Training was conducted with a batch size of 128. Following the linear scaling rule, the base learning rate was set to 0.3 and adjusted in proportion to the batch size using the formula

lr = 0.3 \times batch size / 256

. The exponential moving average (EMA) momentum for the target encoder was initialized at 0.99 and increased toward 1.0 following a cosine schedule throughout training.

Data augmentations included random resized cropping, random horizontal flipping, color jittering, and Gaussian blur. To encourage more spatially coherent cardiac structures within the crops, random resized cropping was performed with a scale range of

[0.2, 1.0]

, differing from the original BYOL configuration which uses a minimum scale of 0.08. After cropping, all images were resized to a fixed resolution of

224 \times 224

.

4.2.2. Segmentation Fine-Tuning

Following self-supervised pretraining, the learned encoder weights were transferred to a U-Net architecture for downstream semantic segmentation. For supervised segmentation on the ACDC dataset, only the labeled end-diastolic (ED) and end-systolic (ES) slices were used. Across the 100 training patients, this corresponds to 1.902 labeled slices, with an average of approximately 19 slices per patient.

To ensure robust evaluation and pathology-balanced training splits, we employed stratified 5-fold cross-validation at the patient level, where each fold preserved the distribution of the five ACDC pathology categories. The entire 5-fold cross-validation process was repeated three times with different random seeds, generating new train/validation splits in each repetition and resulting in 15 independent training runs for each method. To prevent data leakage, all splits were performed at the patient level, meaning that slices from the same patient never appeared in both the training and validation sets of a single fold [61]. For each fold, the model achieving the best performance on the validation set was selected for final evaluation. These models were subsequently evaluated on the held-out test set consisting of 50 patients with a total of 1.076 labeled slices, and final performance was reported as the average Dice score (mean ± standard deviation).

To evaluate the label efficiency of the proposed models, we adopted a label-limited fine-tuning strategy similar to prior works [42,62,63]. At each cross-validation fold, we randomly sampled

M

patients from the training patients and used only their corresponding segmentation masks for fine-tuning. We experimented with different values of

M \in {5, 10, 20, 40, 80}

to thoroughly assess the impact of pretraining under label-scarce segmentation settings. For each value of

M

, the labeled subset was drawn from the training portion of the corresponding cross-validation split while maintaining pathology balance.

For the CHD dataset, segmentation fine-tuning was performed using a patient-level split consisting of 50 patients for training and validation and 17 held-out patients for testing. Five-fold cross-validation was conducted on the training/validation subset. All splits were performed at the patient level to avoid slice-level data leakage.

Unlike ACDC, where annotations are available only at the ED and ES phases, each CHD volume contains dense slice-level annotations. To ensure a fair comparison of label efficiency across datasets, we restricted the amount of supervision per patient by selecting a fixed number of annotated slices. Specifically, for each CHD patient, 40 slices were selected by uniform sampling at fixed intervals along the axial direction of the volume, providing coverage of the full cardiac anatomy while limiting the amount of supervision per patient [64]. This resulted in 2.000 labeled slices for supervised fine-tuning when using all 50 training patients. Label-limited experiments were conducted with

M \in {2, 5,10, 20, 40}

, where

M

denotes the number of labeled patients used during training. The held-out test set consists of 17 patients with a total of 4.088 labeled slices and was used exclusively for final evaluation.

Supervised training was performed using a combination of Dice loss and cross-entropy loss. The Adam optimizer was used with an initial learning rate of

10^{- 4}

and a batch size of 32. The learning rate was scheduled consisting of a warmup phase of 10 epochs followed by cosine annealing over 150 epochs.

4.3. Evaluation Metrics

Our evaluation focuses on measuring the overlap between the predicted segmentation masks and the ground truth masks. Specifically, we utilize the Dice score. This metric was computed using the official evaluation code provided by ACDC, which relies on the medpy library for robust and standardized metric computation.

The Dice score is a widely used metric for evaluating the similarity between two sets, particularly in image segmentation tasks. It quantifies the spatial overlap between the prediction and the ground truth. The Dice score ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no overlap. The formula for the Dice score is given by:

Dice score = \frac{2 |A \cap B|}{|A| + |B|},

(6)

where

A

and

B

are the prediction and the ground truth, respectively.

4.4. Results

4.4.1. Quantitative Results

To evaluate the data efficiency of different encoder initialization strategies, we compare the segmentation performance of a U-Net model with a ResNet50 encoder, initialized with random weights, ImageNet pretrained weights, and three SSL methods (BYOL, BYOL + WSN, and BYOL + PSN). Performance is assessed under varying amounts of labeled fine-tuning data to examine how effectively each method transfers to the downstream segmentation task. The quantitative results are detailed in four dedicated tables: Table 1 presents the average Dice score across all structures, while Table 2, Table 3, and Table 4 report the class-specific segmentation scores for LV, RV, and MYO, respectively.

M

denotes the number of patients used for supervised fine-tuning. Results are reported as mean ± standard deviation over 5-fold cross-validation and repeated 3 times with different random seeds. Best results are in bold and second best underlined.

As shown in Table 1, among all encoder initialization strategies, BYOL + PSN consistently achieves the highest average Dice scores across every labeled data setting, while BYOL + WSN ranks as the second best method. This performance highlights the benefit of combining self-supervised learning with scattering-based representations for robust cardiac segmentation. In contrast, random initialization yields the weakest overall performance, especially in the low-labeled data settings

(M \leq 10)

. ImageNet pretraining provides moderate improvements over random initialization but remains limited by the substantial domain gap between natural images and cardiac MRI. Standard BYOL outperforms both of these non-scattering baselines at every labeled data level because its non-contrastive self-supervised learning objective enables the extraction of domain-specific features directly from unlabeled cardiac MRI data prior to fine-tuning. As the amount of labeled data increases, all methods show steady improvements, reflecting the expected benefit of additional supervision. However, the performance gap between methods varied significantly across the data regimes. When the number of labeled data was highest

(M = 80)

, the performance differences were minimal, with the gap between the best and worst method being only

2.2 %

average Dice score (0.9111 for BYOL + PSN vs. 0.8891 for Random Init). Conversely, in the most label-scarce setting

(M = 5)

, the difference was stark. Random initialization (0.2949) and ImageNet pretraining (0.5763) exhibited significantly poorer performance and suffered from notably high standard deviations, indicating poor stability and generalization under limited supervision. In contrast, the scattering-based SSL methods (BYOL + WSN and BYOL + PSN) demonstrated exceptional data efficiency, performing robustly even with scarce labels. Although both scattering-based SSL methods outperformed BYOL, the most notable improvements were achieved by BYOL + PSN, which surpassed the base BYOL model with average Dice score increases of

4.66 %

at

M = 5

and

2.11 %

at

M = 10

These gains highlight that incorporating scattering front-ends enables SSL frameworks to produce stronger and more reliable representations in low-annotation scenarios. Furthermore, to assess the statistical significance of the improvement provided by the scattering-based SSL pretraining over the standard SSL baseline, a paired t-test was performed comparing the Dice scores of the BYOL and BYOL + PSN encoders. This analysis confirmed that the performance gains of the scattering-based SSL approach are statistically significant across the entire range of labeled data availability, yielding p-values substantially below the

5 %

significance level for all data regimes. Specifically, the

p

-values were calculated as

p = 1.5073 \times 10^{- 5}

for

M = 5

,

p = 7.7736 \times 10^{- 8}

for

M = 10

,

p = 2.9822 \times 10^{- 7}

for

M = 20

,

p = 1.5507 \times 10^{- 8}

for

M = 40

, and

p = 5.0855 \times 10^{- 10}

for

M = 80

. While the paired t-test indicates a statistically significant difference even at M = 80 (

p = 5.0855 \times 10^{- 10}

), the practical performance gap at high-label regimes is small, and all methods converge in Dice score. This suggests that the benefit of scattering-based pretraining is most meaningful in low- and moderate-label settings.

While Table 1 summarizes the overall segmentation quality, the class-specific Dice scores reported in Table 2, Table 3 and Table 4 provide deeper insight into how each pretraining strategy affects the segmentation of individual cardiac structures. The left ventricle (LV) segmentation (Table 2), which corresponds to the largest and most regular structure, yielded the highest absolute Dice scores across all methods, reaffirming that segmentation difficulty is inherently lower for this class. This can be explained by the high contrast between LV and the single bordering structure MYO. Although BYOL + PSN and BYOL + WSN remain the top-performing methods for LV segmentation, the performance gap relative to the base BYOL encoder is the smallest among all classes. This suggests that high-quality SSL features are already highly effective for segmenting anatomically simple and well-defined regions.

The benefits of scattering-based pretraining became significantly more pronounced when moving to the more challenging structures. The right ventricle (RV), reported in Table 3, is notoriously difficult to segment due to its high morphological variability and its adjacency to low-contrast background regions, which makes its boundaries substantially less distinct than those of the LV. In this setting, BYOL + PSN maintains a substantial and consistent lead over all non-SSL baselines, with the performance gap widening dramatically in low-label regimes. At the most label-scarce setting (

M = 5)

, BYOL + PSN achieves a Dice score of 0.7879, representing an absolute improvement of

50.7

percentage points over random initialization (0.2872). These findings underscore the importance of effective pretraining when segmenting anatomically complex structures under limited supervision.

The most pronounced relative performance gains for the scattering-based methods are observed in the highly challenging myocardium (MYO) segmentation task (Table 4), which is particularly sensitive to small boundary errors due to the thinness of the myocardial wall. MYO segmentation is inherently more challenging because it requires delineating both the endocardial and epicardial boundaries, whereas LV and RV segmentation involves identifying only a single contour [65]. At (

M = 5)

, the scattering-enhanced SSL methods deliver their largest improvements over the base contrastive approach: BYOL + PSN achieves a Dice score of 0.7898, representing a

5.6 %

gain over standard BYOL (0.7338). This result strongly supports the hypothesis that integrating scattering-based front-ends (WSN and PSN) into the SSL framework produces representations that are especially well suited for accurate boundary delineation in thin and anatomically complex structures. In summary, while BYOL + PSN consistently provides the highest performance across all classes, its advantage is most pronounced for the RV and MYO, where segmentation complexity and structural variability are greatest.

Table 5 reports the average Dice scores on the CHD dataset under varying numbers of labeled patients

M

. Overall, self-supervised and scattering-based initializations provide clear benefits in low-label regimes, while the advantage diminishes as the amount of labeled supervision increases. In the most label-scarce setting (

M = 2)

, the proposed BYOL + PSN achieves the highest performance, indicating that incorporating scattering priors yields a more data-efficient initialization when only a few labeled patients are available. As

M

increases, performance differences between strong initializations become smaller and occasionally change in ranking. For example, BYOL + WSN achieves the best performance at

M = 20

, whereas ImageNet pretraining yields the highest Dice score at

M = 40

. Importantly, these differences in higher-label settings are relatively small in magnitude, suggesting that multiple initialization strategies converge as supervision becomes sufficient. These results support the main conclusion that scattering-enhanced SSL is most beneficial under limited annotation, while competitive baselines (including ImageNet pretraining) can match or exceed SSL-based initialization when larger labeled subsets are available in CHD.

4.4.2. Qualitative Results

To provide a more intuitive understanding of the quantitative results, Figure 6 presents a qualitative comparison of segmentation masks generated by the different methods across various labeled data regimes. For each method and labeling level, we selected the checkpoint whose test set Dice score was closest to the final averaged Dice score (3 repeats × 5 folds), ensuring that the visualizations reflect a representative model rather than a cherry-picked example. For qualitative comparison, we selected a mid-ventricular ED slice from a DCM case, as this configuration exhibits one of the most difficult MYO appearances in the ACDC dataset. The dilated LV cavity and globally thinned myocardial wall create substantial ambiguity in boundary localization, making this case particularly suitable for assessing the effectiveness of scattering-based pretraining. As illustrated in Figure 6, the scattering-enhanced SSL methods produce noticeably more accurate and coherent segmentations than the remaining baselines especially under the low-label setting

(M = 5)

, further reinforcing the advantages observed in the quantitative results.

To qualitatively assess the segmentation performance on the CHD dataset, Figure 7 presents representative segmentation results produced by different initialization strategies across varying label regimes

{2, 5, 10, 20, 40}

. As shown in Figure 7, random initialization produces fragmented and anatomically inconsistent segmentations in low-label settings, while pretrained models yield more coherent predictions. In particular, the proposed scattering-enhanced self-supervised methods generate more stable and anatomically plausible segmentations under limited supervision (

M = 2

and

M = 5

), consistent with the quantitative results. As the number of labeled patients increases, performance differences between strong initializations become less pronounced.

5. Discussion

This study examined the impact of integrating scattering-based feature extractors into self-supervised learning (SSL) frameworks for cardiac MRI segmentation under limited annotation. The results indicate that embedding scattering-based priors into SSL pretraining leads to more stable and data-efficient representations, with the most pronounced benefits observed in low-label regimes and for anatomically challenging structures such as the right ventricle (RV) and myocardium (MYO). These findings support the central contribution of this work, namely, that fixed or parametrized scattering front-ends enhance the transferability of learned representations for cardiac image segmentation tasks.

Our findings confirm the importance of self-supervised pretraining for cardiac image analysis, a domain in which annotated datasets are inherently limited. The segmentation models initialized with SSL methods (BYOL, BYOL + WSN, and BYOL + PSN) consistently and significantly outperformed the non-SSL baselines (random initialization and ImageNet pretraining) in low-label regimes, while differences diminish as the amount of labeled supervision increases, particularly on the CHD dataset.

In contrast, encoders initialized with random weights or ImageNet pretrained weights exhibited high variance and degraded performance in low-data scenarios, reflecting their sensitivity to limited labeled supervision. Although ImageNet pretraining provides a stronger starting point than random initialization, its effectiveness is greatly constrained in medical imaging due to the substantial domain mismatch between the source and target domains [66]. Conversely, self-supervised learning mitigates this domain shift by learning representations directly from the unlabeled cardiac MRI data, thereby tailoring the features to the characteristics of the target domain. As a result, SSL-based models demonstrate far greater stability and substantially improved segmentation performance compared with non-SSL baselines.

While contrastive SSL methods have shown strong performance in natural image settings, we selected the non-contrastive BYOL framework due to critical challenges inherent to medical imaging. A fundamental limitation of contrastive SSL in this domain is the difficulty of defining true negative pairs. Whereas natural images display large visual variability, medical images are highly homogeneous and often contain only subtle pathological differences. This makes reliable negative identification challenging and significantly increasing the risk of false negatives. For this reason, we selected BYOL as our SSL framework, as its non-contrastive design does not require negative samples.

The central finding of this study is the statistically significant performance improvement obtained by integrating Scattering Networks into the SSL pipeline compared with the standard BYOL approach. This improvement highlights the value of incorporating a strong mathematical prior into the feature extractor for label-scarce medical imaging tasks. While BYOL learns features by maximizing agreement between augmented views of the same image, it does not impose any inherent structural constraints on the learned representations. In contrast, the Wavelet Scattering Network (WSN) is grounded in a rigorous mathematical framework that guarantees translation invariance and stability to small deformations. This is particularly beneficial in cardiac MRI, where minor positional shifts and patient-specific anatomical variations are common, yet the underlying anatomical structures must still be identified with high robustness. By simplifying the feature space and mitigating the effects of small transformations, the scattering front-end enables the encoder to achieve better generalization and substantially higher data efficiency during U-Net fine-tuning stage.

Although both scattering approaches inherit the stability properties of wavelet representations, the consistent superiority of BYOL + PSN over BYOL + WSN can be attributed to the learnable filters in PSN, which allow it to adapt more effectively to the data. WSN relies on a fixed predefined filter bank that provides stable but relatively generic features. In contrast, PSN optimizes its wavelet parameters during training, allowing the scattering representation to become task-driven and better aligned with the underlying anatomical structures.

The effectiveness of this scattering-based representation is quantitatively demonstrated in our class-specific segmentation results. The most dramatic evidence comes from the highly challenging MYO segmentation (Table 4), where the largest relative performance gain over the base BYOL was achieved. This finding strongly supports the hypothesis that the structural and boundary-aware properties introduced by the scattering front-end are particularly effective for resolving subtle and complex myocardial boundaries. Moreover, the consistent superiority of BYOL + PSN across all anatomical classes confirms it as the most effective initialization strategy among the evaluated methods.

Despite these promising results, our study has several limitations. First, although we evaluate the proposed method on two cardiac datasets with different imaging modalities (cardiac MRI and cardiac CT), the experiments are limited to cardiac image segmentation. Consequently, the results demonstrate robustness under domain shift within cardiac imaging, while generalization beyond the cardiac domain is not explicitly assessed in this study. Second, scattering-based transformations introduce fixed design choices (number of scales and orientations), even in the PSN model, which may limit the method’s generalizability across all myocardial substructures. Finally, our evaluation focused on 2D segmentation; extending this approach to 3D models may reveal additional challenges or opportunities.

Building on the findings of this work, several promising avenues for future research remain to be explored. First, the augmentation strategy used in our study was the standard set used in BYOL with only minimal adjustments. Future work could investigate more principled augmentation designs, both by aligning the pretraining transformations with the structural properties of the scattering representation and by developing segmentation-specific augmentations to improve alignment between pretraining and fine-tuning. Second, although U-Net served as a widely adopted baseline for cardiac MRI segmentation, the proposed scattering-based SSL method could be evaluated with alternative architectures such as nnU-Net [67], DeepLabV3+ [68], or U-Net++ [69]. Finally, addressing the limitations noted in this study, future work will investigate the generalizability of scattering-based SSL across different medical imaging modalities and extend the approach to 3D volumetric and 4D spatiotemporal segmentation to address the full complexity of clinical cardiac analysis.

6. Conclusions

In this work, we propose a scattering-based self-supervised learning framework that integrates Wavelet Scattering Networks (WSNs) and Parametric Scattering Network (PSNs) into the BYOL self-supervised learning (SSL) framework. By pretraining on a large corpus of unlabeled cardiac cine-MRI slices, the proposed method learns domain-specific and anatomically meaningful representations without relying on manual annotations. These pretrained encoders were subsequently transferred to a U-Net and fine-tuned for cardiac MRI segmentation. Our experiments demonstrate that the proposed scattering-based SSL pretraining provides statistically significant advantages over conventional baselines (random initialization and ImageNet pretraining) and the standard BYOL approach, particularly when training is conducted with limited labeled data. These findings emphasize the advantage of embedding mathematically grounded scattering priors into SSL pipelines and demonstrate that PSN, which learns its wavelet parameters during training, achieves the best performance. This highlights a promising direction for developing more data-efficient cardiac image segmentation models.

Author Contributions

Conceptualization, S.A. and M.F.T.; methodology, S.A. and M.F.T.; software, S.A.; validation, S.A. and M.F.T.; formal analysis, S.A.; investigation, S.A.; resources, S.A.; data curation, S.A.; writing—original draft preparation, S.A.; writing—review and editing, S.A. and M.F.T.; visualization, S.A.; supervision M.F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was conducted as part of S.A.’s doctoral dissertation and received no external funding.

Data Availability Statement

The ACDC dataset is publicly available [56]: https://www.creatis.insa-lyon.fr/Challenge/acdc/ (accessed on 13 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Jin, C.; Guo, Z.; Lin, Y.; Luo, L.; Chen, H. Label-efficient deep learning in medical image analysis: Challenges and future directions. arXiv 2023, arXiv:2303.12484. [Google Scholar] [CrossRef]
Rayed, M.E.; Islam, S.M.S.; Niha, S.I.; Jim, J.R.; Kabir, M.M.; Mridha, M.F. Deep Learning for Medical Image Segmentation: State-of-the-Art Advancements and Challenges. Inform. Med. Unlocked 2024, 47, 101504. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3320–3328. [Google Scholar]
Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Learning and Transferring Mid-Level Image Representations Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1717–1724. [Google Scholar]
Zoetmulder, R.; Gavves, E.; Caan, M.; Marquering, H. Domain- and task-specific transfer learning for medical segmentation tasks. Comput. Methods Programs Biomed. 2021, 214, 106539. [Google Scholar] [CrossRef]
Tendle, A.; Hasan, M.R. A study of the generalizability of self-supervised representations. Mach. Learn. Appl. 2021, 6, 100124. [Google Scholar] [CrossRef]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep Clustering for Unsupervised Learning of Visual Features. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 69–84. [Google Scholar]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 12310–12320. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Bruna, J.; Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1872–1886. [Google Scholar] [CrossRef] [PubMed]
Jing, L.; Tian, Y. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4037–4058. [Google Scholar] [CrossRef]
Ohri, K.; Kumar, M. Review on self-supervised image recognition using deep neural networks. Knowl.-Based Syst. 2021, 224, 107090. [Google Scholar] [CrossRef]
Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 2021, 35, 857–876. [Google Scholar] [CrossRef]
Albelwi, S. Survey on self-supervised learning: Auxiliary pretext tasks and contrastive learning methods in imaging. Entropy 2022, 24, 551. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]
Wong, B.; Yi, M.Y. Premix: Label-Efficient Multiple Instance Learning via Non-Contrastive Pre-Training and Feature Mixing. arXiv 2024, arXiv:2408.01162. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Hidy, G.; Bakos, B.; Lukács, A. Enhancing Pretraining Efficiency for Medical Image Segmentation via Transferability Metrics. arXiv 2024, arXiv:2410.18677. [Google Scholar] [CrossRef]
Rani, V.; Kumar, M.; Gupta, A.; Sachdeva, M.; Mittal, A.; Kumar, K. Self-supervised learning for medical image analysis: A comprehensive review. Evol. Syst. 2024, 15, 1607–1633. [Google Scholar] [CrossRef]
Zeng, D.; Wu, Y.; Hu, X.; Xu, X.; Yuan, H.; Huang, M.; Zhuang, J.; Hu, J.; Shi, Y. Positional contrastive learning for volumetric medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part II 24. pp. 221–230. [Google Scholar]
Punn, N.S.; Agarwal, S. BT-Unet: A self-supervised learning framework for biomedical image segmentation using barlow twins with U-net models. Mach. Learn. 2022, 111, 4585–4600. [Google Scholar] [CrossRef]
Kalapos, A.; Gyires-Tóth, B. Self-Supervised Pretraining for 2D Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; pp. 472–484. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 3024–3033. [Google Scholar]
Xie, Z.; Lin, Y.; Zhang, Z.; Cao, Y.; Lin, S.; Hu, H. Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16684–16693. [Google Scholar]
Chaitanya, K.; Erdil, E.; Karani, N.; Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inf. Process. Syst. 2020, 33, 12546–12558. [Google Scholar]
Mallat, S. Group Invariant Scattering. Commun. Pure Appl. Math. 2012, 65, 1331–1398. [Google Scholar] [CrossRef]
Sifre, L.; Mallat, S. Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1233–1240. [Google Scholar]
Leterme, H.; Polisano, K.; Perrier, V.; Alahari, K. On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks. arXiv 2022, arXiv:2209.11740. [Google Scholar]
Oyallon, E.; Belilovsky, E.; Zagoruyko, S. Scaling the Scattering Transform: Deep Hybrid Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5618–5627. [Google Scholar]
Oyallon, E.; Zagoruyko, S.; Huang, G.; Komodakis, N.; Lacoste-Julien, S.; Blaschko, M.B.; Belilovsky, E. Scattering Networks for Hybrid Representation Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2208–2221. [Google Scholar] [CrossRef] [PubMed]
Cotter, F.; Kingsbury, N. A Learnable ScatterNet: Locally Invariant Convolutional Layers. In Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 350–354. [Google Scholar]
Gauthier, S.; Thérien, B.; Alsene-Racicot, L.; Chaudhary, M.; Rish, I.; Belilovsky, E.; Eickenberg, M.; Wolf, G. Parametric scattering networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5749–5758. [Google Scholar]
Kinakh, V.; Taran, O.; Voloshynovskiy, S. ScatSimCLR: Self-Supervised Contrastive Learning with Pretext Task Regularization for Small-Scale Datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Virtual, 11–17 October 2021; pp. 1098–1106. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef]
Xu, X.; Wang, T.; Shi, Y.; Yuan, H.; Jia, Q.; Huang, M.; Zhuang, J. Whole heart and great vessel segmentation in congenital heart disease using deep neural networks and graph matching. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2019; pp. 477–485. [Google Scholar]
Janik, A.; Dodd, J.; Ifrim, G.; Sankaran, K.; Curran, K. Interpretability of a Deep Learning Model in the Application of Cardiac MRI Segmentation with an ACDC Challenge Dataset. In Proceedings of the Medical Imaging 2021: Image Processing, Virtual Event, 15–20 February 2021; pp. 861–872. [Google Scholar]
Andreux, M.; Angles, T.; Exarchakis, G.; Leonarduzzi, R.; Rochette, G.; Thiry, L.; Zarka, J.; Mallat, S.; Andén, J.; Belilovsky, E.; et al. Kymatio: Scattering transforms in python. J. Mach. Learn. Res. 2020, 21, 1–6. [Google Scholar]
You, Y.; Gitman, I.; Ginsburg, B. Large batch training of convolutional networks. arXiv 2017, arXiv:1708.03888. [Google Scholar] [CrossRef]
Yagis, E.; Atnafu, S.W.; de Herrera, A.G.S.; Marzi, C.; Scheda, R.; Giannelli, M.; Tessa, C.; Citi, L.; Diciotti, S. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci. Rep. 2021, 11, 22544. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Ouyang, Y.; Wan, X. Self-Supervised Alignment Learning for Medical Image Segmentation. In Proceedings of the IEEE International Symposium on Biomedical Imaging, Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Seince, M.; Le Folgoc, L.; Facury de Souza, L.A.; Angelini, E. Dense Self-Supervised Learning for Medical Image Segmentation. arXiv 2024, arXiv:2407.20395. [Google Scholar] [CrossRef]
Miron, R.; Moisii, C.; Dinu, S.; Breaban, M.E. Evaluating volumetric and slice-based approaches for COVID-19 detection in chest CTs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 529–536. [Google Scholar]
Ammar, A.; Bouattane, O.; Youssfi, M. Automatic cardiac cine MRI segmentation and heart disease classification. Comput. Med. Imaging Graph. 2021, 88, 101864. [Google Scholar] [CrossRef] [PubMed]
Wen, Y.; Chen, L.; Deng, Y.; Zhou, C. Rethinking pre-training on medical imaging. J. Vis. Commun. Image Represent. 2021, 78, 103145. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Deci-sion Support, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]

Figure 1. Architecture of the BYOL self-supervised learning framework. Two augmented views of the same image are processed by online and target networks, with the target updated via exponential moving average (EMA).

Figure 2. Example configuration of a 2D Wavelet Scattering Network (WSN) with

J = 2

scales and

L = 3

orientations, producing zeroth-, first-, and second-order scattering coefficients, where * denotes the convolution operator. The dashed arrows indicate the propagation of scattering paths, and the dashed lines represent hierarchical feature extraction across different orders.

Figure 2. Example configuration of a 2D Wavelet Scattering Network (WSN) with

J = 2

scales and

L = 3

orientations, producing zeroth-, first-, and second-order scattering coefficients, where * denotes the convolution operator. The dashed arrows indicate the propagation of scattering paths, and the dashed lines represent hierarchical feature extraction across different orders.

Figure 3. U-Net architecture used for cardiac MRI segmentation, consisting of an encoder–decoder structure with skip connections for precise spatial localization.

Figure 4. Feature Alignment Module (FAM) used to adapt scattering coefficients to the channel dimensions required by the remaining ResNet-50 encoder stages.

Figure 5. Overall framework of the proposed method, including scattering-based BYOL pretraining and subsequent transfer of the pretrained encoder to a U-Net for supervised segmentation.

Figure 6. Segmentation examples on the ACDC dataset for U-Net models with different encoder initializations and labeled fine-tuning settings. Blue, green, and red indicate LV, MYO, and RV predictions, respectively.

Figure 7. Segmentation examples on the CHD dataset for U-Net models with different encoder initializations and labeled fine-tuning settings. Red, green, blue, yellow, magenta, and cyan indicate LV, RV, LA, RA, MYO, aorta, and pulmonary artery predictions, respectively.

Table 1. Performance comparison of U-Net models with different encoder initialization strategies on the ACDC dataset, reported in terms of average Dice score across all cardiac structures.

Pretraining	M = 5	M = 10	M = 20	M = 40	M = 80
Random Init.	0.2949 ± 0.1364	0.6054 ± 0.0941	0.7955 ± 0.0291	0.8586 ± 0.0206	0.8891 ± 0.0075
ImageNet	0.5763 ± 0.1491	0.7993 ± 0.0280	0.8579 ± 0.0092	0.8825 ± 0.0056	0.8989 ± 0.0037
BYOL	0.7857 ± 0.0263	0.8435 ± 0.0097	0.8728 ± 0.0039	0.8863 ± 0.0039	0.8986 ± 0.0026
BYOL + WSN	0.8172 ± 0.0172	0.8596 ± 0.0107	0.8830 ± 0.0051	0.8971 ± 0.0053	0.9082 ± 0.0020
BYOL + PSN	0.8273 ± 0.0138	0.8646 ± 0.0096	0.8871 ± 0.0071	0.9010 ± 0.0045	0.9111 ± 0.0022

Table 2. Performance comparison of U-Net models with different encoder initialization strategies on the ACDC dataset for left ventricle (LV) segmentation.

Pretraining	M = 5	M = 10	M = 20	M = 40	M = 80
Random Init	0.4449 ± 0.1543	0.7008 ± 0.1115	0.8688 ± 0.0232	0.9141 ± 0.0118	0.9316 ± 0.0044
ImageNet	0.7679 ± 0.0655	0.8657 ± 0.0241	0.9094 ± 0.0091	0.9256 ± 0.0059	0.9370 ± 0.0034
BYOL	0.8725 ± 0.0274	0.9066 ± 0.0060	0.9235 ± 0.0057	0.9316 ± 0.0040	0.9380 ± 0.0023
BYOL + WSN	0.9003 ± 0.0116	0.9186 ± 0.0098	0.9311 ± 0.0043	0.9382 ± 0.0044	0.9450 ± 0.0022
BYOL + PSN	0.9041 ± 0.0096	0.9227 ± 0.0067	0.9336 ± 0.0059	0.9420 ± 0.0031	0.9472 ± 0.0011

Table 3. Performance comparison of U-Net models with different encoder initialization strategies on the ACDC dataset for right ventricle (RV) segmentation.

Pretraining	M = 5	M = 10	M = 20	M = 40	M = 80
Random Init	0.2872 ± 0.1979	0.5960 ± 0.1128	0.7957 ± 0.0296	0.8560 ± 0.0241	0.8921 ± 0.0091
ImageNet	0.5720 ± 0.2460	0.8023 ± 0.0376	0.8592 ± 0.0099	0.8858 ± 0.0055	0.9030 ± 0.0042
BYOL	0.7509 ± 0.0391	0.8244 ± 0.0190	0.8664 ± 0.0062	0.8825 ± 0.0082	0.8995 ± 0.0035
BYOL + WSN	0.7688 ± 0.0330	0.8392 ± 0.0167	0.8710 ± 0.0091	0.8915 ± 0.0102	0.9062 ± 0.0028
BYOL + PSN	0.7879 ± 0.0272	0.8432 ± 0.0175	0.8771 ± 0.0118	0.8958 ± 0.0088	0.9097 ± 0.0046

Table 4. Performance comparison of U-Net models with different encoder initialization strategies on the ACDC dataset for myocardium (MYO) segmentation.

Pretraining	M = 5	M = 10	M = 20	M = 40	M = 80
Random Init	0.1526 ± 0.1766	0.5192 ± 0.0841	0.7220 ± 0.0418	0.8057 ± 0.0270	0.8437 ± 0.0104
ImageNet	0.3890 ± 0.2831	0.7300 ± 0.0333	0.8052 ± 0.0138	0.8363 ± 0.0088	0.8567 ± 0.0055
BYOL	0.7338 ± 0.0306	0.7996 ± 0.0077	0.8284 ± 0.0058	0.8448 ± 0.0047	0.8584 ± 0.0039
BYOL + WSN	0.7824 ± 0.0155	0.8210 ± 0.0117	0.8469 ± 0.0051	0.8616 ± 0.0044	0.8734 ± 0.0037
BYOL + PSN	0.7898 ± 0.0120	0.8278 ± 0.0122	0.8507 ± 0.0081	0.8652 ± 0.0051	0.8763 ± 0.0023

Table 5. Performance comparison of U-Net models with different encoder initialization strategies on the CHD dataset reported in terms of average Dice score across all cardiac structures.

Pretraining	M = 2	M = 5	M = 10	M = 20	M = 40
Random Init	0.2182 ± 0.0748	0.2971 ± 0.0564	0.5149 ± 0.0392	0.5764 ± 0.0158	0.6395 ± 0.0064
ImageNet	0.3027 ± 0.0696	0.4567 ± 0.0372	0.5901 ± 0.0207	0.6372 ± 0.0116	0.6982 ± 0.0037
BYOL	0.3493 ± 0.0523	0.4692 ± 0.0785	0.5842 ± 0.0178	0.6440 ± 0.0118	0.6904 ± 0.0048
BYOL + WSN	0.3286 ± 0.0688	0.4733 ± 0.0795	0.5654 ± 0.0571	0.6498 ± 0.0079	0.6959 ± 0.0073
BYOL + PSN	0.3547 ± 0.0637	0.4673 ± 0.0707	0.5940 ± 0.0130	0.6451 ± 0.0159	0.6873 ± 0.0090

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alasu, S.; Talu, M.F. Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation. Electronics 2026, 15, 506. https://doi.org/10.3390/electronics15030506

AMA Style

Alasu S, Talu MF. Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation. Electronics. 2026; 15(3):506. https://doi.org/10.3390/electronics15030506

Chicago/Turabian Style

Alasu, Serdar, and Muhammed Fatih Talu. 2026. "Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation" Electronics 15, no. 3: 506. https://doi.org/10.3390/electronics15030506

APA Style

Alasu, S., & Talu, M. F. (2026). Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation. Electronics, 15(3), 506. https://doi.org/10.3390/electronics15030506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scattering-Based Self-Supervised Learning for Label-Efficient Cardiac Image Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Self-Supervised Learning

2.2. Self-Supervised Learning for Semantic Segmentation

2.3. Wavelet Scattering Networks (WSNs)

3. Methods

3.1. Bootstrap Your Own Latent (BYOL)

3.2. Scattering-Based Feature Extraction

3.3. Parametric Scattering Networks (PSNs)

3.4. Image Segmentation with U-Net

3.5. Proposed Method

4. Experiments and Results

4.1. Dataset

4.2. Experimental Setup

4.2.1. Self-Supervised Pretraining

4.2.2. Segmentation Fine-Tuning

4.3. Evaluation Metrics

4.4. Results

4.4.1. Quantitative Results

4.4.2. Qualitative Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI