Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention

Yuan, Hongchun; Wang, Yixuan

doi:10.3390/app15095150

Open AccessArticle

Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention

by

Hongchun Yuan

^*

and

Yixuan Wang

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5150; https://doi.org/10.3390/app15095150

Submission received: 27 March 2025 / Revised: 29 April 2025 / Accepted: 30 April 2025 / Published: 6 May 2025

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

To address the temporal jitter with low segmentation accuracy and the lack of high-precision transformations for specific object classes in video generation, we propose the fish body pattern sync-style network for ornamental fish videos. This network innovatively integrates dynamic texture transfer with instance segmentation, adopting a two-stage processing architecture. First, high-precision video frame segmentation is performed using Mask2Former to eliminate background elements that do not participate in the style transfer process. Then, we introduce the wavelet-gated styling network, which reconstructs a multi-scale feature space via discrete wavelet transform, enhancing the granularity of multi-scale style features during the image generation phase. Additionally, we embed a convolutional block attention module within the residual modules, not only improving the realism of the generated images but also effectively reducing boundary artifacts in foreground objects. Furthermore, to mitigate the frame-to-frame jitter commonly observed in generated videos, we incorporate a contrastive coherence preserving loss into the training process of the style transfer network. This enhances the perceptual loss function, thereby preventing video flickering and ensuring improved temporal consistency. In real-world aquarium scenes, compared to state-of-the-art methods, FSSNet effectively preserves localized texture details in generated videos and achieves competitive SSIM and PSNR scores. Moreover, temporal consistency is significantly improved. The flow warping error index decreases to 1.412. We chose FNST (fast neural style transfer) as our baseline model and demonstrate improvements in both model parameter count and runtime efficiency. According to user preferences, 43.75% of participants preferred the dynamic effects generated by this method.

Keywords:

convolutional neural networks; local style transfer; instance segmentation

1. Introduction

Koi fish are a popular ornamental species worldwide, admired for their vibrant colors and auspicious symbolism. As people’s aesthetic preferences evolve, the market demand for ornamental fish continues to grow. However, traditional fish farming faces challenges, such as high breeding costs and environmental limitations, while different fish species exhibit significant variations in body patterns. This issue has attracted attention not only from ichthyologists but also from computer vision researchers. To address this demand, we explore semantic image style transfer methods. By leveraging neural style transfer (NST) technology, we can dynamically generate diverse fish body patterns, enhancing the economic potential of fish breeding while catering to personalized aesthetic preferences.

Traditional style transfer methods, such as patch-based synthesis [1], often struggle with maintaining global coherence and stylistic fidelity, limiting their applicability to complex styles. Similarly, traditional image segmentation techniques face challenges, like computational complexity, sensitivity to image variations, and subjective interpretation issues, which can hinder their effectiveness in real-world applications. Neural style transfer extracts high-level semantic features from images using deep neural networks, allowing it to preserve the structural content of an image while integrating the texture characteristics of a style image. This technique has been widely applied in artistic rendering, visual effects in film production, and virtual reality environments [2]. Compared to traditional texture synthesis and optimization-based algorithms, NST enables more flexible cross-domain style transfer, such as applying oil painting brushstrokes to natural landscape photographs or achieving a unified artistic style in dynamic videos [3].

As illustrated in Figure 1, by integrating instance segmentation with style transfer, we can achieve localized control over the application of stylistic transformations within an image.

To achieve high-quality, visually appealing, and real-time high-resolution fish body pattern video style transfer, we propose the fish body pattern sync-styling network (FBPSNet), a localized video style transfer model applicable to various ornamental fish species. Our method adapts the appearance of ornamental fish in video sequences by performing style matching. To process an input video, our approach first leverages a pre-trained Mask2Former [4] model to extract instance segmentation masks for multiple fish objects in each frame, thereby effectively separating the foreground fish from the background. Based on these segmentation masks, we apply an improved style transfer network to accurately extract fish body patterns, capturing their morphology and texture characteristics. This enables the generation of videos with novel styles, incorporating modifications in shape, texture, and color contrast.

To address the common challenges of style consistency and frame jitter in generated videos, we incorporate contrastive coherence preserving loss [5] (CCPL). This significantly enhances the temporal consistency of our generated videos by preserving structural coherence over time. Additionally, through spatial control mechanisms, users can selectively specify different style effects for the fish bodies and the background, enabling artistic and region-specific stylization. Our main contributions are as follows:

High-precision mask-based local style transfer: We propose FSSNet, a style transfer model tailored for various ornamental fish species, optimizing the accuracy of fish body mask generation in complex underwater scenes. Using segmentation masks, we achieve fast and synchronized pattern transformations while supporting distinct style transfers for foreground fish and the background. Our model effectively adapts multiple styles, such as oil painting and blue-and-white porcelain, enhancing visual diversity.
Wavelet transform and attention-enhanced style transfer: We introduce a wavelet-gated styling network (WGSNet), which incorporates a convolutional block attention module [6] into residual networks to bolster feature matching in fish body regions. Furthermore, we employ wavelet transform functions to refine the downsampling process, effectively capturing localized image details.
Temporal consistency optimization: To mitigate video flickering and jitter, we integrate contrastive coherence preserving loss within the perceptual loss function, significantly improving the temporal consistency of the generated frames.

2. Related Work

2.1. Instance Segmentation

Instance segmentation aims to distinguish individual object instances at the pixel level while differentiating between objects of the same class. As a task at the intersection of object detection and semantic segmentation, instance segmentation is particularly challenging in underwater video scenarios due to issues, such as uneven lighting, motion blur, and complex backgrounds, which necessitate more robust segmentation techniques. Existing methods, such as Mask R-CNN, struggle in complicated underwater environments, often resulting in coarse segmentation mask edges [7].

He et al. [7] introduced the Mask R-CNN architecture, which improves upon Faster R-CNN [8] by performing pixel-level segmentation using an enhanced ROIAlign module for greater segmentation accuracy. However, its high computational overhead limits its real-time applicability. To enhance efficiency, Kurzman et al. [9] proposed class-based stylization using a depth-wise asymmetric bottleneck network (DABNet) to obtain object-class masks. Although this model achieves real-time semantic segmentation, it suffers from significant loss of fine texture details. To address dynamic object segmentation challenges in complex scenes, the space–time correspondence network (STCN) [10] optimizes memory-matching mechanisms, while Wang et al. [11] introduced a vision saliency model based on transformers, integrating motion and color-space features for better segmentation performance.

2.2. Global Style Transfer

Gatys et al. [12] pioneered a neural style transfer (NST) method based on convolutional neural networks (CNNs), enabling the optimization-based separation of content and style features, thus laying the foundation for style transfer research. However, due to its high computational cost and long inference time, its real-time application remains limited. Since 2015, advancements, such as CycleGAN [13], AdaIN [14], and AdaAttN [15], have significantly improved both the efficiency and realism of style transfer.

AdaAttN effectively transfers arbitrary styles yet introduces artifacts in fine-detail regions. Additionally, as it primarily focuses on static images, it lacks explicit temporal consistency constraints, leading to noticeable distortions between consecutive frames in videos. In the domain of video generation, Yan et al. [16] proposed a temporally consistent video transformer model, incorporating temporal dependency modeling during compression and input stages to enhance long-term video coherence. Similarly, Johnson et al. [17] proposed a fast neural style transfer method using an end-to-end CNN, enabling real-time inference. This pioneering framework significantly accelerated style transfer, making it feasible for video applications.

Despite these advancements, existing style transfer methods often produce inconsistent styles or frame skipping in videos, leading to incoherent or unnatural visual effects. Furthermore, fish body patterns exhibit non-rigid deformations, necessitating semantic constraints to refine details and maintain biological feature consistency during style transfer. To ensure visual smoothness in videos, temporal consistency optimization has become a crucial problem. Chen et al. [18] introduced a temporal consistency loss function in video style transfer to constrain stylistic variations between frames, ensuring perceptual continuity. However, for highly detailed and dynamically changing scenes, existing methods may still yield unnatural results and suffer from detail loss.

2.3. Local Style Transfer

Traditional global style transfer methods [12] struggle to achieve spatially controllable style rendering, especially when applied to complex scenes. This often results in uneven style distributions and visual artifacts, leading to style bleeding issues. Li et al. [19] attempted to introduce attention mechanisms for preliminary regional-style control, but their segmentation precision was constrained by the generalization capabilities of pre-trained models.

Inspired by mobile video platforms that utilize face-tracking technology to apply stickers or effects to faces, we explore the use of a CNN for the artistic stylization of ornamental fish. Local style transfer integrates image segmentation with style adaptation, allowing specific styles to be applied to designated regions within an image or video while preserving the original appearance of other areas. However, selecting the target region and ensuring smooth boundary transitions remain challenging, as improper processing can result in content leakage or unnatural style blending. Luan et al. [20] incorporated semantic segmentation into style transfer to enable photorealistic stylization, ensuring spatial coherence while maintaining correct style matching across different semantic regions. Sun et al. [21] further applied mask-based segmentation and residual networks for localized image style transfer. However, these methods primarily focus on static images and do not address the challenge of maintaining frame-to-frame consistency in video sequences.

CCPL enhances the temporal stability of video stylization by utilizing global feature maps and attention mechanisms, ensuring consistent styles across consecutive frames through contrastive learning. This approach allows the system to adaptively choose between multiple styles while preserving the original content information [5]. Despite its effectiveness in improving temporal consistency, CCPL occasionally exhibits insufficient image refinement in practical applications.

3. Proposed System

3.1. Architecture

Our goal is to design a localized style transfer model for ornamental fish videos, replacing fish body patterns with creative and refined stylized textures while maintaining the original background or generating a distinguishable pattern for it. To achieve this we follow the methodology illustrated in Figure 2.

First, we employ a pre-trained instance segmentation network on an ornamental fish dataset to extract a binary mask M_f for the target fish f. The mask M_f has the same height and width as the input video frame C_f, where pixels belonging to class f are set to 1, and background pixels are set to 0. Next, our image transfer network S is used to convert C_f into a stylized image T_s. FSSNet applies the binary mask M_f to separate the stylized target I_f and background image B_f from the original input frame C_f. Finally, the stylized foreground T_s is fused with B_f, resulting in the final output image G_f, which retains the original background while applying style transfer selectively to the fish body. The overall calculation process is formulated as follows:

\begin{array}{l} G_{f} (C_{f}, S) = T s + (1 - M_{f}) \times B_{f} \\ T s = S (I_{f} \times M_{f}) \end{array}

(1)

As depicted in Figure 3, WGSNet is the key component of FSSNet, designed using an encoder–decoder architecture. During the inference phase, WGSNet transforms the content image P into a stylized image G. During training, the optimizer updates the weights by optimizing style loss, content loss, and contrastive coherence preserving loss to ensure high-quality stylization while maintaining temporal consistency across video frames.

To extract hierarchical feature representations, we utilize a pre-trained VGG-16 [22] network as the feature extractor. Here, gram_style refers specifically to the Gram function computed on the style image’s features. Formally, given a feature map

F \in R^{C \times H \times W}

extracted from the style image S, we reshape it into a 2D matrix

F \in R^{C \times (H \cdot W)}

. The Gram function defined as follows:

G = \frac{1}{C \times H \times W} F \times F^{⊤}

(2)

The Gram function effectively encodes style features by inter-channel covariance, which enables the model to be independent of the content structure separation and transmit texture statistics [23].

3.2. Wavelet-Gated Styling Network

FNST [17] often struggles to preserve high-frequency details due to their reliance on single-scale feature representations, which may lead to over-smoothed stylization outputs. To better capture semantic differences in image loss and address detail loss and incomplete feature extraction during image generation, we incorporate the transformer-based architecture from FNST and propose the wavelet-gated style network. We explicitly resolve FNST’s spectral distortion by integrating multi-level wavelet decomposition, enabling joint optimization of low-frequency structural coherence and high-frequency artistic texture fidelity.

As depicted in Figure 4, the input image is first processed by an initial convolution layer (conv1), expanding the input from 3 channels to 32 channels while maintaining spatial resolution. This is followed by instance normalization (IN) and ReLU activation, which eliminates instance-specific means and variance, thereby enhancing style transformation effectiveness [24].

To further enhance key region’s feature matching, we incorporate the convolutional block attention module into the residual module, thereby improving the perceptual loss network. By integrating low-frequency and high-frequency attention mechanisms, we apply attention weighting to the feature maps of both the content image and the generated image, optimizing the final stylized output.

3.2.1. Upsampling

Following the FNST framework, the decoder architecture includes a sequence of deconvolution layers and residual blocks to reconstruct high-resolution stylized outputs. As shown in Figure 5, the upsampling process employs three deconvolution layers with stride 2, progressively recovering spatial resolution from the bottleneck features. Each deconvolution layer is followed by instance normalization [24] and ReLU activation.

3.2.2. Wavelet-Based Downsampling

Traditional convolutional downsampling methods utilize stride-based operations to compress feature maps spatially, often causing high-frequency detail loss and local feature confusion [25]. To mitigate this issue, we integrate a multi-resolution feature extraction module based on discrete wavelet transform (DWT). It consists of a lossless feature encoding block and a feature representation learning block. In the network encoder, two parallel branches are used to process low-frequency and high-frequency information, respectively, and then channel-level splicing or weighted summation is performed in the fusion module. WGSNet employs DWT in its high-level semantic layers, replacing traditional downsampling while removing redundant instance normalization layers. In standard TransformerNet architectures, the conv2 and conv3 layers use a stride of 2 to reduce the size of feature maps. Instead, in the conv1 layer, reflection padding is applied to maintain the image size, followed by a 9 × 9 convolution with stride 1, yielding a 32 × H × W feature tensor. We replace the conv2 and conv3 downsampling operations with Haar wavelet downsampling [26], ensuring that spatial dimension information is encoded into channel dimensions to reduce information loss.

In the first wavelet module (DWT1), the input feature map P is decomposed into a low-frequency approximation component (LL) and high-frequency components (LH, HL, HH), as shown in Figure 6. The output of this module consists of 64 channels. Similarly, the second wavelet module (DWT2) transforms 64 input channels into 128 output channels, effectively preserving both local and global details.

The low-frequency component (LL) is generated by the LDF through progressive decomposition along the row and column directions of the image, preserving the global structure and smooth information. The high-frequency components (LH, HL, and HH) are extracted by cross-combinations of HDF and LDF, corresponding to horizontal edges (LH), vertical edges (HL), and diagonal details (HH), respectively.

The decomposed four-channel features are fused into Y via a channel concatenation operation, followed by a 1 × 1 convolution layer for channel dimension compression. This process is described in the following Equation (3), where

Y \in R^{B \times 4 C \times \frac{H}{2} \times \frac{W}{2}}

:

\begin{array}{l} Y = C o n c a t (Y_{L L}, Y_{H L}, Y_{L H}, Y_{H H}), \\ Y_{o u t} = C o n v_{1 \times 1} (Y) \end{array}

(3)

Since low-frequency components contain global statistical information, applying direct channel normalization can reduce style leakage [23]. Thus, we integrate batch normalization and activation functions within the DWT process, eliminating instance normalization layers previously used in conv2 and conv3.

3.2.3. Residual Module with Attention Mechanism

While DWT effectively extracts localized frequency domain features, it lacks global semantic modeling capability. Lower-level features tend to be more fundamental, often requiring less complex attention mechanisms. To address precise feature localization, we embed a CBAM within the residual layers, as shown in Figure 7. The CBAM combines both channel attention and spatial attention, enabling adaptive feature reweighting in two dimensions to enhance the network’s focus on key regions.

The improved residual module includes two convolution layers, instance normalization layers, and the CBAM. For the input feature map F

\in R^{B \times C \times H \times W}

, its processing steps are formalized as follows:

\begin{array}{l} F^{'} = σ (IN (Conv (F))), F^{″} = σ (IN (Conv (F^{'}))), \\ F_{att} = CBAM (F^{″}), \\ F_{out} = F_{in} + F_{att} \end{array}

(4)

where σ denotes the ReLU, while IN represents instance normalization operations.

The channel attention module compresses spatial dimensions using global average pooling (GAP) and global max pooling (GMP), generating channel-wise descriptors. These descriptors are passed through fully connected layers (MLP) and a Sigmoid function to produce channel weights W_s, as follows:

W_{s} = σ (f^{MLP} ([AvgPool (F); MaxPool (F)]))

(5)

where σ represents the Sigmoid, while MLP is the multi-layer perceptron.

3.3. Contrastive Coherence Preserving Loss

Style transfer involves applying the color, texture, and fine details of a style image to a target image, modifying objects and scenes accordingly. This process is governed by two key loss functions: content loss and style loss. Content loss assesses the high-dimensional feature differences between the generated image and the target content image by comparing feature maps extracted from a CNN. Style loss, on the other hand, quantifies the stylistic discrepancies between the generated and target images, which is commonly achieved using Gram matrix comparisons [27].

When initializing style transfer on consecutive video frames using independent Gaussian noise, each frame may converge to significantly different local minima, leading to noticeable flickering artifacts [28]. To mitigate style jitter and content leakage in the generated videos, we incorporate CCPL within the perceptual loss function, improving temporal consistency across frames.

As shown in Equation (4),

d_{g}^{x, y} = G_{a}^{x} Θ G_{n}^{x, y}

is the feature difference between the anchor and the positive sample in the stylized image, and

d_{c}^{x, y} = C_{a}^{x} Θ C_{n}^{x, y}

is the difference in the corresponding position of the input image.

G_a^x, where x = 1, …, N is the CCPL module which randomly samples N vectors from the encoded features of the input image C and the stylized image G. This process is referred to as defining the “Anchor”. Positive samples are obtained by sampling the eight nearest neighboring vectors G_n^x,y where y = 1, …, 8, within the spatial neighborhood of each anchor G_a^x. Negative samples are defined as the vectors corresponding to other anchor points within the same batch. Equation (6) is as follows:

\begin{array}{l} d_{g}^{x, y} = G_{a}^{x} Θ G_{n}^{x, y}, d_{c}^{x, y} = C_{a}^{x} Θ C_{n}^{x, y} \\ L_{c c p} = - \sum_{m = 1}^{8 \times N} \log [\frac{\exp (d^{m} (I) \cdot d^{m} (I_{c o n t e n t}) / τ)}{\exp (d^{m} (I) \cdot d^{m} (I_{c o n t e n t}) / τ) + \sum_{n = 1, n \neq m}^{8 \times N} \exp (d^{m} (I) \cdot d^{n} (I_{c o n t e n t}) / τ)}] \end{array}

(6)

where

Θ

represents vector subtraction.

τ

is a temperature parameter that controls the steepness of the distribution of similarity

3.4. The Training and Loss Function

We adopt a phased training strategy to train both WGSNet and the mask sub-network. The training objective consists of segmentation loss and image perceptual loss, achieving localized style transfer through multi-task optimization. The detailed configuration is as follows:

Segmentation loss is a weighted combination of the classification loss, dice loss, and mask loss. The overall loss function is expressed as follows:

$L_{total} = λ_{cls} L_{cls} + λ_{mask} L_{mask} + λ_{dice} L_{dice}$

(7)

where L_cls represents the cross-entropy loss, L_mask represents the binary cross-entropy loss, and L_dice represents the dice loss.
Cross-entropy loss employs class-weighted balanced cross-entropy to address class imbalance, ensuring that the model can learn distinguishing boundaries between different classes. Dice loss enhances the modeling of segmentation region continuity, focusing on improving the boundary accuracy between foreground objects and the background, making it particularly suitable for improving segmentation quality in scenarios involving small targets or complex backgrounds. If the task prioritizes overall region accuracy, increasing the weight of the dice loss is advisable. Conversely, if pixel-level classification precision is more critical, elevating the weight of the cross-entropy loss is recommended. Mask loss is based on binary cross-entropy, focusing on refining mask predictions. Equation (8) is as follows:

$\begin{array}{l} L_{cls} = \frac{1}{N} \sum_{i = 1}^{N} w_{c} \cdot (- y_{i} \log p_{i}) \\ L_{dice} = 1 - \frac{2 \sum (p_{i} \cdot y_{i}) + ϵ}{\sum p_{i} + \sum y_{i} + ϵ} \\ L_{mask} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log σ (p_{i}) + (1 - y_{i}) \log (1 - σ (p_{i}))] \end{array}$

(8)

where N represents the number of instances, W_c denotes the class weight, y_i is the ground-truth label, and P_i is the predicted probability.
Content loss is computed by measuring the mean squared error between high-level convolutional features of the target and content images, evaluating content similarity.

Style loss captures the texture statistics of style images through the Gram matrix. Equation (9) is as follows:

\begin{array}{l} L_{content} = λ_{c o n t e n t} \sum_{i, j} {(F_{i j}^{l} (I) - F_{i j}^{l} (I_{c o n t e n t}))}^{2} \\ L_{style} = λ_{s t y l e} \sum_{i, j} {(G_{i j}^{l} (I) - G_{i j}^{l} (I_{s t y l e}))}^{2} \end{array}

(9)

where

F^{l} (I) \in R^{C_{l} \times H_{l} \times W_{l}}

, F_ij^l denotes the activation value at the i channel and j spatial location in the l convolutional layer of the VGG-16 network. G_l represents the corresponding Gram matrix in the l convolutional layer. I is the generated image. I_style is the style reference image, and I_content is the content image.

The perceptual loss function for the style transfer is expressed as follows:

$L_{total} = λ_{content} L_{c o n t e n t} + λ_{style} L_{s t y l e} + λ_{ccp} L_{c c p}$

(10)

4. Experiments

This section presents the experimental performance of key modules in our study. First, we introduce the datasets used to train the instance segmentation model and WGSNet. Then, we describe the experimental setup. Finally, we evaluate the qualitative and quantitative results to verify the effectiveness of our model for video stylization and analyze user preferences across different models.

4.1. Dataset

To train the image segmentation model, we used a large-scale fish segmentation dataset available on Kaggle [29]. The dataset provided pixel-level fish body masks and species classifications, enabling the Mask2Former module to accurately separate fish foregrounds. This ensured that the style transfer was applied exclusively to target regions, minimizing background interference. To train the stylization network, we selected the MS-COCO dataset [30], which contains approximately 83,000 images, with each image resized to 512 × 512 pixels. Its diverse and complex contexts offer rich content features, enhancing the generalization capabilities of style transfer models. Moreover, the extensive volume of samples aids in reducing overfitting risks and stabilizing model learning outcomes. To validate the practical applicability and reliability of our proposed model, we performed comparative experiments using our self-constructed ornamental fish video dataset.

We created a dataset simulating a real fish tank environment, covering various freshwater and tropical ornamental fish species, such as red-and-white guppy, hat goldfish, tancho koi, and Asian arowana. As illustrated in Figure 8, we manually annotated video footage of red-white koi fish, containing 2020 fish segmentation annotations. The dataset was captured using a Sony ZV-E10 camera (Tokyo, Japan) at a 1920 × 1080 resolution at 30 FPS.

4.2. Implementation Details

Our experiments were conducted on a desktop computing platform. The operating system of the experimental environment was Ubuntu 16.04.4 LTS, with 16 GB of memory, an AMD Ryzen 9 5950X 16-Core CPU, and an NVIDIA GeForce RTX 3090 24 GB GPU. The Programming Language is Python 3.7, the deep learning framework is Pytorch 1.13.1, and CUDA 11.7.

For the segmentation model, we used Mask2Former, pre-trained on the ImageNet-22K dataset [31]. The optimization followed the AdamW optimizer with a learning rate of 1 × 10⁻⁴ and weight decay of 0.05. For certain backbone parameters, a lower learning rate multiplier was applied. We trained the model for 20,000 iterations using a polynomial learning rate decay schedule.

For WGSNet, we computed the content loss and CCPL loss using the relu3_3 layer of the VGG-16 network [22], while style loss was computed from relu1_2, relu2_2, relu3_3, and relu4_3. Training the model on a single NVIDIA GeForce RTX 3090 GPU took approximately 2 h. Applying our algorithm to stylize a 295-frame video with a resolution of 1440 × 830 took a total of 263.08 s, averaging 0.89 s per high-resolution frame. In addition, in order to obtain the performance of the model in processing long time series, we selected a long duration video from the fish tank simulation dataset for the experiments.

4.3. Comparison

4.3.1. Fish Body Instance Segmentation

We trained our ornamental fish dataset on multiple instance segmentation models, and the quantitative comparison results are presented in Table 1. During training, Mask2Former achieved its best fish segmentation performance at 17,000 iterations, recording an IoU of 94.81% and a dice coefficient of 97.32%. As Mask2Former is based on a transformer architecture and employs dynamic mask prediction, along with multi-head attention and multi-scale feature enhancement, the experimental results demonstrate that it has significant advantages when segmenting small objects, such as fish. Therefore, we selected Mask2Former as our primary instance segmentation model.

As shown in Figure 9a, we compared the segmentation results of different models on the goldfish dataset. The first column represents input video frames, while columns two to five show the segmentation predictions from different models. The selected images contain overlapping fish, as well as partially occluded or disappearing and reappearing objects. Compared to the other models, Mask2Former effectively handles these challenges, accurately segmenting fish instances, even in complex occlusion scenarios. Additionally, Figure 9b highlights the segmentation of goldfish tails, demonstrating that Mask2Former achieves more precise extraction in cases where tail segmentation is typically challenging.

Figure 10 illustrates a comparison of localized style transfer images based on different segmentation masks. In areas where the fish’s edges are difficult to distinguish from the background, K-Net and Mask2Former yield different results. In Figure 10c, which shows the method using K-Net, there is style blending of the background and foreground in the regions where fish tails overlap with multiple fish bodies. In contrast, the image based on Mask2Former, as shown in Figure 10d, accurately captures complex boundaries, achieving clear style separation between the fish and the background while preserving the texture details of the content image.

Although the model was specifically designed with stability for long-sequence processing in mind, during actual operation, we observed that when the fish body makes large movements within a short period, occasional color drifts occur. These drifts lead to the loss of texture in the stylized images, adversely affecting certain video frames. The primary challenges encountered during long-sequence video testing include the model’s difficulty in adapting to rapidly moving fish and the occurrence of flickering stylization effects caused by underwater disturbances, such as seagrass and oxygenation bubbles. These factors can lead to occasional segmentation errors, affecting the overall quality of the stylized video. Figure 11 illustrates this phenomenon:

4.3.2. Quantitative Results

To compare our proposed FSSNet model with existing methods, we selected CycleGAN, AdaAttN, and SCTNet + LCCP for quantitative evaluation using SSIM and PSNR metrics [35]. The results in Table 2 indicate that our FSSNet model outperforms competing methods on both SSIM and PSNR, reflecting better image fidelity and perceptual quality in generated video frames.

4.3.3. Qualitative Results

To ensure that the generated images maintained a high resemblance to the content image while balancing stylized textures, we conducted multiple evaluations by adjusting the style weight parameter. When the CCPL weight is set too low, it leads to a reduction in flow warp error, indicating decreased temporal coherence. Therefore, we kept the content weight and the CCPL weight fixed at 10⁵. This allowed us to prevent excessive similarity leading to the loss of artistic realism or excessive difference resulting in poor texture retention.

Figure 12 illustrates how different style weights impact the final stylized output. When the style loss is relatively high, the fish structure is well represented through the masks; however, much of the content image’s texture is lost. Conversely, when the style loss is too low, the generated image retains only some color features from the style image, resulting in reduced detail. Thus, it is crucial to balance the weights of content loss, style loss, and CCPL to achieve significant stylization effects while maintaining structural fidelity.

Figure 13 demonstrates the spatial control capabilities of FSSNet. By manipulating the segmentation masks for the foreground fish and background, our model allows users to apply different artistic styles independently to each region, offering greater customizability in artistic fish video generation.

To further validate the perceptual quality of our generated videos under different style transfer models, we conducted a user study. We selected four ornamental fish video clips from our dataset and applied three different style transfer algorithms using four representative paintings as reference styles.

The four videos represented typical scenarios pertinent to our study, namely multiple fish with mutual occlusion, unstable background influences, rapid swimming of multiple fish, and steady swimming by a single fish. The partial frames of the generated video are shown in Figure 14.

A total of 40 participants were recruited for the study, aged between 20 and 45, with a gender ratio of 1:1. The group included 10 computer vision researchers, 10 students majoring in marine science, and 20 general users with no relevant background, in order to ensure diversity in evaluation perspectives. In the survey, participants were asked to select the most preferred video from the four generated results each time. The results are shown in Table 3. Our model, with its spatial control mechanism, was the most preferred by participants across all evaluation metrics, demonstrating its effectiveness in balancing style adaptation and content preservation. For the stylization strength metric, FFSNet with spatial control achieved an impressive score of 56.25%.

4.3.4. Ablation

As illustrated in Figure 15, compared to traditional convolutional downsampling, the DWT module significantly enhances the capture of textural details by preserving the frequency sub-bands.

To assess the computational complexity of our model, we calculated the total number of parameters responsible for style transfer and compared them with those of FNST. To simulate real-world working conditions, we used 1920 × 1080 resolution three-channel images to compute FLOPs. For FPS measurements, we processed 100 frames of 1920 × 1080 images for each model, repeating the computation three times to obtain the average FPS. The selected stylized method was spatial control. As shown in Table 4, the introduction of DWT contributes to a reduction in parameters, while the addition of CBAM increases both the parameter count and FLOPs. We enhanced the model’s expressiveness with a minimal impact on overall runtime efficiency.

To evaluate the temporal consistency improvements of our model on video frames, we compared the flow warping error among different models following the definition in [36]. For this experiment, we selected three long ornamental fish videos from our fish video database, containing occlusions and fast-swimming fish to test challenging scenarios. To ensure consistency across all tested models, the output videos were configured to 10 FPS, following the default setting in [5]. The video resolution was set to the original 1920 × 1080 pixels. As shown in Table 5, our FSSNet model demonstrates significant improvements, achieving a flow warping error enhancement of 5.862–5.443 compared to the baseline models.

5. Conclusions

Considering the localized style transfer requirements for ornamental fish videos, this paper proposes FSSNet, a model that balances aesthetic quality and temporal consistency. Our approach effectively addresses several challenges present in traditional style transfer methods, including high-frequency detail loss, inadequate global feature understanding, and limited multi-style generalization capability. Our method has been recognized in user studies, further validating its effectiveness in dynamic video processing. In our future work, we plan to focus on enhancing the speed and accuracy of instance segmentation for multiple fish targets. In addition, we will try to explore the application of FSSNet in environments with real-time transmission and higher quality images.

Author Contributions

Conceptualization, Y.W. and H.Y.; methodology, Y.W.; software, Y.W.; validation, Y.W. and H.Y.; formal analysis, Y.W.; investigation, Y.W.; resources, Y.W. and H.Y.; data curation, Y.W. and H.Y.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52478564).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The article contains all original contributions of the study. For further inquiries, please contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, T.Q.; Schmidt, M. Fast Patch-based Style Transfer of Arbitrary Style. arXiv 2016. [Google Scholar] [CrossRef]
Singh, A.; Jaiswal, V.; Joshi, G.; Sanjeeve, A.; Gite, S.; Kotecha, K. Neural Style Transfer: A Critical Review. IEEE Access 2021, 9, 131583–131613. [Google Scholar] [CrossRef]
Luan, F.; Paris, S.; Shechtman, E.; Bala, K. Deep Painterly Harmonization. Comput. Graph. Forum 2018, 37, 95–106. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
Wu, Z.; Zhu, Z.; Du, J.; Bai, X. CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 189–206. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of The European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016. [Google Scholar] [CrossRef]
Kurzman, L.; Vazquez, D.; Laradji, I. Class-Based Styling: Real-time Localized Style Transfer with Semantic Segmentation. arXiv 2019. [Google Scholar] [CrossRef]
Cheng, H.K.; Tai, Y.-W.; Tang, C.-K. Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 11781–11794. [Google Scholar]
Wang, F.; Luo, X.; Wang, Q.; Li, L. Aerial-BiSeNet: A real-time semantic segmentation network for high resolution aerial imagery. Chin. J. Aeronaut. 2021, 34, 47–59. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A Neural Algorithm of Artistic Style. arXiv 2015. [Google Scholar] [CrossRef]
Chu, C.; Zhmoginov, A.; Sandler, M. CycleGAN, a Master of Steganography. arXiv 2017. [Google Scholar] [CrossRef]
Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. arXiv 2017. [Google Scholar] [CrossRef]
Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer. arXiv 2021. [Google Scholar] [CrossRef]
Yan, W.; Hafner, D.; James, S.; Abbeel, P. Temporally consistent transformers for video generation. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; JMLR.org: Honolulu, HI, USA, 2023; Volume 202, pp. 39062–39098. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv 2016. [Google Scholar] [CrossRef]
Chen, D.; Liao, J.; Yuan, L.; Yu, N.; Hua, G. Coherent Online Video Style Transfer. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1114–1123. [Google Scholar] [CrossRef]
Li, X.; Liu, S.; Kautz, J.; Yang, M.-H. Learning Linear Transformations for Fast Arbitrary Style Transfer. arXiv 2018. [Google Scholar] [CrossRef]
Luan, F.; Paris, S.; Shechtman, E.; Bala, K. Deep Photo Style Transfer. arXiv 2017. [Google Scholar] [CrossRef]
Sun, J.; Liu, X. Local Style Transfer Method Based on Residual Neural Network. Prog. Laser Optoelectron. 2020, 57, 081012. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv 2017. [Google Scholar] [CrossRef]
Mallat, S. A Wavelet Tour of Signal Processing; Elsevier: Amsterdam, The Netherlands, 1999; ISBN 978-0-08-052083-4. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Qiu, X.; Xu, R.; He, B.; Zhang, Y.; Zhang, W.; Ge, W. ColoristaNet for Photorealistic Video Style Transfer. arXiv 2022. [Google Scholar] [CrossRef]
Ruder, M.; Dosovitskiy, A.; Brox, T. Artistic Style Transfer for Videos. In Pattern Recognition, Proceedings of the 38th German Conference, GCPR 2016, Hannover, Germany, 12–15 September 2016; Rosenhahn, B., Andres, B., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 26–36. [Google Scholar] [CrossRef]
Ulucan, O.; Karakaya, D.; Turkan, M. A Large-Scale Dataset for Fish Segmentation and Classification. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Kai, L.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-Net: Towards Unified Image Segmentation. arXiv 2021. [Google Scholar] [CrossRef]
Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. arXiv 2019. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Lai, W.-S.; Huang, J.-B.; Wang, O.; Shechtman, E.; Yumer, E.; Yang, M.-H. Learning Blind Video Temporal Consistency. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11219, pp. 179–195. ISBN 978-3-030-01266-3. [Google Scholar] [CrossRef]

Figure 1. Comparison between local style transfer and global style transfer.

Figure 2. Pipeline of the fish pattern sync-styling network. Our approach consists of three main stages, namely image segmentation, stylized image generation, and background fusion.

Figure 3. Framework of the fish body pattern sync-styling network. As shown, FSSNet comprises five core components, namely a wavelet-gated styling network (WGSNet), feature extraction network (VGG-16), style loss module, content loss module, and contrastive coherence preserving loss (CCPL) module.

Figure 4. Framework of a wavelet-gated style network.

Figure 5. Framework of the upsampling module.

Figure 6. Illustration of the wavelet transform method. HDF represents a high-pass decomposition filter; LDF represents a low-pass decomposition filter.

Figure 7. Residual module with channel attention and spatial attention.

Figure 8. Example of the goldenfish segmentation dataset. (a) shows the real goldenfish vedio frames; (b) shows the mask corresponding to the frame.

Figure 9. Mask effects of different segmentation models. (a) is the input video frame and the generated mask, while (b) is the detail comparison between K-net and Mask2Former.

Figure 10. Comparison of localized style transfer images generated using masks from different image segmentation models. (a) shows the selected content image, (b) is the style image with blue and white porcelain patterns, (c) is the result of localized style transfer using masks obtained with K-Net, and (d) is the result using masks obtained with Mask2Former.

Figure 11. Model performance in long-sequence frames. (a) represents the expected normal result, (b) shows color drifts caused by rapid fish movement, (c) demonstrates the model’s adjustment to the color drifts, with the video returning to normal levels.

Figure 12. Relationship diagram of generated images based on blue and white porcelain style and style weight. (a) Content; (b) style weight = 10⁵; (c) style weight = 10⁴; (d) style weight = 2 × 10³; (e) style weight = 10³.

Figure 13. FFSNet under spatial control.

Figure 14. Stylish video display. (a) original video; (b) style image; (c) SCTNet + CCPL; (d) globally styled FFSNet; (e) locally styled FFSNet.

Figure 15. Comparison of the effects of different modules. (a) Content image. (b) Style image. (c) Baseline. (d) Baseline + DWT. (e) Baseline + DWT + CBAM.

Table 1. Segmentation index of the fish body by different models. The symbol “↑” indicates that a higher value is preferable; the best-performing results are highlighted in bold.

Model	IoU ↑	Acc ↑	Dice ↑	Precision ↑	Recall ↑
Deeplabv3+ [32]	94.46	96.26	97.15	97.69	96.30
K-Net [33]	94.78	96.12	97.32	97.58	96.12
Fast-SCNN [34]	83.65	86.23	86.78	87.04	86.23
Mask2Former [4]	94.81	96.59	97.33	98.07	96.59

Table 2. Quantitative comparison of SSIM and PSNR between different style transfer models. “↑” means the higher the coefficient, the better the model.

Style	Index	CycleGAN [13]	AdaAttN [15]	SCTNet + Lccp [5]	FSSNet (Ours)
Blue and white porcelain	SSIM ↑	0.607	0.620	0.582	0.632
Blue and white porcelain	PSNR ↑	16.17	16.32	16.15	16.41
Starry night	SSIM	0.612	0.641	0.577	0.653
Starry night	PSNR	16.43	16.68	16.35	16.64
Mosaic	SSIM	0.543	0.612	0.581	0.625
Mosaic	PSNR	16.06	16.44	16.21	16.53
Feathers	SSIM	0.492	0.528	0.503	0.546
Feathers	PSNR	15.64	16.29	15.88	16.38

Table 3. Generated video user preference survey results. The user preference results show that participants selected the most aesthetically pleasing video. Temporal consistency is the degree of flickering in stylized videos. Stylization strength shows how well the generated video aligns with the reference style. Content fidelity means image sharpness and texture quality in the transformed videos.

Stylized Method	User Preference	Temporal Consistency	Stylization Strength	Content Fidelity
SCTNet + L_ccp	18.75	6.25	12.5	12.5
Global FFSNet	25	25	18.75	25
Local FFSNet	25	31.25	12.5	18.75
FFSNet with spacial control	31.25	37.5	56.25	43.75

Table 4. Comparison of model complexity and running performance.

Style Model	Params (M)	FLOPs (GFLOPs)	FPS
Baseline	1.679	321,010	1.225
Baseline + DWT	1.628	310,393	1.200
Baseline + DWT + CBAM	1.639	312,155	1.198

Table 5. Flow warping error comparison between our methods.

Video Number	FNST	WGSNet	WGSNet + L_ccp
1	6.405	4.768	1.412
2	7.855	5.314	1.791
3	7.653	4.849	1.812

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, H.; Wang, Y. Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention. Appl. Sci. 2025, 15, 5150. https://doi.org/10.3390/app15095150

AMA Style

Yuan H, Wang Y. Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention. Applied Sciences. 2025; 15(9):5150. https://doi.org/10.3390/app15095150

Chicago/Turabian Style

Yuan, Hongchun, and Yixuan Wang. 2025. "Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention" Applied Sciences 15, no. 9: 5150. https://doi.org/10.3390/app15095150

APA Style

Yuan, H., & Wang, Y. (2025). Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention. Applied Sciences, 15(9), 5150. https://doi.org/10.3390/app15095150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fish Body Pattern Style Transfer Based on Wavelet Transformation and Gated Attention

Abstract

1. Introduction

2. Related Work

2.1. Instance Segmentation

2.2. Global Style Transfer

2.3. Local Style Transfer

3. Proposed System

3.1. Architecture

3.2. Wavelet-Gated Styling Network

3.2.1. Upsampling

3.2.2. Wavelet-Based Downsampling

3.2.3. Residual Module with Attention Mechanism

3.3. Contrastive Coherence Preserving Loss

3.4. The Training and Loss Function

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Comparison

4.3.1. Fish Body Instance Segmentation

4.3.2. Quantitative Results

4.3.3. Qualitative Results

4.3.4. Ablation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI