Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision

Wang, Penglin; Zhao, Yaochi; Hu, Zhuhua

doi:10.3390/electronics14132541

Open AccessArticle

Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision

by

Penglin Wang

,

Yaochi Zhao

^*

and

Zhuhua Hu

School of Cyberspace Security, Hainan University, No. 58 Renmin Avenue, Meilan District, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2541; https://doi.org/10.3390/electronics14132541

Submission received: 1 May 2025 / Revised: 6 June 2025 / Accepted: 16 June 2025 / Published: 23 June 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Camouflaged object detection (COD) aims to detect objects that seamlessly integrate with their surrounding environment and are thereby intractable to distinguish from the background. Existing approaches face difficulties in dynamically adapting to scenarios where the foreground closely resembles the background. Additionally, these methods primarily rely on single-domain boundary supervision while overlooking multi-dimensional constraints, leading to indistinct object boundaries. Inspired by the hawk’s visual predation mechanism, namely, global perception and local refinement, we design an innovative two-stage boundary-aware network, namely, SFNet, which relies on supervision in the spatial-frequency domains. In detail, to simulate the global perception mechanism, we design a multi-scale dynamic attention module to capture contextual relationships between camouflaged objects and surroundings and to enhance key feature representation. In the local refinement stage, we introduce a dual-domain boundary supervision mechanism that jointly optimizes boundaries in frequency and spatial domains, along with an adaptive gated boundary guided module to maintain global semantic consistency. Extensive experiments on four camouflaged object detection datasets demonstrate that SFNet surpasses state-of-the-art methods by 4.1%, with lower computational overhead and memory costs.

Keywords:

camouflaged object detection; boundary supervision; multi-scale dynamic attention

1. Introduction

Camouflage is an adaptive survival strategy among animals, allowing them to integrate flawlessly with their environment by modifying their color, texture, or markings to evade predators [1]. Camouflaged object detection (COD) is an essential research field in computer vision, with the central challenge of accurately detecting and segmenting objects that possess a great extent of visual similarity to their surrounding background. COD has diverse applications, including medical image segmentation [2], post-disaster search [3], and rescue for locating injured individuals, military reconnaissance to detect hidden facilities [4], agricultural monitoring for identifying concealed crops [5], and creative media applications [6].

Early COD methods predominantly depended on handcrafted features, including texture [7], color [8], intensity [9], and optical flow [10]. However, these approaches were heavily dependent on manual feature engineering and lacked adaptability to complex object structures and varying environmental conditions, limiting their effectiveness. With the emergence of deep learning, convolutional neural networks (CNNs) have significantly advanced COD by automatically learning discriminative features and capturing intricate patterns. Current COD methods based on deep learning are generally categorized into two major paradigms according to their design principles: (1) Feature Representation Learning: These approaches aim to detect camouflaged objects without explicit boundary supervision. Representative strategies include simulating predator vision for precise target localization [11], leveraging deep texture features to enhance object differentiation [12], modeling uncertainty to address ambiguous regions [13], integrating depth estimation to provide more comprehensive clues [14], and incorporating prompt-based learning to guide network adaptation [15]. (2) Boundary-Supervised Learning: These methods introduce explicit boundary supervision to refine boundary delineation [16]. However, they often struggle to model the dynamic interactions between foreground and background, making them less robust in complex environments. Although some studies have incorporated frequency-domain features to enhance boundary representations, they frequently fail to effectively integrate spatial and frequency information, resulting in suboptimal feature fusion and limited adaptability [17].

To address these challenges, we draw inspiration from the multi-scale perception, dynamic attention, and heightened boundary sensitivity of hawk vision during predation. Hawks utilize a hierarchical visual system that seamlessly integrates global and local information, enabling rapid prey detection even in complex environments. Their vision operates at multiple scales—first scanning broad areas to identify potential movement (global perception) before refining their focus on finer details (local refinement) to lock onto a target. Moreover, their dynamic attention mechanism selectively enhances salient regions while suppressing irrelevant background noise. Most critically, hawks exhibit exceptional sensitivity to boundary variations, allowing them to distinguish prey even when it blends seamlessly into its surroundings.

Motivated by these biological principles, we propose the two-stage boundary-aware network with spatial-frequency domain supervision (SFNet), a novel framework designed to enhance camouflaged object detection through a coarse-to-fine learning paradigm. Specifically, SFNet integrates three key modules: (1) The multi-scale dynamic attention (MSDA) module emulates the hawk’s “global-local perception” strategy by first capturing global context through multi-scale convolution and emphasizing key features via a channel attention mechanism. The high-frequency details are then isolated and combined with a spatial attention mechanism to focus on target regions, enabling the initial segmentation of camouflaged objects. (2) The dual-domain boundary supervision (DDBS) module is inspired by the hawk’s ability to accurately perceive fine-grained boundaries in natural environments. It simultaneously optimizes boundary features in both the spatial and frequency domains, enhancing the representation of boundary regions through cross-domain information fusion. This approach significantly improves the discriminability of camouflaged target boundaries. (3) The adaptive gated boundary guidance (AGBG) module dynamically balances object body and boundary features, ensuring robust boundary delineation and consistency in object segmentation.

By synergistically combining these modules, SFNet progressively refines the detection process from coarse to fine, achieving superior boundary accuracy and robustness. As shown in Figure 1, SFNet outperforms previous methods in challenging cases such as small, large, multiple objects, and occlusions. Importantly, our method effectively addresses challenges that existing approaches struggle with—namely, dynamically adapting to scenes where the foreground and background are highly similar, as well as handling ambiguous and blurred boundaries. Related discussions on existing methods are presented in the Section 2. Our main contributions include framework and component innovations, as well as performance and efficiency optimization, as follows:

We design a two-stage boundary supervision framework, SFNet, which extracts subject features by dynamically capturing detailed changes and adaptively combines boundary information obtained through a dual-domain boundary supervision mechanism, enabling the accurate detection of camouflaged objects.
To ensure consistent characterization of both the body and the boundary features of the camouflaged object, we introduce a multi-scale dynamic attention module, a dual-domain boundary supervision mechanism, and an adaptive gated boundary guidance module for extracting, enhancing, and fusing object body and boundary features.
Extensive experiments indicate that the model we propose achieves the highest performance across all four evaluation metrics, outperforming 18 existing advanced methods by a significant margin while incurring lower computational overhead and memory costs.

2. Related Work

Current camouflaged object detection methods can be broadly categorized into two paradigms. The first is feature representation learning method, which enhance the model’s perception of camouflaged objects by extracting multi-level and multi-modal deep features. The second is boundary-supervised learning method, which leverage boundary information as a supervisory signal to improve the model’s ability to distinguish between the target and the background at a fine-grained level.

2.1. Feature Representation Learning

Feature representation learning aims to enhance the detection of camouflaged targets by leveraging deep-level features. These approaches often utilize multi-level features or integrate various modalities to enhance the perception of camouflaged objects.

Inspired by the foraging strategies observed in animals in the natural world, several studies have mimicked how they recognize and distinguish camouflage patterns. Fan et al. [20] designed two modules, search and recognition, the former locating the approximate area of the camouflaged object, while the latter refines the detection results. He et al. [21] introduced an adversarial training framework inspired by prey–predator dynamics. Challenging camouflaged objects are generated on the prey side, while the completeness and boundary clarity of segmentation results are enhanced on the predator side.

Furthermore, some studies have focused on excavating deep texture features to address the challenge of camouflage. Ren et al. [22] incorporated multiple modules to learn texture-aware features across various levels, thus amplifying texture contrasts and improving differentiation between camouflaged targets and their background. Ji et al. [23] introduced a dual-branch architecture comprising a context encoder for learning contextual semantic information and a texture encoder for extracting structural textures.

In parallel, uncertainty-based estimation is another direction in camouflage detection. Liu et al. [24] accomplished stochastic uncertainty estimation by dynamically deriving the discrepancy between prediction and ground truth. The generated confidence maps accurately localize segmentation error regions and allow for a preliminary assessment of the prediction in the absence of ground truth. Zhang et al. [25] modeled both model and data biases in the training data, enabling efficient, highly reliable, and highly accurate camouflaged object detection.

Beyond these methods, approaches that integrate depth perception have garnered significant attention. Wu et al. [26] proposed an all-round attentive fusion strategy to enhance semantic consistency between decoding layers and thus reduce the effect of inherent noise in depth estimation.

Prompt-learning approaches have also gained traction in the field. Hu et al. [27] proposed a cross-modal chain-of-thought prompting method to generate specific visual prompts. Subsequently, the segmentation results were progressively optimized through stepwise mask generation. Luo et al. [28] demonstrated the use of 2D prompts to capture task peculiarities. With joint training, their approach was applicable to both COD and SOD (salient object detection) [29] tasks.

2.2. Boundary-Supervised Learning

Camouflaged objects are typically highly integrated with the background boundaries, which often contain detailed information. By incorporating boundary information as a supervisory signal, the model can better distinguish fine-grained contrasts between camouflaged objects and the surrounding background, thereby enhancing its discriminative capability. Sun et al. [18] designed a straightforward and efficient boundary extraction module to mine accurate semantic information of object boundaries. Li et al. [30] focused on the boundaries and textural characteristics of camouflaged objects, respectively, and combined the enhancement features of both to derive the final prediction results. Guan et al. [31] proposed a two-branch COD framework that reconstructs structural and detailed information separately to decouple target recognition and edge detection, preventing the confusion between target and edge features. Zhao et al. [32] extracted semantic and boundary information using hierarchical features and incorporated the dual-attention mechanism to filter discriminative information and then refined the obtained information to generate clearer boundaries.

Moreover, novel approaches have been explored to integrate various types of information with boundary supervision, with the aim of improving the accuracy of camouflaged object detection. He et al. [33] incorporated frequency-domain information by employing learnable wavelets to break down features. This process split the features into diverse frequency bands, focusing particularly on the bands harboring abundant discriminative information. Additionally, they designed an auxiliary task to reconstruct highly precise and complete boundary prediction maps. The two tasks are jointly learned, ultimately producing precise segmentation results.

Although these approaches have promising performance, most methods focus on a single domain: either by enhancing features in the spatial domain or by implementing constraint mechanisms in the frequency domain. Such single-domain optimization strategies may result in suboptimal outcomes; frequency-domain techniques risk the loss of critical spatial details during high-frequency processing, while spatial-domain methods are comparatively less effective at mitigating noise. In contrast, our proposed dual-domain boundary supervision mechanism establishes enhancement channels in both the frequency and spatial domains simultaneously, achieving complementarity via feature fusion to improve camouflaged object detection performance.

3. Methodology

Figure 2 illustrates the overall architecture of the proposed SFNet. It is composed of two main phases: global perception and local refinement. The pre-trained PVTv2 [34] is employed as the backbone to extract features at multiple levels from the input image, represented as

f_{i}

,

i \in {1, 2, 3, 4}

. In the global perception phase, these multi-scale features are first processed by Receptive Field Block (RFB) [35] to generate a collection of enhanced features

r f_{i}

,

i \in {1, 2, 3, 4}

. These enhanced features then pass through the MSDA module that utilizes a dynamic attention mechanism to perform the localization and feature optimization of camouflaged objects. This mechanism combines multi-scale information [36] with an attention mechanism that integrates frequency channel attention [37] and spatial attention [38]. Subsequently, the Feature Fusion Block (FFB) aggregates the multi-scale features to produce a rough segmentation result. In the local refinement phase, we introduce the DDBS module, which enhances boundary information by jointly optimizing features in both the spatial domain and frequency domain, thereby capturing more accurate and detailed object boundaries. Eventually, the AGBG module adaptively fuses the body and boundary features of the object, generating the final camouflaged object detection result while preserving global semantic consistency.

3.1. Multi-Scale Dynamic Attention Module

To effectively capture the complex relationship between the object and the background and achieve comprehensive perception of the scene from global to local information, this paper proposes an innovative multi-scale dynamic attention (MSDA) module, whose structure is depicted in Figure 3. At the global perception level, the module employs parallel multi-scale convolutional branches with kernels of varying sizes to accommodate feature extraction at different scales. Among them, smaller convolutional kernels (3 × 3) are adept at capturing fine-grained features and boundary information, making them well suited for complex scenarios where the foreground is highly blended with the background. They are particularly effective in mining subtle cues hidden within camouflaged objects, thereby enabling rapid localization. In contrast, larger convolutional kernels (5 × 5 and 7 × 7) possess a broader receptive field, allowing the extraction of richer contextual information. This facilitates a more comprehensive understanding of the overall structure of the camouflaged regions and their relationship with the surrounding environment, which in turn aids in delineating the global contour and shape of the object. Therefore, combining the fine-grained feature extraction of small kernels with the global contextual awareness of large kernels enables more accurate and comprehensive localization of camouflaged objects. Specifically, the output feature maps from each branch are first multiplied by their corresponding scale weights to produce weighted feature maps. These weighted multi-scale feature maps are then summed to obtain the adaptively fused multi-scale features

f_{i}^{m}

:

f_{i}^{m} = \sum_{k = 1}^{3} w_{k} \cdot {Conv 2 d}_{s_{k}} (r f_{i})

(1)

where ∑ denotes summation,

w_{k}

represents the weights, and

{Conv 2 d}_{s_{k}} (r f_{i})

refers to the convolution operation performed at each scale

s_{k}

.

Next, we apply a channel attention mechanism to capture critical features across channels, thereby enhancing the identification of key channels within the feature map [39]. This mechanism captures channel-level statistics via global average pooling and generates attention weights for each channel using a two-layer convolutional structure. The channel attention weights are calculated as follows:

C_{A} = σ (Conv 1 \times 1 (ReLU (Conv 1 \times 1 (Avg (f_{i}^{m})))))

(2)

where Avg denotes average pooling, Conv1 × 1 represents a 1 × 1 convolution, the ReLU activation function is represented by ReLU, while

σ

indicates the Sigmoid activation function. Subsequently, the multi-scale feature is multiplied element-wise by the channel attention weights to produce the channel attention feature map

f_{i}^{c}

:

f_{i}^{c} = f_{i}^{m} \otimes C_{A}

(3)

In terms of local perception, high-frequency components of feature maps contain rich details such as boundaries and textures, whereas low-frequency components primarily represent the overall structure or background information. To better focus on these fine and critical details, we enhance the high-frequency features. This is achieved by decomposing the feature map in the frequency domain into a low-frequency component

f_{i}^{l}

and a high-frequency component

f_{i}^{h}

. The low-frequency component is computed by averaging the feature map, and the high-frequency component is obtained by subtracting the low-frequency component from the original feature map. This process significantly highlights local detail information. The specific computation process is as follows:

\{\begin{matrix} f_{i}^{l} = Avg (f_{i}^{m}) \\ f_{i}^{h} = Conv 3 \times 3 (f_{i}^{m} - f_{i}^{l}) \end{matrix}

(4)

Subsequently, the channel attention feature maps undergo mean and max pooling separately, the resulting pooled features

f_{i}^{c_m e a n}

,

f_{i}^{c_m a x}

are fused with the high-frequency components, and the result undergoes an activation operation to generate the spatial attention weights

S_{A}

, which accurately locate the target region. The formula is as follows:

S_{A} = σ (Conv 7 \times 7 (f_{i}^{c_m e a n} © f_{i}^{c_m a x} © f_{i}^{h}))

(5)

where Conv7 × 7 denotes a 7 × 7 convolution,

σ

represents the Sigmoid activation function, and © indicates concatenation.

Ultimately, the channel attention feature maps are combined with the spatial attention weights through element-wise multiplication, resulting in feature maps that fuse the channel and spatial attention

f_{i}^{s}

:

f_{i}^{s} = f_{i}^{c} \otimes S_{A}

(6)

A learnable weight

α

is employed to balance the contributions of channel and spatial attention, allowing the model to dynamically shift its focus between semantic consistency and precise localization. Through this adaptive fusion, the final feature map

f_{i}^{M S D A}

integrates a weighted combination of both attention components.

f_{i}^{M S D A} = α \cdot f_{i}^{c} + (1 - α) \cdot f_{i}^{s}

(7)

f_{i}^{M S D A}

is then passed through the feature fusion block, where a residual connection is utilized to generate the fused feature map

f_{i}^{f f b}

, which is then processed by two convolution operations to generate a rough segmentation map. The first convolution uses a

3 \times 3

convolution to preserve the spatial dimensions while extracting features, and the second convolution employs a

1 \times 1

convolution to shrink the channel dimension to a single channel, producing the final single-channel output:

f_{i}^{f f b} = G_{2}^{R} (G_{1}^{R} (f_{i + 1}^{f f b} \oplus f_{i}^{M S D A}))

(8)

f_{r} = Conv 1 \times 1 (Conv 3 \times 3 (f^{f f b}))

(9)

where

G_{1}^{R}

and

G_{2}^{R}

represent residual connections.

3.2. Dual-Domain Boundary Supervision Module

Fine boundary information is of paramount importance in improving the accuracy of object detection and segmentation [40]. However, relying solely on spatial-domain supervision presents several limitations. On one hand, it exhibits weak sensitivity to faint or blurred boundaries; on the other hand, it struggles to effectively capture the global structural information of objects and is susceptible to background noise. These issues hinder the model’s ability to accurately delineate object boundaries in scenarios characterized by complex backgrounds or ambiguous edges.

To address the aforementioned challenges, particularly the inadequacy of single-spatial-domain supervision in modeling high-frequency boundary details, this paper proposes a dual-domain boundary supervision (DDBS) module that integrates both spatial and frequency domain information. The module consists of two collaborative branches: the spatial-domain branch focuses on detecting and extracting boundary information at the spatial level, while the frequency-domain branch aims to enhance high-frequency features related to boundaries, thereby improving the model’s sensitivity to detailed structural cues.

Since object boundaries often correspond to high-frequency components in the image, which are easily overlooked by conventional spatial-domain supervision, the design of the frequency-domain branch is particularly crucial. We first scale the multi-channel feature maps across multiple scales to generate images at various resolutions. To effectively separate low-frequency and high-frequency information, we first perform a two-dimensional fast Fourier transform (FFT) on the image to map it into the frequency domain, followed by the computation of its amplitude spectrum. Based on the average amplitude, the high-frequency regions are then enhanced to highlight the image’s details and boundary information. This frequency information is later resized via interpolation to match the original image size, ensuring consistency for subsequent processing. The detailed formulae are as follows:

\{\begin{matrix} \begin{matrix} M & = |{FFT}_{2} (R (f^{f f b}, z))| \\ M^{'} & = M \times (1 + α \cdot 1 (M > μ (M))) \\ M_{h} & = R (M^{'}, size (f^{f f b})) \\ M_{h} & = {M_{h} ∣ z \in scales} \end{matrix} \end{matrix}

(10)

where R is a scaling function used to scale the multi-channel feature map

f^{f f b}

by a ratio of z.

{FFT}_{2}

represents the two-dimensional fast Fourier transform. M denotes the frequency amplitude, and

M^{'}

is the result of processing the frequency amplitude M. Its processing method depends on the indicator function

1 (M > μ (M))

, which is used to mark the frequency components whose amplitude is greater than the average value

μ (M)

.

α

serves as an enhancement coefficient that is responsible for regulating the intensity of high-frequency enhancement.

Size (f^{f f b})

represents the original image’s size.

M_{h}

is the frequency information to match the size of the original image, obtained by interpolating and adjusting

M^{'}

.

M_{\hat{h}}

indicates the set of processed frequency information at different scales.

In the parallel spatial branch, we employ the Sobel operator to extract spatial boundary information from the image. Specifically, the horizontal and vertical convolution kernels of the Sobel operator are, respectively, applied to the coarse segmentation map via convolution to calculate the boundary response in both the horizontal and vertical directions. The spatial boundary information E is obtained by calculating the gradient amplitudes in these two directions. The calculation process unfolds in the following way:

E = {∥\vec{G}∥}_{2} = {∥(C (f_{r}, K_{x}), C (f_{r}, K_{y}))∥}_{2}

(11)

where

K_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}]

and

K_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

denote the horizontal kernel and vertical kernel of the Sobel operator, respectively.

C (\cdot, \cdot)

represents the two-dimensional convolution operation.

Eventually, the multi-scale frequency information and the spatial boundary information are concatenated together as inputs to the convolutions. The first convolution fuses these features and further extracts deep boundary features. After that, the final fine boundary map is obtained after processing by the second convolution. The corresponding formula is presented below:

f_{e} = Conv 3 \times 3 (M_{h} © E)

(12)

E_{d} = Conv 3 \times 3 (f_{e})

(13)

3.3. Adaptive Gated Boundary Guided Module

Figure 4 presents the integrated structure of the DDBS and AGBG modules. The AGBG module is designed to efficiently fuse the body and boundary features of a object. Unlike traditional fusion approaches that adopt static strategies and lack regional adaptability, the AGBG module introduces a learnable gating mechanism to enable dynamic, region-specific modulation of boundary information.

Specifically, the body features are first concatenated with the refined boundary features along the channel dimension. This concatenated tensor is processed by a

1 \times 1

convolution to integrate inter-channel information while preserving spatial resolution. Subsequently, another

1 \times 1

convolution followed by a Sigmoid activation produces a single-channel spatial gating map

W_{g}

, which serves as a soft attention mask. Compared to traditional noise suppression methods such as wavelet thresholding and non-local filtering, the spatial variation of

W_{g}

allows the network to adaptively control the contribution of boundary features at each location. This ensures that noisy or irrelevant regions are suppressed while salient boundary cues are emphasized.

The gated boundary features are then further modulated by a Sigmoid function to generate boundary attention maps. These maps are used to enhance the original body features through element-wise multiplication and residual addition, yielding the final feature representation

f_{d}

. This residual connection ensures that essential body information is preserved while incorporating selectively enhanced boundary cues, thereby maintaining stability and reducing the risk of over-suppression.

W_{g} = σ (Conv 1 \times 1 (Conv 1 \times 1 (f^{f f b} © f_{e})))

(14)

f_{d} = (σ (W_{g} \otimes f_{e})) \otimes f^{f f b} + f^{f f b}

(15)

Eventually, the segmentation result

f_{o u t}

is obtained by convolution operation:

f_{o u t} = Conv 1 \times 1 (Conv 3 \times 3 (f_{d}))

(16)

3.4. Loss Function

We utilize a weighted IoU loss [41] along with a BCE loss to provide comprehensive supervision for camouflaged objects and their boundaries. The total loss function L is formulated in the following way, with the weights of all loss components set to 1:

L = L_{I o U}^{w} (f_{r}, G_{o}) + L_{I o U}^{w} (f_{d}, G_{o}) + L_{B C E} (f_{e}, G_{e})

(17)

where

L_{I o U}^{w} (f_{r}, G_{o})

and

L_{I o U}^{w} (f_{d}, G_{o})

represent the weighted IoU loss between the coarse prediction map and the final prediction map with the true mask, while

L_{B C E} (f_{e}, G_{e})

denotes the BCE loss between the predicted boundary and the ground truth boundary.

4. Experiments

This section provides a detailed description of the experimental validation process for the proposed SFNet in the task of camouflaged object detection (COD). Specifically, it covers the experimental setup—including dataset specifications, evaluation metrics, and training settings—followed by quantitative, qualitative, and efficiency analyses. Through comprehensive experimental evaluations, the superior performance of SFNet is demonstrated in terms of detection accuracy, robustness, and computational efficiency.

4.1. Datasets

In line with previous studies, we utilize CAMO—Train [42] and COD10K—Train [1] datasets to train our model, along with evaluating its performance on their respective test sets, and on the CHAMELEON [43] and NC4K [44] datasets. CAMO consists of 1250 images, with 1000 images allocated to the training set and the remaining 250 images designated for the test set. The dataset is categorized into two types: natural camouflaged objects, which include real-world animals, and artificially camouflaged objects, which consist of human-made items. CHAMELEON provides 76 images for testing, while COD10K comprises 10,000 images, which are categorized into 10 super-classes and further classified into 78 sub-classes. Additionally, NC4K contains 4121 images featuring a diverse range of camouflaged objects, further enriching our evaluation.

4.2. Evaluation Metrics

Our model is evaluated using four widely adopted metrics: structure measure (

S_{α}

) [45], E-measure (

E_{ϕ}

) [46], weighted F-measure (

F_{β}^{w}

) [47], and MAE (M) [48].

S_{α}

assesses the structural similarity between true outcomes and predictions. The following is the formula for

S_{α}

:

S_{α} = α S_{o} + (1 - α) S_{r}

(18)

where

S_{o}

denotes the target-perceived similarity, which focuses on the precise location of the target and the boundary alignment,

S_{r}

denotes the region-perceived similarity, which evaluates the overlap between the true and estimated regions, and

α

is the weighted coefficient, which is assigned as 0.5, used to balance the influence of the target-perceived and the region-perceived.

E_{ϕ}

combines local pixel values with image-level averages in a single term, capturing both image-level statistical information and local pixel-matching information. The formula for calculation is as follows:

E_{ϕ} = \frac{1}{w \times h} \sum_{x = 1}^{w} \sum_{y = 1}^{h} ϕ_{F M} (x, y)

(19)

where w represents the image’s width, and h represents its height.

ϕ_{F M} (x, y)

represents the enhanced alignment matrix.

F_{β}

is based on the harmonic average of two metrics: Precision and Recall, serving as an essential metric for striking a balance between a model’s accuracy and coverage in classification tasks.

F_{β}^{w}

is a weighted version of F-measure, which can be weighted according to the importance of the different categories during the computation, thus making the model evaluation more accurate. The calculation formula is as follows:

F_{β}^{w} = \frac{(1 + β^{2}) \cdot {Precision}_{ω} \cdot {Recall}_{ω}}{β^{2} \cdot {Precision}_{ω} + {Recall}_{ω}}

(20)

where

{Precision}_{ω}

represents the weighted precision that assigns weights to the precision of each category, and

{Recall}_{ω}

denotes the weighted recall that assigns weights to the recall of each category.

M is a widely used metric for assessing the discrepancy between predicted and actual values. It calculates the average error across all pixels, offering a comprehensive measure of the model’s prediction accuracy. M is calculated as follows:

M = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} |S (x, y) - G (x, y)|

(21)

where H and W denote the height and the width of the image.

S (x, y)

and

G (x, y)

represent the predicted and ground truth values at position

(x, y)

, respectively.

4.3. Training Settings

The proposed SFNet model is implemented using PyTorch (version 2.5.1) and runs on a server equipped with a 16-core Intel Xeon Gold 6430 CPU and an RTX 4090 GPU. The backbone network adopts PVTv2 pre-trained on ImageNet [49]. During training, the batch size is set to 8, and the Adam optimizer is used for parameter updating with an initial learning rate of

1 \times 10^{- 4}

. The training process spans 80 epochs, requiring approximately 4 h to complete. A polynomial decay schedule is employed to adjust the learning rate dynamically throughout training. To ensure stable convergence, gradient clipping is applied. In the training as well as the testing phases, the images are scaled to a uniform size of 512 × 512, and the final predictions are reduced to the original size by bilinear interpolation. In the ablation studies, we further investigate the impact of different backbone networks by experimenting with ResNet [50] and Res2Net [51] pretrained on ImageNet.

4.4. Comparison to the State of the Art

Our method is compared with 18 state-of-the-art camouflaged object detection methods, including SINet [1], MGL [16], UGTR [52], OCENet [24], SegMaR [53], ZoomNet [11], FEDER [33], MFFN [54], C²FNet [55], BGNet [18], FAPNet [19], BgNet [56], FSPNet [57], VSCode [28], FPNet [58], RISNet [59], SARNet [60], and MSCAF-Net [61]. The prediction results for all comparison methods were obtained either directly from the authors or generated by leveraging publicly accessible pre-trained models without any modifications. To ensure a fair and consistent evaluation, all predictions were assessed using the same codebase under identical experimental settings.

4.5. Quantitative Analysis

Table 1 provides a comparative performance analysis of our method in comparison to several camouflaged object detection (COD) methods, evaluated across four benchmark datasets and using four commonly used metrics. The results clearly indicate that our method consistently surpasses 18 state-of-the-art methods across all datasets. Specifically, under standard evaluation settings, our method achieves notable performance gains over the second-best competitor, with improvements of 0.4%, 1.2%, 3.6%, and 8.7% on CAMO-Test; 0.3%, 0.8%, 3.0%, and 14.3% on CHAMELEON-Test; 0.5%, 1.5%, 2.8%, and 16.7% on COD10K-Test; and 0.2%, 0.7%, 1.3%, and 10.3% on NC4K-Test with respect to

S_{α}

,

E_{ϕ}

,

F_{β}^{w}

, and M, respectively. These improvements highlight the strong generalization capability of our method, which can be summarized into two key innovations: the multi-scale dynamic attention mechanism and the integration of dual-domain boundary supervision. The former enables comprehensive extraction of target features across different scale levels, focusing on fine-grained details and capturing the holistic contextual environment. This enables the model to more accurately discern subtle differences in camouflaged objects, as well as their positions and sizes. In the latter, frequency-domain supervision emphasizes capturing high-frequency details in the image, while spatial-domain supervision reinforces the spatial structure of object boundaries. The synergistic effect of these two components significantly enhances the clarity and integrity of object boundaries, effectively suppressing background interference and noise. The synergy not only improves the model’s robustness in complex environments but also enhances detection accuracy, particularly in challenging camouflaged scenarios.

4.6. Qualitative Analysis

As depicted in Figure 5, we present a comparative analysis of several representative scenes selected from four benchmark datasets. These scenes encompass a diverse range of challenging cases, including human subjects (Row. 1), tiny objects (Row. 3, 4, and 7), large objects (Row. 2 and 5), objects with intricate boundaries (Row. 5), occluded objects (Row. 6), and multiple camouflaged objects (Row. 8). In these challenging scenarios, our method exhibits marked advantages. For human targets characterized by complex structures and rich semantics, the model precisely distinguishes detailed regions such as the head, torso, and limbs, effectively mitigating common issues like adhesion and missed segmentation found in other methods. When addressing small objects, our model demonstrates superior perceptual capabilities, capturing subtle image details to achieve accurate localization and segmentation, whereas alternative methods are more susceptible to missed detections and localization errors. For large objects, we comprehensively model the global semantic context to ensure both structural coherence and preservation of local details, significantly reducing boundary discontinuities. In segmentation tasks involving intricate boundaries, our model reliably retains fine-grained boundary information, while other methods often produce blurred edges. Under occlusion conditions, our method successfully segments both primary objects and occluding objects (e.g., branches in Row. 6), maintaining clear structural delineation; in contrast, other methods frequently yield incomplete segmentation or fail to differentiate occluders. Moreover, in environments containing multiple camouflaged objects, our model consistently distinguishes each object clearly, avoiding the common issue of target merging. Moreover, we conducted extensive tests of the proposed method across various real-world scenarios, with the results illustrated in Figure 6. For objects whose boundaries are obscured due to dynamic lighting or motion blur—conditions that often cause objects to blend into the background—our method consistently demonstrates superior performance. It not only achieves accurate segmentation of camouflaged objects but also effectively preserves boundary clarity. These results further highlight the robustness and generalization capability of our method in complex and challenging environments.

4.7. Efficiency Analysis

To provide a clearer understanding of the relationship between performance and computational effectiveness, Figure 7 and Figure 8 illustrate the relationships among different models with respect to Performance (

F_{β}^{w}

), Parameters (M), FLOPs (G), and Frame Rates (FPS). Our method demonstrates a favorable balance between computational efficiency and segmentation accuracy and exhibits superior performance in terms of parameters, FLOPs, and frame rates. Notably, SFNet achieves a reduction of 12.898G floating-point operations in comparison with the second-best model while also maintaining superior segmentation accuracy and high frame rate, demonstrating its superior computational efficiency. This lightweight characteristic significantly enhances the model’s applicability in resource-constrained environments, such as mobile devices and drone surveillance, thereby greatly improving its practical value. Furthermore, our method achieves substantial performance gains over existing leading models. For example, it outperforms MSCAF-Net by 5.9% in the

F_{β}^{w}

metric, underscoring its effectiveness in accurately detecting camouflaged objects while maintaining computational efficiency. Overall, these results indicate that our method effectively attains an equilibrium between high performance and lower computational cost, exhibiting significant practical utility. It is particularly well suited for mobile and edge computing devices with strict constraints on computational resources and real-time responsiveness.

5. Ablation Study

To thoroughly evaluate the effectiveness of the essential components in SFNet, we executed a series of experiments and provided an in-depth analysis.

5.1. Effectiveness of Multi-Scale Dynamic Attention

The effectiveness of the multi-scale dynamic attention (MSDA) mechanism is comprehensively validated through both quantitative evaluation and qualitative visual comparisons. By emphasizing high-frequency details in the target regions, the MSDA module considerably enhances the model’s capability to capture intricate boundary structures, preserving fine-grained boundary details that are often lost in conventional methods. Additionally, by adaptively integrating key features across multiple scales and dynamically adjusting attention weights, our approach strengthens the model’s capability to capture and refine crucial target representations, leading to more precise segmentation. As shown in Figure 9, the incorporation of the MSDA module significantly enhances feature representation compared to the baseline model, producing clearer object boundaries and reducing noise in complex regions. Furthermore, the quantitative evaluation metrics in Table 2 demonstrate consistent performance gains across multiple evaluation metrics. Specifically, relative to the baseline, our method achieves improvements of 1.5% in

S_{α}

, 1.2% in

E_{ϕ}

, 3.5% in

F_{β}^{w}

, and 10.4% in M. The experimental results demonstrate the effectiveness of the MSDA mechanism in improving segmentation accuracy and meanwhile maintaining robust boundary delineation.

5.2. Effectiveness of Dual-Domain Boundary Supervision and Adaptive Gated Guidance

Building upon the baseline model, we introduce two innovative modules: the dual-domain boundary supervision (DDBS) module and the adaptive gated boundary guided (AGBG) module. These modules are designed to thoroughly evaluate the synergistic impact of boundary supervision in both the frequency and spatial domains. Additionally, they demonstrate the advantages of adaptively guiding the integration of boundary information with global semantic features, ensuring that boundary details are captured with high precision. As illustrated in Figure 9, the inclusion of the DDBS and AGBG modules leads to a significant enhancement in the clarity of boundary predictions. This results in more accurate localization and segmentation, particularly for small or challenging objects, such as those in Row 4 of the figure. The results depicted in Table 2 emphasize the effectiveness of these modules, demonstrating that our model constantly surpasses the baseline across multiple metrics. Specifically, we observe notable improvements in

S_{α}

,

E_{ϕ}

, and

F_{β}^{w}

, along with a reduction in the M metric. These results show that our proposed model exhibits dual advantages: it enhances the precision of local boundary details by fusing frequency-domain boundary features with spatial-domain semantic information while simultaneously preserving the target object’s integrity through the adaptive integration of boundary details and contextual semantics.

5.3. Effectiveness of the PVTv2

To validate the effectiveness of the adopted PVTv2 backbone in our model, as well as to evaluate the sensitivity and generalizability of the proposed module across different backbone architectures, we conducted comparative experiments by replacing PVTv2 with representative convolutional neural network backbones, such as ResNet and Res2Net. To ensure a fair comparison, all models were trained using the same input resolution (512 × 512) and consistent training strategies and hyperparameter settings. As illustrated in Figure 10, the experimental results demonstrate that the model based on PVTv2 outperforms those based on ResNet and Res2Net in terms of detection accuracy. After integrating our proposed module, PVTv2 exhibited a notable improvement in detection accuracy compared to its original version. Similarly, the ResNet and Res2Net models also demonstrated enhanced performance over their respective B1 and B2 baselines when equipped with our module. These results clearly indicate that our proposed module exhibits strong adaptability to different backbone architectures and does not rely on any specific backbone. Furthermore, in terms of overall performance, Transformer-based backbones generally outperform convolution-based backbones in the camouflaged object detection task, primarily due to their superior ability to model global contextual information. It is worth noting that although ResNet and Res2Net are traditional CNN architectures, they still achieved substantial performance gains after incorporating our module, further validating its effectiveness and universality across diverse network structures.

Eventually, as illustrated in Figure 11, we also performed a visualization process on the feature maps, where the feature map generated by the MSDA module is in the fourth column. After the introduction of the DDBS module, the boundary perception is substantially enhanced, which results in more complete and clear boundary information (Col. 5). Then, the AGBG module performs an adaptive fusion of the subject information with the boundary information, which effectively retains the key contour and structural information while ensuring global semantic consistency, thus realizing a more desirable segmentation effect (Col. 6).

6. Conclusions

In this paper, we propose a novel two-stage boundary-aware network with spatial-frequency domain supervision, termed SFNet, inspired by the exceptional visual capabilities of hawks. Specifically, we enhance high-frequency features through a multi-scale dynamic attention module to achieve the initial segmentation of camouflaged objects. Furthermore, we design a dual-domain boundary supervision mechanism that reinforces boundary features through the meticulous integration of spatial and frequency domain information. In addition, we introduce an adaptive gated boundary guidance module that controls the fusion of object and boundary features. These innovations enable SFNet to accurately capture subtle boundaries and delineate object contours, even in the presence of background noise and occlusion.

The experimental results demonstrate that SFNet achieves notable improvements in accuracy compared to state-of-the-art methods across multiple benchmark datasets while significantly reducing computational overhead and model complexity. This balance between high accuracy and efficiency endows SFNet with unique advantages in real-world applications. In environmental monitoring, SFNet can rapidly and accurately detect elusive wildlife in complex scenarios, such as camouflaged insects in dense forests or marine creatures blending into the seafloor, thereby supporting biodiversity research and conservation. In medical imaging, the network is capable of efficiently distinguishing subtle boundaries, allowing for more precise identification of tumors or lesions relative to surrounding tissues, ultimately improving early diagnosis and treatment planning.

In future work, we will focus on further optimizing SFNet to tackle more challenging scenarios, such as dynamic environments and real-time detection tasks, thereby broadening its applicability across critical domains.

Author Contributions

Conceptualization, P.W. and Y.Z.; methodology, P.W. and Y.Z.; software, Z.H.; validation, P.W., Y.Z. and Z.H.; formal analysis, P.W. and Y.Z.; investigation, P.W. and Z.H.; resources, P.W., Y.Z. and Z.H.; data curation, P.W.; writing—original draft preparation, P.W.; writing—review and editing, P.W. and Y.Z.; visualization, P.W.; supervision, Y.Z. and Z.H.; project administration, Y.Z.; funding acquisition, Y.Z. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 62361024 and Grant No. 62161010), the Key Research and Development Project of Hainan Province (Grant No. ZDYF2024GXJS021 and Grant No. ZDYF2022GXJS348), and the Hainan Province Natural Science Foundation (623RC446).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To present the research framework more clearly, we systematically categorized and summarized the representative methods selected from related works and comparative approaches (see Table A1). “Biology” refers to biologically inspired methods; “Texture” covers learning approaches based on texture features; “Uncertainty” includes uncertainty-based modeling methods; “Depth” involves comprehensive methods that integrate depth information; “Prompt” denotes prompt-learning techniques; “Boundary” represents strategies employing boundary supervision; and “Frequency” indicates methods incorporating frequency domain information.

Table A1. Categories of different methods and detection accuracy. ’↑’ indicates higher is better; ’↓’ indicates lower is better; ’‡’ denotes results not publicly available.

Method	Category	CAMO				CHAMELEON				COD10K				NC4K
Method	Category	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$
SINet [1]	Biology	0.751	0.771	0.606	0.100	0.869	0.891	0.740	0.044	0.771	0.806	0.551	0.051	‡	‡	‡	‡
SegMaR [53]	Biology	0.815	0.874	0.753	0.071	0.906	0.951	0.860	0.025	0.833	0.899	0.724	0.034	‡	‡	‡	‡
ZoomNet [11]	Biology	0.820	0.877	0.752	0.066	0.902	0.943	0.845	0.023	0.838	0.888	0.729	0.029	0.853	0.896	0.784	0.043
MFFN [54]	Biology	‡	‡	‡	‡	‡	‡	‡	‡	0.851	0.897	0.752	0.028	0.858	0.902	0.793	0.043
SARNet [60]	Biology	0.868	0.927	0.828	0.047	0.912	0.957	0.871	0.021	0.864	0.931	0.777	0.024	0.886	0.937	0.842	0.032
MSCAF-Net [61]	Biology	0.873	0.929	0.828	0.046	0.912	0.958	0.865	0.022	0.865	0.927	0.775	0.024	0.887	0.934	0.838	0.032
TINet [12]	Texture	0.781	0.847	0.678	0.087	0.874	0.916	0.783	0.038	0.793	0.848	0.635	0.043	‡	‡	‡	‡
TANet [22]	Texture	0.830	0.884	0.763	0.066	0.903	0.963	0.862	0.023	0.823	0.884	0.763	0.066	‡	‡	‡	‡
UGTR [52]	Uncertainty	0.785	0.823	0.686	0.086	0.888	0.911	0.796	0.031	0.818	0.853	0.667	0.035	0.839	0.874	0.747	0.052
OCENet [24]	Uncertainty	0.802	0.852	0.723	0.080	0.897	0.940	0.833	0.027	0.827	0.894	0.707	0.033	0.853	0.902	0.785	0.045
RISNet [59]	Depth	0.870	0.922	0.827	0.050	‡	‡	‡	‡	0.873	0.931	0.799	0.025	0.882	0.925	0.834	0.037
VSCode [28]	Prompt	0.836	0.892	0.768	0.060	‡	‡	‡	‡	0.847	0.913	0.744	0.028	0.874	0.920	0.813	0.038
MGL [16]	Boundary	0.775	0.812	0.673	0.088	0.893	0.917	0.812	0.031	0.814	0.851	0.666	0.035	0.833	0.867	0.739	0.053
BGNet [18]	Boundary	0.812	0.870	0.749	0.073	0.901	0.943	0.850	0.027	0.831	0.901	0.722	0.033	0.851	0.907	0.788	0.044
BgNet [56]	Boundary	0.831	0.884	0.762	0.065	0.894	0.943	0.823	0.029	0.826	0.898	0.703	0.034	0.855	0.908	0.784	0.040
FEDER [33]	Boundary+Frequency	0.802	0.867	0.738	0.071	0.887	0.946	0.834	0.030	0.822	0.900	0.716	0.032	‡	‡	‡	‡
FPNet [58]	Boundary+Frequency	0.851	0.905	0.802	0.056	0.914	0.960	0.868	0.022	0.850	0.912	0.755	0.028	0.889	0.934	0.851	0.032

References

Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2777–2787. [Google Scholar]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 263–273. [Google Scholar]
Li, J.; He, W.; Li, Z.; Guo, Y.; Zhang, H. Overcoming the uncertainty challenges in detecting building changes from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2025, 220, 1–17. [Google Scholar] [CrossRef]
Liu, M.; Di, X. Extraordinary MHNet: Military high-level camouflage object detection network and dataset. Neurocomputing 2023, 549, 126466. [Google Scholar] [CrossRef]
Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping from High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2024, 63. [Google Scholar] [CrossRef]
Chu, H.K.; Hsu, W.H.; Mitra, N.J.; Cohen-Or, D.; Wong, T.T.; Lee, T.Y. Camouflage images. ACM Trans. Graph. 2010, 29, 51–61. [Google Scholar] [CrossRef]
Galun; Sharon; Basri; Brandt. Texture segmentation by multiscale aggregation of filter responses and shape elements. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 716–723. [Google Scholar]
Pulla Rao, C.; Guruva Reddy, A.; Rama Rao, C. Camouflaged object detection for machine vision applications. Int. J. Speech Technol. 2020, 23, 327–335. [Google Scholar] [CrossRef]
Tankus, A.; Yeshurun, Y. Detection of regions of interest and camouflage breaking by direct convexity estimation. In Proceedings of the 1998 IEEE Workshop on Visual Surveillance, Bombay, India, 2 January 1998; pp. 42–48. [Google Scholar]
Beiderman, Y.; Teicher, M.; Garcia, J.; Mico, V.; Zalevsky, Z. Optical technique for classification, recognition and identification of obscured objects. Opt. Commun. 2010, 283, 4274–4282. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
Zhu, J.; Zhang, X.; Zhang, S.; Liu, J. Inferring camouflaged objects by texture-aware interactive guidance network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3599–3607. [Google Scholar]
Kajiura, N.; Liu, H.; Satoh, S. Improving camouflaged object detection with the uncertainty of pseudo-edge labels. In Proceedings of the 3rd ACM International Conference on Multimedia in Asia, Gold Coast, Australia, 1–3 December 2021; pp. 1–7. [Google Scholar]
Wang, Q.; Yang, J.; Yu, X.; Wang, F.; Chen, P.; Zheng, F. Depth-aided camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3297–3306. [Google Scholar]
Tang, L.; Jiang, P.T.; Shen, Z.H.; Zhang, H.; Chen, J.W.; Li, B. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 8805–8814. [Google Scholar]
Zhai, Q.; Li, X.; Yang, F.; Jiao, Z.; Luo, P.; Cheng, H.; Liu, Z. MGL: Mutual graph learning for camouflaged object detection. IEEE Trans. Image Process. 2022, 32, 1897–1910. [Google Scholar] [CrossRef]
Zhong, Y.; Li, B.; Tang, L.; Kuang, S.; Wu, S.; Ding, S. Detecting camouflaged object in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4504–4513. [Google Scholar]
Sun, Y.; Wang, S.; Chen, C.; Xiang, T.Z. Boundary-Guided Camouflaged Object Detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 1335–1341. [Google Scholar]
Zhou, T.; Zhou, Y.; Gong, C.; Yang, J.; Zhang, Y. Feature aggregation and propagation network for camouflaged object detection. IEEE Trans. Image Process. 2022, 31, 7036–7047. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
He, C.; Li, K.; Zhang, Y.; Zhang, Y.; You, C.; Guo, Z.; Li, X.; Danelljan, M.; Yu, F. Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ren, J.; Hu, X.; Zhu, L.; Xu, X.; Xu, Y.; Wang, W.; Deng, Z.; Heng, P.A. Deep texture-aware features for camouflaged object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 33, 1157–1167. [Google Scholar] [CrossRef]
Ji, G.P.; Fan, D.P.; Chou, Y.C.; Dai, D.; Liniger, A.; Van Gool, L. Deep gradient learning for efficient camouflaged object detection. Mach. Intell. Res. 2023, 20, 92–108. [Google Scholar] [CrossRef]
Liu, J.; Zhang, J.; Barnes, N. Modeling aleatoric uncertainty for camouflaged object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1445–1454. [Google Scholar]
Zhang, Y.; Zhang, J.; Hamidouche, W.; Deforges, O. Predictive uncertainty estimation for camouflaged object detection. IEEE Trans. Image Process. 2023, 32, 3580–3591. [Google Scholar] [CrossRef]
Wu, Z.; Wang, J.; Zhou, Z.; An, Z.; Jiang, Q.; Demonceaux, C.; Sun, G.; Timofte, R. Object segmentation by mining cross-modal semantics. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, BC, Canada, 29 October–3 November 2023; pp. 3455–3464. [Google Scholar]
Hu, J.; Lin, J.; Gong, S.; Cai, W. Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 12511–12518. [Google Scholar]
Luo, Z.; Liu, N.; Zhao, W.; Yang, X.; Zhang, D.; Fan, D.P.; Khan, F.; Han, J. Vscode: General visual salient and camouflaged object detection with 2d prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17169–17180. [Google Scholar]
Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 136–145. [Google Scholar]
Li, P.; Yan, X.; Zhu, H.; Wei, M.; Zhang, X.P.; Qin, J. Findnet: Can you find me? boundary-and-texture enhancement network for camouflaged object detection. IEEE Trans. Image Process. 2022, 31, 6396–6411. [Google Scholar] [CrossRef] [PubMed]
Guan, J.; Fang, X.; Zhu, T.; Qian, W. SDRNet: Camouflaged object detection with independent reconstruction of structure and detail. Knowl.-Based Syst. 2024, 299, 112051. [Google Scholar] [CrossRef]
Zhao, Z.; Liu, Z.; Peng, C. AGFNet: Attention guided fusion network for camouflaged object detection. In Proceedings of the CAAI International Conference on Artificial Intelligence, Beijing, China, 27–28 August 2022; Springer: Berlin/Heidelberg, Germany; pp. 478–489. [Google Scholar]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Nam, J.H.; Syazwany, N.S.; Kim, S.J.; Lee, S.C. Modality-agnostic domain generalizable medical image segmentation by multi-frequency in multi-scale attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11480–11491. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Zhao, R.; Li, Y.; Zhang, Q.; Zhao, X. Bilateral decoupling complementarity learning network for camouflaged object detection. Knowl.-Based Syst. 2025, 314, 113158. [Google Scholar] [CrossRef]
Wei, J.; Wang, S.; Huang, Q. F³Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Skurowski, P.; Abdulameer, H.; Błaszczyk, J.; Depta, T.; Kornacki, A.; Kozieł, P. Animal camouflage analysis: Chameleon database. Unpubl. Manuscr. 2018, 2, 7. [Google Scholar]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11591–11601. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment Measure for Binary Foreground Map Evaluation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, Stockholm, Sweden, 13–19 July 2018; pp. 698–704. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Yang, F.; Zhai, Q.; Li, X.; Huang, R.; Luo, A.; Cheng, H.; Fan, D.P. Uncertainty-guided transformer reasoning for camouflaged object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4146–4155. [Google Scholar]
Jia, Q.; Yao, S.; Liu, Y.; Fan, X.; Liu, R.; Luo, Z. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4713–4722. [Google Scholar]
Zheng, D.; Zheng, X.; Yang, L.T.; Gao, Y.; Zhu, C.; Ruan, Y. Mffn: Multi-view feature fusion network for camouflaged object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6232–6242. [Google Scholar]
Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
Chen, T.; Xiao, J.; Hu, X.; Zhang, G.; Wang, S. Boundary-guided network for camouflaged object detection. Knowl.-Based Syst. 2022, 248, 108901. [Google Scholar] [CrossRef]
Huang, Z.; Dai, H.; Xiang, T.Z.; Wang, S.; Chen, H.X.; Qin, J.; Xiong, H. Feature shrinkage pyramid for camouflaged object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5557–5566. [Google Scholar]
Cong, R.; Sun, M.; Zhang, S.; Zhou, X.; Zhang, W.; Zhao, Y. Frequency perception network for camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, BC, Canada, 29 October–3 November 2023; pp. 1179–1189. [Google Scholar]
Wang, L.; Yang, J.; Zhang, Y.; Wang, F.; Zheng, F. Depth-aware concealed crop detection in dense agricultural scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17201–17211. [Google Scholar]
Xing, H.; Gao, S.; Wang, Y.; Wei, X.; Tang, H.; Zhang, W. Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5444–5457. [Google Scholar] [CrossRef]
Liu, Y.; Li, H.; Cheng, J.; Chen, X. MSCAF-Net: A general framework for camouflaged object detection via learning multi-scale context-aware features. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4934–4947. [Google Scholar] [CrossRef]

Figure 1. Visual comparison of our camouflaged object detection method with two other boundary supervision methods (e.g., BGNet [18] and FAPNet [19]) in a range of challenging scenarios.

Figure 2. The overall architecture of the proposed SFNet. MSDA, DDBS, and AGBG are short for the multi-scale dynamic attention module, the dual-domain boundary supervision module, and the adaptive gated boundary guided module, respectively. RFB and FFB represent the Receptive Field Block and the Feature Fusion Block.

Figure 3. Details of the proposed MSDA. We leverage multi-scale information to facilitate the recognition of object size and position and introduce a dynamic attention mechanism to selectively highlight target regions, thereby enhancing the discriminative feature representation.

Figure 4. Details of the proposed DDBS and AGBG. The DDBS comprises a frequency branch and a spatial branch, which collaboratively optimize boundary features across two dimensions. The AGBG dynamically combines the object features with boundary features to generate more refined segmentation results.

Figure 5. Visualization comparison with 8 state-of-the-art methods on different datasets, CAMO (Row. 1–2), CHAMELEON (Row. 3–4), COD10K (Row. 5–6), and NC4K (Row. 7–8) covering different scenarios, such as human body, small objects, large objects, complex objects with boundaries, multiple objects, and occluded objects.

Figure 6. Real-world challenges and detection results of SFNet.

Figure 7. Comparison of our method with other state-of-the-art methods for COD10K-Test in terms of Performance (

F_{β}^{w}

), Parameters, and FLOPs.

Figure 7. Comparison of our method with other state-of-the-art methods for COD10K-Test in terms of Performance (

F_{β}^{w}

), Parameters, and FLOPs.

Figure 8. Parallel histogram comparison of Performance (

F_{β}^{w}

), Parameters (M), FLOPs (G), and Frame Rates (FPS).

Figure 8. Parallel histogram comparison of Performance (

F_{β}^{w}

), Parameters (M), FLOPs (G), and Frame Rates (FPS).

Figure 9. Qualitative evaluation results from the ablation study are presented, where the term “B” refers to the combination of PVTv2, RFB, and FFB.

Figure 10. Quantitative analysis of the effectiveness of PVTv2. B1, B2, and B3 denote ResNet/Res2Net/PVTv2+RFB+FFB, respectively. The abbreviations M, D, and A stand for the multi-scale attention module, the dual-domain boundary supervision module, and the sdaptive gated boundary guided module.

Figure 11. Visualization of feature maps.

f_{r}

,

f_{e}

, and

f_{o u t}

are the output feature maps of the MDSA, DDBS, and AGBG modules, respectively. Red squares mark regions where boundary features are enhanced.

Figure 11. Visualization of feature maps.

f_{r}

,

f_{e}

, and

f_{o u t}

are the output feature maps of the MDSA, DDBS, and AGBG modules, respectively. Red squares mark regions where boundary features are enhanced.

Table 1. Quantitative comparisons of SFNet and other SOTA methods on four COD benchmark datasets. The method ranked first and the method ranked second are highlighted in red and blue. ’↑’ indicates higher is better; ’↓’ indicates lower is better; ’‡’ denotes results not publicly available.

Method	Backbone	CAMO				CHAMELEON				COD10K				NC4K
Method	Backbone	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$
SINet [1]	ResNet50	0.751	0.771	0.606	0.100	0.869	0.891	0.740	0.044	0.771	0.806	0.551	0.051	‡	‡	‡	‡
MGL [16]	ResNet50	0.775	0.812	0.673	0.088	0.893	0.917	0.812	0.031	0.814	0.851	0.666	0.035	0.833	0.867	0.739	0.053
UGTR [52]	ResNet50	0.785	0.823	0.686	0.086	0.888	0.911	0.796	0.031	0.818	0.853	0.667	0.035	0.839	0.874	0.747	0.052
OCENet [24]	ResNet50	0.802	0.852	0.723	0.080	0.897	0.940	0.833	0.027	0.827	0.894	0.707	0.033	0.853	0.902	0.785	0.045
SegMaR [53]	ResNet50	0.815	0.874	0.753	0.071	0.906	0.951	0.860	0.025	0.833	0.899	0.724	0.034	‡	‡	‡	‡
ZoomNet [11]	ResNet50	0.820	0.877	0.752	0.066	0.902	0.943	0.845	0.023	0.838	0.888	0.729	0.029	0.853	0.896	0.784	0.043
FEDER [33]	ResNet50	0.802	0.867	0.738	0.071	0.887	0.946	0.834	0.030	0.822	0.900	0.716	0.032	‡	‡	‡	‡
MFFN [54]	ResNet50	‡	‡	‡	‡	‡	‡	‡	‡	0.851	0.897	0.752	0.028	0.858	0.902	0.793	0.043
C²FNet [55]	Res2Net50	0.799	0.859	0.730	0.077	0.893	0.946	0.845	0.028	0.811	0.887	0.691	0.036	‡	‡	‡	‡
BGNet [18]	Res2Net50	0.812	0.870	0.749	0.073	0.901	0.943	0.850	0.027	0.831	0.901	0.722	0.033	0.851	0.907	0.788	0.044
FAPNet [19]	Res2Net50	0.815	0.865	0.734	0.076	0.893	0.940	0.825	0.028	0.822	0.888	0.694	0.036	0.851	0.899	0.775	0.047
BgNet [56]	Res2Net50	0.831	0.884	0.762	0.065	0.894	0.943	0.823	0.029	0.826	0.898	0.703	0.034	0.855	0.908	0.784	0.040
FSPNet [57]	ViT	0.856	0.899	0.799	0.050	0.908	0.943	0.851	0.023	0.851	0.895	0.735	0.026	0.879	0.915	0.816	0.035
VSCode [28]	Swin	0.836	0.892	0.768	0.060	‡	‡	‡	‡	0.847	0.913	0.744	0.028	0.874	0.920	0.813	0.038
FPNet [58]	PVT	0.851	0.905	0.802	0.056	0.914	0.960	0.868	0.022	0.850	0.912	0.755	0.028	0.889	0.934	0.851	0.032
RISNet [59]	PVT	0.870	0.922	0.827	0.050	‡	‡	‡	‡	0.873	0.931	0.799	0.025	0.882	0.925	0.834	0.037
SARNet [60]	PVTv2	0.868	0.927	0.828	0.047	0.912	0.957	0.871	0.021	0.864	0.931	0.777	0.024	0.886	0.937	0.842	0.032
MSCAF-Net [61]	PVTv2	0.873	0.929	0.828	0.046	0.912	0.958	0.865	0.022	0.865	0.927	0.775	0.024	0.887	0.934	0.838	0.032
SFNet(Ours)	PVTv2	0.877	0.940	0.858	0.042	0.917	0.968	0.897	0.018	0.877	0.945	0.821	0.020	0.889	0.944	0.862	0.029

Table 2. Quantitative assessment of the ablation experiments. ’↑’ indicates higher is better; ’↓’ indicates lower is better. The red color indicates the best-performing method.

Method	CAMO				CHAMELEON				COD10K				NC4K
Method	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$	$S_{α} ↑$	$E_{ϕ} ↑$	$F_{β}^{W} ↑$	$M ↓$
Baseline	0.860	0.928	0.830	0.047	0.879	0.942	0.836	0.027	0.854	0.930	0.781	0.024	0.877	0.939	0.843	0.032
Baseline+MDSA	0.870	0.939	0.847	0.043	0.897	0.957	0.867	0.024	0.868	0.941	0.805	0.021	0.886	0.945	0.859	0.029
Baseline+DDBS+AGBG	0.867	0.939	0.843	0.043	0.892	0.954	0.854	0.023	0.865	0.943	0.802	0.021	0.885	0.946	0.856	0.029
Baseline+MDSA+DDBS+AGBG	0.877	0.940	0.858	0.042	0.917	0.968	0.897	0.018	0.877	0.945	0.821	0.020	0.889	0.944	0.862	0.029

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, P.; Zhao, Y.; Hu, Z. Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision. Electronics 2025, 14, 2541. https://doi.org/10.3390/electronics14132541

AMA Style

Wang P, Zhao Y, Hu Z. Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision. Electronics. 2025; 14(13):2541. https://doi.org/10.3390/electronics14132541

Chicago/Turabian Style

Wang, Penglin, Yaochi Zhao, and Zhuhua Hu. 2025. "Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision" Electronics 14, no. 13: 2541. https://doi.org/10.3390/electronics14132541

APA Style

Wang, P., Zhao, Y., & Hu, Z. (2025). Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision. Electronics, 14(13), 2541. https://doi.org/10.3390/electronics14132541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boundary-Aware Camouflaged Object Detection via Spatial-Frequency Domain Supervision

Abstract

1. Introduction

2. Related Work

2.1. Feature Representation Learning

2.2. Boundary-Supervised Learning

3. Methodology

3.1. Multi-Scale Dynamic Attention Module

3.2. Dual-Domain Boundary Supervision Module

3.3. Adaptive Gated Boundary Guided Module

3.4. Loss Function

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Training Settings

4.4. Comparison to the State of the Art

4.5. Quantitative Analysis

4.6. Qualitative Analysis

4.7. Efficiency Analysis

5. Ablation Study

5.1. Effectiveness of Multi-Scale Dynamic Attention

5.2. Effectiveness of Dual-Domain Boundary Supervision and Adaptive Gated Guidance

5.3. Effectiveness of the PVTv2

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI