Boundary-Guided Differential Attention: Enhancing Camouflaged Object Detection Accuracy

Hongliang Zhang; Bolin Xu; Sanxin Jiang

doi:10.3390/jimaging11110412

,

and

College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Imaging2025, 11(11), 412;https://doi.org/10.3390/jimaging11110412

This article belongs to the Section Computer Vision and Pattern Recognition

Version Notes

Order Reprints

Abstract

Camouflaged Object Detection (COD) is a challenging computer vision task aimed at accurately identifying and segmenting objects seamlessly blended into their backgrounds. This task has broad applications across medical image segmentation, defect detection, agricultural image detection, security monitoring, and scientific research. Traditional COD methods often struggle with precise segmentation due to the high similarity between camouflaged objects and their surroundings. In this study, we introduce a Boundary-Guided Differential Attention Network (BDA-Net) to address these challenges. BDA-Net first extracts boundary features by fusing multi-scale image features and applying channel attention. Subsequently, it employs a differential attention mechanism, guided by these boundary features, to highlight camouflaged objects and suppress background information. The weighted features are then progressively fused to generate accurate camouflage object masks. Experimental results on the COD10K, NC4K, and CAMO datasets demonstrate that BDA-Net outperforms most state-of-the-art COD methods, achieving higher accuracy. Here we show that our approach improves detection accuracy by up to 3.6% on key metrics, offering a robust solution for precise camouflaged object segmentation.

Keywords:

camouflaged object detection; attention mechanism; boundary detection; deep neural network

1. Introduction

Camouflaged Object Detection (COD) is an emerging computer vision task aimed at accurately recognizing and segmenting camouflaged targets that are seamlessly hidden in their backgrounds []. COD has wide applications in various detection tasks across multiple domains, including medical image segmentation, defect detection, agricultural image detection (e.g., locust detection), security monitoring (e.g., obstacle detection), and scientific research (e.g., biological studies). Therefore, the first COD method based on deep neural networks [] gained widespread attention immediately after its introduction.

Early COD methods [], inspired by the hunting process of predators, typically included a search module for the preliminary localization of camouflaged objects, followed by a recognition module for accurate segmentation. Additionally, some approaches [,] drew inspiration from bionics, simulating the hunting and observation behaviors of animals, making COD simpler and more efficient. However, these methods remained somewhat rudimentary and struggled to segment camouflaged objects with precision.

To further improve performance, an increasing number of COD methods [,,,] developed in recent years have begun incorporating guidance information. These approaches either extract texture features from the input image or estimate boundary features of the camouflaged object, using them to guide the detection. Accurate boundary priors not only aid in locating camouflaged objects but also help mitigate boundary blurriness during object segmentation. However, existing methods often rely solely on low-level features from the input image for boundary extraction, overlooking the high-level features. Additionally, when it comes to object localization and segmentation, these methods tend to focus heavily on global information. Yet, due to the high similarity between camouflaged objects and their background, this global information often fails to effectively highlight the differences between the objects and their surroundings.

Currently, to better differentiate camouflaged objects from their backgrounds, numerous COD methods have introduced attention mechanisms. These attention mechanisms either focus on extracting boundary knowledge or enhancing the prominence of camouflaged objects. For example, the edge attention network [] and edge-assisted position aware attention network [] are designed to extract informative boundary features. Meanwhile, others, such as the overlapping window cross-layer attention mechanism [,], use high-level features to guide the enhancement of lower-level features, and the dual attention mechanism [] captures the scale diversity of camouflaged objects. Methods like the parallel attention selection mechanism [] and Multiple Attention Mechanisms (PAM) [] use multiple attention mechanisms to emphasize the separation between background and camouflaged objects.

Inspired by how the human visual system achieves functional specialization and efficiency gains by responding to different frequency stimuli through distinct neural pathways, Liang et al. proposed the Efficient Frequency Injection Module (FIM). This module enhances the representational capacity of lightweight backbone networks by injecting fine-grained high-frequency features and object-level low-frequency features at different stages []. Liu et al. [] propose a depth-aware attention fusion network that incorporates depth maps as auxiliary inputs to enhance the network’s perception of three-dimensional information. Concurrently, a ternary branch encoder is employed to extract color and depth information along with their interactive relationships. Guan et al. [] proposed a dual-branch strategy to reconstruct structural and detailed features separately, addressing the disparity in reconstruction requirements between structure and detail. This approach aims to identify camouflaged objects and their edges. Zhang et al. [] proposed collaborative cross-scale feature learning network(CCNet), which efficiently detects collaborative camouflaged targets by leveraging synergistic information between single images and camouflage image groups. Fang et al. [] developed a method that uses learnable wavelets to extract high-frequency edge details, refines them by aggregating contextual features and sensing inter-branch differences, and employs a scene enhancement module with reverse attention to recover structural information from occluded areas. These specially designed attention mechanisms have significantly improved the accuracy of camouflaged object detection. However, few attention mechanisms are directly built to specifically focus on the distinction between camouflaged objects and their surroundings.

2. Motivation and Contributions

To address the dual challenges of accurately localizing camouflaged objects within complex backgrounds and suppressing distracting contextual information, this study introduces a Boundary-Guided Differential Attention Mechanism. This mechanism leverages the disparity between Global Average Pooling (GAP) and Global Max Pooling (GMP) to enhance subtle discriminative cues, thereby improving the model’s ability to distinguish camouflaged regions from the background.

Inspired by the hierarchical nature of human visual perception—where coarse boundary localization precedes fine-grained detail analysis—we propose the Boundary-Guided Differential Attention Network (BDA-Net) for camouflaged object detection.The visual detection results are shown in Figure 1. When boundary cues are available, human vision tends to disregard irrelevant background details and focus on object-specific features. Following this principle, BDA-Net first extracts boundary priors from multi-scale image features and then utilizes boundary-guided differential attention, derived from the GAP–GMP disparity, to suppress background responses and enhance object representations. The refined features are subsequently fused to produce accurate and detail-preserving segmentation of camouflaged targets.The main contributions of this work are as follows:

Figure 1. Detection results of camouflaged objects using BDA-Net. (a) Input images containing camouflaged objects; (b) Ground truth binary masks of the camouflaged objects; (c) Prediction results of the proposed BDA-Net.

Inspired by how the human eye perceives camouflaged objects, we propose a differential attention mechanism, which leverages the difference between GAP and GMP under the guidance of boundary knowledge to highlight camouflaged objects.
We introduce a method for boundary prior extraction, which first fuses multi-scale features of the input image and then applies channel attention mechanism to emphasize boundary features.
Building on these insights, we developed BDA-Net for COD. Guided by boundary priors, the network applies differential attention to the multi-scale features of the input image, achieving superior performance.

3. Method

3.1. Overall Architecture

The overall architecture of BDA-Net is illustrated in Figure 2. Broadly, BDA-Net consists of four main components: the PVTv2 backbone network for feature extraction, the Boundary Detection Module (BDM) for extracting boundary priors, the Differential Attention Module (DAM) for highlighting camouflaged objects, and the Context Aggregation Module (CAM) for predicting and segmenting the camouflaged objects.

Figure 2. Overall architecture of BDA-Net. First, the input image, with a fixed resolution of

416 \times 416 \times 3

, is processed by a Transformer-based backbone network to extract features at four different scales. These features are sent to the BDM module to extract boundary priors, which guide the differential attention mechanism. Simultaneously, the features are fed into four DAM modules, where the boundary priors help highlight the camouflaged object’s features. Finally, the CAM module fuses the enhanced features to produce the camouflage object mask.

In BDA-Net, the input image, with a fixed resolution of

416 \times 416 \times 3

, is first processed by PVTv2, resulting in four multi-scale feature maps, denoted as

f_{i} (i = 1, 2, 3, 4)

. These features are processed in two ways: first, they are fed into the BDM to extract the boundary prior

f_{e}

, which serves as guidance for the differential attention mechanism; second, they are passed through the four DAM modules, where the boundary prior

f_{e}

is used to highlight the camouflaged objects, producing refined feature maps

f_{i}^{'} (i = 1, 2, 3, 4)

. Finally, the three CAM modules progressively fuse the enhanced features to generate camouflaged object masks at three different scales, denoted as

m_{i} (i = 1, 2, 3)

. Among them,

m_{1}

, which integrates all contextual information, has the highest precision and the largest size, and is chosen as the final output.

3.2. Boundary Detection Module

The architecture of BDM is shown in Figure 3. In the BDM, multi-scale features are first concatenated into a single-scale feature. Since the features

f_{1}, f_{2}, \dots, f_{4}

correspond to four layers of the PVTv2 and have different dimensions, specifically

104 \times 104

,

52 \times 52

, ⋯, and

13 \times 13

, it’s necessary to upsample the smaller features

f_{2}, f_{3}

, and

f_{4}

to match the resolution of

f_{1}

. This ensures seamless concatenation, resulting in a unified feature map

f_{u}

with a size of

104 \times 104

. A

3 \times 3

convolution is then applied to this unified feature map, producing the fused feature

f_{u}

. It is important to note that a

1 \times 1

convolution is applied to each feature before concatenation, primarily to adjust the number of channels. This process can be expressed by the following equation:

f_{u} = {C o n v}_{3 \times 3} (f_{1} + U (f_{2}) + U (f_{3}) + U (f_{4})),

(1)

where

U (\cdot)

represents the upsampling function. Since higher-level features typically capture the semantic information of the camouflaged object, while lower-level features focus more on its detailed structures, the simultaneous use of multi-scale features allows the BDM to be highly adaptive.

Figure 3. Architecture of the BDM module. This module first upsamples the smaller-scale features

f_{2}

,

f_{3}

, and

f_{4}

to match the size of feature

f_{1}

. These features are then concatenated to form a single-scale feature map. Finally, a channel attention is applied to weight the feature map, highlighting the boundaries of the camouflaged object.

To enhance boundary features, we implemented a simplified version of the Efficient Channel Attention (ECA) [] developed by Wang et al. Specifically, in ECA, the input features are duplicated into two copies. One copy is used to generate weights, which are then multiplied element-wise with the other copy. The resulting output is mapped to the range (0,1) using a Sigmoid function

σ

, forming the boundary prior.

To obtain the weights and extract boundary information, we apply GAP followed by a ReLU activation to the feature

f_{u}

. This not only reduces model parameters but also introduces non-linearity, enhancing the robustness of the BDM. Finally, the weights are constrained within the range of (0,1) using a Sigmoid function. The process can be expressed by the following equation:

w_{e} = σ ({C o n v}_{1 \times 1} (ReLU ({C o n v}_{1 \times 1} (f_{u}^{GAP})))),

(2)

where

w_{e}

and

f_{u}^{GAP}

represents the element weight and the features obtained by applying GAP to

f_{u}

, respectively.

3.3. Differential Attention Module

The architecture of DAM is shown in Figure 4. This module first integrates the boundary prior generated by the BDM with the features extracted by the backbone network. It then weights the fused features to highlight the camouflaged object using a differential attention mechinism. Accordingly, the workflow of DAM can be divided into two stages: boundary prior fusion and differential attention.

Figure 4. Architecture of the DAM module. In this module, the feature

f_{i}

is first fused with the boundary prior

f_{e}

. The resulting features are then weighted using a differential attention mechanism, producing the enhanced feature

f_{i}^{'}

that highlights the camouflaged object.

3.3.1. Boundary Prior Fusion

To achieve the fusion, we multiply the boundary prior

f_{e}

with the feature

f_{i}

pixel by pixel, then add the result to the feature

f_{i}

pixel by pixel. Finally, we apply a

3 \times 3

convolution to the summed features, resulting in a preliminary boundary-guided feature

f_{i}^{e}

. It is important to note that

f_{i}

, extracted by the backbone network, can have four different possible sizes. Therefore, the boundary prior may need to be resized through downsampling before pixel-wise multiplication to ensure compatibility. As shown in the Figure 5, after applying the boundary prior fusion operation, the detail information in the feature visualization image is significantly improved. The entire process can be described by the following formula:

f_{i}^{e} = {C o n v}_{3 \times 3} ((f_{i} \otimes D (f_{e})) \oplus f_{i}), i = (1, 2, 3, 4)

(3)

where ⊕ denotes element-wise addition and

D (\cdot)

represents downsampling.

Figure 5. Comparison of visualization results from Boundary Prior Fusion.

3.3.2. Differential Attention Mechanism

To design the differential attention mechanism, we primarily leverage GAP and GMP. Guided by boundary priors, background features are suppressed by exploiting the difference between GAP and GMP.As illustrated in Figure 6, the absolute difference |GMP-GAP| demonstrates superior detection performance. Similar to AP and MP, GAP computes the average of all pixels in each channel, which tends to capture holistic features. In contrast, GMP selects the maximum pixel value from each channel, making it more suitable for highlighting the most prominent parts of the feature map. The difference between the two can suppress background information while highlighting the camouflaged object. Based on this observation, the feature map

f_{e}

is processed in three parallel branches. Two branches apply GAP and GMP, respectively, and then subtract the results. The resulting difference is normalized and mapped to the range of (0,1) using the Sigmoid function. This output is used to weight the third branch, thereby enhancing the camouflaged target. The process can be described by the following formula:

f_{i}^{'} = f_{i}^{e} \otimes σ (N o r m (a b s (f_{i}^{GAP} ⊖ f_{i}^{GMP}) \oplus f_{i}^{GAP})) .

(4)

Here, i represents the index of the feature layer, while

a b s (\cdot)

and

N o r m (\cdot)

denote the absolute value and normalization functions, respectively. Additionally,

f_{i}^{GAP}

and

f_{i}^{GMP}

represent the features obtained by applying GAP and GMP to

f_{e}

, followed by a

1 \times 1

convolution.

Figure 6. Comparison of the resulting images after applying GAP, GMP, and

| GAP - GMP |

on the source image.

3.4. Context Aggregation Module

The CAM module is designed to aggregate contextual information from each feature layer and generate the corresponding camouflage object mask. In this process, the CAM takes two inputs: the current layer’s features enhanced by the DAM and the mask output from the previous layer’s CAM. It is important to note that as the feature layer increases, the semantic information becomes richer while the detail information decreases, making mask prediction more challenging. Therefore, we generate masks only for the lower three layers. For the highest layer,

f_{4}^{'}

, it is combined with

f_{3}^{'}

to generate the mask for layer 3. The implementation of CAM follows the method described in [].

3.5. Loss Function

Our method produces four prediction results: three camouflaged object masks and one object edge. For each mask, we use a weighted binary cross-entropy loss

L_{B C E}^{ω}

and a weighted IOU loss

L_{I O U}^{ω}

[] together to more accurately capture the key pixels in the image. For target edge learning, once the object’s spatial localization is obtained through multi-level feature fusion and refinement via the BDM, the edge mask is optimized using the Dice loss function. The Dice loss indirectly enhances boundary precision by maximizing the Dice similarity coefficient between the predicted mask and the ground truth. As a region-level metric, it is particularly sensitive to pixel discrepancies along object boundaries. During backpropagation, misclassifications at edge points produce large gradient updates, strongly guiding parameter optimization and enabling the model to generate segmentation results with sharper and more accurate contours. Consequently, the total loss function proposed by our method, denoted as

L_{Total}

, can be formulated as:

\begin{matrix} L_{Total} = \sum_{i = 1}^{3} (L_{BCE}^{ω} (m_{i}, G_{m}) + L_{IOU}^{ω} (m_{i}, G_{m})) \\ + λ L_{Dice} (f_{e}, G_{e}) . \end{matrix}

(5)

Here,

G_{m}

and

G_{e}

represent the Ground Truth (GT) for the camouflaged object’s mask and edges, respectively. Correspondingly,

m_{i}

and

f_{e}

denote the predicted results for the camouflaged object’s mask and edges produced by the proposed method, where

i = (1, 2, 3)

indicates the feature layer index. The hyperparameter

λ

is set to 3.

4. Experiments

In this section, we present the details of our implementation and comprehensively compare BDA-Net with the latest COD methods on three publicly available datasets using common evaluation metrics. Additionally, we conduct ablation studies to validate the effectiveness of the key modules in our proposed method.

4.1. Implementation Details

In BDA-Net, we used PVTv2 pre-trained on ImageNet-1k as the backbone. During the training phase, we resized the input images to

416 \times 416 \times 3

and employed the Adam optimizer with a batch size of 12, setting the number of iterations to 50. Additionally, the learning rate was initially set to

1 \times 10^{- 4}

and decayed according to a poly learning strategy with an exponent of 0.9. Our model was trained on an NVIDIA RTX 3090 GPU (with 24 GB memory).

4.2. Datasets

We trained and evaluated our models on three public benchmark datasets: CAMO [], COD10K [] and NC4K []. CAMO contains 1250 camouflaged images across 8 categories, with 1000 images used for training and 250 for testing. COD10K is the largest COD dataset, comprising 5066 images, with 3040 allocated for training and 2026 for testing, spanning 5 main categories and 69 subcategories. NC4K, consisting of 4121 images collected from the Internet, is the largest COD test set available to date.

4.3. Evaluation Metrics

We employ four widely used evaluation metrics to assess the performance of our model: Structure-measure (

S_{α}

) [], Mean Absolute Error (MAE) [], weighted F-measure (

F_{β}^{ω}

) [], and average E-measure (

E_{Φ}

) [].

4.4. Comparison with SOTA Methods

We compare the proposed method with 12 SOTA methods including PFNet [], SINet-v2 [], SegMar [], ZoomNet [], TPRNet [], DTINet [], PolarNet [], MSCAF-Net [], SARNet [], FEDER [], FSPNet [], EANet [].

4.4.1. Quantitative Comparison

The test results of the proposed method compared with 12 state-of-the-art COD methods on the COD10K, CAMO, and NC4K datasets are shown in Table 1. As evident from the table, BDA-Net consistently outperforms all other methods across all three datasets. Specifically, on the COD10K dataset, our method shows enhancements of

1.38 %

in

S_{α}

,

3.60 %

in

F_{β}^{ω}

, and

0.64 %

in

E_{Φ}

compared to the second-best method, SARNet. On the CAMO dataset, our method outperforms the second-best method, MSCAF-Net, with improvements of

0.69 %

in

S_{α}

,

1.81 %

in

F_{β}^{ω}

, and

0.22 %

in

E_{Φ}

. Additionally, on the NC4K dataset, our method achieves better results with increases of

0.56 %

in

S_{α}

,

1.31 %

in

F_{β}^{ω}

, and

0.21 %

in

E_{Φ}

relative to the second-best method, SARNet.

Table 1. Quantitative comparison with state-of-the-art methods for COD on three benchmarks using four widely used evaluation metrics (

S_{α}

,

M A E

,

F_{β}

,

E_{Φ}

).

4.4.2. Evaluation Curves of COD Methods

To further evaluate the performance, we present the

F_{β}

-Threshold,

E_{m}

-Threshold, and Precision-Recall (PR) curves of BDA-Net compared with 12 other COD methods on the COD10K dataset, as shown in Figure 7, respectively. From the figures, it can be observed that BDA-Net consistently outperforms other methods in both the

F_{β}

-Threshold and

E_{m}

-Threshold curves, indicating higher detection accuracy. However, we also observe that when the recall rate of BDA-Net falls below

0.7

, its precision slightly lags behind MSCAF-Net and SARNet, indicating that there is still room for further improvement.

Figure 7. Presion-Recall,

F_{β}^{ω}

-Threshold and

E_{Φ}

-Threshold curves of BDA-Net and the recent SOTA algorithms on COD10K dataset.

BDA-Net delivers superior detection accuracy; however, this performance comes at the expense of reduced inference speed (3.67 FPS) and increased computational complexity, as reflected by a higher parameter count (69.476 M) and greater FLOPs (55.998 G). Therefore, BDA-Net is particularly well-suited for accuracy-critical applications, such as medical endoscopy, as well as latency-tolerant scenarios, including license plate recognition and underwater object detection.

4.5. Ablation Analysis

4.5.1. Key Modules

Within BDA-Net, BDM and DAM are two key modules and represent our main contributions. To validate the effectiveness of the two modules, we conducted a series of ablation experiments, with the results shown in Table 2. It is worth noting that the baseline model retains only the PVTv2 backbone and CAM modules. To ensure consistency and feasibility across experiments, we employed identical experimental hardware and training parameters for all models—including the baseline model and variants utilizing BDM, DAM, and other architectures. Specifically, training was conducted on an NVIDIA RTX 3090 GPU (with 24 GB memory) using batch size 12 and hyperparameters set to 50 epochs.

Table 2. Quantitative evaluation for ablation studies on COD10K using four widely used evaluation metrics (

S_{α}

, MAE,

F_{β}

,

E_{Φ}

).

From this table, we can observe at least two points. First, when adding either the BDM or DAM to the Baseline, the performance of the network improves to some extent. Specifically, there are obvious improvements in metrics

S_{α}

and

F_{β}^{ω}

, while MAE and

E_{Φ}

remain largely unchanged. This indicates that enhancing the features extracted by the backbone network, whether through boundary detection or feature highlighting, contributes to improving the accuracy of camouflaged object detection. Second, when both BDM and DAM are added to the baseline, the network’s performance improves further, with all four metrics surpassing those of the baseline. This demonstrates that under the guidance of boundary information provided by BDM, DAM can more accurately highlight the features of camouflaged objects, thereby further enhancing detection performance.

To visually observe the impact of BDM and DAM on the detection results, we present the outputs of these four models on four test samples, as shown in Figure 8. From this figure, it can be seen that when using only the Baseline, the resulting masks contain not only camouflaged objects but also some background objects. However, when BDM is added to the Baseline, the resulting masks show a reduction in background objects, while the camouflaged objects remain unchanged. Conversely, when DAM is added to the Baseline, the resulting masks retain only high-contrast objects, and some or all of the camouflaged objects may be lost. Furthermore, when both BDM and DAM are added to the Baseline, the background objects in the resulting masks almost completely disappear, and the resulting masks become highly similar to GT. This indicates that, under the guidance of boundary information, DAM effectively focuses on the camouflaged object, thereby improving the accuracy of the resulting masks.

Figure 8. The visual comparison of detection results obtained by different models in the ablation study, (a) Baseline, (b) Baseline + BDM (GT), (c) Baseline + DAM, (d) Baseline + BDM (Edge), (e) Ours (GT), (f) ours (Edge).

An insufficient number of input feature channels constrains the representational capacity of the model, often resulting in underfitting and the loss of discriminative information. Conversely, excessively large channel dimensions increase computational overhead and memory consumption while introducing redundant representations that may hinder generalization. To systematically examine this trade-off, we conducted an ablation study on channel dimension adjustment using 1 × 1 convolutional up-sampling within the BDM module. As summarized in Table 3, four configurations were evaluated: (N1) direct fusion of multi-level backbone features while preserving their original channel dimensions; (N2) unification of all feature layers to 16 channels; (N3) unification to 128 channels; and (N4) unification to 64 channels. The quantitative results indicate that the model achieves its best performance when the channel dimensions of all feature layers are consistent with those of

f_{1}

, suggesting that balanced channel allocation effectively preserves feature integrity while maintaining computational efficiency.

Table 3. Quantitative performance comparison of BDA-Net across different channel inputs on the COD10K dataset. The best results for each evaluation metric are marked in bold.

4.5.2. Differential Attention Analysis

Within the DAM, the differential attention is achieved by performing ⊖ and ⊕ operations on the outputs of GAP and GMP, as shown in Equation (4). There are six possible combinations depending on the presence or absence of GAP and GMP, each representing a different attention mechanism. For simplicity, we label them as A1, A2, …, A6. We tested these attention mechanisms on the COD10K dataset, and the results are presented in Table 4.

Table 4. Quantitative evaluation for ablation studies of feature difference highlighting module on COD10K.

From Table 4, at least two key points can be observed: first, using only the operator ⊖, corresponding to A1, A3, and A5, yields

S_{α}

values of 0.873, 0.875, and 0.874, respectively, effectively highlighting camouflaged objects. This confirms the effectiveness of the attention mechanism. Second, the subsequent operator ⊕, corresponding to A2, A4, and A6, further improves attention to camouflaged objects, with

S_{α}

values of 0.876, 0.876, and 0.877, respectively. This indicates that operator ⊕ strengthens the effect of operator ⊖, and that the absolute difference

| GAP - GMP |

demonstrates better attention to camouflaged objects, leading to improved detection accuracy.

4.6. Analysis of the Effects of Different $λ$ Hyparameters

To validate the effectiveness of

λ

hyperparameters in BDA-Net, experiments were conducted with different lambda settings. As shown in Table 5, the proposed loss function consistently enhances detection accuracy on the COD10K dataset.

Table 5. Quantitative performance comparison of BDA-Net across different

λ

hyparameters inputs on the COD10K dataset. The best results for each evaluation metric are marked in bold.

4.7. Failure Cases

BDA-Net may exhibit biased predictions under challenging conditions. As shown in Figure 9, limitations arise in three scenarios: (1) objects with complex textures, where spatial reasoning is insufficient; (2) small targets, where discriminative features are inadequately captured; and (3) occluded objects, which the network often fails to identify. Future work will focus on improving robustness through enhanced camouflage modeling and feature representation learning.

Figure 9. Visualization of detection failures in BDA-Net under severe conditions.

5. Conclusions

Motivated by the human visual system’s strategy for detecting camouflaged objects—a process of progressive refinement from global contour localization to internal detail analysis—we introduce a BDA-Net for COD. In this network, we first extract boundary priors based on the multi-scale features of the image. Then, guided by the boundary information, we use the difference between GAP and GMP to suppress background features and construct a differential attention mechanism to highlight camouflaged objects. Finally, feature fusion is performed to achieve the segmentation of camouflaged objects. We evaluated BDA-Net on three commonly used COD datasets, and the results show that it delivers highly competitive performance compared to 12 state-of-the-art COD methods.

In subsequent investigations, our efforts will be directed toward addressing persistent challenges in COD, with particular emphasis on the accurate identification of microscale objects and partially occluded targets within complex natural scenes. These factors represent critical bottlenecks that currently impede high-precision detection performance. The development of robust computational mechanisms capable of effectively mitigating these issues will be instrumental in advancing BDA-Net’s operational robustness and real-world applicability under challenging environmental conditions.

Author Contributions

Conceptualization, B.X. and H.Z.; methodology, B.X. and H.Z.; validation, B.X. and H.Z.; formal analysis, B.X. and H.Z.;writing—original draft preparation, B.X., H.Z. and S.J.; writing—review and editing, B.X., H.Z. and S.J.; visualization, B.X., H.Z. and S.J.; supervision, B.X., H.Z. and S.J.; project administration, H.Z. and S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, F.; Hu, S.; Shen, Y.; Fang, C.; Huang, J.; He, C.; Tang, L.; Yang, Z.; Li, X. A survey of camouflaged object detection and beyond. arXiv 2024, arXiv:2408.14562. [Google Scholar] [CrossRef]
Fan, D.; Ji, G.; Sun, G.; Cheng, M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2777–2787. [Google Scholar]
Ren, J.; Hu, X.; Zhu, L.; Xu, X.; Xu, Y.; Wang, W.; Deng, Z.; Heng, P. Deep texture-aware features for camouflaged object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 33, 1157–1167. [Google Scholar] [CrossRef]
Mei, H.; Ji, G.; Wei, Z.; Yang, X.; Wei, X.; Fan, D. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
Pang, Y.; Zhao, X.; Xiang, T.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
Sun, Y.; Wang, S.; Chen, C.; Xiang, T. Boundary-guided camouflaged object detection. arXiv 2022, arXiv:2207.00794. [Google Scholar] [CrossRef]
Xiao, J.; Chen, T.; Hu, X.; Zhang, G.; Wang, S. Boundary-guided context-aware network for camouflaged object detection. Neural Comput. Appl. 2023, 35, 15075–15093. [Google Scholar] [CrossRef]
Yang, F.; Zhai, Q.; Li, X.; Huang, R.; Luo, A.; Cheng, H.; Fan, D. Uncertainty-guided transformer reasoning for camouflaged object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4146–4155. [Google Scholar]
Yang, J.; Shi, Y. EPANet: Edge-assisted Position Aware Attention Network for Camouflaged Object Detection. In Proceedings of the 2023 8th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 23–25 November 2023; pp. 376–382. [Google Scholar]
Liu, Z.; Jiang, P.; Lin, L.; Deng, X. Edge attention learning for efficient camouflaged object detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5230–5234. [Google Scholar]
Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Pomona, CA, USA, 24–28 October 2022; pp. 3608–3616. [Google Scholar]
Bayraktar, I.; Bakirci, M. Attention-Augmented YOLO11 for High-Precision Aircraft Detection in Synthetic Aperture Radar Imagery. In Proceedings of the 2025 27th International Conference on Digital Signal Processing and Its Applications (DSPA), Moscow, Russia, 26–28 March 2025; pp. 1–6. [Google Scholar]
Fan, B.; Cong, K.; Zou, W. Dual Attention and Edge Refinement Network for Camouflaged Object Detection. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 60–65. [Google Scholar]
Xiang, J.; Pan, Q.; Zhang, Z.; Fu, S.; Qin, Y. Double-branch fusion network with a parallel attention selection mechanism for camouflaged object detection. Sci. China Inf. Sci. 2023, 66, 162403. [Google Scholar] [CrossRef]
Du, S.; Yao, C.; Kong, Y.; Yang, Y. BANet: Camouflaged Object Detection Based on Boundary Guidance and Multiple Attention Mechanisms. In Proceedings of the 2023 9th Annual International Conference on Network and Information Systems for Computers (ICNISC), Wuhan, China, 27–29 October 2023; pp. 464–469. [Google Scholar]
Liang, W.; Wu, J.; Wu, Y.; Mu, X.; Xu, J. FINet: Frequency injection network for lightweight camouflaged object detection. IEEE Signal Process. Lett. 2024, 31, 526–530. [Google Scholar] [CrossRef]
Liu, X.; Qi, L.; Song, Y.; Wen, Q. Depth awakens: A depth-perceptual attention fusion network for RGB-D camouflaged object detection. Image Vis. Comput. 2024, 143, 104924. [Google Scholar] [CrossRef]
Guan, J.; Fang, X.; Zhu, T.; Qian, W. SDRNet: Camouflaged object detection with independent reconstruction of structure and detail. Knowl. Based Syst. 2024, 299, 112051. [Google Scholar] [CrossRef]
Zhang, C.; Bi, H.; Mo, D.; Sun, W.; Tong, J.; Jin, W.; Sun, Y. CCNet: Collaborative Camouflaged Object Detection via decoder-induced information interaction and supervision refinement network. Eng. Appl. Artif. Intell. 2024, 133, 108328. [Google Scholar] [CrossRef]
Fang, X.; Chen, J.; Wang, Y.; Jiang, M.; Ma, J.; Wang, X. EPFDNet: Camouflaged object detection with edge perception in frequency domain. Image Vis. Comput. 2025, 154, 105358. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Wei, J.; Wang, S.; Huang, Q. F³Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12321–12328. [Google Scholar]
Le, T.; Nguyen, T.; Nie, Z.; Tran, M.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11591–11601. [Google Scholar]
Fan, D.; Cheng, M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Fan, D.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar] [CrossRef]
Fan, D.; Ji, G.; Cheng, M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef] [PubMed]
Jia, Q.; Yao, S.; Liu, Y.; Fan, X.; Liu, R.; Luo, Z. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4713–4722. [Google Scholar]
Zhang, Q.; Ge, Y.; Zhang, C.; Bi, H. Tprnet: Camouflaged object detection via transformer-induced progressive refinement network. Vis. Comput. 2023, 39, 4593–4607. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, Z.; Tan, Y.; Wu, W. Boosting camouflaged object detection with dual-task interactive transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 140–146. [Google Scholar]
Wang, X.; Zhang, Z.; Gao, J. Polarization-based camouflaged object detection. Pattern Recognit. Lett. 2023, 174, 106–111. [Google Scholar] [CrossRef]
Liu, Y.; Li, H.; Cheng, J.; Chen, X. MSCAF-Net: A general framework for camouflaged object detection via learning multi-scale context-aware features. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4934–4947. [Google Scholar] [CrossRef]
Xing, H.; Gao, S.; Wang, Y.; Wei, X.; Tang, H.; Zhang, W. Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5444–5457. [Google Scholar] [CrossRef]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Huang, Z.; Dai, H.; Xiang, T.; Wang, S.; Chen, H.; Qin, J.; Xiong, H. Feature shrinkage pyramid for camouflaged object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5557–5566. [Google Scholar]
Liang, W.; Wu, J.; Mu, X.; Hao, F.; Du, J.; Xu, J.; Li, P. Weighted dense semantic aggregation and explicit boundary modeling for camouflaged object detection. IEEE Sens. J. 2024, 24, 21108–21122. [Google Scholar] [CrossRef]
Zhang, D.; Wang, C.; Fu, Q. Efficient camouflaged object detection via progressive refinement network. IEEE Signal Process. Lett. 2023, 31, 231–235. [Google Scholar] [CrossRef]
Yang, H.; Zhu, Y.; Sun, K.; Ding, H.; Lin, X. Camouflaged object detection via dual-branch fusion and dual self-similarity constraints. Pattern Recognit. 2025, 157, 110895. [Google Scholar] [CrossRef]

Figure 1. Detection results of camouflaged objects using BDA-Net. (a) Input images containing camouflaged objects; (b) Ground truth binary masks of the camouflaged objects; (c) Prediction results of the proposed BDA-Net.

Figure 2. Overall architecture of BDA-Net. First, the input image, with a fixed resolution of

416 \times 416 \times 3

, is processed by a Transformer-based backbone network to extract features at four different scales. These features are sent to the BDM module to extract boundary priors, which guide the differential attention mechanism. Simultaneously, the features are fed into four DAM modules, where the boundary priors help highlight the camouflaged object’s features. Finally, the CAM module fuses the enhanced features to produce the camouflage object mask.

Figure 3. Architecture of the BDM module. This module first upsamples the smaller-scale features

f_{2}

,

f_{3}

, and

f_{4}

to match the size of feature

f_{1}

. These features are then concatenated to form a single-scale feature map. Finally, a channel attention is applied to weight the feature map, highlighting the boundaries of the camouflaged object.

Figure 4. Architecture of the DAM module. In this module, the feature

f_{i}

is first fused with the boundary prior

f_{e}

. The resulting features are then weighted using a differential attention mechanism, producing the enhanced feature

f_{i}^{'}

that highlights the camouflaged object.

Figure 5. Comparison of visualization results from Boundary Prior Fusion.

Figure 6. Comparison of the resulting images after applying GAP, GMP, and

| GAP - GMP |

on the source image.

Figure 7. Presion-Recall,

F_{β}^{ω}

-Threshold and

E_{Φ}

-Threshold curves of BDA-Net and the recent SOTA algorithms on COD10K dataset.

Figure 8. The visual comparison of detection results obtained by different models in the ablation study, (a) Baseline, (b) Baseline + BDM (GT), (c) Baseline + DAM, (d) Baseline + BDM (Edge), (e) Ours (GT), (f) ours (Edge).

Figure 9. Visualization of detection failures in BDA-Net under severe conditions.

Table 1. Quantitative comparison with state-of-the-art methods for COD on three benchmarks using four widely used evaluation metrics (

S_{α}

,

M A E

,

F_{β}

,

E_{Φ}

).

Table 1. Quantitative comparison with state-of-the-art methods for COD on three benchmarks using four widely used evaluation metrics (

S_{α}

,

M A E

,

F_{β}

,

E_{Φ}

).

Method	Year	COD10K				CAMO				NC4K
Method	Year	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$
PFNet []	2021	0.800	0.040	0.660	0.877	0.782	0.085	0.695	0.855	0.829	0.053	0.745	0.887
SINetV2 []	2022	0.815	0.037	0.680	0.887	0.820	0.070	0.743	0.882	0.847	0.048	0.770	0.903
SegMar []	2022	0.833	0.034	0.724	0.899	0.815	0.071	0.753	0.874	0.841	0.046	0.781	0.896
ZoomNet []	2022	0.838	0.029	0.729	0.888	0.820	0.066	0.752	0.877	0.853	0.043	0.784	0.896
DTINet []	2022	0.824	0.034	0.695	0.896	0.856	0.050	0.796	0.916	0.863	0.041	0.792	0.917
PolarNet []	2023	0.820	0.034	0.735	0.896	0.816	0.073	0.785	0.874	0.849	0.046	0.810	0.905
FEDER []	2023	0.822	0.032	0.716	0.900	0.802	0.071	0.738	0.867	0.847	0.044	0.789	0.907
BCNet []	2023	0.827	0.033	0.704	0.894	0.829	0.068	0.761	0.889	0.857	0.043	0.788	0.910
FSPNet []	2023	0.851	0.026	0.735	0.895	0.856	0.071	0.799	0.899	0.879	0.035	0.816	0.915
MSCAF-Net []	2023	0.865	0.024	0.775	0.927	0.873	0.046	0.828	0.929	0.887	0.032	0.838	0.934
SARNet []	2023	0.864	0.024	0.777	0.931	0.868	0.047	0.828	0.927	0.886	0.032	0.842	0.937
EANet []	2024	0.825	0.029	0.709	0.910	0.841	0.051	0.793	0.919	0.825	0.039	0.798	0.922
SAE-Net []	2024	0.837	0.064	0.770	0.891	0.837	0.064	0.770	0.891	0.862	0.042	0.796	0.912
PRNet []	2024	0.821	0.035	0.742	0.895	0.816	0.072	0.791	0.875	0.844	0.047	0.813	0.903
DSNet []	2025	0.809	0.038	0.657	0.878	0.817	0.073	0.726	0.870	0.843	0.050	0.753	0.894
BDA-Net (Ours)	-	0.877	0.021	0.805	0.938	0.879	0.043	0.847	0.931	0.891	0.030	0.853	0.939

Table 2. Quantitative evaluation for ablation studies on COD10K using four widely used evaluation metrics (

S_{α}

, MAE,

F_{β}

,

E_{Φ}

).

Table 2. Quantitative evaluation for ablation studies on COD10K using four widely used evaluation metrics (

S_{α}

, MAE,

F_{β}

,

E_{Φ}

).

Model	COD10K
Model	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$
Baseline	0.869	0.023	0.784	0.928
+BDM	0.874	0.023	0.796	0.928
+DAM	0.874	0.022	0.796	0.926
+DAM +BDM (Ours)	0.877	0.021	0.805	0.938

Table 3. Quantitative performance comparison of BDA-Net across different channel inputs on the COD10K dataset. The best results for each evaluation metric are marked in bold.

Model	COD10K
Model	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$
N1	0.874	0.022	0.802	0.936
N2	0.873	0.022	0.802	0.934
N3	0.876	0.021	0.803	0.938
N4 (Ours)	0.877	0.021	0.805	0.938

Table 4. Quantitative evaluation for ablation studies of feature difference highlighting module on COD10K.

Attention	Combinations			COD10K
	⊖		⊕	COD10K
	GMP	GAP	GAP	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$
A1	✓			0.873	0.022	0.794	0.933
A2	✓		✓	0.876	0.022	0.799	0.936
A3		✓		0.875	0.021	0.800	0.938
A4		✓	✓	0.876	0.021	0.799	0.937
A5	✓	✓		0.874	0.022	0.796	0.936
A6 (Ours)	✓	✓	✓	0.877	0.021	0.805	0.938

Table 5. Quantitative performance comparison of BDA-Net across different

λ

hyparameters inputs on the COD10K dataset. The best results for each evaluation metric are marked in bold.

Table 5. Quantitative performance comparison of BDA-Net across different

λ

hyparameters inputs on the COD10K dataset. The best results for each evaluation metric are marked in bold.

Model	COD10K
Model	$S_{α} ↑$	MAE ↓	$F_{β}^{ω} ↑$	$E_{Φ} ↑$
$λ = 1$	0.874	0.023	0.799	0.933
$λ = 2$	0.875	0.023	0.801	0.936
$λ = 3$ (Ours)	0.877	0.021	0.805	0.938

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Boundary-Guided Differential Attention: Enhancing Camouflaged Object Detection Accuracy

Abstract

1. Introduction

2. Motivation and Contributions

3. Method

3.1. Overall Architecture

3.2. Boundary Detection Module

3.3. Differential Attention Module

3.3.1. Boundary Prior Fusion

3.3.2. Differential Attention Mechanism

3.4. Context Aggregation Module

3.5. Loss Function

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4. Comparison with SOTA Methods

4.4.1. Quantitative Comparison

4.4.2. Evaluation Curves of COD Methods

4.5. Ablation Analysis

4.5.1. Key Modules

4.5.2. Differential Attention Analysis

4.6. Analysis of the Effects of Different $λ$ Hyparameters

4.7. Failure Cases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Boundary-Guided Differential Attention: Enhancing Camouflaged Object Detection Accuracy

Abstract

1. Introduction

2. Motivation and Contributions

3. Method

3.1. Overall Architecture

3.2. Boundary Detection Module

3.3. Differential Attention Module

3.3.1. Boundary Prior Fusion

3.3.2. Differential Attention Mechanism

3.4. Context Aggregation Module

3.5. Loss Function

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4. Comparison with SOTA Methods

4.4.1. Quantitative Comparison

4.4.2. Evaluation Curves of COD Methods

4.5. Ablation Analysis

4.5.1. Key Modules

4.5.2. Differential Attention Analysis

4.6. Analysis of the Effects of Different λ Hyparameters

4.7. Failure Cases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

4.6. Analysis of the Effects of Different $λ$ Hyparameters