Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds

Lv, Yaning; Liu, Sanyang; Gong, Yudong; Yang, Jing

doi:10.3390/electronics14061118

Open AccessArticle

Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds

¹

School of Mathematics and Statistics, Xi’an University of Electronic Science and Technology, Xi’an 710126, China

²

Department of Computer System and Technology, Universiti Malaya, Kuala Lumpur 50603, Malaysia

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(6), 1118; https://doi.org/10.3390/electronics14061118

Submission received: 13 February 2025 / Revised: 28 February 2025 / Accepted: 1 March 2025 / Published: 12 March 2025

Download

Browse Figures

Versions Notes

Abstract

Small-Structure Camouflaged Object Detection (SSCOD) is a highly promising yet challenging task, as small-structure targets often exhibit weaker features and occupy a significantly smaller proportion of the image compared to normal-sized targets. Such data are not only prevalent in existing benchmark camouflaged object detection datasets but also frequently encountered in real-world scenarios. Although existing camouflaged object detection (COD) methods have significantly improved detection accuracy, research specifically focused on SSCOD remains limited. To further advance the SSCOD task, we propose a detail-preserving multi-scale adaptive network architecture that incorporates the following key components: (1) An adaptive scaling strategy designed to mimic human visual perception when observing blurry targets. (2) An Attentive Atrous Spatial Pyramid Pooling (A2SPP) module, enabling each position in the feature map to autonomously learn the optimal feature scale. (3) A scale integration mechanism, leveraging Haar Wavelet-based Downsampling (HWD) and bilinear upsampling to preserve both contextual and fine-grained details across multiple scales. (4) A Feature Enhancement Module (FEM), specifically tailored to refine feature representations in small-structure detection scenarios. Extensive comparative experiments and ablation studies conducted on three camouflaged object detection datasets, as well as our proposed small-structure test datasets, demonstrated that our framework outperformed existing state-of-the-art (SOTA) methods. Notably, our approach achieved superior performance in detecting small-structured targets, highlighting its effectiveness and robustness in addressing the challenges of SSCOD tasks. Additionally, we conducted polyp segmentation experiments on four datasets, and the results showed that our framework is also well-suited for polyp segmentation, consistently outperforming other recent methods.

Keywords:

camouflaged object detection; small-structure awareness; feature enhancement; polyp segmentation

1. Introduction

Small-Structure Camouflaged Object Detection (SSCOD) is a task worthy of attention, as it enhances the detection capability for small objects by selectively extracting the weak features of these hard-to-recognize objects. In daily life, small-structure targets are very common, such as tiny insects like mosquitoes, ants, and spiders, which are inherently small in structure. Additionally, some large-structure objects, when photographed from a distance, have a reduced pixel proportion relative to the background, making them appear small. When we reduced the statistical proportion of pixels in the COD10K [1] dataset to approximately 3%, nearly one-third of the data, specifically 28.19%, were filtered out. These objects are more likely to blend into the complex background, making them harder to detect. Therefore, it is necessary to design targeted solutions for the SSCOD task to effectively address this challenge.

Recently, many researchers have made remarkable contributions to the advancement of COD, and the primary research methodologies can be roughly categorized into three types: (1) Models based on imitating human visual behavior [1,2,3,4,5], inspired by how humans tend to move closer and zoom in on ambiguous objects to better identify them, use networks based on deformable receptive fields, such as dilated convolutions to extract features of camouflaged targets. (2) Methods based on specialized attention mechanisms [6,7,8,9], which help the model focus on key information and understand input relationships, enhancing camouflaged object detection. (3) Methods based on feature fusion [10,11,12] perform a special feature fusion, aiming to enhance the extraction of detailed features and overall features. All these methods have improved the accuracy of the camouflaged object detection task. However, these methods have rarely produced interesting designs specifically for small-structure data, even though such data account for nearly one-third of the commonly used COD10K dataset.

Existing methods have made advancements, but they are not suitable for our small-structure scenarios. Due to the challenges in detecting smaller camouflaged targets, which include the following: (1) Multi-scale target structures. As shown in Challenge 1 in Figure 1, similar spiders have different proportions in images of different scales. The proportion of target pixels in Challenge 1 (a) is more than twice that in Challenge 1 (b). This significant difference in proportion may have been caused by various external factors such as different shooting distances during image acquisition, which can easily make a model selectively ignore small structural features. As a result, the model tends to focus on more prominent large-scale target structures, leading to issues such as broken details and undersegmentation. Similarly, experimental results show that if an image of a small-structure object is enlarged, the model’s segmentation accuracy improves accordingly. (2) Weak features and small ratios. As shown in Figure 1 challenge 2, the white pixels of the small structure occupy 1.5% of the entire image. Perhaps some models can distinguish the main structure of ants; however, the segmentation of their legs and antennae is often unclear, as the legs and antennae of ants in Figure 1 only account for 10% of their body. From this, it can be seen that these features have less significance compared to the massive number of pixels in the background, and are easily overlooked, which means that this is a problem of class imbalance at the feature level. Many existing algorithms focus more on class imbalance at the sample level. However, our difficulty lies in the issue of internal balance at the feature level. If we use convolutions of the same size and shape to extract features, and the convolution kernel has no bias, this will result in the same number of times and weights for all pixels in the entire image to be traversed. As a result, the model will traverse the background or other organizational structures, and some irrelevant surrounding objects will receive more attention. This shows that the segmentation of the small-structured objects we target is a problem of class imbalance at the feature level, caused by ordinary convolution kernel extraction mechanisms.

To investigate the challenges associated with small object detection under the aforementioned issues, we first adopt a multi-scale input strategy and propose a Multiple Feature Encoder based on PVTv2 [13]. By scaling the input image up and down, this approach mimics the human strategy of observing small-structured objects, effectively expanding the receptive field of the convolutional kernel and enhancing the ability to capture fine details and subtle cues of camouflaged objects.

Secondly, due to the weak features of small structures, we use a Feature Enhancement Module (FEM). Prior to the FEM, we introduce a Mixed Sampling Module (MS) for scale integration, which fuses features from different scales into a primary scale to facilitate subsequent feature enhancement. Many feature enhancement methods have achieved good results in the past [14,15,16], such as using the feature pyramid FPN [17] method to improve detection accuracy through multi-layer feature fusion, or using convolution methods, which pay less attention to details. After integrating primary and secondary scales through different sampling methods, we use a dynamic feature enhancement module with multiple receptive fields. This module uses three branches—large, medium, and small—for feature extraction, while using the features of small structures to guide the learning of large-structure features. It dynamically combines features from different receptive fields to enhance the detection of tiny objects, which plays a crucial role in the SSCOD task.

Simultaneously, our model not only focuses on segmentation performance for small structures but also pays attention to the details of large-structure targets. We utilize a hybrid multi-scale Attentive Atrous Spatial Pyramid Pooling (A2SPP) module. Many models have incorporated ASPP to expand the receptive field and extract multi-scale contextual information, but ASPP treats features from different scales as equally important. Since object scales vary, different input images have distinct preferences for various features, and different network layers prefer different scales of features. To address these limitations, we designed the A2SPP module, which integrates a Mixed Attention Mechanism (MAM)—a fusion of spatial and channel attention—into each branch of ASPP. This design enables automatic feature scale selection, thereby facilitating a more effective application of multi-scale learning to the SSCOD task.

In summary, our contributions to the SSCOD task are as follows:

We propose a multi-scale network that emulates the zooming strategy employed by humans when observing images, thereby effectively leveraging subtle cues between candidate objects and their surrounding background. This approach enhances the network’s ability to capture and represent the features of small structures with greater precision.
We designed a mixed sampling (MS) module based on Haar Wavelet-based Downsampling (HWD) to integrate multi-scale features into the original feature scale. The HWD operation uses Haar wavelet transforms to decompose the signal into low-frequency and high-frequency detail components, preserving the details and boundary information of camouflaged objects. This enhances the network’s ability to capture critical features, improving detection performance.
We propose a Feature Enhancement Module (FEM) that dynamically adjusts features across varying receptive fields and refines detail representations, thereby enhancing the model’s capability to detect small objects more effectively.
We propose an A2SPP module, which incorporates a mixed attention mechanism into each branch of the ASPP framework to enable automatic selection of feature scales, thereby facilitating more effective multi-scale learning.
We evaluated our model on three COD and four polyp segmentation test datasets, with both quantitative and qualitative results showing significant improvements over state-of-the-art methods. Additionally, pixel-level classification on three COD datasets identified targets with less than a 3% pixel proportion as small structures. The results further confirmed that our method excels in segmenting small-structure targets.

2. Related Works

2.1. Models Based on Imitating Human Visual Behavior

Inspired by human behavior when observing uncertain images or videos, refs. [1,2,3,4,5] enhanced detection performance through methods such as feature scaling and networks based on deformable receptive fields. For example, SINet [1] established an important image dataset, COD10K, in the field of camouflaged objects. Building on this, SINetV2 [2] proposed a two-module network that mimics animal hunting behavior, providing a solution for detecting hidden targets. SARNet [4] proposed a Search–Amplify–Identify architecture, obtaining rich search features and fine-grained segmentation effects through cross-level and adjacent-level feature fusion and amplification of the feature maps. The aforementioned models consider the issue of size variation and achieved promising segmentation results. However, their performance on extremely small structures remains less prominent. Unlike the multi-receptive field convolution approaches commonly used in prior works, our approach focuses on accurately localizing camouflaged objects under conditions of unclear appearance and varying scales. By employing a multi-scale input strategy to simulate zoom-in and zoom-out mechanisms, our method more effectively captures the details and cues of camouflaged objects [3].

2.2. Methods Based on Specialized Attention Mechanisms

Models based on specialized attention mechanisms [6,7,8,9,18] have excelled in feature fusion and enhancement. For example, AFINet [7] utilized a Multi-Attention Interaction module, integrating cross-level features containing diverse attributes through attention interaction, thereby leveraging specific and complementary information across different levels to cope with variations in scale. Ref. [8] utilized a CBAM to enhance feature extraction capabilities, combining it with the U-Net and VGG16 algorithms, resulting in effective improvements for camouflaged object detection on the CAMO and CAMO-COCO datasets. PRNet [18] utilized attention mechanisms to perceive camouflaged objects in both spatial and channel dimensions, effectively integrating scale information across multiple levels of features to adapt to camouflaged objects of various scales. In our work, we use Atrous Spatial Pyramid Pooling (ASPP) [19] to obtain broader contextual information, and since ASPP uses simple summation operations to consider all feature scales as equally important [20], we add a mixed attention module to each branch of the ASPP. In this manner, each position in the feature map autonomously learns its optimal feature, enabling the method to effectively emphasize small structural features, while preserving the learning performance for large structural features.

2.3. Methods Based on Feature Fusion

Feature-fusion-based methods [10,11,12,21] fuse different levels of features to enhance model performance. Among them, PFNet [11] adopted a Feature Pyramid Network (FPN) structure to gradually aggregate features from different levels in a top-down manner, thereby obtaining the final prediction map in a coarse to fine manner. CFANet [10] explored how to effectively aggregate multi-level and multi-scale features generated by backbone networks by mining the similarities and differences of features at different levels. The single-level lightweight camouflage object detection network using multi-level feature fusion in LINet [12] integrates features of different feature layers and receptive field sizes, which not only improves the detection performance but also meets the requirement of fast detection. ZoomNet [3] designed an attention-based SIU unit to merge scale fusion layers of specific scale information, which can adaptively highlight expressions of different scales. Considering that some small object structures can be effectively detected through small receptive field information in local areas, while others require information from larger receptive fields at greater distances to make accurate judgments, our method proposes a dynamic feature enhancement module with multiple receptive fields.

3. Method

In this chapter, Section 3.1 introduces the overall model architecture. Section 3.2 provides a detailed explanation of the Multiple Feature Encoder structure. Section 3.3 and Section 3.4 describe the MS in the Scale Fusion Network and FEM. Section 3.5 elaborates on A2SPP and details of its MAM composition. Section 3.6 presents the Multi-Granularity Perception Unit in the model decoder. And Section 3.7 introduces the loss functions used in our approach.

3.1. Overall Architecture

We argue that different zoom scales often contain specific information. Aggregating the differentiated information at different scales will benefit exploring inconspicuous yet valuable clues from confusing scenarios. To implement this, intuitively, we customize an image pyramid based on a single scale input to identify camouflaged objects. The scales are divided into a main scale and two auxiliary scales. The latter is obtained by re-scaling the input to imitate the operation of zooming in and out.

We propose A2SPP by combining multiple feature extraction encoders, which adds a hybrid attention module to each branch of the ASPP to achieve automatic selection of feature scales for better multi-scale learning. We designed a scale fusion module to integrate the features extracted by multiple feature encoders at different scales. In this module, in order to enhance and combine information at specific scales, we also designed a multi receptive field feature detail enhancement module FEM. By filtering and merging, expressions of different scales can be adaptively highlighted. After the above modules, the auxiliary scales are integrated into the main scale. And this can enhance our model’s ability to extract semantic clues of detailed information, and better segment camouflage objects with smaller structures. Afterwards, we use a Hierarchical Propagation Decoder to integrate multi-level features in a top-down manner, further expanding the receptive field and diversifying the feature representations within the module.

As shown in Figure 2, a Multi-Scale Feature Encoder is used to extract features corresponding to different scales. The last layer’s multi-scale features undergo automatic scale selection through A2SPP, enabling better multi-scale learning. Then, at different levels of the Scale Fusion Network, the FEM enhances and aggregates critical cues from different scales. Finally, the fused features are progressively integrated through a top-down path in the Hierarchical Propagation Decoder.

3.2. Multiple Feature Encoder

Let

I \in R^{3 \times H \times W}

represent the input image to the network, where 3 denotes the number of color channels, and H and W are the image height and width, respectively. Initially, we apply bilinear interpolation to the single-scale input image at scaling ratios of 1.5 and 0.5 to generate auxiliary scales.

Subsequently, depth features are extracted from these grouped inputs using a triple-feature encoder, which is based on the widely used PVTv2 architecture [13]. The primary scale and the two auxiliary scales are set to 1.0×, 1.5×, and 0.5×, respectively. This process produces three sets of 64-channel feature maps, denoted as

{f_{i}^{k}}_{i = 1}^{5}

,

k \in {0.5, 1.0, 1.5

}. These feature maps are then fed sequentially into the multi-scale fusion network and the Multi-Granularity Perception Unit for further processing and refinement. This architecture enables effective integration and utilization of multi-scale features for downstream tasks.

3.3. Scale Fusion Network

Mixed-Sampling Before scale integration, first adjust the size of features

f_{i}^{1.5}

, and

f_{i}^{0.5}

to make them consistent with the resolution of the main scale feature

f_{i}^{1.0}

. To integrate large-scale features back into the original input scale, many methods typically employ downsampling operations such as max pooling or strided convolutions. However, for most segmentation tasks, aggregating features within local neighborhoods can lead to the loss of critical information, such as edges and textures, which negatively impacts the segmentation performance for small-structure targets in our SSCOD task. To optimize this issue, we use a simple yet effective pooling operation based on Haar Wavelet-based Downsampling (HWD) [22]. HWD decomposes the signal into low-frequency and detailed high-frequency components, using Haar wavelet transforms to reduce the spatial resolution of feature maps, while retaining as much detail and boundary information as possible. Therefore, for

f_{i}^{1.5}

, we use a downsampling method based on a Haar wavelet and inspired by the literature to reduce the sampling, which helps to retain more boundary, texture, and detail information. For

f_{i}^{0.5}

, we directly employ bilinear interpolation for upsampling, and the resulting features are subsequently passed through conversion layers for further processing. Figure 3 shows the detailed network structure of the module.

3.4. Feature Enhancement Module

FEM The features extracted from different receptive fields play a crucial role in small object detection, while the importance of the features of different receptive fields for small object detection is not the same [23]. Specifically, features from larger receptive fields capture more global contextual information, while those from smaller receptive fields are more sensitive to fine details. The ability to effectively integrate and weigh these features is vital for accurate small object detection. To address this, we propose a Feature Enhancement Module (FEM), which dynamically adjusts the contributions of features from different receptive fields. By enhancing fine-grained features, FEM improves the model’s sensitivity to small objects, enabling more accurate localization and segmentation. This adaptive feature refinement process allows the model to better capture subtle details and overcome challenges posed by the small scale of objects, thereby significantly boosting detection performance.

To highlight the differences between ZoomNet and our method, Figure 4 illustrates the distinctions. Compared to the feature fusion approach proposed in ZoomNet, our method not only integrates features from different scales, but also employs branches with different convolutional kernels. This design uses target features from small receptive fields to guide the learning of features for large-structure targets, significantly enhancing the detection capability for small objects.

Figure 5 illustrates the structure of the Feature Enhancement Module (FEM), which consists of three branches with small, medium, and large convolutional kernels. Each branch corresponds to a different receptive field and convolutional operation. These branches include a fine-grained branch with a smaller receptive field, a short-range context branch with a medium receptive field, and a long-range context branch with a larger receptive field. Specifically, we use a standard

3 \times 3

convolution for the fine-grained branch to capture minute features, a

5 \times 5

convolution for the short-range context branch, and a

7 \times 7

convolutional kernel for the long-range context branch to extract features of large-structure targets. f represents the input characteristic diagram, and

f_{1}

,

f_{2}

, and

f_{3}

represent the intermediate output characteristic diagram of the three branches.

We incorporate a residual-like mechanism to enhance the module’s capacity for detailed feature extraction. Specifically, the output

f_{1}

from the

C o n v 3

× 3 branch is added to the initial input f, and this combined feature map serves as the input to the

C o n v 5

× 5 branch. Similarly, the output

f_{2}

from the

C o n v 5

× 5 branch is added to f, and the resulting feature map is used as the input to the

C o n v 7

× 7 branch for subsequent operations. This iterative aggregation process allows the module to progressively refine features across different receptive fields, effectively capturing fine-grained details, while maintaining contextual coherence. By incorporating these enriched feature maps at each stage, the module achieves a more comprehensive representation of the input, which is particularly beneficial for tasks requiring high precision, such as small object detection.

w_{1}

,

w_{2}

, and

w_{3}

represent the dynamic weights of the different branches, and the acquisition methods can be expressed as follows:

\begin{matrix} w_{i} & = S o f t m a x (s u m (G C 3 (c a t (f_{1}, f_{2}, f_{3}))), \end{matrix}

(1)

where

G C 3

represents the grouping convolution with a convolution kernel of 3 × 3.

3.5. Attentive Atrous Spatial Pyramid Pooling

In segmentation tasks, Atrous Spatial Pyramid Pooling (ASPP) is a commonly utilized module for enriching multi-scale features by introducing multiple parallel branches within the network. Each branch employs dilated convolutions and pooling operations with varying scales, enabling the capture of contextual information across different levels. These pooling operations are particularly effective in aggregating information from diverse scales, thereby enhancing the network’s capability to perceive objects and structures of varying sizes [19]. This makes ASPP an essential component for addressing challenges posed by objects with heterogeneous dimensions in complex segmentation tasks. However, ASPP inherently assumes equal importance for features extracted across all scales by applying a straightforward summation operation to combine multi-scale features. This simplification neglects the fact that different input images exhibit diverse preferences for feature scales, due to variations in object dimensions and scene complexity. Furthermore, different network layers, based on their hierarchical roles, demonstrate varying priorities for multi-scale features, necessitating a more adaptive approach to feature integration [20].

As shown in Figure 6, to address these limitations, we propose Attentive Atrous Spatial Pyramid Pooling (A2SPP), which incorporates a mixed attention mechanism into each branch of the ASPP framework. This mechanism combines spatial and channel attention to dynamically weigh and select feature scales, allowing the network to adaptively focus on the most relevant features based on the specific requirements of the input image and the processing stage of the network. By enabling the automatic selection of feature scales, A2SPP ensures a more refined integration of multi-scale information.

This approach not only improves the network’s ability to accurately segment objects of varying sizes but also enhances its robustness and adaptability across diverse segmentation scenarios. By leveraging attention-based mechanisms, A2SPP effectively resolves the oversimplification of traditional ASPP, resulting in a more nuanced and context-aware representation of multi-scale features that drives the superior performance in segmentation tasks.

As illustrated in Figure 7, our MAM (Multi-Attention Module) integrates channel attention and spatial attention to establish a robust global context model, effectively capturing fine-grained details, while maintaining a broad contextual understanding.

Channel Attention ( $f_{c}$ ): The channel attention mechanism enhances feature representation by focusing on the most informative channels. We achieve this through a combination of Global Average Pooling (GAP) and Global Max Pooling (GMP) along the channel dimension, which are then summed and passed through a series of

1 \times 1

convolutions:

\begin{matrix} f_{c} = C o n v 1 \times 1 (R e L U (C o n v 1 \times 1 (X_{G A P}^{c} + X_{G M P}^{c}))), \end{matrix}

(2)

where

X_{G M P}^{c}

and

X_{G A P}^{c}

denote the features processed by GAP and GMP along the channel dimension, respectively. ReLU is the activation function used to introduce non-linearity.

Spatial Attention ( $f_{s}$ ): Spatial attention aims to enhance a model’s capability to focus on crucial spatial regions within the input feature map. We utilize GAP and GMP across the spatial dimensions, followed by concatenation and self-attention:

\begin{matrix} f_{s} = C o n v 5 \times 5 (S e l f A t t e n ([X_{G A P}^{s}, X_{G M P}^{s}])), \end{matrix}

(3)

where

X_{G A P}^{s}

and

X_{G M P}^{s}

represent the globally pooled features along the spatial dimensions. Self-Attention (SelfAtten) calculates the spatial dependencies by computing Query (Q), Key (K), and Value (V), followed by

SelfAtten (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

A

5 \times 5

depthwise convolution is then applied to enhance the local spatial interactions.

Feature Fusion and Coarse Attention Map ( $f_{c s}$ ): The output features from the channel attention (

f_{c}

) and spatial attention (

f_{s}

) modules are fused through an addition operation:

\begin{matrix} f_{c s} & = f_{c} + f_{s} \end{matrix}

(5)

This fusion operation combines the complementary strengths of channel and spatial attention, enabling the model to effectively capture global contextual relationships, while preserving fine-grained details. The result is a coarse attention map

f_{c s} \in R^{C \times H \times W}

, where C denotes the number of channels, and H and W are the spatial dimensions.

Final Feature (f): To generate the final enhanced feature map, the coarse attention map (

f_{c s}

) is combined with the input feature (X) using a channel shuffle operation, followed by a

7 \times 7

convolution:

\begin{matrix} f = σ (C o n v 7 \times 7 (C S ([X, f_{c s}]))), \end{matrix}

(6)

where

σ

denotes the sigmoid operation,

C S

(·) denotes the channel shuffle operation,

C o n v 7

× 7(·) denotes a convolution layer with k × k kernel size, and in our implementation, the group number is set to C.

This Mixed Attention Mechanism enhances the network’s capability to adaptively focus on informative regions across both channel and spatial dimensions, capture long-range dependencies using the self-attention crucial for complex camouflage scenarios, and preserve fine-grained details while maintaining a comprehensive global context. These features make it particularly effective in handling complex camouflaged object detection scenarios, where distinguishing between the foreground and background is challenging.

3.6. Multi-Granularity Perception Unit

After passing through FEM, the information from the three scales (1.0×, 1.5×, 0.5×) is integrated together. We use a Multi-Granularity Perception Unit (MGPU) [3] for inter-channel information exchange and feature refinement.

The input to the

M G P U_{i}

consists of the features from the current layer

F E M_{i}

and the features from the subsequent MGPU layer

M G P U_{i + 1}

, calculated as

\hat{f_{i}} = f_{i}^{F E M} + f_{i + 1}^{M G P U}

. A 1 × 1 convolution is then applied to expand the channel dimensions of the feature map, followed by splitting the features into six groups

{f_{j}}_{j = 1}^{G}

. In the first group

f_{1}

, the features are further divided into three subsets

{{f^{'}}_{1}^{k}}_{k = 1}^{3}

, where

{f^{'}}_{1}^{1}

is used for information exchange with the next group, and the remaining two subsets are used for channel modulation. For the

j^{t h}

group, the feature

f_{j}

is concatenated along the channel dimension with the feature

{f^{'}}_{j - 1}^{1}

from the previous group. This is followed by a convolution and a split operation, dividing the group into three subsets. For the final group, the output only contains

{f^{'}}_{G}^{2}

and

{f^{'}}_{G}^{3}

, which are used for channel refinement in the

G^{t h}

group.

The MGPU is designed to learn critical features through inter-channel information exchange, generating more expressive feature representations. This design significantly enhances the model’s ability to capture diverse visual cues, improving the object perception and refining prediction results.

3.7. Loss Functions

To overcome the uncertainty in predictions due to the inherent complexity of data, we use an Uncertainty-Aware Loss (UAL) [3] to supplement the Binary Cross-Entropy (BCE) loss, enabling the model to distinguish uncertain regions and make accurate and reliable predictions:

\begin{matrix} l_{B C E}^{i, j} & = & - g_{i, j} l o g P_{i, j} - (1 - g_{i, j}) l o g (1 - P_{i, j}) \end{matrix}

(7)

\begin{matrix} l_{U A L}^{i, j} & = 1 - Δ_{i, j} = 1 - {| 2 P_{i, j} - 1 |}^{2}, \end{matrix}

(8)

where

g_{i, j} \in {0, 1}

and

P_{i, j} \in [0, 1]

denote the ground truth and the predicted value at position

(i, j)

. The total loss can be written as

l = l_{B C E} + λ l_{U A L}

, and

λ

is

λ = λ_{m i n} + \frac{1}{2} (1 - c o s (\frac{t - t_{m i n}}{t_{m i n} - t_{m i n}} π)) (λ_{m a x} - λ_{m i n}),

(9)

where

λ_{m i n}

and

λ_{m a x}

were set to 0 and 1 in our experiments, t is the current and number of iterations,

t_{m i n}

and

t_{m i n}

are the increasing interval in the iterations [

t_{m i n}

,

t_{m i n}

].

4. Experiments and Results

4.1. Experiment on COD

4.1.1. Datasets

We evaluated our approach using three benchmark datasets for Camouflaged Object Detection (COD): CAMO [24], COD10K [1], and NC4K [25]. The CAMO dataset includes 1250 images containing camouflaged objects and 1250 images with non-camouflaged objects. The COD10K dataset comprises a total of 5066 images featuring camouflaged objects. The NC4K dataset consists of 4121 camouflaged images collected from various online sources. For training, we utilized all images containing camouflaged objects, specifically 3040 images from COD10K and 1000 images from CAMO. The remaining images from each dataset were reserved for testing, ensuring a robust evaluation of our method.

To evaluate the performance of our model on the SSCOD task, we first analyzed the target object pixel proportions in the ground truth (GT) annotations of three benchmark test datasets: COD10K, CAMO, and NC4K. Specifically, we classified data where the target object pixel proportion was less than 3% of the total image pixels as small structures, given their unique challenges for detection due to their small size and integration with the background.

Using this classification, we created three specialized small-structure test subsets: COD10K-small, CAMO-small, and NC4K-small, each containing images with small camouflaged targets. We then performed evaluation tests of our model on these subsets to assess its ability to accurately detect and segment small-structure targets.

4.1.2. Evaluation Metrics

A variety of metrics are widely employed to evaluate the performance of camouflaged object detection. These include the Structure-measure (

S_{α}

) [26], which assesses the preservation of structural information; the Adaptive E-measure (

E_{ϕ}^{a d}

) [27], designed to evaluate both local and global consistency; the Weighted F-measure (

F_{β}^{w}

) [28], which provides a comprehensive assessment of precision and recall within the prediction map; and the Mean Absolute Error (MAE), which quantifies the element-wise difference between the predicted values and the ground truth. Together, these metrics offer a robust and multifaceted evaluation of a model’s performance.

(a) Structure-measure ( $S_{α}$ )

This was used to assess the structural similarity between the regional perception (

S_{r}

) and object perception (

S_{o}

), which is defined by

\begin{matrix} S_{α} = α \times S_{o} + (1 - α) \times S_{r}, \end{matrix}

(10)

where

α \in [0, 1]

is a trade-off parameter and it is set to 0.5 as default.

(b) E-measure ( $E_{ϕ}$ )

This was used to capture image-level statistics and their local pixel matching information, which is defined by

\begin{matrix} E_{ϕ} & = & \frac{1}{W \times H} \sum \sum ϕ (S (x, y), G (x, y)), \end{matrix}

(11)

where W and H denote the width and height of the ground truth G, and (x, y) is the coordinate of each pixel in G. The symbol

ϕ

is the enhanced alignment matrix. We obtained a set of

E_{ϕ}

by converting the prediction S into a binary mask with a threshold in the range of [0, 255]. In our experiments, we reported the mean of

E_{ϕ}

values over all the thresholds.

(c) The weighted F-measure ( $F_{β}^{w}$ )

We used weighted Precision, which is a measure of exactness, and weighted Recall, which is a measure of completeness used to comprehensively consider both precision and recall, and we can obtain the weighted harmonic mean by

\begin{matrix} F_{β}^{w} & = & (1 + β^{2}) \frac{P r e c i s i o n^{w} \times R e c a l l^{w}}{β^{2} P r e c i s i o n^{w} + R e c a l l^{w}}, \end{matrix}

(12)

where

P r e c i s i o n = \frac{| M \cap G |}{M}

and

R e c a l l = \frac{| M \cap G |}{G}

. In addition,

β^{2}

was set to 0.3 to emphasize the precision. We used different fixed [0, 255] thresholds to compute the F-measure. This yielded a set of F-measure values, for which we reported the maximal

F_{β}

in our experiments.

(d) Mean Absolute Error (M)

This was adopted to compute the average pixel-level relative error between the ground truth and normalized prediction, which is defined by

\begin{matrix} M & = & \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} | S (x, y) - G (x, y) |, \end{matrix}

(13)

where G and S denote the ground truth and normalized prediction (normalized to [0, 1]).

(e) PR curve

Given a saliency map S, we can convert it to a binary mask M, and then compute the precision and recall by comparing M with the ground truth G:

\begin{matrix} P r e c i s i o n = \frac{| M \cap G |}{M}, R e c a l l = \frac{| M \cap G |}{G} . \end{matrix}

(14)

Then, we adopted a popular strategy to partition the saliency map S using a set of thresholds (i.e., from 0 to 255). For each threshold, we first calculated a pair of recall and precision scores, and then combined them to obtain a PR curve that described the performance of the model at the different thresholds.

4.1.3. Experiment Details

Our model was implemented using PyTorch 2.1.2 and trained on a server equipped with an NVIDIA 3090 GPU. The encoder was initialized with the parameters of PVTv2 pretrained on ImageNet, while the remaining components were randomly initialized. We employed the Adam optimizer with

β

= (0.9, 0.999) to update the model parameters. The initial learning rate was set to 0.0001 and followed a step decay strategy. The model was trained for 100 epochs, with a batch size of 8. Both the input

I_{1.0}

and the ground truth were bilinearly interpolated to a resolution of 384 × 384 for consistency during training.

4.2. Quantitative Comparison

We conducted a comprehensive evaluation of our proposed model against ten state-of-the-art (SOTA) methods in the field, including SINet, SINetV2, PFNet [11], DGNet [29], BSA-Net [30], CamoFormer [31], BGNet [32], FEDER, BANet, EANet [33], CINet [34], ZoomNet, SARNet [4], and SDRNet [35]. As shown in Table 1, on the COD10K evaluation dataset, our method outperformed SDRNet (which also utilizes PVTv2 as the backbone) by 3.0%, 6.3%, 3.5%, and 0.006 for the metrics (

S_{α}

,

F_{β}^{ω}

,

E_{ϕ}

, M), respectively.

To further assess the performance for small targets, we calculated the proportion of white pixels in the ground truth (GT) of the test datasets relative to the entire image. Images with less than 3% white pixels were classified as containing small targets, resulting in the creation of small-target test subsets: COD10K-small (28.8%), CAMO-small (4%), and NC4K-small (11%). On these small-target subsets, we compared our model with SDRNet. In the small subset COD10K, our model achieved improvements over SDRNet of 2.6%, 8.4%, and 3.3% in terms of

S_{α}

,

F_{β}^{ω}

, and

E_{ϕ}

, respectively, while reducing the M metric by 0.007. This demonstrated the effectiveness of our approach, particularly in the challenging task of detecting small camouflaged objects.

The detection results are illustrated in Figure 8, where our approach demonstrated a clear advantage over the other state-of-the-art (SOTA) methods in detecting both multi-scale and small targets. This superior performance highlights the robustness and adaptability of our model in handling objects of varying sizes and complexities.

In addition to the comprehensive quantitative comparisons based on the aforementioned four evaluation metrics, we also present precision–recall (PR) curves in Figure 9. The PR curves provide a more detailed perspective on the trade-off between precision and recall across different thresholds. As observed, our model consistently achieved a superior performance, outperforming the other camouflaged object detection (COD) methods across all levels. These results further validated the effectiveness of our proposed approach in accurately capturing camouflaged targets, particularly those that are small or exhibit multi-scale characteristics. The combination of strong quantitative metrics and favorable PR curve results underscores the significant advancements our method bring to the field of COD.

4.3. Qualitative Comparison

The detection results are presented in Figure 8, where our proposed approach consistently demonstrated superior performance in detecting both large and small camouflaged targets. Notably, our model showed significant improvements in handling small targets, which are typically challenging to detect due to their subtle features and low contrast with the background.

Figure 8 compares the detection results of our model with those of five state-of-the-art (SOTA) camouflaged object detection (COD) methods. Visual analysis clearly shows that our model consistently achieved a more accurate and complete segmentation of the camouflaged objects, resulting in superior visual outcomes.

Large Targets (First Row): Our model effectively handled large-structure camouflaged objects and provided accurate segmentation results. In contrast, methods such as BGNet, ZoomNet, and SDRNet often exhibited segmentation inaccuracies, including missing details in boundary regions.

Small Targets (Second and Third Rows): For small-structure targets that were difficult to distinguish from the background and prone to misclassification, our model outperformed the other methods shown in the Figure 8. It accurately localized the targets and captured fine-grained details, demonstrating robustness in detecting challenging small targets.

Heavily Occluded Targets (Fourth Row): When camouflaged objects were heavily occluded by their surroundings and had unclear boundaries due to blending with the background, our model accurately detected the objects. As shown in the figure, it effectively preserved the edge details of targets partially obscured by grass, providing more precise segmentation results compared to the other methods.

Multiple Targets (Fifth Row): Detecting multiple camouflaged objects in a single scene is particularly challenging. While some methods failed to locate all targets, our model successfully detected and segmented multiple objects, showcasing its adaptability and scalability in handling complex scenarios.

In summary, the results in Figure 8 show that our model consistently maintained a SOTA performance under various complex and challenging conditions. In camouflaged object detection across multiple scales, complex backgrounds, and occlusions, our approach demonstrated strong robustness and adaptability. These findings highlight the effectiveness of our model in COD tasks, particularly for small object detection, detail preservation, and multi-scale detection.

4.4. Ablation Study

(a) Effectiveness of Proposed Modules

To evaluate the contributions of the various key components in the proposed model, we performed a series of ablation studies by systematically removing individual components from the full model. As shown in Table 2, these ablation experiments were conducted on the COD10K and CAMO test datasets, to validate the effectiveness of each module.

Baseline (I): The baseline model directly used multi-scale features from the backbone and simply fused them for object recognition. While providing a basic structure, it lacked advanced mechanisms for feature enhancement and scale-specific learning.

Baseline + A2SPP (II): Building upon the baseline, the A2SPP module was introduced to guide the model in automatically selecting feature scales. This mechanism facilitated more effective multi-scale learning by dynamically adapting to the contextual needs of the input, leading to enhanced performance.

Baseline + FEM (III): This variant incorporated the FEM module into the baseline, which dynamically adjusts features from multiple receptive fields to improve the accuracy of detecting small objects. Additionally, the FEM integrates features from various hierarchical levels, thereby enhancing the model’s ability to identify targets with greater precision.

Baseline + FEM + A2SPP (IV): This configuration combined both the FEM and A2SPP modules with the baseline to leverage the complementary strengths of dynamic multi-receptive field feature adjustment and adaptive scale selection. This holistic approach maximized the model’s capability to detect camouflaged objects, particularly small and subtle ones.

The quantitative results of the ablation studies are presented in Table 2, while the visual detection performance on small-structured objects for the four configurations (I-IV) is illustrated in Figure 10. The results clearly demonstrate that the inclusion of the A2SPP and FEM modules significantly improved the model’s ability to detect small-structured objects. Specifically, the A2SPP module excelled in learning and adapting to multi-scale features, while the FEM module enhanced the detection of fine-grained details and minute structures. Together, these components effectively address the challenges of small object detection and camouflaged object identification, leading to substantial improvements in overall performance. These findings highlight the critical role of each module in increasing the model’s efficacy and robustness across diverse test scenarios.

(b) Mixed-sampling Input Scheme

In the scale fusion network, we downsample the high-resolution features

f_{i}^{1.5}

to

f_{i}^{1.0}

, aiming to produce effectively integrated and diverse multi-scale features that enhance the detection of camouflaged objects. Downsampling plays a critical role in this process, as it helps incorporate complementary feature information across different scales, ensuring a more robust representation of objects with varying sizes and complexities.

To determine the most appropriate downsampling method for the SSCOD task, we conducted a comparative analysis of several commonly used techniques, including average pooling, max pooling, bilinear interpolation, and bicubic interpolation, as shown in Table 3. Each method was evaluated based on its ability to preserve key structural details, while effectively reducing the spatial resolution of the feature maps. From the analysis, it became clear that the HWD method, introduced earlier in our work, consistently outperformed the other approaches. This superior performance resulted from its ability to retain fine-grained details and critical boundary information, making it particularly suitable for the challenges associated with camouflaged object detection.

Consequently, the HWD method served as the default setting in our framework for the subsequent studies and experiments. By leveraging the strengths of HWD, the scale fusion network achieves a more comprehensive and detailed feature representation, significantly enhancing the overall effectiveness of the proposed approach in addressing small-structure camouflaged object detection scenarios.

(c) Validation of design choices of A2SPP

Considering the inherent limitations of ASPP, we proposed an enhanced A2SPP module that integrates an MAM into each branch of the ASPP structure. The MAM module builds upon the Convolutional Block Attention Module (CBAM) [37] framework, incorporating a self-attention mechanism specifically within the computation of spatial attention. This enhancement allows the module to capture more nuanced spatial relationships and improve its ability to process multi-scale contextual information.

To evaluate the effectiveness of this design, we conducted a series of ablation studies by systematically modifying the MAM module. First, we removed the self-attention mechanism, reducing the MAM to a standard CBAM module. We then replaced all MAM modules in A2SPP with CBAM modules, to observe the impact on performance. Furthermore, we separately substituted the MAM with simpler attention mechanisms, such as spatial attention (SA) and channel attention (CA), to assess their relative contributions.

The experimental results, detailed in Table 4, revealed a significant decline in performance across all scenarios. This demonstrates that the inclusion of the self-attention mechanism in the CBAM framework is crucial for the enhanced performance of the MAM. Specifically, the self-attention mechanism allows for a more effective weighting of spatial features, improving the model’s ability to capture fine-grained details and complex spatial relationships. These findings underscore the critical role of the MAM in optimizing the A2SPP module, and highlight the importance of incorporating advanced attention mechanisms for robust multi-scale feature learning.

Figure 11 illustrates the intermediate feature maps at different stages of the decoder, with layers 1 to 5 representing progressively shallower levels of the decoder. We visualize and compare the intermediate feature maps across these layers to demonstrate the effectiveness of the A2SPP module. As shown in the figure, the model equipped with A2SPP exhibited superior performance in capturing fine details, such as the legs of the butterfly, compared to standard pooling methods. This highlights the enhanced capability of A2SPP for preserving small structural features during the decoding process.

4.5. Experiment on Polyp Segmentation

(1) Datasets: We evaluated our model on five benchmark datasets for polyp segmentation, including ETIS [38], CVC-ClinicDB [39], CVC-ColonDB [40], and Kvasir [41]. Specifically, the ETIS dataset consists of 196 polyp images, the CVC-ClinicDB dataset contains 612 polyp images, the CVC-ColonDB dataset includes 380 images, and the Kvasir dataset comprises 1000 images.

(2) Comparison with other SOTA Polyp Segmentation methods: The primary objective of polyp segmentation is to accurately detect polyp tissue that closely resembles the surrounding background, which plays a crucial role in the effective prevention of cancer in clinical practice. To evaluate the performance of our model in the polyp segmentation task, we validated it on four publicly available datasets: ETIS, ColonDB, ClinicDB, and Kvasir-SEG, while ensuring that all hyperparameter settings remained consistent with those used in the SSCOD task described in the main text.

We compared our model against eight state-of-the-art polyp segmentation methods, including UNet [42], UNet++ [43], SFA [44], EU-Net [45], SANet [46], PraNet [47], DCRNet [48], and FDRNet [49]. The quantitative comparison results appear in Table 5, showing that our model achieved the best overall performance among all methods. This demonstrates its capability to effectively handle tasks that share similar challenges with SSCOD.

The qualitative evaluation results appear in Figure 12. As shown in the figure, our proposed method consistently produced more accurate and complete segmentation results, while maintaining a clean background. Additionally, the first and second rows in the figure indicate that our model performed well in segmenting small-structure targets within the polyp segmentation task.

5. Conclusions

This paper presented a novel multi-scale framework specifically designed to enhance the detection performance of small-structure targets in the Small-Structure Camouflaged Object Detection (SSCOD) task. Small-structure targets pose unique challenges, due to their weak feature representations and their tendency to blend seamlessly into the background, making their detection particularly difficult. To address these issues, we introduced key components aimed at improving feature extraction, multi-scale information integration, and fine-grained object segmentation.

Our proposed framework incorporates three essential modules: Haar Wavelet-based Downsampling (HWD), a Feature Enhancement Module (FEM), and Attention-Space Pyramid Pooling (A2SPP). The HWD module effectively preserves structural and boundary information by decomposing features into low-frequency and high-frequency components, allowing for better retention of detail in downsampled representations. The FEM module further refines features by dynamically adjusting receptive fields, ensuring that both small and large objects are effectively captured. Meanwhile, the A2SPP module enhances the model’s ability to learn multi-scale contextual information by integrating spatial and channel-wise attention mechanisms, enabling automatic selection of the most relevant feature scales. Additionally, we introduced a zoom-in and zoom-out strategy for input images, mimicking human observation behavior when analyzing blurry or inconspicuous objects. This strategy ensures that the model can adaptively adjust to different object scales, thereby improving the accuracy of camouflaged object detection, particularly in small-structure scenarios.

To validate the effectiveness of our proposed approach, we conducted extensive experiments on three widely used benchmark datasets, along with our newly curated small-target dataset designed specifically to evaluate SSCOD performance. The experimental results demonstrated that our framework consistently outperformed existing state-of-the-art (SOTA) camouflaged object detection (COD) methods across multiple evaluation metrics. In particular, our method showed substantial improvements in detecting small-structure objects, preserving fine details, and reducing false positives in complex backgrounds.

These findings highlight the robustness and adaptability of our approach in addressing the unique challenges associated with small-structure camouflaged object detection. By effectively leveraging multi-scale information, attention mechanisms, and an adaptive scaling strategy, our framework provides a significant advancement in SSCOD, setting a new benchmark for detecting camouflaged objects across varying scales and complex environments.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.G. and J.Y.; supervision, Y.G. and J.Y.; validation, Y.L.; original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.; project administration and formal analysis, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available and include the following: (1) COD10K training data and COD10K, CAMO test data can be accessed on GitHub at: https://github.com/DengPingFan/SINet (accessed on 10 June 2024). (2) NC4K test data can be accessed on GitHub at: https://github.com/JingZhang617/COD-Rank-Localize-and-Segment (accessed on 10 June 2024). (3) Polyp segmentation test data can be accessed on GitHub at: https://github.com/DengPingFan/Polyp-PVT (accessed on 10 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SSCOD	Small-Structure Camouflaged Object Detection
COD	Camouflaged Object Detection
A2SPP	Attentive Atrous Spatial Pyramid Pooling
FEM	Feature Enhancement Module
HWD	Haar Wavelet-based Downsampling
FPN	Feature Pyramid Networks
MAM	Mixed Attention Mechanism
CBR	Conv+BN+Relu
MGPU	Multi-Granularity Perception Unit
ASPP	Atrous Spatial Pyramid Pooling
SF	Scale fusion
GAP	Global Average Pooling

Appendix A

Appendix A.1. Supplementary PR Cruve

Due to space limitations in the main text, the PR curves previously presented only include results from the COD10K and CAMO test datasets. The PR curve on the NC4K test set is shown in Figure A1. As observed, our model also outperformed the other methods on the NC4K dataset.

Figure A1. Precision–recall (PR) curves on the NC4K dataset.

Appendix A.2. Supplementary Ablation Study on Small COD Datasets

For the ablation study on A2SPP and the Mixed Sampling Module, the quantitative results in the main text only present method comparisons on the two commonly used COD datasets, CAMO and COD10K. The metric comparison on our constructed small-structure dataset is provided in Table A1 and Table A2.

The comparison results in Table A1 show that our model outperformed the other sampling methods across four evaluation metrics on the small-structure dataset. Similarly, the results in Table A2 indicate that our MAM-based A2SPP module achieved superior performance on the small-structure test dataset compared to commonly used attention methods such as CE, CBAM, and SE.

Table A1. Comparison of different downsampling methods for the proposed Scale Fusion Network on the small-structure camouflaged object dataset. ↑ and ↓, respectively, indicate that higher is better and lower is better.

Model	CAMO-Small (4%)				COD10K-Small (28.8%)
Model	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$
w/average-pooling	0.788	0.657	0.018	0.862	0.51	0.717	0.011	0.928
w/max-pooling	0.790	0.659	0.016	0.863	0.850	0.715	0.010	0.930
w/bi-cubic	0.789	0.659	0.017	0.862	0.850	0.716	0.009	0.928
w/bi-linear	0.788	0.660	0.017	0.863	0.852	0.717	0.010	0.929
ours	0.791	0.661	0.015	0.864	0.853	0.720	0.008	0.931

Table A2. Comparisons of different downsampling methods for the proposed Scale Fusion Network. ↑, indicate that higher is better.

Model	CAMO-Small (4%)				COD10K-Small (28.8%)
Model	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↑$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↑$	$E_{ϕ} ↑$
$A 2 S P P_{C B A M}$	0.782	0.653	0.018	0.859	0.846	0.716	0.009	0.927
$A 2 S P P_{S E}$	0.785	0.656	0.016	0.861	0.850	0.713	0.010	0.926
$A 2 S P P_{C E}$	0.784	0.655	0.017	0.858	0.849	0.714	0.010	0.923
ours	0.791	0.661	0.015	0.864	0.853	0.720	0.008	0.931

References

Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2777–2787. [Google Scholar]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef] [PubMed]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2150–2160. [Google Scholar]
Xing, H.; Gao, S.; Wang, Y.; Wei, X.; Tang, H.; Zhang, W. Go closer to see better: Camouflaged object detection via object area amplification and figure-ground conversion. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5444–5457. [Google Scholar] [CrossRef]
Li, Q.; Wang, Z.; Zhang, X.; Du, H. Lightweight camouflaged object detection model based on multilevel feature fusion. Complex Intell. Syst. 2024, 10, 4409–4419. [Google Scholar] [CrossRef]
Fan, B.; Cong, K.; Zou, W. Dual Attention and Edge Refinement Network for Camouflaged Object Detection. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; IEEE: Piscataway, NJ, USA; pp. 60–65. [Google Scholar]
Du, S.; Yao, C.; Kong, Y.; Yang, Y. BANet: Camouflaged Object Detection Based on Boundary Guidance and Multiple Attention Mechanisms. In Proceedings of the 2023 9th Annual International Conference on Network and Information Systems for Computers (ICNISC), Wuhan, China, 27–29 October 2023; IEEE: Piscataway, NJ, USA; pp. 464–469. [Google Scholar]
Akbarnezhad, E.; Naserizadeh, F. Improving Camouflage Object Detection Using U-NET and VGG16 Deep Neural Networks and CBAM Attention Mechanism. In Proceedings of the 2024 10th International Conference on Artificial Intelligence and Robotics (QICAR), Qazvin, Iran, 29–29 February 2024; IEEE: Piscataway, NJ, USA; pp. 56–62. [Google Scholar]
Zhang, Q.; Yan, W.; Zhao, Y.; Jin, Q.; Zhang, Y. AFINet: Camouflaged object detection via Attention Fusion and Interaction Network. J. Vis. Commun. Image Represent. 2024, 102, 104208. [Google Scholar] [CrossRef]
Zhang, Q.; Yan, W. CFANet: A cross-layer feature aggregation network for camouflaged object detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA; pp. 2441–2446. [Google Scholar]
Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
Qiu, T.; Li, X.; Liu, K.; Li, S.; Chen, F.; Zhou, C. Boundary Guided Feature Fusion Network for Camouflaged Object Detection. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 433–444. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Tong, J.; Bi, Y.; Zhang, C.; Bi, H.; Yuan, Y. Local to global purification strategy to realize collaborative camouflaged object detection. Comput. Vis. Image Underst. 2024, 241. [Google Scholar] [CrossRef]
Khan, A.; Khan, M.; Gueaieb, W.; Saddik, A.E.; Masi, G.D.; Karray, F. CamoFocus: Enhancing Camouflage Object Detection with Split-Feature Focal Modulation and Context Refinement. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Song, X.; Zhang, P.; Lu, X.; Hei, X.; Liu, R. A Universal Multi-View Guided Network for Salient Object and Camouflaged Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11184–11197. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hu, X.; Zhang, X.; Wang, F.; Sun, J.; Sun, F. Efficient Camouflaged Object Detection Network Based on Global Localization Perception and Local Guidance Refinement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5452–5465. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Qiu, Y.; Liu, Y.; Chen, Y.; Zhang, J.; Zhu, J.; Xu, J. A2SPPNet: Attentive atrous spatial pyramid pooling network for salient object detection. IEEE Trans. Multimed. 2022, 25, 1991–2006. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. Tiny Object Detection in Remote Sensing Images Based on Object Reconstruction and Multiple Receptive Field Adaptive Feature Enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616213. [Google Scholar] [CrossRef]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11591–11601. [Google Scholar]
Cheng, M.M.; Fan, D.P. Structure-measure: A new way to evaluate foreground maps. Int. J. Comput. Vis. 2021, 129, 2622–2638. [Google Scholar] [CrossRef]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Ji, G.P.; Fan, D.P.; Chou, Y.C.; Dai, D.; Liniger, A.; Gool, L.V. Deep Gradient Learning for Efficient Camouflaged Object Detection. Mach. Intell. Res. 2023, 20, 92–108. [Google Scholar] [CrossRef]
Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I Can Find You! Boundary-Guided Separated Attention Network for Camouflaged Object Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; AAAI Press: New York, NY, USA, 2022; pp. 3608–3616. [Google Scholar]
Yin, B.; Zhang, X.; Fan, D.P.; Jiao, S.; Cheng, M.M.; Van Gool, L.; Hou, Q. CamoFormer: Masked Separable Attention for Camouflaged Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10362–10374. [Google Scholar] [CrossRef]
Chen, T.; Xiao, J.; Hu, X.; Zhang, G.; Wang, S. Boundary-guided network for camouflaged object detection. Knowl.-Based Syst. 2022, 248, 108901. [Google Scholar] [CrossRef]
Liu, Z.; Jiang, P.; Lin, L.; Deng, X. Edge Attention Learning for Efficient Camouflaged Object Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA; pp. 5230–5234. [Google Scholar]
Ma, M.; Sun, B. A Cross-Level Interaction Network Based on Scale-Aware Augmentation for Camouflaged Object Detection. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 69–81. [Google Scholar] [CrossRef]
Guan, J.; Fang, X.; Zhu, T.; Qian, W. SDRNet: Camouflaged object detection with independent reconstruction of structure and detail. Knowl.-Based Systems 2024, 299, 112051. [Google Scholar] [CrossRef]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Silva, J.; Histace, A.; Romain, O.; Dray, X.; Granado, B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 2013, 9, 283–293. [Google Scholar] [CrossRef] [PubMed]
WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph 2015, 43, 99–111. [CrossRef] [PubMed]
Tajbakhsh, N.; Gurudu, S.R.; Liang, J. Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information. IEEE Trans. Med Imaging 2016, 35, 630–644. [Google Scholar] [CrossRef] [PubMed]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; Johansen, H.D. Kvasir-SEG: A Segmented Polyp Dataset. In Proceedings of the 26th International Conference on Multimedia Modeling, Daejeon, Republic of Korea, 5–8 January 2020. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the 4th Deep Learning in Medical Image Analysis (DLMIA) Workshop, Granada, Spain, 20 September 2018. [Google Scholar]
Fang, Y.; Chen, C.; Yuan, Y.; Tong, K.Y. Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar]
Patel, K.; Bur, A.M.; Wang, G. Enhanced U-Net: A Feature Enhancement Network for Polyp Segmentation. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021. [Google Scholar]
Wei, J.; Hu, Y.; Zhang, R.; Li, Z.; Cui, S. Shallow Attention Network for Polyp Segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021, Proceedings, Part I 24; Springer International Publishing: Cham, Switzerland, 2021. [Google Scholar]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. PraNet: Parallel Reverse Attention Network for Polyp Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Yin, Z.; Liang, K.; Ma, Z.; Guo, J. Duplex Contextual Relation Network for Polyp Segmentation. In Proceedings of the 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), Kolkata, India, 28–31 March 2022. [Google Scholar]
Liu, Z.; Deng, X.; Jiang, P.; Lv, C.; Min, G.; Wang, X. Edge Perception Camouflaged Object Detection Under Frequency Domain Reconstruction. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 10194–10207. [Google Scholar] [CrossRef]
Liu, G.; Chen, Z.; Liu, D.; Chang, B.; Dou, Z. FTMF-net: A Fourier transform-multiscale feature fusion network for segmentation of small polyp objects. IEEE Trans. Instrum. Meas. 2023, 72, 1–15. [Google Scholar] [CrossRef]
Rahman, M.M.; Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 6222–6231. [Google Scholar]

Figure 1. Challenges. (1) Multi-scale structures, (2) small ratios and weak features for camouflaged object datasets and results.

Figure 2. Overview of the proposed model, which consists of three components: the Multiple Feature Encoder, the Scale Fusion Network, and the Hierarchical Propagation Decoder.

Figure 3. The proposed structure of the scale fusion network. Scale integration based on HWD and bilinear upsampling, integrating multi-scale features into the original feature scale.

Figure 4. The difference between ZoomNet and our method in scale fusion networks.

Figure 5. The proposed structure of the FEM. Composed of three branches with large, medium, and small convolutions, the features of small structures guide the learning of features for large structures, dynamically combining features from different receptive fields.

Figure 6. Structure of the proposed A2SPP module. On the basis of ASPP, each branch incorporates a hybrid attention mechanism MAM that combines spatial and channel attention to achieve automatic selection of feature scales.

Figure 7. The proposed structure of the MAM.

Figure 8. The visualization comparisons between our method and stronger SOTA methods in various scenarios, including multi-scale, occlusion, small structures, and detailed scenes, show that the proposed method significantly outperformed the compared SOTA methods in these scenarios.

Figure 9. Precision–recall (PR) curves on the camouflaged object datasets. Due to space limitations, we only present the results on the COD10K and CAMO test sets, while the NC4K results are provided in Appendix A.

Figure 10. Visual comparisons showing the effects of the proposed components.

Figure 11. Visual comparison of intermediate feature maps from different stages of the decoder, showing the effects of the proposed A2SPP.

Figure 12. The visualization results on the polyp segmentation dataset show that our method had sufficient advantages compared to classical multiple polyp segmentation models.

Table 1. Quantitative comparison of our model with 14 state-of-the-art models using four evaluation metrics (i.e.,

S_{α}

[25],

F_{β}^{w}

[28],

E_{ϕ}

[26], and M). ↑ and ↓, respectively, indicate that higher is better and lower is better.

Table 1. Quantitative comparison of our model with 14 state-of-the-art models using four evaluation metrics (i.e.,

S_{α}

[25],

F_{β}^{w}

[28],

E_{ϕ}

[26], and M). ↑ and ↓, respectively, indicate that higher is better and lower is better.

Methods	Pub/Year	In-Size	CAMO (250 Images)				COD10K (2026 Images)				NC4K (4121 Images)
Methods	Pub/Year	In-Size	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑
SINet [1]	CPVR20	352	0.745	0.644	0.091	0.829	0.776	0.631	0.043	0.874	0.808	0.723	0.058	0.883
PFNet [11]	CPVR21	416	0.782	0.695	0.085	0.855	0.800	0.660	0.040	0.890	0.829	0.745	0.053	0.898
SINetV2 [2]	TPAMI22	352	0.820	0.743	0.070	0.884	0.815	0.680	0.037	0.864	0.847	0.770	0.048	0.901
BGNet [32]	KBS22	352	0.812	0.749	0.073	0.890	0.831	0.722	0.033	0.901	0.851	0.788	0.044	0.907
FEDER [36]	CPVR23	384	0.802	0.738	0.071	0.873	0.822	0.716	0.032	0.905	0.847	0.789	0.044	0.915
CamoFormer [31]	TPAMI24	352	0.817	0.752	0.067	0.885	0.838	0.724	0.029	0.930	0.855	0.788	0.038	0.925
BSA-Net [30]	AAAI22	384	0.794	0.717	0.079	0.867	0.818	0.699	0.034	0.901	0.842	0.771	0.048	0.907
DGNet [29]	MIR23	352	0.838	0.768	0.057	0.914	0.822	0.692	0.033	0.911	0.857	0.783	0.042	0.922
BANet [7]	ICNISC23	416	0.862	0.811	0.050	0.914	0.852	0.748	0.027	0.916	0.878	0.820	0.036	0.925
EANet [33]	ICASSP24	384	0.841	0.793	0.051	0.918	0.825	0.709	0.029	0.910	0.860	0.798	0.039	0.922
CINet [34]	TETCI24	384	0.847	0.794	0.055	0.899	0.841	0.744	0.028	0.914	0.868	0.815	0.037	0.924
ZoomNet [3]	CPVR22	384	0.846	0.786	0.056	0.892	0.862	0.759	0.025	0.905	0.876	0.812	0.036	0.913
SARNet [4]	TCSVT23	384	0.868	0.850	0.047	0.927	0.864	0.800	0.024	0.931	0.886	0.863	0.032	0.937
SDRNet [35]	KBS24	384	0.872	0.826	0.049	0.924	0.871	0.785	0.023	0.924	0.889	0.842	0.032	0.934
Our	——	384	0.889	0.860	0.041	0.940	0.896	0.835	0.017	0.953	0.902	0.866	0.029	0.952
Comparison on Small Target Dataset
Methods	Pub/Year	In-size	CAMO-small (4%)				COD10K-small (28.8%)				NC4K-small (11%)
Methods	Pub/Year	In-size	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	M↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	M↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	M↓	$E_{ϕ}$ ↑
ZoomNet [3]	TPAMI22	384	0.745	0.513	0.044	0.829	0.752	0.590	0.021	0.866	0.775	0.603	0.022	0.853
SDRNet [35]	KBS24	384	0.776	0.586	0.024	0.851	0.829	0.660	0.016	0.901	0.817	0.652	0.016	0.877
Our	——	384	0.791	0.611	0.015	0.864	0.853	0.720	0.008	0.931	0.830	0.678	0.013	0.896

Table 2. Ablation studies were conducted on two large-scale test datasets and their corresponding small-object datasets. ↑ and ↓, respectively, indicate that higher is better and lower is better. ✓ indicates that the module was added to the model. Bold text represents the best results.

Model	Base	A2SPP	FEM	CAMO (250 Images)				COD10K (2026 Images)
Model	Base	A2SPP	FEM	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$
I	✓	–	–	0.854	0.816	0.045	0.924	0.865	0.809	0.024	0.932
II	✓	✓	–	0.867	0.831	0.042	0.929	0.878	0.811	0.021	0.946
III	✓	–	✓	0.878	0.837	0.042	0.934	0.885	0.826	0.020	0.947
IV	✓	✓	✓	0.889	0.860	0.041	0.940	0.896	0.835	0.017	0.953
Small Datasets Comparison
Model	Base	A2SPP	FEM	CAMO-small (4%)				COD10K-small (28.8%)
Model	Base	A2SPP	FEM	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$
I	✓	–	–	0.760	0.641	0.020	0.836	0.823	0.678	0.013	0.911
II	✓	✓	–	0.777	0.652	0.017	0.842	0.831	0.688	0.010	0.919
III	✓	–	✓	0.780	0.657	0.016	0.853	0.844	0.693	0.009	0.924
IV	✓	✓	✓	0.791	0.661	0.015	0.864	0.853	0.720	0.008	0.931

Table 3. Comparisons of different downsampling methods for the proposed Scale Fusion Network. ↑ and ↓, respectively, indicate that higher is better and lower is better. Due to space limitations, the table shows comparisons on two common COD datasets, with small-structure dataset results in the Appendix A.

Model	CAMO (250 Images)				COD10K (2026 Images)
Model	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↓$	$E_{ϕ} ↑$
w/average-pooling	0.886	0.856	0.042	0.942	0.895	0.833	0.020	0.953
w/max-pooling	0.888	0.858	0.041	0.944	0.896	0.834	0.019	0.955
w/bi-cubic	0.886	0.857	0.043	0.942	0.895	0.833	0.019	0.954
w/bi-linear	0.887	0.857	0.042	0.943	0.894	0.834	0.020	0.953
ours	0.889	0.860	0.041	0.940	0.896	0.835	0.017	0.953

Table 4. Comparisons of different downsampling methods for the proposed Scale Fusion Network. ↑, indicate that higher is better. Due to space limitations, the table shows comparisons on two common COD datasets, with small-structure dataset results in the Appendix A.

Model	CAMO (250 Images)				COD10K (2026 Images)
Model	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↑$	$E_{ϕ} ↑$	$S_{α} ↑$	$F_{β}^{ω} ↑$	$M ↑$	$E_{ϕ} ↑$
$A 2 S P P_{C B A M}$	0.878	0.851	0.044	0.938	0.886	0.825	0.021	0.947
$A 2 S P P_{S E}$	0.881	0.853	0.042	0.939	0.889	0.828	0.019	0.950
$A 2 S P P_{C E}$	0.883	0.854	0.043	0.941	0.891	0.829	0.020	0.952
ours	0.889	0.860	0.041	0.940	0.896	0.835	0.017	0.953

Table 5. Quantitative comparison of all nine state-of-the-art method indicators on the polyp segmentation dataset, showing the best results in bold (i.e.,

S_{α}

[25],

F_{β}^{w}

[28],

E_{ϕ}

[26], and M). ↑ and ↓, respectively, indicate that higher is better and lower is better.

Table 5. Quantitative comparison of all nine state-of-the-art method indicators on the polyp segmentation dataset, showing the best results in bold (i.e.,

S_{α}

[25],

F_{β}^{w}

[28],

E_{ϕ}

[26], and M). ↑ and ↓, respectively, indicate that higher is better and lower is better.

Methods	ClinieDB				ETIS				Kvasir-SEG				ColonDB
Methods	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑	$S_{α}$ ↑	$F_{β}^{ω}$ ↑	$M$ ↓	$E_{ϕ}$ ↑
U-Net [42]	0.889	0.811	0.019	0.913	0.684	0.366	0.035	0.629	0.858	0.808	0.048	0.886	0.712	0.498	0.061	0.696
UNet++ [43]	0.873	0.785	0.022	0.891	0.683	0.390	0.035	0.629	0.862	0.808	0.048	0.886	0.691	0.467	0.064	0.680
SFA [44]	0.793	0.647	0.042	0.840	0.557	0.231	0.109	0.531	0.782	0.670	0.075	0.834	0.634	0.379	0.094	0.675
EU-Net [45]	0.936	0.891	0.011	0.963	0.831	0.730	0.045	0.863	0.917	0.893	0.028	0.951	0.831	0.730	0.045	0.863
SANet [46]	0.939	0.909	0.012	0.959	0.837	0.726	0.043	0.869	0.915	0.892	0.028	0.951	0.837	0.726	0.043	0.869
PraNet [47]	0.915	0.885	0.030	0.944	0.820	0.699	0.043	0.847	0.915	0.885	0.030	0.944	0.820	0.699	0.043	0.847
DCRNet [48]	0.911	0.868	0.035	0.933	0.821	0.684	0.052	0.840	0.911	0.868	0.035	0.933	0.821	0.684	0.052	0.840
TMF-Net [50]	0.928	0.886	0.011	0.957	0.796	0.606	0.025	0.831	0.893	0.852	0.035	0.927	0.814	0.693	0.040	0.848
PVT-Cascsde [51]	0.936	0.918	0.013	0.969	0.872	0.749	0.013	0.896	0.919	0.906	0.025	0.961	0.855	0.779	0.027	0.896
FDRNet [49]	0.949	0.929	0.011	0.977	0.878	0.752	0.017	0.901	0.926	0.911	0.022	0.965	0.871	0.786	0.030	0.900
Ours	0.967	0.941	0.010	0.986	0.899	0.836	0.014	0.912	0.940	0.921	0.021	0.973	0.884	0.801	0.025	0.918

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, Y.; Liu, S.; Gong, Y.; Yang, J. Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds. Electronics 2025, 14, 1118. https://doi.org/10.3390/electronics14061118

AMA Style

Lv Y, Liu S, Gong Y, Yang J. Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds. Electronics. 2025; 14(6):1118. https://doi.org/10.3390/electronics14061118

Chicago/Turabian Style

Lv, Yaning, Sanyang Liu, Yudong Gong, and Jing Yang. 2025. "Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds" Electronics 14, no. 6: 1118. https://doi.org/10.3390/electronics14061118

APA Style

Lv, Y., Liu, S., Gong, Y., & Yang, J. (2025). Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds. Electronics, 14(6), 1118. https://doi.org/10.3390/electronics14061118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Camouflaged Object Detection with Enhanced Small-Structure Awareness in Complex Backgrounds

Abstract

1. Introduction

2. Related Works

2.1. Models Based on Imitating Human Visual Behavior

2.2. Methods Based on Specialized Attention Mechanisms

2.3. Methods Based on Feature Fusion

3. Method

3.1. Overall Architecture

3.2. Multiple Feature Encoder

3.3. Scale Fusion Network

3.4. Feature Enhancement Module

3.5. Attentive Atrous Spatial Pyramid Pooling

3.6. Multi-Granularity Perception Unit

3.7. Loss Functions

4. Experiments and Results

4.1. Experiment on COD

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Experiment Details

4.2. Quantitative Comparison

4.3. Qualitative Comparison

4.4. Ablation Study

4.5. Experiment on Polyp Segmentation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Supplementary PR Cruve

Appendix A.2. Supplementary Ablation Study on Small COD Datasets

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI