Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration

Camouflaged object detection (COD) aims to segment those camouflaged objects that blend perfectly into their surroundings. Due to the low boundary contrast between camouflaged objects and their surroundings, their detection poses a significant challenge. Despite the numerous excellent camouflaged object detection methods developed in recent years, issues such as boundary refinement and multi-level feature extraction and fusion still need further exploration. In this paper, we propose a novel multi-level feature integration network (MFNet) for camouflaged object detection. Firstly, we design an edge guidance module (EGM) to improve the COD performance by providing additional boundary semantic information by combining high-level semantic information and low-level spatial details to model the edges of camouflaged objects. Additionally, we propose a multi-level feature integration module (MFIM), which leverages the fine local information of low-level features and the rich global information of high-level features in adjacent three-level features to provide a supplementary feature representation for the current-level features, effectively integrating the full context semantic information. Finally, we propose a context aggregation refinement module (CARM) to efficiently aggregate and refine the cross-level features to obtain clear prediction maps. Our extensive experiments on three benchmark datasets show that the MFNet model is an effective COD model and outperforms other state-of-the-art models in all four evaluation metrics (Sα, Eϕ, Fβw, and MAE).


Introduction
In nature, organisms use body color, texture, and coverings to conceal themselves within their surroundings, thereby avoiding detection by predators. Camouflaged object detection (COD) is an emerging computer vision segmentation task that aims to segment these objects that blend perfectly into their surroundings [1]. Unlike salient object detection [2][3][4][5], which segments salient objects with distinct boundaries and high contrast with their backgrounds, camouflaged objects often lack clear visual boundaries with their backgrounds and may be obscured by other objects in their surroundings, which makes accurate camouflaged object detection more challenging. Nevertheless, due to its significant research implications and wide application in medical image processing (e.g., polyp segmentation [6], lung infection segmentation [7]), pest detection [8], defect detection [9], and underwater object detection [10], COD has become a research hotspot.
Traditional camouflaged object detection methods [11][12][13][14][15] typically rely on handcrafted features (textures, colors, edges, etc.) to differentiate camouflaged objects from their surroundings. However, due to their limited ability to extract and analyze high-level semantic information, these methods tend to perform poorly in challenging COD scenarios. In contrast, in recent years, benefiting from the development of deep-learning-based methods, Fan et al. [1] made a substantial contribution to this field by comprehensively studying • We propose the novel MFNet to investigate the effectiveness of adjacent layer feature integration in the COD task and confirm the rationality of fully capturing contextual information through the interaction of adjacent layer features; • We propose an edge guidance module to explicitly learn object edge representations and guide the model to discriminate camouflaged objects' edges effectively; • We propose a multi-level feature integration module to efficiently extract and integrate global semantic information and local detail information in multi-scale features; • We propose a context aggregation refinement module to aggregate and refine crosslayer features via the attention mechanism, atrous convolution, and asymmetric convolution.
With the introduction of the large-scale COD dataset COD10K and the animal-predationinspired baseline SINet by Fan et al. [1], more and more deep-learning-based COD methods have emerged. For instance, Mei et al. [21] proposed PFNet, which uses high-level semantic information to roughly localize camouflaged objects and then wipe out the false positive and false negative areas, which can distract the segmentation results. Zhai et al. [18] proposed MGL, which models the localization and refinement process in camouflaged object detection through graph convolutional networks. Li et al. [25] proposed JCSOD, which employs a joint adversarial learning framework to perform both salient object detection (SOD) and COD tasks to improve the accuracy and robustness of camouflaged object detection. Yang et al. [19] proposed UGTR, a method that combines a convolutional neural network and Transformer, to leverage a probabilistic representation model to learn the uncertainty of camouflaged objects within the Transformer framework. This enables the model to pay more attention to uncertain regions, leading to more precise segmentation. Lv et al. [26] proposed a novel COD network LSR to simultaneously localize and segment camouflaged objects and rank them according to their detectability. Fan et al. [27] proposed SINet-V2, which first introduced group reverse attention to address the COD problem, and obtained excellent detection performance by combining the location information provided by the neighbor connection decoder module. Recently, Pang et al. [28] proposed a camouflaged object detection model called ZoomNet, which mimics human behavior (zooming in and out) when observing blurred images, using a scaling strategy to learn mixed-scale semantics through scale integration and hierarchical scale mixing.

Multi-Level Feature Fusion
Multi-level feature fusion strategies have been widely used in detection and segmentation tasks. Integrating various levels of feature information enables the effective extraction of contextual semantic information, which enhances the learning capability of the model. Furthermore, coordinating high-level semantic features with low-level fine details is crucial in camouflaged object detection (COD) tasks. Previous works have proposed different multi-level feature fusion strategies. Some methods [29][30][31][32] connect features of the corresponding level in the encoder to the decoder through the transport layer. Since single-level features can only characterize information at a specific scale, this top-down connectivity greatly diminishes the ability to characterize details in low-level features. Each level of features contains rich information, and in order to retain as much information as possible, Refs. [33][34][35] combined features from multiple levels in a fully connected or heuristic manner. However, the extensive integration of cross-scale information tends to lead to high computational costs and a lot of noise, thus reducing the model's performance. Xia et al. [36] proposed an aggregated interaction strategy with adjacent three-level features as input, fed into each of the three branches, and the information from other branches is flexibly integrated between each branch through interactive learning. The method makes better use of multi-level features, avoids the interference caused by resolution differences in feature fusion, and effectively integrates contextual information from adjacent resolutions. Zhou et al. [24] designed a cross-level fusion and propagation module. It first fuses cross-level features through a series of convolutional layers and residual connections, and then the feature propagation part allows the decoder to obtain more effective features from the encoder to improve the detection performance by weighing the contributions of features from the encoder and decoder. Moreover, we consider the scale variation in the adjacent tertiary features, using the fine detail information in the low-level high-resolution features and the rich semantic information in the high-level low-resolution features as a complement to provide local and global information for the current-level features. In this way, the extraction and fusion of contextual semantic information of multi-level features are facilitated, thus providing rich feature representations for the decoder.

Boundary-Aware Learning
Edge information is increasingly used as auxiliary information to refine object segmentation boundaries, resulting in more accurate segmentation results. Ding et al. [37] suggested learning edges as additional semantic classes to enable the network to learn the boundary layout of scene segmentation effectively. Zhao et al. [3] considered the complementarity between salient edge information and salient object information, and modeled both in the network. By doing so, they fully utilized the salient edge information to achieve more effective object segmentation. Zhu et al. [38] attempted to integrate the boundary information into the feature space using multi-level features of the encoder to enhance the sensitivity of the model to the boundary. Zhou et al. [24] considered that only low-level features contain sufficient boundary information; a boundary guidance module was designed to explicitly model the boundary information by inputting the two lowest-level features from the encoder that contained rich edge details. This module aids in the localization of camouflaged objects and the refinement of edges. Unlike the above methods, our proposed method considers that high-level semantic information can guide the model to filter the edge noise. Therefore, we propose an edge guidance module that explicitly generates the edges of the camouflaged object by combining high-level semantic information and low-level detail information and guides the refinement of camouflaged object edges by embedding the generated edge semantic features into the model.

Proposed Method
The overall framework of the proposed MFNet is shown in Figure 1, consisting of three key components: the edge guidance module, multi-level feature integration module, and context aggregation refinement module. Specifically, we use the pre-trained Res2Net-50 [39] as the backbone to extract multi-level features from an input image I ∈ R H×W×3 , resulting in a set of features f i , i ∈ {1, 2, 3, 4, 5}. The resolution of f i is H/2 i+1 × W/2 i+1 , i ∈ {1, 2, 3, 4, 5}. Next, we propose the EGM, which uses high-level features f 5 and low-level features f 3 to model the edge information f e associated with the camouflaged object and obtain the object-related edge semantics. Then, the proposed MFIM integrates multi-level features and edge cues, leveraging high-level semantic information and low-level detail cues to guide the extraction of global and local information by the current layer features, facilitating feature learning and enhancing the boundary representation. Subsequently, the aggregated features are fed into the proposed CARM to effectively integrate cross-level features in a top-down manner, refining the camouflaged object detection results. Finally, we employ a multi-level supervision strategy to improve the COD performance. We will describe the three key modules mentioned above in detail in the following.  Figure 1. The whole pipeline of the proposed multi-level feature integration network (MFNet), consisting of three main components, i.e., edge guidance module (EGM), multi-level feature integration module (MFIM), and context aggregation refinement module (CARM). Please refer to Section 3 for details.

Edge Generation Module
Complete edge information is crucial for object localization and segmentation. However, in contrast to the method used in [24], which relies solely on low-level features to obtain edge cues, we argue that low-level features contain many irrelevant non-object edge details. Therefore, it is necessary to utilize the rich semantic information in high-level features to guide the generation of object edge features. For this purpose, we design the edge guidance module (EGM) to explicitly model the edges of camouflaged objects by combining high-level features f 5 and low-level features f 3 . As shown in Figure 1, when the features enter the EGM, two 1 × 1 convolutional layers are first used to reduce the channels. The features are then integrated by concatenation, and the integrated features are fed to a 3 × 3 convolutional layer to obtain the fused feature representation. Finally, the fused features are fed into a 1 × 1 convolutional layer and a Sigmoid function to obtain the final edge prediction map.

Multi-Level Feature Integration Module
In COD tasks, high-level features typically contain rich semantic information, while low-level features contain more detailed local clues. To take both semantic and detailed information into account, we propose the multi-level feature integration module (MFIM). We divide MFIM into two cases: the first case is for f 2 , f 3 , and f 4 , which have two adjacent feature layers, while the other case is for f 5 , which has only one adjacent feature layer because of its location. In order to take full advantage of the multi-layer features and reduce the model parameters, we introduce f 1 into the network as the adjacent low-level feature of f 2 . The framework of MFIM is illustrated in Figure 2. If we define the MFIM process as F(·), its equation can be described as follows: In practice, we design three branches (i.e., one current branch and two adjacent branches) in MFIM. The current branch introduces edge cues through a channel attention module and captures rich contextual information by using atrous convolution and asymmetric convolution with different dilation rates in parallel; for the detailed flow, see Section 3.2.1. The two adjacent branches are the adjacent low-level feature branch and the adjacent high-level feature branch. The adjacent low-level feature branch extracts local detail information via the spatial attention module, while the adjacent high-level feature branch extracts global semantic information via the self-attention mechanism; for details, refer to Section 3.2.2. The features of the three branches are integrated via element-by-element addition operation; for details, refer to Section 3.2.3.

Current Branch
The current branch performs two main operations to process feature f i . Firstly, we incorporate the edge cue f e into the network using a channel attention (CA) module [40], which selectively amplifies or suppresses informative channels in feature maps to explore cross-channel interactions and extract the critical information between channels. This process can be formulated as follows: where f ca i is the output feature of CA, and f e is the edge feature. Then, we utilize the scale-related pyramid convolution (SRPC) module to more effectively combine multi-scale information. This process can be expressed as follows: where f s i is the output feature of the SRPC module. Our proposed SRPC module is dedicated to multi-scale feature learning and integration. Unlike the global context module in BBSNet [41], which independently extracts information at different scales through separate branches, the SRPC module refers to [42], fully considers the cross-scale interaction between adjacent branches, and increases the feature scale diversity through asymmetric convolution and atrous convolution. Specifically, we take the output feature f k i of f i after the 1 × 1 convolutional layer as an example. We divide f k i uniformly into four feature maps along the channel dimension ( f i ) for multi-scale learning. This process fuses the features of adjacent branches and obtains multi-scale contextual features via a series of atrous convolutional layers and asymmetric convolutional layers. This process can be formulated as follows: where ⊕ is the element-wise summation, Conv a 3 is a 3 × 3 asymmetric convolutional layer, and Conv n j 3 is a 3 × 3 atrous convolutional layer with a dilation rate of n j . Referring to EDN [42], we set n j ∈ {2, 3, 4}. Finally, the features f i k j , j ∈ {1, 2, 3, 4} from the four branches are concatenated and passed through a residual connection, followed by a 3 × 3 convolutional layer. This process can be formulated as follows: where Concat(·) is the concatenation operation, Conv 3 is a 3 × 3 convolutional layer, and f s i is the output feature of the SRPC module.

Adjacent Branch
Adjacent branches can be divided into two types. One is the adjacent lower-level feature branch, and the lower-level features usually contain more spatial detail information. The spatial attention module [40] focuses on the local information of the feature map by performing a max pooling operation on the feature map in the channel dimension, so we enhance the current feature representation by extracting fine spatial detail information through the spatial attention module, which can be computed as where ⊗ is the element-wise multiplication, f sa i is the output feature of the adjacent lowerlevel feature branch, and f s i is the output feature of the current branch. Another is the adjacent high-level feature branch, where the high-level features contain more contextual semantic information. The multi-dconv head transposed attention (MHTA) module [43] can effectively model long-range dependency relationships, thereby capturing global feature information. Thus, we capture the rich contextual information to enhance the global semantic representation via the MHTA module, which can be computed as where f mh i is the output feature of the adjacent high-level feature branch.

Branches' Integration
After being processed by the current and adjacent branches, the features are obtained as f sa i , f s i , and f mh i . Then, they are fused with f ca i (the output feature of CA) by the elementwise summation operation, which can be defined as follows: where f m i is the output feature of the MFIM.

Context Aggregation Refinement Module
The effective fusion of cross-level features from top to bottom often improves the learning performance. To this end, we propose a context aggregation refinement module (CARM) to improve the detection effect by making full use of the contextual information to refine the features level by level. As shown in Figure 3, for the CARM at the i-th (i ∈ {1, 2, 3}) stage, it first fuses the output feature of the CARM at the next stage (denoted as f c i+1 ) with the features obtained by the MFIM at the current stage (denoted as f m i ) through a concatenation operation. The concatenated result is fed to the 3 × 3 convolutional layer and an element-wise summation is performed by adding f c i+1 . After a 3 × 3 convolution, the high-dimensional features are mapped to the spatial-wise gate through the 1 × 1 convolutional layer, and then the Softmax function is used to obtain the weights to perform element-wise multiplication with the feature f f used i to filter the interference information. We can express this process as follows: where So f tmax(·) is the Softmax function and f att i is the output feature of the filtered feature. The atrous convolutional layer and asymmetric convolutional layer can obtain rich contextual semantic information through multi-scale receptive fields. Thus, we further refine the features with atrous convolution and asymmetric convolution, which can be defined as where Conv asy 3 is the 3 × 3 atrous convolutional layer with a dilation rate of 3, Conv atr 3 is the 3 × 3 atrous convolutional layer, f c i is the output feature of the CARM, and DConv de 3 is a 3 × 3 deconvolution layer followed by a dropout layer. For the last CARM (i = 4), the input corresponding to the next CARM is replaced by the output feature of the 4th MFIM.

Loss Function
In pixel-wise segmentation tasks, the binary cross-entropy (BCE) loss and the intersectionover-union (IOU) loss are widely used in conjunction with each other to provide strong constraints on the local pixel and global structure of the object. Inspired by the success of the weighted IOU loss and weighted BCE loss in [4], our detection loss function is defined as where L w BCE and L w IOU denote the weighted IOU loss and BCE loss, respectively. L w IOU highlights the importance of hard pixels (a type of pixel that targets difficult samples or misclassified pixels in a pixel-level classification task) by increasing their weights, and L w BCE focuses more on hard pixels rather than treating all pixels equally. Meanwhile, the produced edge map can be measured using the adaptive pixel intensity (API) loss [44], which distinguishes relatively important pixels (pixels that are adjacent to fine or explicit edges) by applying the pixel intensity to the L1 loss. Thus, our total loss can be defined as where P i is the camouflaged object's predicted map, G o is the ground truth of the camouflaged object, P e is the edge of the camouflaged object's predicted map, and G e is the ground truth of the edge of the camouflaged object.

Implementation Details
Our model is implemented in PyTorch [45] using Res2Net-50 [39] pre-trained on ImageNet as the backbone. All input images are resized to 416 × 416 and undergo data augmentation via random horizontal flipping. We set the batch size to 16 and employ the Adam optimizer [46]. We initialize the learning rate to 1 × 10 −4 and adjust it using a poly strategy with a power of 0.9. The training process, using an NVIDIA RTX 3090 GPU for acceleration, takes around 3.5 h to complete for 60 epochs. The source code and results will be released at https://github.com/WkangLiu/MFNet, accessed on 14 June 2023.

Datasets
We evaluate our model on three popular COD benchmark datasets: CAMO [47], CHAMELEON [48], COD10K [1]. CAMO includes a total of 1250 images, of which 1000 images are used as the training set and 250 images are used as the test set. CHAMELEON contains 76 images collected on the Internet, all of which are used as the test set. COD10K is the largest dataset, containing 5066 images collected on websites classified into 10 superclasses and 78 sub-classes, of which 3040 images are used as the training set and 2026 images are used as the test set.

Evaluation Metrics
We use four widely used evaluation metrics to judge the accuracy of our models, including the structure measure (S α ) [49], E-measure (E φ ) [50], weighted F-measure (F w β ) [51], and mean absolute error (MAE) [14]. In addition, we also provide precision-recall (PR) curves and F β -threshold (F β ) curves to help to evaluate the model more comprehensively.
The structure measure is used to measure the structural similarity of region-aware (S o ) and object-aware (S r ) aspects, which is defined by where α is set to 0.5 by default and a higher value of S α leads to the better performance of the models.

E-Measure (E φ )
The E-measure is used to consider the global image-level statistics and the local pixel-matching information, which is defined by where W and H are the width and height of the ground truth G, and (x, y) is the coordinate of each pixel in G. Symbol φ is the enhanced alignment matrix. We obtain a set of E φ by converting the prediction P into a binary mask with a threshold in the range of [0, 255]. A higher value of E φ leads to the better performance of the models.

Weighted F-Measure (F w β )
The weighted F-measure is used to consider both precision and recall simultaneously, which is defined by where precision = |M∩G| |M| and recall = |M∩G| |G| . Meanwhile, β 2 is set to 0.3 by default and a higher value of F ω β leads to the better performance of the models.

Mean Absolute Error (MAE)
The mean absolute error is used to calculate the average pixel-level relative error between the ground truth and normalized prediction, which is defined by where W and H are the width and height of the image. P and G are the normalized prediction and ground truth, respectively. A smaller value of MAE leads to the better performance of the models.

Comparison with SOTA Methods
We compare the proposed method with 16 state-of-the-art COD baselines: CPD [2], EGNet [3], F3Net [4], UCNet [5], SINet [1], PraNet [6], C2FNet [22], PFNet [21], TINet [17], UGTR [19], R-MGL [18], JCSOD [25], LSR [26], C2FNet-V2 [52], SINet-V2 [27], and BSA-Net [38]. Table 1 details the quantitative comparison between our model and the 16 state-of-theart methods on three benchmark datasets. For a fair comparison, we use the prediction maps provided by the original authors, and if not provided, we directly use their official code and models to compute the missing prediction maps. It can be clearly seen that our network significantly outperforms other advanced models in most evaluation metrics on the three datasets. For instance, on the COD10K dataset, when compared with the secondbest method BSA-Net, our method increases S α and F w β by 1.9% and 3.9%, respectively, and decreases MAE by 5.9%. On the CAMO dataset, when compared with the second-best method C2FNet-V2, our method increases S α and F w β by 3.1% and 4.5%, respectively, and decreases MAE by 13.0%. Overall, our proposed method greatly improves the performance of the SOTA. Figure 4 provides the PR curves and the F β curves of our method and other methods on the CAMO and CHAMELEON datasets. The higher the curve in the figure, the better the performance of the model, which further demonstrates the superiority of our method.   Table 1. Quantitative evaluation results on three benchmark datasets regarding S-measure, E-measure, weighted F-measure, and MAE scores. The best results are highlighted in bold. "↑" and "↓" indicate that larger or smaller is better.

Method
Year CAMO CHAMELEON COD10K

Qualitative Evaluation
In Figure 5, we visualize some challenging scenes and results generated by our method and other SOTA methods. It is not difficult to see that our model can accurately segment objects at different scales, including large objects (Row 1), medium objects (Row 2), small objects (Row 3-4), and multiple objects (Row 5-6). Meanwhile, for objects with high luminance (Row 7), occlusion (Row 8), or abundant edge details (Row 9-10), our method is also able to generate accurate predictions that are very consistent with the ground truth.

Effect of Different Modules
We conduct ablation studies to investigate the effectiveness of the MFIM, CARM, and EGM in our MFNet. Due to the limited number of images in the CHAMELEON test dataset, the poor prediction of individual images may have a large impact on the results, and we only show the experimental results of the models on the CAMO and COD10K datasets. Specifically, the quantitative results of the ablation experiments are summarized in Table 2.
Effect of MFIM. We first adopt a basic model without the MFIM, CARM, and EGM as a baseline (#1). The basic model consists of an encoder-decoder structure, where the encoder uses the backbone network Res2Net-50 [39] and the decoder integrates features layer by layer with a top-down approach. Based on the baseline, by comparing #1 and #2, we find that adding MFIM can significantly improve the detection accuracy. Furthermore, from Figure 6, we can see that the MFIM effectively aggregates multi-scale features. Although a small amount of noise is obtained along with the effective features, it undeniably improves the integrity of camouflaged object detection and enhances the detection accuracy.  Table 1 lists the quantitative comparison between our model and 16 state-of-the-art 333 methods on three benchmark datasets. For a fair comparison, we use the prediction maps 334 provided by the original authors and if not provided, we directly use their official code and 335 models to compute the missing prediction maps. It can be clearly seen that our network 336 significantly outperforms other advanced models in most evaluation metrics on the three 337 datasets. For instance, on the COD10K dataset, when compared with the second-best 338 method BSA-Net, our method increases S α and F w β by 1.9% and 3.9%, respectively, and 339 decreases MAE by 5.9%. On the CAMO dataset, when compared with the second-best 340 method C2FNet-V2, our method increases S α and F w β by 3.1% and 4.5%, respectively, and 341 decreases MAE by 13.0%. Overall, our proposed method greatly improves the performance 342 of SOTA. Figure 4 provides the PR curves and the F β curves of our method and other 343 methods on CAMO and CHAMELEON datasets. The higher the curve in the figure, the 344 Figure 5. Quantitative evaluation of the proposed MFNet with other SOTA methods (i.e., SINet [1], PFNet [21], UGTR [19], JCSOD [25], LSR [26], C2FNet-V2 [52], and BSA-Net [38]). baseline (#1). The basic model consists of an encoder-decoder structure, where the encoder 362 uses the backbone network Res2Net-50 [40] and the decoder integrates features layer by 363 layer with a top-down approach. Based on the baseline, by comparing #1 and #2, we 364 find that adding MFIM can significantly improve detection accuracy. Furthermore, from 365 Figure 6, we can see that MFIM effectively aggregates multi-scale features. Although a 366 small amount of noise is obtained along with effective features, it undeniably refines the 367 integrity features of camouflage object detection and enhances detection accuracy.

Effect of CARM.
To explore the effectiveness of the CARM, we merged the CARM into #2. As shown in Table 2, compared with #2, the performance of model #3 with the CARM added is significantly improved, which is reflected in the three evaluation metrics for both the CAMO and COD10K datasets. The CARM integrates the rich features output by the MFIM layer by layer in a top-down manner and guides the low-level features with the high-level features, which can help the model to filter out the irrelevant features in the low-level features. This is also verified by the visual comparison results shown in Figure 6. Overall, the inclusion of the CARM further improves the performance of the model.

Effect of EGM.
After comparing #3 and #4 in Table 2, it can be seen that the EGM can further improve the COD performance, achieving gains of 1.2% for S α , 1.8% for E φ , and 7.0% for MAE on the CAMO dataset. In addition, #3 and #4 in Figure 6 can also prove that the addition of edge information makes the edge details of camouflaged object detection clearer, and the ambiguity of the semantics is also effectively alleviated.

Effect of Different Levels of Features as Input in EGM
To verify the importance of high-level features in the EGM to guide edge semantic generation, we designed three variants: (1) low-level features f 1 and f 2 as input in the EGM ( f 1 + f 2 ), (2) low-level features f 1 and high-level features f 5 as input in the EGM ( f 1 + f 5 ), and (3) low-level features f 2 and high-level features f 5 as input in the EGM ( f 1 + f 2 ). We report the quantitative results in Table 3.
We follow FAP-Net [24] and use the low-level features f1 and f2 (#5) as input to the EGM, and the worst results are obtained. Meanwhile, when f1 (#6), f2 (#7), and f3 (#4) are used to explore edges together with f5 to help locate object-related edges, better results are achieved, which proves the effectiveness of using the rich semantic information of high-level features to guide the generation of object edge features. As shown in Table 3, the combination of f3 + f5 (#4) obtains the best performance for camouflaged object detection.

Effect of Different Branches in MFIM
To verify the effectiveness of the two types of branches in the MFIM, we design two variants: (1) removing current branches in the MFIM (without CB) and (2) removing adjacent branches in the MFIM (without AB). We report the quantitative results in Table 4.
The quantitative results show that the performance without CB (#8) and without AB (#9) is worse than with our method (#4), which confirms the effectiveness of the current branch and adjacent branch. Concretely, on the CAMO dataset, the model performance without CB is degraded, e.g., S α : 0.824 → 0.765, E φ : 0.883 → 0.802, MAE: 0.067 → 0.090. Comparatively, the model performance without AB declines less significantly on the same dataset, e.g., S α : 0.824 → 0.799, E φ : 0.883 → 0.847, MAE : 0.067 → 0.080. A similar situation can be observed for the COD10k dataset. We suggest that the reason is that when the current branch is removed, the local and global information of adjacent branches cannot effectively interact with that of the current branch, which greatly reduces the performance of the model.
By observing the visualization results in Figure 7, we can find that both variants are poorly visualized compared to our model. In particular, the visualization without CB (#8) is worse than that without AB (#9), which is consistent with our previous analysis.

No.
Models CAMO COD10K By observing the visualization results in Figure 7, we can find that both variants are 408 poorly visualized compared to our model. In particular, the visualization of w/o CB (#8) is 409 worse than that of w/o AB (#9), which is consistent with our previous analysis. 410 Table 3. Ablation analysis of three variant models modified for EGM on the CAMO and COD10K datasets. bold: top-1 results. The quantitative evaluation results obtained by (#5)   To verify the necessity of atrous convolution and asymmetric convolution in CARM, 412 we design two variants: 1) replacing atrous convolution and asymmetric convolution by 413 direct connection operations (w DC) and 2) replacing atrous convolution and asymmetric 414 convolution by 3×3 convolutional layers (w NC). We report the quantitative results in 415

Effect of Atrous Convolution and Asymmetric Convolution in CARM
To verify the necessity of atrous convolution and asymmetric convolution in the CARM, we design two variants: (1) replacing atrous convolution and asymmetric convolution with direct connection operations (with DC) and (2) replacing atrous convolution and asymmetric convolution with 3 × 3 convolutional layers (with NC). We report the quantitative results in Table 4. Our comparison with other ablation analyses reveals that the performance of the two variants differs less from ours. In contrast, atrous convolution and asymmetric convolution (ours) are more conducive to the refinement of camouflaged object detection by the CARM. After reviewing the visualization results in Figure 7, it becomes evident that the three models-with DC (#10), with NC (#11), and ours (#4)-exhibit an incremental improvement in their ability to detect camouflaged objects. In conclusion, the CARM based on atrous convolution and asymmetric convolution can better obtain high-quality contextual semantic information in different sizes and shapes of receptive fields to refine camouflaged objects.

Downstream Applications
In this section, we apply MFNet to downstream tasks related to COD to evaluate its generalization ability. The datasets used for the three downstream applications are shown in Table 5.

Polyp Segmentation
A polyp is a tumorous lesion that grows in the colon. The accurate segmentation of polyps is crucial in detecting them in colonoscopy images for prompt surgical intervention. In order to evaluate the effectiveness of our method in polyp segmentation, we followed the same benchmark protocol as [6], retrained our MFNet on the KvasirSEG [53] and CVC-ClinicDB [54] datasets, and tested it on the CVC-300 dataset. Figure 8a illustrates the visual results generated by our MFNet.

Defect Detection
Defect detection is an essential process in industrial production to ensure the quality of products. We demonstrate the effectiveness of MFNet in defect detection tasks by taking road crack detection as an example. We retrain our MFNet on the widely used Crack-Forest [55] dataset, using 60% of the samples for training and 40% for testing. Figure 8b presents the visual results of our approach. convolution (Ours) is more conducive to the refinement of camouflaged objects by the 419 CARM. After reviewing the visualization results in Figure 7, it becomes evident that the 420 three models, w DC (#10), w NC (#11), and Ours (#4), exhibit an incremental improvement 421 in their ability to detect camouflaged objects. In conclusion, CARM based on atrous 422 convolution and asymmetric convolution can better obtain high-quality contextual semantic 423 information in different sizes and shapes of receptive fields to refine camouflage objects.

425
In this section, we applied MFNet to downstream tasks related to COD to evaluate its 426 generalization ability. The datasets used for the three downstream applications are shown 427 in Table 5.

429
Polyp is a tumorous lesion that grows in the colon. Accurate segmentation of polyps 430 is crucial in detecting them from colonoscopy images for prompt surgical intervention. 431 In order to evaluate the effectiveness of our method in polyp segmentation, we followed 432 the same benchmark protocol as [6], retrained our MFNet on KvasirSEG [53] and CVC-433 ClinicDB [54] datasets, and tested it on the CVC-300 dataset. Figure 8 (a) illustrates the 434 visual results generated by our MFNet. 435

436
Defect detection is an essential process in industrial production to ensure the quality 437 of products. We demonstrate the effectiveness of MFNet in defect detection tasks by 438 taking road crack detection as an example. We retrained our MFNet on the widely-used 439 CrackForest [55] dataset, using 60% of the samples for training and 40% for testing. Figure 8

Transparent Object Segmentation
In daily life and industrial production, robots and drones need to accurately identify transparent objects (such as glass, windows, etc.) that are not easily visible, in order to avoid accidents. We further investigate the effectiveness of MFNet in transparent object segmentation tasks. For convenience, we reorganize the annotations of the Trans10K [56] dataset from instance-level to object-level for training purposes. The visual results presented in Figure 8c further demonstrate the generalization ability of MFNet.

Conclusions
In this paper, we propose a novel multi-level feature integration network (MFNet) for the COD task. We first explicitly model edges with the proposed EGM and use the obtained edge information to guide the network to refine the camouflaged objects' edges. Secondly, we propose the MFIM to effectively integrate the complete contextual semantic information using the strong correlation of features in adjacent layers. Finally, we propose the CARM to effectively aggregate and refine the cross-layer features to obtain clear prediction maps. Through extensive experiments, we prove that our MFNet outperforms other state-of-theart COD methods and exhibits excellent detection performance.
Author Contributions: K.L. contributed to the conceptualization, methodology, validation, data analysis, and writing of the paper; X.L. supervised the conception, reviewed the work, and approved the final manuscript; T.Q. assisted in data acquisition, review, and editing; Y.Y. helped in interpreting results, critical revisions, and theoretical framework verification; S.L. assisted in provision of study materials, grammar and spellchecking, and additional experiments. All authors have read and agreed to the published version of the manuscript.