Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration

Liu, Kangwei; Qiu, Tianchi; Yu, Yinfeng; Li, Songlin; Li, Xiuhong

doi:10.3390/s23135789

Open AccessArticle

Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration

by

Kangwei Liu

,

Tianchi Qiu

,

Yinfeng Yu

,

Songlin Li

and

Xiuhong Li

^*

Key Laboratory of Signal Detection and Processing, Department of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(13), 5789; https://doi.org/10.3390/s23135789

Submission received: 18 April 2023 / Revised: 30 May 2023 / Accepted: 15 June 2023 / Published: 21 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Camouflaged object detection (COD) aims to segment those camouflaged objects that blend perfectly into their surroundings. Due to the low boundary contrast between camouflaged objects and their surroundings, their detection poses a significant challenge. Despite the numerous excellent camouflaged object detection methods developed in recent years, issues such as boundary refinement and multi-level feature extraction and fusion still need further exploration. In this paper, we propose a novel multi-level feature integration network (MFNet) for camouflaged object detection. Firstly, we design an edge guidance module (EGM) to improve the COD performance by providing additional boundary semantic information by combining high-level semantic information and low-level spatial details to model the edges of camouflaged objects. Additionally, we propose a multi-level feature integration module (MFIM), which leverages the fine local information of low-level features and the rich global information of high-level features in adjacent three-level features to provide a supplementary feature representation for the current-level features, effectively integrating the full context semantic information. Finally, we propose a context aggregation refinement module (CARM) to efficiently aggregate and refine the cross-level features to obtain clear prediction maps. Our extensive experiments on three benchmark datasets show that the MFNet model is an effective COD model and outperforms other state-of-the-art models in all four evaluation metrics (

S_{α}

,

E_{ϕ}

,

F_{β}^{w}

, and

M A E

).

Keywords:

camouflaged object detection; multi-level feature integration; attention mechanism; boundary semantic information

1. Introduction

In nature, organisms use body color, texture, and coverings to conceal themselves within their surroundings, thereby avoiding detection by predators. Camouflaged object detection (COD) is an emerging computer vision segmentation task that aims to segment these objects that blend perfectly into their surroundings [1]. Unlike salient object detection [2,3,4,5], which segments salient objects with distinct boundaries and high contrast with their backgrounds, camouflaged objects often lack clear visual boundaries with their backgrounds and may be obscured by other objects in their surroundings, which makes accurate camouflaged object detection more challenging. Nevertheless, due to its significant research implications and wide application in medical image processing (e.g., polyp segmentation [6], lung infection segmentation [7]), pest detection [8], defect detection [9], and underwater object detection [10], COD has become a research hotspot.

Traditional camouflaged object detection methods [11,12,13,14,15] typically rely on hand-crafted features (textures, colors, edges, etc.) to differentiate camouflaged objects from their surroundings. However, due to their limited ability to extract and analyze high-level semantic information, these methods tend to perform poorly in challenging COD scenarios. In contrast, in recent years, benefiting from the development of deep-learning-based methods, Fan et al. [1] made a substantial contribution to this field by comprehensively studying the COD task and creating the COD10K dataset. They proposed a search and recognition network called SINet. This network uses the simulation of a predator’s predation process in nature to decouple COD into two processes, rough localization and accurate segmentation, thereby achieving the precise segmentation of camouflaged objects. Subsequently, a series of works [16,17,18,19,20,21] have further explored this area.

While the existing methods achieve decent detection performance in most scenarios, there is still considerable room for improvement when dealing with highly challenging situations. In the COD task, multi-level feature fusion also plays a vital role [22]. Convolutional neural networks (CNNs) are known to extract low-level image features through the initial layers. As the network further deepens and processes these low-level features, it effectively captures and incorporates rich semantic information into higher-level features. Additionally, some methods [1,21,23] tend to prioritize the localization of the camouflaged object, overlooking the significance of edge refinement.

In response to the aforementioned issues, we propose a general framework for COD called the multi-level feature integration network (MFNet), which focuses on learning and integrating multi-level contextual features from input images. Specifically, we introduce the edge guidance module (EGM), which generates the edges of camouflaged objects by combining high-level semantic information and low-level spatial details, and guides the network to obtain a more explicit depiction of the edges of camouflaged objects. In addition, we design a multi-level feature integration module (MFIM) to extract and integrate high-level semantic and spatial details in multi-level features. To efficiently aggregate and refine cross-level contextual information, we propose a context aggregation refinement module (CARM) to filter interference information via the attention mechanism [24] and capture and refine multi-scale contextual information via atrous convolution and asymmetric convolution. Our proposed method can significantly improve the detection performance compared to the state-of-the-art methods. The main contributions of this paper are four-fold:

We propose the novel MFNet to investigate the effectiveness of adjacent layer feature integration in the COD task and confirm the rationality of fully capturing contextual information through the interaction of adjacent layer features;
We propose an edge guidance module to explicitly learn object edge representations and guide the model to discriminate camouflaged objects’ edges effectively;
We propose a multi-level feature integration module to efficiently extract and integrate global semantic information and local detail information in multi-scale features;
We propose a context aggregation refinement module to aggregate and refine cross-layer features via the attention mechanism, atrous convolution, and asymmetric convolution.

2. Related Work

2.1. Camouflaged Object Detection

Traditional COD methods segment camouflaged objects by extracting artificial features such as texture, color, edge, contrast, and 3D convexity in camouflage scenes [11,12,13,14,15]. However, these features are inefficient in complex scenes.

With the introduction of the large-scale COD dataset COD10K and the animal-predation-inspired baseline SINet by Fan et al. [1], more and more deep-learning-based COD methods have emerged. For instance, Mei et al. [21] proposed PFNet, which uses high-level semantic information to roughly localize camouflaged objects and then wipe out the false positive and false negative areas, which can distract the segmentation results. Zhai et al. [18] proposed MGL, which models the localization and refinement process in camouflaged object detection through graph convolutional networks. Li et al. [25] proposed JCSOD, which employs a joint adversarial learning framework to perform both salient object detection (SOD) and COD tasks to improve the accuracy and robustness of camouflaged object detection. Yang et al. [19] proposed UGTR, a method that combines a convolutional neural network and Transformer, to leverage a probabilistic representation model to learn the uncertainty of camouflaged objects within the Transformer framework. This enables the model to pay more attention to uncertain regions, leading to more precise segmentation. Lv et al. [26] proposed a novel COD network LSR to simultaneously localize and segment camouflaged objects and rank them according to their detectability. Fan et al. [27] proposed SINet-V2, which first introduced group reverse attention to address the COD problem, and obtained excellent detection performance by combining the location information provided by the neighbor connection decoder module. Recently, Pang et al. [28] proposed a camouflaged object detection model called ZoomNet, which mimics human behavior (zooming in and out) when observing blurred images, using a scaling strategy to learn mixed-scale semantics through scale integration and hierarchical scale mixing.

2.2. Multi-Level Feature Fusion

Multi-level feature fusion strategies have been widely used in detection and segmentation tasks. Integrating various levels of feature information enables the effective extraction of contextual semantic information, which enhances the learning capability of the model. Furthermore, coordinating high-level semantic features with low-level fine details is crucial in camouflaged object detection (COD) tasks. Previous works have proposed different multi-level feature fusion strategies. Some methods [29,30,31,32] connect features of the corresponding level in the encoder to the decoder through the transport layer. Since single-level features can only characterize information at a specific scale, this top-down connectivity greatly diminishes the ability to characterize details in low-level features. Each level of features contains rich information, and in order to retain as much information as possible, Refs. [33,34,35] combined features from multiple levels in a fully connected or heuristic manner. However, the extensive integration of cross-scale information tends to lead to high computational costs and a lot of noise, thus reducing the model’s performance. Xia et al. [36] proposed an aggregated interaction strategy with adjacent three-level features as input, fed into each of the three branches, and the information from other branches is flexibly integrated between each branch through interactive learning. The method makes better use of multi-level features, avoids the interference caused by resolution differences in feature fusion, and effectively integrates contextual information from adjacent resolutions. Zhou et al. [24] designed a cross-level fusion and propagation module. It first fuses cross-level features through a series of convolutional layers and residual connections, and then the feature propagation part allows the decoder to obtain more effective features from the encoder to improve the detection performance by weighing the contributions of features from the encoder and decoder. Moreover, we consider the scale variation in the adjacent tertiary features, using the fine detail information in the low-level high-resolution features and the rich semantic information in the high-level low-resolution features as a complement to provide local and global information for the current-level features. In this way, the extraction and fusion of contextual semantic information of multi-level features are facilitated, thus providing rich feature representations for the decoder.

2.3. Boundary-Aware Learning

Edge information is increasingly used as auxiliary information to refine object segmentation boundaries, resulting in more accurate segmentation results. Ding et al. [37] suggested learning edges as additional semantic classes to enable the network to learn the boundary layout of scene segmentation effectively. Zhao et al. [3] considered the complementarity between salient edge information and salient object information, and modeled both in the network. By doing so, they fully utilized the salient edge information to achieve more effective object segmentation. Zhu et al. [38] attempted to integrate the boundary information into the feature space using multi-level features of the encoder to enhance the sensitivity of the model to the boundary. Zhou et al. [24] considered that only low-level features contain sufficient boundary information; a boundary guidance module was designed to explicitly model the boundary information by inputting the two lowest-level features from the encoder that contained rich edge details. This module aids in the localization of camouflaged objects and the refinement of edges. Unlike the above methods, our proposed method considers that high-level semantic information can guide the model to filter the edge noise. Therefore, we propose an edge guidance module that explicitly generates the edges of the camouflaged object by combining high-level semantic information and low-level detail information and guides the refinement of camouflaged object edges by embedding the generated edge semantic features into the model.

3. Proposed Method

The overall framework of the proposed MFNet is shown in Figure 1, consisting of three key components: the edge guidance module, multi-level feature integration module, and context aggregation refinement module. Specifically, we use the pre-trained Res2Net-50 [39] as the backbone to extract multi-level features from an input image

I \in R^{H \times W \times 3}

, resulting in a set of features

f_{i}, i \in {1, 2, 3, 4, 5}

. The resolution of

f_{i}

is

H / 2^{i + 1} \times W / 2^{i + 1}, i \in {1, 2, 3, 4, 5}

. Next, we propose the EGM, which uses high-level features

f_{5}

and low-level features

f_{3}

to model the edge information

f_{e}

associated with the camouflaged object and obtain the object-related edge semantics. Then, the proposed MFIM integrates multi-level features and edge cues, leveraging high-level semantic information and low-level detail cues to guide the extraction of global and local information by the current layer features, facilitating feature learning and enhancing the boundary representation. Subsequently, the aggregated features are fed into the proposed CARM to effectively integrate cross-level features in a top-down manner, refining the camouflaged object detection results. Finally, we employ a multi-level supervision strategy to improve the COD performance. We will describe the three key modules mentioned above in detail in the following.

3.1. Edge Generation Module

Complete edge information is crucial for object localization and segmentation. However, in contrast to the method used in [24], which relies solely on low-level features to obtain edge cues, we argue that low-level features contain many irrelevant non-object edge details. Therefore, it is necessary to utilize the rich semantic information in high-level features to guide the generation of object edge features. For this purpose, we design the edge guidance module (EGM) to explicitly model the edges of camouflaged objects by combining high-level features

f_{5}

and low-level features

f_{3}

. As shown in Figure 1, when the features enter the EGM, two 1 × 1 convolutional layers are first used to reduce the channels. The features are then integrated by concatenation, and the integrated features are fed to a 3 × 3 convolutional layer to obtain the fused feature representation. Finally, the fused features are fed into a 1 × 1 convolutional layer and a Sigmoid function to obtain the final edge prediction map.

3.2. Multi-Level Feature Integration Module

In COD tasks, high-level features typically contain rich semantic information, while low-level features contain more detailed local clues. To take both semantic and detailed information into account, we propose the multi-level feature integration module (MFIM). We divide MFIM into two cases: the first case is for

f_{2}

,

f_{3}

, and

f_{4}

, which have two adjacent feature layers, while the other case is for

f_{5}

, which has only one adjacent feature layer because of its location. In order to take full advantage of the multi-layer features and reduce the model parameters, we introduce

f_{1}

into the network as the adjacent low-level feature of

f_{2}

. The framework of MFIM is illustrated in Figure 2. If we define the MFIM process as

F (\cdot)

, its equation can be described as follows:

f_{i}^{m} = \{\begin{matrix} F (f_{i - 1}, f_{i}, f_{i + 1}), & i = 2, 3, 4 \\ F (f_{i - 1}, f_{i}), & i = 5 \end{matrix}

(1)

where

f_{i}^{m} \in R^{h_{i} \times w_{i} \times c_{i}}

is the output feature of MFIM, and

f_{i - 1}

,

f_{i}

and

f_{i + 1}

are the adjacent low-level features, current-level features, and the adjacent higher-level feature, respectively.

In practice, we design three branches (i.e., one current branch and two adjacent branches) in MFIM. The current branch introduces edge cues through a channel attention module and captures rich contextual information by using atrous convolution and asymmetric convolution with different dilation rates in parallel; for the detailed flow, see Section 3.2.1. The two adjacent branches are the adjacent low-level feature branch and the adjacent high-level feature branch. The adjacent low-level feature branch extracts local detail information via the spatial attention module, while the adjacent high-level feature branch extracts global semantic information via the self-attention mechanism; for details, refer to Section 3.2.2. The features of the three branches are integrated via element-by-element addition operation; for details, refer to Section 3.2.3.

3.2.1. Current Branch

The current branch performs two main operations to process feature

f_{i}

. Firstly, we incorporate the edge cue

f_{e}

into the network using a channel attention (CA) module [40], which selectively amplifies or suppresses informative channels in feature maps to explore cross-channel interactions and extract the critical information between channels. This process can be formulated as follows:

f_{i}^{c a} = C A (f_{i}, f_{e}),

(2)

where

f_{i}^{c a}

is the output feature of CA, and

f_{e}

is the edge feature. Then, we utilize the scale-related pyramid convolution (SRPC) module to more effectively combine multi-scale information. This process can be expressed as follows:

f_{i}^{s} = S R P C (f_{i}^{c a}),

(3)

where

f_{i}^{s}

is the output feature of the SRPC module. Our proposed SRPC module is dedicated to multi-scale feature learning and integration. Unlike the global context module in BBSNet [41], which independently extracts information at different scales through separate branches, the SRPC module refers to [42], fully considers the cross-scale interaction between adjacent branches, and increases the feature scale diversity through asymmetric convolution and atrous convolution. Specifically, we take the output feature

f_{i}^{k}

of

f_{i}

after the 1 × 1 convolutional layer as an example. We divide

f_{i}^{k}

uniformly into four feature maps along the channel dimension (

f_{i}^{k_{1}}

,

f_{i}^{k_{2}}

,

f_{i}^{k_{3}}

,

f_{i}^{k_{4}}

) for multi-scale learning. This process fuses the features of adjacent branches and obtains multi-scale contextual features via a series of atrous convolutional layers and asymmetric convolutional layers. This process can be formulated as follows:

f_{i}^{k_{j}^{'}} = \{\begin{matrix} C o n v_{3}^{a} (f_{i}^{k_{j}} \oplus f_{i}^{k_{j + 1}}), & j = 1 \\ C o n v_{3}^{n_{j}} (f_{i}^{k_{j - 1}^{'}} \oplus f_{i}^{k_{j}} \oplus f_{i}^{k_{j + 1}}), & j = 2, 3 \\ C o n v_{3}^{n_{j}} (f_{i}^{k ’_{j - 1}} \oplus f_{i}^{k_{j}}), & j = 4 \end{matrix}

(4)

where ⊕ is the element-wise summation,

C o n v_{3}^{a}

is a 3 × 3 asymmetric convolutional layer, and

C o n v_{3}^{n_{j}}

is a 3 × 3 atrous convolutional layer with a dilation rate of

n_{j}

. Referring to EDN [42], we set

n_{j} \in {2, 3, 4}

. Finally, the features

{f_{i}}^{k_{j}^{^{'}}}

, j \in {1, 2, 3, 4}

from the four branches are concatenated and passed through a residual connection, followed by a 3 × 3 convolutional layer. This process can be formulated as follows:

f_{i}^{s} = C o n v_{3} (C o n c a t (f_{i}^{k_{1}^{'}}, f_{i}^{k_{2}^{'}}, f_{i}^{k_{3}^{'}}, f_{i}^{k_{4}^{'}})) .

(5)

where Concat

(\cdot)

is the concatenation operation,

C o n v_{3}

is a 3 × 3 convolutional layer, and

f_{i}^{s}

is the output feature of the SRPC module.

3.2.2. Adjacent Branch

Adjacent branches can be divided into two types. One is the adjacent lower-level feature branch, and the lower-level features usually contain more spatial detail information. The spatial attention module [40] focuses on the local information of the feature map by performing a max pooling operation on the feature map in the channel dimension, so we enhance the current feature representation by extracting fine spatial detail information through the spatial attention module, which can be computed as

f_{i}^{s a} = S A (f_{i - 1}) \otimes f_{i}^{s},

(6)

where ⊗ is the element-wise multiplication,

f_{i}^{s a}

is the output feature of the adjacent lower-level feature branch, and

f_{i}^{s}

is the output feature of the current branch. Another is the adjacent high-level feature branch, where the high-level features contain more contextual semantic information. The multi-dconv head transposed attention (MHTA) module [43] can effectively model long-range dependency relationships, thereby capturing global feature information. Thus, we capture the rich contextual information to enhance the global semantic representation via the MHTA module, which can be computed as

f_{i}^{m h} = M H T A (f_{i + 1}) \otimes f_{i}^{s},

(7)

where

f_{i}^{m h}

is the output feature of the adjacent high-level feature branch.

3.2.3. Branches’ Integration

After being processed by the current and adjacent branches, the features are obtained as

f_{i}^{s a}

,

f_{i}^{s}

, and

f_{i}^{m h}

. Then, they are fused with

f_{i}^{c a}

(the output feature of CA) by the element-wise summation operation, which can be defined as follows:

f_{i}^{m} = \{\begin{matrix} f_{i}^{s a} \oplus f_{i}^{s} \oplus f_{i}^{c a} \oplus f_{i}^{m h}, & i = 2, 3, 4 \\ f_{i}^{s} \oplus f_{i}^{c a} \oplus f_{i}^{m h}, & i = 5 \end{matrix}

(8)

where

f_{i}^{m}

is the output feature of the MFIM.

3.3. Context Aggregation Refinement Module

The effective fusion of cross-level features from top to bottom often improves the learning performance. To this end, we propose a context aggregation refinement module (CARM) to improve the detection effect by making full use of the contextual information to refine the features level by level. As shown in Figure 3, for the CARM at the i-th (

i \in {1, 2, 3}

) stage, it first fuses the output feature of the CARM at the next stage (denoted as

f_{i + 1}^{c}

) with the features obtained by the MFIM at the current stage (denoted as

f_{i}^{m}

) through a concatenation operation. The concatenated result is fed to the 3 × 3 convolutional layer and an element-wise summation is performed by adding

f_{i + 1}^{c}

. After a 3 × 3 convolution, the high-dimensional features are mapped to the spatial-wise gate through the 1 × 1 convolutional layer, and then the Softmax function is used to obtain the weights to perform element-wise multiplication with the feature

f_{i}^{f u s e d}

to filter the interference information. We can express this process as follows:

f_{i}^{f u s e d} = C o n v_{3} (C o n v_{3} (C o n c a t (f_{i}^{m}, f_{i + 1}^{c})) \oplus f_{i + 1}^{c}),

(9)

f_{i}^{a t t} = S o f t m a x (C o n v_{1} (f_{i}^{f u s e d})) \otimes f_{i}^{f u s e d},

(10)

where

S o f t m a x (\cdot)

is the Softmax function and

f_{i}^{a t t}

is the output feature of the filtered feature. The atrous convolutional layer and asymmetric convolutional layer can obtain rich contextual semantic information through multi-scale receptive fields. Thus, we further refine the features with atrous convolution and asymmetric convolution, which can be defined as

f_{i}^{a} = C o n v_{3} (C o n c a t (C o n v_{3}^{a s y} (f_{i}^{a t t}), C o n v_{3}^{a t r} (f_{i}^{a t t}))),

(11)

f_{i}^{c} = D C o n v_{3}^{d e} (f_{i}^{a})

(12)

where

C o n v_{3}^{a s y}

is the 3 × 3 atrous convolutional layer with a dilation rate of 3,

C o n v_{3}^{a t r}

is the 3 × 3 atrous convolutional layer,

f_{i}^{c}

is the output feature of the CARM, and

D C o n v_{3}^{d e}

is a 3 × 3 deconvolution layer followed by a dropout layer. For the last CARM (i = 4), the input corresponding to the next CARM is replaced by the output feature of the 4th MFIM.

3.4. Loss Function

In pixel-wise segmentation tasks, the binary cross-entropy (BCE) loss and the intersection-over-union (IOU) loss are widely used in conjunction with each other to provide strong constraints on the local pixel and global structure of the object. Inspired by the success of the weighted IOU loss and weighted BCE loss in [4], our detection loss function is defined as

L_{d e t} = L_{B C E}^{w} + L_{I O U}^{w}

(13)

where

L_{B C E}^{w}

and

L_{I O U}^{w}

denote the weighted IOU loss and BCE loss, respectively.

L_{I O U}^{w}

highlights the importance of hard pixels (a type of pixel that targets difficult samples or misclassified pixels in a pixel-level classification task) by increasing their weights, and

L_{B C E}^{w}

focuses more on hard pixels rather than treating all pixels equally. Meanwhile, the produced edge map can be measured using the adaptive pixel intensity (API) loss [44], which distinguishes relatively important pixels (pixels that are adjacent to fine or explicit edges) by applying the pixel intensity to the L1 loss. Thus, our total loss can be defined as

L_{t o t a l} = \sum_{i = 1}^{4} L_{d e t} (P_{i}, G_{o}) + L_{e d g e} (P_{e}, G_{e})

(14)

where

P_{i}

is the camouflaged object’s predicted map,

G_{o}

is the ground truth of the camouflaged object,

P_{e}

is the edge of the camouflaged object’s predicted map, and

G_{e}

is the ground truth of the edge of the camouflaged object.

4. Experiments

4.1. Implementation Details

Our model is implemented in PyTorch [45] using Res2Net-50 [39] pre-trained on ImageNet as the backbone. All input images are resized to 416 × 416 and undergo data augmentation via random horizontal flipping. We set the batch size to 16 and employ the Adam optimizer [46]. We initialize the learning rate to 1 × 10

^{- 4}

and adjust it using a poly strategy with a power of 0.9. The training process, using an NVIDIA RTX 3090 GPU for acceleration, takes around 3.5 h to complete for 60 epochs. The source code and results will be released at https://github.com/WkangLiu/MFNet, accessed on 14 June 2023.

4.2. Datasets

We evaluate our model on three popular COD benchmark datasets: CAMO [47], CHAMELEON [48], COD10K [1]. CAMO includes a total of 1250 images, of which 1000 images are used as the training set and 250 images are used as the test set. CHAMELEON contains 76 images collected on the Internet, all of which are used as the test set. COD10K is the largest dataset, containing 5066 images collected on websites classified into 10 super-classes and 78 sub-classes, of which 3040 images are used as the training set and 2026 images are used as the test set.

4.3. Evaluation Metrics

We use four widely used evaluation metrics to judge the accuracy of our models, including the structure measure (

S_{α}

) [49], E-measure (

E_{ϕ}

) [50], weighted F-measure (

F_{β}^{w}

) [51], and mean absolute error (

M A E

) [14]. In addition, we also provide precision–recall (PR) curves and

F_{β}

-threshold (

F_{β}

) curves to help to evaluate the model more comprehensively.

4.3.1. Structure Measure ( $S_{α}$ )

The structure measure is used to measure the structural similarity of region-aware (

S_{o}

) and object-aware (

S_{r}

) aspects, which is defined by

S_{α} = α \times S_{o} + (1 - α) \times S_{r},

(15)

where

α

is set to 0.5 by default and a higher value of

S_{α}

leads to the better performance of the models.

4.3.2. E-Measure ( $E_{ϕ}$ )

The E-measure is used to consider the global image-level statistics and the local pixel-matching information, which is defined by

E_{ϕ} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} ϕ (P (x, y), G (x, y)),

(16)

where W and H are the width and height of the ground truth G, and

(x, y)

is the coordinate of each pixel in G. Symbol

ϕ

is the enhanced alignment matrix. We obtain a set of

E_{ϕ}

by converting the prediction P into a binary mask with a threshold in the range of [0, 255]. A higher value of

E_{ϕ}

leads to the better performance of the models.

4.3.3. Weighted F-Measure ( $F_{β}^{w}$ )

The weighted F-measure is used to consider both precision and recall simultaneously, which is defined by

F_{β}^{ω} = \frac{(1 + β^{2}) \times P r e c i s i o n^{ω} \times R e c a l l^{ω}}{β^{2} \times P r e c i s i o n^{ω} + R e c a l l^{ω}}

(17)

where

precision = \frac{| M \cap G |}{| M |}

and

recall = \frac{| M \cap G |}{| G |}

. Meanwhile,

β^{2}

is set to 0.3 by default and a higher value of

F_{β}^{ω}

leads to the better performance of the models.

4.3.4. Mean Absolute Error ( $M A E$ )

The mean absolute error is used to calculate the average pixel-level relative error between the ground truth and normalized prediction, which is defined by

M A E = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} | P (i, j) - G (i, j) |,

(18)

where W and H are the width and height of the image. P and G are the normalized prediction and ground truth, respectively. A smaller value of

M A E

leads to the better performance of the models.

4.4. Comparison with SOTA Methods

We compare the proposed method with 16 state-of-the-art COD baselines: CPD [2], EGNet [3], F3Net [4], UCNet [5], SINet [1], PraNet [6], C2FNet [22], PFNet [21], TINet [17], UGTR [19], R-MGL [18], JCSOD [25], LSR [26], C2FNet-V2 [52], SINet-V2 [27], and BSA-Net [38].

4.4.1. Quantitative Evaluation

Table 1 details the quantitative comparison between our model and the 16 state-of-the-art methods on three benchmark datasets. For a fair comparison, we use the prediction maps provided by the original authors, and if not provided, we directly use their official code and models to compute the missing prediction maps. It can be clearly seen that our network significantly outperforms other advanced models in most evaluation metrics on the three datasets. For instance, on the COD10K dataset, when compared with the second-best method BSA-Net, our method increases

S_{α}

and

F_{β}^{w}

by 1.9% and 3.9%, respectively, and decreases

M A E

by 5.9%. On the CAMO dataset, when compared with the second-best method C2FNet-V2, our method increases

S_{α}

and

F_{β}^{w}

by 3.1% and 4.5%, respectively, and decreases

M A E

by 13.0%. Overall, our proposed method greatly improves the performance of the SOTA. Figure 4 provides the PR curves and the

F_{β}

curves of our method and other methods on the CAMO and CHAMELEON datasets. The higher the curve in the figure, the better the performance of the model, which further demonstrates the superiority of our method.

4.4.2. Qualitative Evaluation

In Figure 5, we visualize some challenging scenes and results generated by our method and other SOTA methods. It is not difficult to see that our model can accurately segment objects at different scales, including large objects (Row 1), medium objects (Row 2), small objects (Row 3–4), and multiple objects (Row 5–6). Meanwhile, for objects with high luminance (Row 7), occlusion (Row 8), or abundant edge details (Row 9–10), our method is also able to generate accurate predictions that are very consistent with the ground truth.

4.5. Ablation Analysis

4.5.1. Effect of Different Modules

We conduct ablation studies to investigate the effectiveness of the MFIM, CARM, and EGM in our MFNet. Due to the limited number of images in the CHAMELEON test dataset, the poor prediction of individual images may have a large impact on the results, and we only show the experimental results of the models on the CAMO and COD10K datasets. Specifically, the quantitative results of the ablation experiments are summarized in Table 2.

Effect of MFIM. We first adopt a basic model without the MFIM, CARM, and EGM as a baseline (#1). The basic model consists of an encoder–decoder structure, where the encoder uses the backbone network Res2Net-50 [39] and the decoder integrates features layer by layer with a top-down approach. Based on the baseline, by comparing #1 and #2, we find that adding MFIM can significantly improve the detection accuracy. Furthermore, from Figure 6, we can see that the MFIM effectively aggregates multi-scale features. Although a small amount of noise is obtained along with the effective features, it undeniably improves the integrity of camouflaged object detection and enhances the detection accuracy.

Effect of CARM. To explore the effectiveness of the CARM, we merged the CARM into #2. As shown in Table 2, compared with #2, the performance of model #3 with the CARM added is significantly improved, which is reflected in the three evaluation metrics for both the CAMO and COD10K datasets. The CARM integrates the rich features output by the MFIM layer by layer in a top-down manner and guides the low-level features with the high-level features, which can help the model to filter out the irrelevant features in the low-level features. This is also verified by the visual comparison results shown in Figure 6. Overall, the inclusion of the CARM further improves the performance of the model.

Effect of EGM. After comparing #3 and #4 in Table 2, it can be seen that the EGM can further improve the COD performance, achieving gains of 1.2% for

S_{α}

, 1.8% for

E_{ϕ}

, and 7.0% for

M A E

on the CAMO dataset. In addition, #3 and #4 in Figure 6 can also prove that the addition of edge information makes the edge details of camouflaged object detection clearer, and the ambiguity of the semantics is also effectively alleviated.

4.5.2. Effect of Different Levels of Features as Input in EGM

To verify the importance of high-level features in the EGM to guide edge semantic generation, we designed three variants: (1) low-level features

f_{1}

and

f_{2}

as input in the EGM (

f_{1} + f_{2}

), (2) low-level features

f_{1}

and high-level features

f_{5}

as input in the EGM (

f_{1} + f_{5}

), and (3) low-level features

f_{2}

and high-level features

f_{5}

as input in the EGM (

f_{1} + f_{2}

). We report the quantitative results in Table 3.

We follow FAP-Net [24] and use the low-level features f1 and f2 (#5) as input to the EGM, and the worst results are obtained. Meanwhile, when f1 (#6), f2 (#7), and f3 (#4) are used to explore edges together with f5 to help locate object-related edges, better results are achieved, which proves the effectiveness of using the rich semantic information of high-level features to guide the generation of object edge features. As shown in Table 3, the combination of f3 + f5 (#4) obtains the best performance for camouflaged object detection.

4.5.3. Effect of Different Branches in MFIM

To verify the effectiveness of the two types of branches in the MFIM, we design two variants: (1) removing current branches in the MFIM (without CB) and (2) removing adjacent branches in the MFIM (without AB). We report the quantitative results in Table 4.

The quantitative results show that the performance without CB (#8) and without AB (#9) is worse than with our method (#4), which confirms the effectiveness of the current branch and adjacent branch. Concretely, on the CAMO dataset, the model performance without CB is degraded, e.g.,

S_{α}

: 0.824 → 0.765,

E_{ϕ}

: 0.883 → 0.802,

M A E

: 0.067 → 0.090. Comparatively, the model performance without AB declines less significantly on the same dataset, e.g.,

S_{α}

: 0.824 → 0.799,

E_{ϕ}

: 0.883 → 0.847,

M A E :

0.067 → 0.080. A similar situation can be observed for the COD10k dataset. We suggest that the reason is that when the current branch is removed, the local and global information of adjacent branches cannot effectively interact with that of the current branch, which greatly reduces the performance of the model.

By observing the visualization results in Figure 7, we can find that both variants are poorly visualized compared to our model. In particular, the visualization without CB (#8) is worse than that without AB (#9), which is consistent with our previous analysis.

4.5.4. Effect of Atrous Convolution and Asymmetric Convolution in CARM

To verify the necessity of atrous convolution and asymmetric convolution in the CARM, we design two variants: (1) replacing atrous convolution and asymmetric convolution with direct connection operations (with DC) and (2) replacing atrous convolution and asymmetric convolution with 3 × 3 convolutional layers (with NC). We report the quantitative results in Table 4.

Our comparison with other ablation analyses reveals that the performance of the two variants differs less from ours. In contrast, atrous convolution and asymmetric convolution (ours) are more conducive to the refinement of camouflaged object detection by the CARM. After reviewing the visualization results in Figure 7, it becomes evident that the three models—with DC (#10), with NC (#11), and ours (#4)—exhibit an incremental improvement in their ability to detect camouflaged objects. In conclusion, the CARM based on atrous convolution and asymmetric convolution can better obtain high-quality contextual semantic information in different sizes and shapes of receptive fields to refine camouflaged objects.

5. Downstream Applications

In this section, we apply MFNet to downstream tasks related to COD to evaluate its generalization ability. The datasets used for the three downstream applications are shown in Table 5.

5.1. Polyp Segmentation

A polyp is a tumorous lesion that grows in the colon. The accurate segmentation of polyps is crucial in detecting them in colonoscopy images for prompt surgical intervention. In order to evaluate the effectiveness of our method in polyp segmentation, we followed the same benchmark protocol as [6], retrained our MFNet on the KvasirSEG [53] and CVC-ClinicDB [54] datasets, and tested it on the CVC-300 dataset. Figure 8a illustrates the visual results generated by our MFNet.

5.2. Defect Detection

Defect detection is an essential process in industrial production to ensure the quality of products. We demonstrate the effectiveness of MFNet in defect detection tasks by taking road crack detection as an example. We retrain our MFNet on the widely used CrackForest [55] dataset, using 60% of the samples for training and 40% for testing. Figure 8b presents the visual results of our approach.

5.3. Transparent Object Segmentation

In daily life and industrial production, robots and drones need to accurately identify transparent objects (such as glass, windows, etc.) that are not easily visible, in order to avoid accidents. We further investigate the effectiveness of MFNet in transparent object segmentation tasks. For convenience, we reorganize the annotations of the Trans10K [56] dataset from instance-level to object-level for training purposes. The visual results presented in Figure 8c further demonstrate the generalization ability of MFNet.

6. Conclusions

In this paper, we propose a novel multi-level feature integration network (MFNet) for the COD task. We first explicitly model edges with the proposed EGM and use the obtained edge information to guide the network to refine the camouflaged objects’ edges. Secondly, we propose the MFIM to effectively integrate the complete contextual semantic information using the strong correlation of features in adjacent layers. Finally, we propose the CARM to effectively aggregate and refine the cross-layer features to obtain clear prediction maps. Through extensive experiments, we prove that our MFNet outperforms other state-of-the-art COD methods and exhibits excellent detection performance.

Author Contributions

K.L. contributed to the conceptualization, methodology, validation, data analysis, and writing of the paper; X.L. supervised the conception, reviewed the work, and approved the final manuscript; T.Q. assisted in data acquisition, review, and editing; Y.Y. helped in interpreting results, critical revisions, and theoretical framework verification; S.L. assisted in provision of study materials, grammar and spellchecking, and additional experiments. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Xinjiang Province under grant number 2020D01C026, the open project of key laboratory, Xinjiang Uygur Autonomous Region under grant number 2022D04079, and the National Natural Science Foundation of China under grant numbers U1911401 and 61433012.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2777–2787. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Wei, J.; Wang, S.; Huang, Q. F³Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12321–12328. [Google Scholar]
Zhang, J.; Fan, D.P.; Dai, Y.; Anwar, S.; Saleh, F.S.; Zhang, T.; Barnes, N. UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8582–8591. [Google Scholar]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2020, Proceedings of the 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part VI 23; Springer: Berlin/Heidelberg, Germany, 2020; pp. 263–273. [Google Scholar]
Wu, Y.H.; Gao, S.H.; Mei, J.; Xu, J.; Fan, D.P.; Zhang, R.G.; Cheng, M.M. Jcs: An explainable covid-19 diagnosis system by joint classification and segmentation. IEEE Trans. Image Process. 2021, 30, 3113–3126. [Google Scholar] [CrossRef]
Fuentes, A.; Yoon, S.; Kim, S.C.; Park, D.S. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors 2017, 17, 2022. [Google Scholar] [CrossRef] [Green Version]
Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Rizzini, D.L.; Kallasi, F.; Oleari, F.; Caselli, S. Investigation of vision-based underwater object detection with multiple datasets. Int. J. Adv. Robot. Syst. 2015, 12, 77. [Google Scholar] [CrossRef] [Green Version]
Kavitha, C.; Rao, B.P.; Govardhan, A. An efficient content based image retrieval using color and texture of image sub blocks. Int. J. Eng. Sci. Technol. (IJEST) 2011, 3, 1060–1068. [Google Scholar]
Qiu, L.; Wu, X.; Yu, Z. A high-efficiency fully convolutional networks for pixel-wise surface defect detection. IEEE Access 2019, 7, 15884–15893. [Google Scholar] [CrossRef]
Siricharoen, P.; Aramvith, S.; Chalidabhongse, T.; Siddhichai, S. Robust outdoor human segmentation based on color-based statistical approach and edge combination. In Proceedings of the The 2010 International Conference on Green Circuits and Systems, Shanghai, China, 21–23 June 2010; pp. 463–468. [Google Scholar]
Perazzi, F.; Krähenbühl, P.; Pritch, Y.; Hornung, A. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Pan, Y.; Chen, Y.; Fu, Q.; Zhang, P.; Xu, X. Study on the camouflaged target detection method based on 3D convexity. Mod. Appl. Sci. 2011, 5, 152. [Google Scholar] [CrossRef]
Yan, J.; Le, T.N.; Nguyen, K.D.; Tran, M.T.; Do, T.T.; Nguyen, T.V. Mirrornet: Bio-inspired camouflaged object segmentation. IEEE Access 2021, 9, 43290–43300. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, X.; Zhang, S.; Liu, J. Inferring camouflaged objects by texture-aware interactive guidance network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3599–3607. [Google Scholar]
Zhai, Q.; Li, X.; Yang, F.; Chen, C.; Cheng, H.; Fan, D.P. Mutual graph learning for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12997–13007. [Google Scholar]
Yang, F.; Zhai, Q.; Li, X.; Huang, R.; Luo, A.; Cheng, H.; Fan, D.P. Uncertainty-guided transformer reasoning for camouflaged object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4146–4155. [Google Scholar]
Ji, G.P.; Zhu, L.; Zhuge, M.; Fu, K. Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recognit. 2022, 123, 108414. [Google Scholar] [CrossRef]
Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
Sun, Y.; Chen, G.; Zhou, T.; Zhang, Y.; Liu, N. Context-aware cross-level fusion network for camouflaged object detection. arXiv 2021, arXiv:2105.12555. [Google Scholar]
Wang, K.; Bi, H.; Zhang, Y.; Zhang, C.; Liu, Z.; Zheng, S. D ² C-Net: A Dual-Branch, Dual-Guidance and Cross-Refine Network for Camouflaged Object Detection. IEEE Trans. Ind. Electron. 2021, 69, 5364–5374. [Google Scholar] [CrossRef]
Zhou, T.; Zhou, Y.; Gong, C.; Yang, J.; Zhang, Y. Feature Aggregation and Propagation Network for Camouflaged Object Detection. IEEE Trans. Image Process. 2022, 31, 7036–7047. [Google Scholar] [CrossRef]
Li, A.; Zhang, J.; Lv, Y.; Liu, B.; Zhang, T.; Dai, Y. Uncertainty-aware joint salient object and camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10071–10081. [Google Scholar]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11591–11601. [Google Scholar]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 2160–2170. [Google Scholar]
Luo, Z.; Mishra, A.; Achkar, A.; Eichel, J.; Li, S.; Jodoin, P.M. Non-local deep features for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 21–26 July 2017; pp. 6609–6617. [Google Scholar]
Zhang, X.; Wang, T.; Qi, J.; Lu, H.; Wang, G. Progressive Attention Guided Recurrent Network for Salient Object Detection. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, S.; Tan, X.; Wang, B.; Hu, X. Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
Wu, R.; Feng, M.; Guan, W.; Wang, D.; Lu, H.; Ding, E. A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8150–8159. [Google Scholar]
Zhang, P.; Wang, D.; Lu, H.; Wang, H.; Ruan, X. Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 202–211. [Google Scholar]
Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
Wang, T.; Zhang, L.; Wang, S.; Lu, H.; Yang, G.; Ruan, X.; Borji, A. Detect globally, refine locally: A novel approach to saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3127–3135. [Google Scholar]
Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
Ding, H.; Jiang, X.; Liu, A.Q.; Thalmann, N.M.; Wang, G. Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6819–6829. [Google Scholar]
Zhu, H.; Li, P.; Xie, H.; Yan, X.; Liang, D.; Chen, D.; Wei, M.; Qin, J. I can find you! Boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 22 February–1 March 2022; Volume 36, pp. 3608–3616. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [Green Version]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Fan, D.P.; Zhai, Y.; Borji, A.; Yang, J.; Shao, L. BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In Proceedings of the Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII; Springer: Berlin/Heidelberg, Germany, 2020; pp. 275–292. [Google Scholar]
Wu, Y.H.; Liu, Y.; Zhang, L.; Cheng, M.M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Trans. Image Process. 2022, 31, 3125–3136. [Google Scholar] [CrossRef]
Yin, B.; Zhang, X.; Hou, Q.; Sun, B.Y.; Fan, D.P.; Van Gool, L. CamoFormer: Masked Separable Attention for Camouflaged Object Detection. arXiv 2022, arXiv:2212.06570. [Google Scholar]
Lee, M.S.; Shin, W.; Han, S.W. TRACER: Extreme Attention Guided Salient Object Tracing Network (Student Abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 22 February–1 March 2022; Volume 36, pp. 12993–12994. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Skurowski, P.; Abdulameer, H.; Błaszczyk, J.; Depta, T.; Kornacki, A.; Kozieł, P. Animal camouflage analysis: Chameleon database. Unpubl. Manuscr. 2018, 2, 7. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.1042. [Google Scholar]
Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 248–255. [Google Scholar]
Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; de Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings of the MultiMedia Modeling, Proceedings of the 26th International Conference, MMM 2020, Daejeon, Republic of Korea, 5–8 January 2020; Proceedings, Part II 26; Springer: Berlin/Heidelberg, Germany, 2020; pp. 451–462. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 2015, 43, 99–111. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Wang, W.; Ding, M.; Shen, C.; Luo, P. Segmenting transparent objects in the wild. In Proceedings of the Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 696–711. [Google Scholar]

Figure 1. The whole pipeline of the proposed multi-level feature integration network (MFNet), consisting of three main components, i.e., edge guidance module (EGM), multi-level feature integration module (MFIM), and context aggregation refinement module (CARM). Please refer to Section 3 for details.

Figure 2. Detailed architecture of the multi-level feature integration module (MFIM).

f^{e}

is the output feature of EGM.

Figure 2. Detailed architecture of the multi-level feature integration module (MFIM).

f^{e}

is the output feature of EGM.

Figure 3. Detailed architecture of the context aggregation refinement module (CARM).

Figure 4. PR and

F_{β}

curves of the proposed method and other SOTA methods on CAMO and COD10K datasets.

Figure 4. PR and

F_{β}

curves of the proposed method and other SOTA methods on CAMO and COD10K datasets.

Figure 5. Quantitative evaluation of the proposed MFNet with other SOTA methods (i.e., SINet [1], PFNet [21], UGTR [19], JCSOD [25], LSR [26], C2FNet-V2 [52], and BSA-Net [38]).

Figure 6. The visual comparison of the detection results obtained by different models in the ablation study. (#1) Baseline, (#2) Baseline+MFIM, (#3) Baseline+MFIM+CARM, (#4) Baseline+MFIM+CARM+EGM (ours).

Figure 7. Visual comparison of detection results obtained with the four variants of the MFIM and CARM models in ablation studies. The detection results were obtained (#4) with our method, (#8) without CB, (#9) without AB, (#10) with DC, (#11) with NC.

Figure 8. Visualization results of three downstream applications. From top to bottom: image (1st row), ground truth (2nd row), and ours (3rd row).

Table 1. Quantitative evaluation results on three benchmark datasets regarding S-measure, E-measure, weighted F-measure, and MAE scores. The best results are highlighted in bold. “↑” and “↓” indicate that larger or smaller is better.

Method	Year	CAMO				CHAMELEON				COD10K
Method	Year	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{w}$ ↑	$MAE$ ↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{w}$ ↑	$MAE$ ↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$F_{β}^{w}$ ↑	$MAE$ ↓
CPD	2019	0.716	0.796	0.658	0.113	0.857	0.898	0.813	0.048	0.750	0.853	0.640	0.053
EGNet	2019	0.662	0.766	0.612	0.124	0.848	0.831	0.676	0.050	0.737	0.810	0.608	0.056
F3Net	2020	0.711	0.780	0.630	0.109	0.848	0.917	0.798	0.047	0.739	0.819	0.609	0.051
UCNet	2020	0.739	0.787	0.700	0.095	0.880	0.930	0.836	0.036	0.776	0.857	0.681	0.042
SINet	2020	0.745	0.829	0.644	0.092	0.872	0.946	0.806	0.034	0.776	0.864	0.631	0.043
PraNet	2020	0.769	0.837	0.663	0.094	0.860	0.907	0.763	0.044	0.789	0.861	0.629	0.045
C2FNet	2021	0.796	0.864	0.719	0.080	0.888	0.935	0.828	0.032	0.813	0.890	0.686	0.036
PFNet	2021	0.782	0.842	0.695	0.085	0.882	0.931	0.810	0.033	0.800	0.877	0.660	0.040
TINet	2021	0.781	0.848	0.678	0.087	0.874	0.916	0.783	0.038	0.793	0.861	0.635	0.042
UGTR	2021	0.784	0.851	0.684	0.086	0.888	0.940	0.794	0.031	0.818	0.853	0.667	0.035
R-MGL	2021	0.775	0.847	0.673	0.088	0.893	0.923	0.813	0.030	0.814	0.852	0.666	0.035
JCSOD	2021	0.800	0.873	0.728	0.073	0.894	0.943	0.848	0.030	0.809	0.884	0.684	0.035
LSR	2021	0.787	0.854	0.696	0.080	0.893	0.938	0.839	0.033	0.804	0.880	0.673	0.037
C2FNet-V2	2022	0.799	0.859	0.730	0.077	0.893	0.947	0.845	0.028	0.811	0.891	0.691	0.036
SINet-V2	2022	0.820	0.882	0.743	0.070	0.888	0.942	0.816	0.030	0.815	0.887	0.680	0.037
BSA-Net	2022	0.796	0.851	0.717	0.079	0.895	0.946	0.841	0.027	0.818	0.891	0.699	0.034
Ours	-	0.824	0.883	0.763	0.067	0.904	0.948	0.856	0.026	0.834	0.901	0.726	0.032

Table 2. Ablation analyses of each component on the CAMO and COD10K datasets. Bold: top result. The quantitative evaluation results obtained by (#1) Baseline, (#2) Baseline+MFIM, (#3) Baseline+MFIM+CARM, (#4) Baseline+MFIM+CARM+EGM (ours).

No.	B	MFIM	CARM	EGM	CAMO			COD10K
No.	B	MFIM	CARM	EGM	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓
#1	🗸				0.782	0.813	0.083	0.811	0.859	0.035
#2	🗸	🗸			0.803	0.844	0.078	0.821	0.868	0.034
#3	🗸	🗸	🗸		0.814	0.867	0.072	0.834	0.899	0.033
#4	🗸	🗸	🗸	🗸	0.824	0.883	0.067	0.834	0.901	0.032

Table 3. Ablation analysis of three variant models modified for EGM on the CAMO and COD10K datasets. Bold: top result. The quantitative evaluation results obtained by (#5)

f_{1} + f_{2}

, (#6)

f_{1} + f_{5}

, (#7)

f_{2} + f_{5}

, (#4)

f_{3} + f_{5}

(ours).

Table 3. Ablation analysis of three variant models modified for EGM on the CAMO and COD10K datasets. Bold: top result. The quantitative evaluation results obtained by (#5)

f_{1} + f_{2}

, (#6)

f_{1} + f_{5}

, (#7)

f_{2} + f_{5}

, (#4)

f_{3} + f_{5}

(ours).

No.	Models	CAMO			COD10K
No.	Models	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓
#5	$f_{1} + f_{2}$	0.794	0.845	0.079	0.825	0.887	0.033
#6	$f_{1} + f_{5}$	0.804	0.848	0.079	0.822	0.876	0.034
#7	$f_{2} + f_{5}$	0.817	0.876	0.071	0.834	0.898	0.032
#4	$f_{3} + f_{5}$	0.824	0.883	0.067	0.834	0.901	0.032

Table 4. Ablation analysis of four variant models modified for MFIM and CARM on the CAMO and COD10K datasets. Bold: top result. The quantitative evaluation results obtained (#4) with our method, (#8) without CB, (#9) without AB, (#10) with DC, (#11) with NC.

No.	Models	CAMO			COD10K
No.	Models	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓	$S_{α}$ ↑	$E_{ϕ}$ ↑	$MAE$ ↓
#8	Without CB	0.765	0.802	0.090	0.815	0.863	0.035
#9	Without AB	0.799	0.847	0.080	0.823	0.883	0.034
#10	With DC	0.820	0.881	0.070	0.830	0.897	0.034
#11	With NC	0.825	0.880	0.068	0.832	0.897	0.034
#4	Ours	0.824	0.883	0.067	0.834	0.901	0.032

Table 5. Three downstream applications and the datasets used for each.

Downstream Application	Dataset Used
Polyp Segmentation	KvasirSEG Dataset [53], CVC-ClinicDB Dataset [54], and CVC-300 Dataset
Defect Detection	CrackForest Dataset [55]
Transparent Object Segmentation	Trans10K Dataset [56]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, K.; Qiu, T.; Yu, Y.; Li, S.; Li, X. Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration. Sensors 2023, 23, 5789. https://doi.org/10.3390/s23135789

AMA Style

Liu K, Qiu T, Yu Y, Li S, Li X. Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration. Sensors. 2023; 23(13):5789. https://doi.org/10.3390/s23135789

Chicago/Turabian Style

Liu, Kangwei, Tianchi Qiu, Yinfeng Yu, Songlin Li, and Xiuhong Li. 2023. "Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration" Sensors 23, no. 13: 5789. https://doi.org/10.3390/s23135789

APA Style

Liu, K., Qiu, T., Yu, Y., Li, S., & Li, X. (2023). Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration. Sensors, 23(13), 5789. https://doi.org/10.3390/s23135789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Edge-Guided Camouflaged Object Detection via Multi-Level Feature Integration

Abstract

1. Introduction

2. Related Work

2.1. Camouflaged Object Detection

2.2. Multi-Level Feature Fusion

2.3. Boundary-Aware Learning

3. Proposed Method

3.1. Edge Generation Module

3.2. Multi-Level Feature Integration Module

3.2.1. Current Branch

3.2.2. Adjacent Branch

3.2.3. Branches’ Integration

3.3. Context Aggregation Refinement Module

3.4. Loss Function

4. Experiments

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.3.1. Structure Measure ( S α )

4.3.2. E-Measure ( E ϕ )

4.3.3. Weighted F-Measure ( F β w )

4.3.4. Mean Absolute Error ( M A E )

4.4. Comparison with SOTA Methods

4.4.1. Quantitative Evaluation

4.4.2. Qualitative Evaluation

4.5. Ablation Analysis

4.5.1. Effect of Different Modules

4.5.2. Effect of Different Levels of Features as Input in EGM

4.5.3. Effect of Different Branches in MFIM

4.5.4. Effect of Atrous Convolution and Asymmetric Convolution in CARM

5. Downstream Applications

5.1. Polyp Segmentation

5.2. Defect Detection

5.3. Transparent Object Segmentation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3.1. Structure Measure ( $S_{α}$ )

4.3.2. E-Measure ( $E_{ϕ}$ )

4.3.3. Weighted F-Measure ( $F_{β}^{w}$ )

4.3.4. Mean Absolute Error ( $M A E$ )