1. Introduction
Camouflaged object detection (COD), in a narrow sense, aims at detecting objects hidden in image scenes [
1], such as chameleons hiding in their environment via epidermal color changes [
2] or armed soldiers lurking in mountains and forests. In a broader sense, COD aims at detecting any image objects with low detectability, such as component defects in the industry [
3] or signs of lesions in pathology [
4]. The study of COD is not only an important scientific problem in the field of computer vision but can also bring great economic and social benefits.
However, the colors and patterns of camouflaged objects almost blend with the environment, which greatly reduces their probability of them being detected, identified, or targeted [
5].
Figure 1 shows two cases: artificial camouflage, in which soldiers wearing special camouflage clothing and who are obscured by objects can conceal themselves, and natural camouflage, where lizards and deer have evolved their bodies to resist natural enemies and hunt prey with colors similar to their environment. In both cases, they can almost perfectly achieve the effect of blending with the background.
To date, the corresponding research on camouflaged target identification has mainly focused on two aspects. The first aspect is the identification of camouflaged objects with the help of nonvisible detection techniques. Camouflaged objects can effectively hamper the detection of visible wavelengths, while the combination of detection techniques in other wavelengths, such as infrared, hyperspectral, and polarized wavelengths, can make up for the shortcomings of visible detection itself [
6,
7,
8]; although this method is effective, it does not solve the scientific problem of detecting camouflaged targets in visible wavelengths. The second aspect is the study of COD techniques in visible wavelengths. Before the rapid development of deep learning, researchers commonly used traditional digital image processing methods, such as spectral transforms [
9], sparse matrices [
10], and human vision systems [
11], which are not robust and have poor generalization performance. In recent years, with the development of deep learning technology, scholars have applied it to the detection of camouflaged objects, providing new ideas for the detection of camouflaged objects in visible wavelengths [
12].
The current general deep learning object detection networks (e.g., the multi-path vision transformer (MPVIT) [
13]) are effective for detecting salient objects in images, but as shown in
Figure 1, they often produce missed detections and low confidence when detecting camouflaged objects for the following three reasons:
The color and texture are highly similar between the camouflaged object and the background and reduce the differences in their features.
Camouflaged objects usually have irregular shapes and varying scales, making the spatial information of these objects in images extremely complex.
The low utilization of the highly representative key point information of the camouflaged objects results in missed detections and false alarms.
This paper proposed a ternary cascade perception-based method for detecting camouflaged objects to solve the abovementioned problems. The innovative feature is the ternary cascade perception module (TCPM), which focuses on extracting the relationship information between features, the spatial information of camouflaged targets, and the location information of key points. Second, we improved the existing feature fusion module and proposed a cascade aggregation pyramid (CAP) to achieve more efficient feature fusion. Finally, we proposed a joint loss function, which used multiple loss functions for the prediction of target classes and bounding boxes and jointly obtained the training loss values.
The main contributions of this paper are as follows:
To overcome the deficiencies of existing COD methods, the TCPM was proposed to solve the problems of missed detections and low confidence levels.
Combined with the TCPM, a cascade perception-based COD network (CPNet) was constructed, and a designed CAP with a joint loss function was also used in the network to make it more targeted for COD.
The proposed CPNet demonstrated the best performance on the COD10K camouflaged object dataset, and all the metrics were better than or on par with the published methods.
The rest of this paper is structured as follows: In
Section 2, we introduced the related work. In
Section 3, we introduced the network structure of CPNet in detail.
Section 4 gives the discussions of experimental results. Finally, we summarized the research content in
Section 5.
3. Methods
The structure of the proposed CPNet for COD is shown in
Figure 3a. The CPNet sequentially cascades the feature extraction backbone, the CAP, the TCPM, and the detection output head. The feature extraction backbone downsamples the input image to 1/4, 1/8, 1/16, and 1/32 of the original size and then inputs the four sets of feature maps into the CAP for feature fusion. The CAP outputs four 3D tensors, which are then fed into the TCPM for information perception, and the object class and object bounding box information is finally output.
The design ideas and specific structures of each component will be discussed in detail below.
3.1. Feature Extraction Backbone
The feature extraction backbone (Backbone) was used to extract the feature information at each scale in the input image. For the multi-scale phenomenon of camouflaged objects, the Swin transformer feature extraction network [
37] with good comprehensive performance was used, and its structure is shown in
Figure 3b. The Swin transformer draws on the layered sliding window structure of traditional CNNs. First, the image to be recognized in the input network is partitioned into patches; adjacent 4 × 4 pixels constitute a patch, and each patch is mapped to superpixel points. The number of output channels is 4 × 4 × 3. The output of this step is
. Then, the Swin transformer performs downsampling by merging the patches and obtaining multi-scale feature representations. Each stage extracts features through the sliding window of the Swin transformer block and detects low-level features and high-level features. The Swin transformer block is shown in
Figure 3c. Finally, the generated feature maps during each stage are input into the CAP. The Swin transformer backbones with different sizes have different numbers of Swin transformer blocks n and the number of channels C. The Swin transformer models with larger n and C have better classification accuracy on the ImageNet dataset [
38]. CPNet selected two models, Swin-T and Swin-S, for the experiment.
Table 1 shows the parameter settings of the Swin transformer backbone and the experimental results obtained on the standard classification dataset (ImageNet-1K).
3.2. CAP
In the feature extraction network, the low-level feature map contains a large amount of location information, while the high-level feature map carries stronger semantic information after continuous feature extraction. In 2021, Qiao et al. proposed the recursive feature pyramid (RFP) [
17], a module that improves upon the feature pyramid network (FPN) [
39] by adopting the design idea of “looking and thinking twice” to feed the feature maps from the primary training step of the FPN into the backbone network for fusing the information between the high- and low-level features several times, while achieving enhanced semantic representation and target localization. The FPN structure is shown in
Figure 4a, and the RFP structure is shown in
Figure 4b. The receptive field of the first output feature map FPN in RFP was firstly enhanced by using atrous spatial pyramid pooling (ASPP) [
40], and then the feature extraction and feature fusion were performed again using FPN. Finally, the two sets of feature maps were fused to achieve a stronger representation.
The RFP makes the model training time and inference time too long due to its complex structure and multiple training of the backbone. In this paper, the RFP was improved in a targeted way for efficient object detection, called the CAP, as shown in
Figure 4c.
The CAP changed the secondary training part of the RFP to an aggregation network (AN). First, the lower-level feature map
is downsampled using the convolutional layer combination of convolution + batch normalization + Relu activation function (CBR) with a step size of 2. It aggregates low-level representations and obtains higher-lever
:
where
denotes the execution of the convolutional layer. Then, the feature maps
output from the primary training part with receptive field enhancement are added together, and finally, the fusion between the feature maps is performed using the CBR with a step size of 1 to obtain higher-level features
.
The feature fusion from high-level representation to low-level presentation in CAP is implemented by the FPN block in FPN, and the feature aggregation from low-level presentation to high-level representation is implemented by the AN block. The AN block enhances the low-level location features by aggregating them with multiple CBR steps, which further improves the localization capability of the network. After fusing features by CAP, a multi-scale feature map containing both location and semantic information can be obtained.
3.3. TCPM
To accurately detect a camouflaged object in an image and solve the problem that the key information of the camouflaged object is difficult to obtain, the TCPM was purposely constructed. It focuses on extracting the relationship information between features, the spatial information of the camouflaged object, and the location information of key points.
The TCPM was built based on the characteristics of camouflaged objects, and it includes a feature perception module (FPM), a spatial perception module (SPM), and a key point perception module (KPM), which are embedded into the network in a cascaded manner. The structure of the TCPM is shown in
Figure 5.
3.3.1. FPM
The FPM uses a graph convolution network (GCN) [
41] to obtain information regarding the relative relationships between adjacent features, thus increasing the variance of the features and making the network more capable of distinguishing mismatched content from the background.
The input tensor
is preprocessed using the convolutional layer combination (CBR) to obtain the feature map
:
Second, the entire feature map is bidimensionally pooled by global average pooling (GAP) to generate a pooled feature map
:
where
represents the operation of the global average pooling the input feature map into a feature map with
W = 1 and
H = 1. Then, the four-dimensional tensor output of GAP is truncated into
B three-dimensional tensors. In the CNN, the number of channels
C represents the number of features extracted from the feature map, so the
B three-dimensional tensors can be considered as a point set of
C features. The GCN (a CNN that is capable of obtaining relationship information) is used to further obtain the relative relationship between the feature nodes. Its schematic is shown in
Figure 6. Suppose there are five feature tensors
–
, which are regarded as five nodes, and each node carries its neighbor information by GCN (for example,
carries the information of the 2 nodes adjacent to it to form
after the GCN).
After obtaining the new 3D feature map tensor
with node relationships, the 3D tensor is connected. Then, the softmax is performed on the channel dimension to normalize the relationship information and obtain the weight factor feature map
:
where
denotes the softmax operation on the channel dimension. The weight factor feature map
is multiplied with the preprocessed feature map
channel by channel to obtain the relationship-weighted feature map
. Finally,
is connected with the original input for the residuals and added pixel-by-pixel to obtain the output feature tensor
of the FPM, which is reinput into the main network.
3.3.2. SPM
The SPM obtains critical features through spatial normalization, which mitigates the problems of missed detections and multitarget detection by using spatial information.
For the output tensor
of the FPM, preprocessing is first performed using the convolutional layer combination (CBR) to obtain the feature map
:
The SPM improves the existing batch normalization method to obtain the spatial feature weights. Batch normalization is performed for the same channel of different feature maps within each batch, and this calculation method can utilize the semantic information of different feature maps. The literature [
42] experimentally demonstrated that BN generates large errors in small batches. In addition, the category and environment gaps between different images in the camouflaged object dataset are large, the semantic information of the images input within the same batch is often not rich, and a uniform weight calculation will make the results unstable and equivalent to the introduction of noisy data into a single image. Therefore, the four-dimensional tensor
is first divided into B three-dimensional tensors and then batch normalized along the dimension.
Figure 7 shows the schematic diagram of the improved batch normalization method. Different feature maps within each batch are spatially normalized along the channel direction, which can ensure that the weight calculation is performed only within the individual feature map and is not affected by the rest of the feature maps within the batch. Finally,
is obtained after normalization.
In addition, the normalization operation trains a scale factor
of each channel and measures the variance within a single channel. To calculate the feature weights of the spatial dimension and indicate the importance of the channel information, each channel is assigned weights
:
After fusing the weights with the feature,
is obtained by
Then, the B three-dimensional tensors are concatenated (cat), and a softmax normalization operation is performed on the channel dimension that normalizes the spatial information to 0~1. The weight factor feature map
is obtained by
The weight factor feature map
is multiplied with the preprocessed feature map
and yields the spatial attention-weighted feature map
, which is finally connected with the original input. The output feature tensor
of the SPM is obtained by pixel-wise addition.
3.3.3. KPM
The KPM performs information perception for the key points of the camouflaged object by two-dimensional average pooling (AP) and dilated convolution (DConv). Key points refer to those points that carry key information and can significantly affect the detection results. If the key point information of the camouflaged object is determined, the confidence levels of the remaining locations can be weakened, thus reducing the false alarm rate of the network. First, we obtained the location information of the key points by conducting AP in different dimensions; then, we increased the receptive field via DConvs at different scales to obtain more multi-scale information about the area around the key points.
The output tensor
of the SPM is first preprocessed using the CBR combination to obtain a feature map:
Unlike the FPM, AP does not operate on the whole feature map. As is shown in
Figure 4, it is divided into two dimensions to obtain information regarding the positions of nodes in the horizontal and vertical dimensions of the feature map. Finally, a pair of 3D tensor-pooled feature maps
and
is obtained by
Next, the two feature maps containing the location information of the key points in different directions are multiplied (as is shown in
Figure 8a) to obtain
; then, the CBR combination is used to fuse the information between the channels to obtain the feature map
with the global location information of the embedded key points.
Local information is not enough. To accurately identify the key points of the camouflaged object, information near the key points are needed for further discrimination. For this purpose, a small feature extraction range is required for the larger location information feature map
. Two DConv branches of different sizes are used to expand the receptive field and obtain multi-scale regional information; the schematic diagram is shown in
Figure 8b with DConv kernel sizes of 3 × 3 and 5 × 5 and dilated coefficients of 2. The two sets of feature maps are superimposed to obtain a receptive field-enhanced feature map
with twice the number of channels:
where
denotes DConv with a dilation coefficient of 2 and a convolutional kernel size of
i.
denotes the addition operation executed on the two sets of feature maps in the channel dimension. The softmax operation is performed in the channel dimension. The CBR combination is used for channel recovery and information fusion, followed by the softmax normalization operation in the channel dimension. Finally, the 3D multiscale information–weight factor feature map
is obtained by
The weight factor feature map
is multiplied with the preprocessed feature map
channel by channel to obtain the weighted feature map
. Finally, the
is connected with the module input (i.e., the output feature tensor
of the SPM) for pixel-by-pixel addition. As is shown in
Figure 5, to obtain the output feature tensor
of the KPM, this tensor is reinput into the network.
3.4. Detection Head and Loss Function
The detection head uses the Cascade RCNN [
43] to predict object classes and bounding boxes. The joint loss function is designed for the output prediction results. The focal loss function [
44] is used to obtain the object class loss value
, and the complete intersection-over-union (CIOU) function [
45] is used to obtain the object bounding box loss value
. The total loss value is
:
The object class loss values are calculated using the focal loss function, which controls the effects of positive and negative samples on the total loss values by adjusting the focus factor.
where
is the focus factor, and since the camouflaged objects account for a relatively small portion of the image,
is chosen to be 0.25 to solve the problem regarding the imbalanced number of classes.
denotes the modulation factor, which is chosen to be 2 to reduce the effect of a large number of high-confidence negative samples.
denotes the classification probabilities of different classes.
The object bounding box loss value is obtained by calculating the distance between the true object bounding box and the predicted object bounding box using the CIOU function to obtain
:
where
denotes the intersection over the union between the area of the predicted object bounding box
and the area of the true object bounding box
.
denotes the Euclidean distance between the center point
of the predicted object bounding box and the center point
of the true object bounding box.
is the length of the farthest diagonal line connecting the predicted object bounding box and the true object bounding box.
is used to measure the consistency of the aspect ratio between the predicted object bounding box and the true object bounding box:
where
and
represent the width and height of the true object bounding box and
represent the width and height of the predicted bounding box, respectively.
5. Conclusions
In this paper, we proposed a ternary cascade perception-based detection method and constructed a CPNet to solve the problems of missing detections and low detection confidence for the COD applications. The TCPM focuses on extracting the relationship information between features, the spatial information of the camouflaged object, and the location information of key points, and it fuses the representation in the deep networks more efficiently. For the loss function, the focal loss function was used to obtain the object class loss value, and the CIOU function was used to obtain the object bounding box loss value, as this joint training method is more robust. A cross-sectional comparison conducted on the challenging COD10K dataset demonstrated the comprehensive performance advantage of the CPNet in COD; CPNet outperformed or was on par with state-of-the-art models, and its AP50 and AP75 exceeded the MPVIT by 10%. The ablation experiments demonstrated the effectiveness of each proposed module. Finally, extended experiments were conducted to demonstrate the effectiveness of the CPNet when applied to generalized COD. We offer the community an initial model of generalized COD. In future work, we plan to introduce contrast learning into COD so that the network can learn parameters by comparing the feature differences between camouflaged and generic objects. In addition, the introduction of edge information to refine the detection effect can also be explored. Subsequently, we will also advance our research on generalized COD to motivate the translation of research results into practical applications.