Features Split and Aggregation Network for Camouflaged Object Detection

Higher standards have been proposed for detection systems since camouflaged objects are not distinct enough, making it possible to ignore the difference between their background and foreground. In this paper, we present a new framework for Camouflaged Object Detection (COD) named FSANet, which consists mainly of three operations: spatial detail mining (SDM), cross-scale feature combination (CFC), and hierarchical feature aggregation decoder (HFAD). The framework simulates the three-stage detection process of the human visual mechanism when observing a camouflaged scene. Specifically, we have extracted five feature layers using the backbone and divided them into two parts with the second layer as the boundary. The SDM module simulates the human cursory inspection of the camouflaged objects to gather spatial details (such as edge, texture, etc.) and fuses the features to create a cursory impression. The CFC module is used to observe high-level features from various viewing angles and extracts the same features by thoroughly filtering features of various levels. We also design side-join multiplication in the CFC module to avoid detail distortion and use feature element-wise multiplication to filter out noise. Finally, we construct an HFAD module to deeply mine effective features from these two stages, direct the fusion of low-level features using high-level semantic knowledge, and improve the camouflage map using hierarchical cascade technology. Compared to the nineteen deep-learning-based methods in terms of seven widely used metrics, our proposed framework has clear advantages on four public COD datasets, demonstrating the effectiveness and superiority of our model.


Introduction
When viewing an image or encountering a scene, it can be challenging to notice the object at first glance if the difference between the foreground and background is minimal [1][2][3][4][5][6].The attempt by one or more objects to modify their traits (such as color, texture, etc.) to blend into their environment to avoid discovery is known as camouflage.There are two types of camouflage, as depicted in Figure 1.Natural camouflaged objects [7,8] are animals that use their inherent advantages to blend in with their surroundings and protect themselves.For instance, chameleons or other animals can change their color and physical appearance to match the hues and patterns of their surroundings.Additionally, artificial camouflaged objects were initially employed in battle, where soldiers and military gear employ camouflage to blend into their surroundings.In daily life, we can also see artificial camouflage, such as body art.Because of the characteristics of the camouflaged objects, the study of COD has not only scientific value but also significant engineering applications, such as surface defect detection [9], polyp segmentation [10], pest control, search, and rescue [11,12], and other applications.The research on camouflage can be traced back to 1998.In recent years, COD has attracted more and more attention from researchers.Traditional models are mainly based on hand-crafted features (color, texture, optical flow, etc.) to describe the unified features of the object [13][14][15][16].However, limited by hand-crafted features, the traditional model cannot work well when the background environment changes [17,18].To address this, deep-learning-based techniques for COD have been developed, which utilize deep features automatically learned by the network from extensive training images.These features are more generic and effective than hand-crafted features.For example, Fan et al. [19] designed the first deep-learning-based model, SINet, which simulates the human visual mechanism, especially used for COD.Zheng et al. [20] successfully predicted the camouflage map using the short connection of the frame.However, there are still some shortcomings in the existing models.Specifically, (1) they cannot deeply explore high-level features, leading to imprecisely locating small objects.(2) There is no particularly effective method for integrating high-level and low-level features, even directly discarding low-level features, resulting in suboptimal performance in handling object edges.
Inspired by the description above, we propose a new model for COD named features split and aggregation network (FSANet) to solve the above problems, as shown in Figure 2, which primarily consists of three modules.Taking into account how low-level features and high-level features contribute differently to the creation of camouflage maps [19,20], we use the backbone to extract five feature layers and divide them into two parts with the second layer as the boundary.The first stage consists of the backbone's first two layers, simulating a human's cursory examination of the scene to gather spatial details (such as the edge, texture, etc.).The SDM module is used in this phase to fuse the features to create a cursory impression.The second stage consists of the backbone's last three layers, simulating a person's additional observation and reworking of imperfect scenes.Specifically, both the same and different information is observed for the same feature across different viewing angles.We employ an ordinary convolution layer and TEM [1] module to mimic the human evaluation of features from various angles.After that, we fuse these features to enhance similar features.To avoid detail distortion and filter out noise by feature multiplication, we simultaneously utilize side-join multiplication (SJM), as shown by the red line in Figure 2.These operations constitute the CFC module.Finally, to obtain more thorough detection results, we build the HFAD module to thoroughly mine effective information from the two stages.We guide the fusion of low-level features by using high-level semantic information, and we enhance the camouflage map generated in the earlier stage by using a hierarchical cascade technique.
Overall, we can summarize our main contributions as follows: 1.
We simulate the human observation camouflage scenes to propose a new COD method that includes the spatial detail mining module, the cross-scale feature combination module, and the hierarchical feature aggregation decoder.We rigorously test our model against nineteen others using four public datasets (CAMO [21], CHAMELEON [22], COD10K [1], and NC4K [2]) and evaluate it across seven metrics, where it demonstrates clear advantages.2.
To fully mine spatial detail information, we design a spatial detail mining module that interacts with first-level feature information, simulating the human's cursory examination.To effectively mine information in high-level features, we designed a cross-scale feature combination module to strengthen high-level semantic information by combining features from adjacent scales, simulating humans' evaluation of features from various angles.Furthermore, we build a hierarchical feature aggregation module to fully integrate multi-level deep features, simulating humans' aggregation and processing of information.

Related Works
This section discusses COD based on deep learning and context-aware deep learning, both of which are related to our model.

Camouflaged Object Detection (COD)
Camouflaged Object Detection (COD) has become an important area of research for identifying objects that are blended in with their surroundings.In this emerging field, significant contributions have been made.
Fan et al. [19] utilized a search module (SM) alongside a partial decoder component (PDC) [23] to enhance the accuracy of initial detection zones, fine-tuning the identification of camouflaged objects by focusing on salient features within the rough areas.Sun et al. [24] developed C 2 F-Net, which employs multi-scale channel attention to guide the fusion of features across different levels.Their approach ensures that both local nuances and global context are considered, thus improving the detection of objects across various scales.Mei et al. [3] introduced PFNet, which cleverly combines high-level feature maps with inverted predictions.By integrating these with the current layer's attributes, and processing them through a context exploration block, the network is able to effectively reduce false positive and negative detections by strategically employing subtraction techniques.Li et al. [25] proposed JCOSOD, which considers the uncertainties inherent in fully labeling camouflaged objects.They used a full convolutional discriminator to gauge the confidence in predictions, and an adversarial training strategy was applied to refine the model's ability to estimate prediction confidence.
Together, these advancements reflect a growing sophistication in COD, showing a trend towards more nuanced algorithms capable of distinguishing objects that are naturally or artificially designed to be hard to detect.

Context-Aware Deep Learning
Contextual information is important in object segmentation tasks as it has the ability to improve feature representation and, in turn, improve performance.Efforts have been made to improve contextual information.Stars et al. [26] proposed a salience detection algorithm based on four psychological principles.The model defines the algorithm using local low-level considerations, global considerations, and visual organization rules.Highlevel factors are used for post-processing and are helpful in producing compact, attractive, and rich information.Chen et al. [27] created ASPP, which collects contextual data using various dilated convolutions.They proposed an approach that, in the end, produces accurate semantic segmentation results based on the DCNN's capability to detect objects and the fcCRF's capability to localize objects with fine detail.To improve the features of the local context, Tan et al. [28] employed LCANet to merge the local area context and the global scene context in a coarse-to-fine framework.

The Proposed Method
In this section, we present the overall architecture of FSANet before delving into the specifics of each module.Finally, we discuss the training loss function of the proposed model.

Overall Architecture
The overall architecture of FSANet can be seen in Figure 2, which consists primarily of the spatial detail mining module, the cross-scale feature combination module, and the hierarchical feature aggregation decoder to endow the model with the ability to detect camouflaged objects.Specifically, for the input image I ∈ R W×H×3 , we use Res2Net-50 [29] as the backbone to extract five different levels of information, denoted as F i , i ∈ {1, 2, 3, 4, 5}.The resolutions of each layer are H k , W k , k = 4, 4, 8, 16, 32 .We divide the backbone into two parts, with the second layer as a boundary.The first two levels of features are low-level fine-detail features, including spatial details (such as edge, texture, etc.), while the last three layers F i , i ∈ {3, 4, 5} are high-level semantic features that include specific details (such as semantic information, position, etc.).We obtain the spatial details from low-level features by using the spatial detail mining module and combine the features to provide a superficial impression, denoted by P 1 .However, it contains more redundant information.For high-level features, we design a cross-scale feature combination module to obtain three layers of high-level features denoted as P i , i ∈ {2, 3, 4}, each with different specific semantic information.Finally, we employ the hierarchical feature aggregation decoder, which utilizes the high-level features layer to refine and fuse the low-level features layer by layer, yielding the prediction map of COD.Below are detailed descriptions of each key component.

Spatial Detail Mining (SDM)
Because the camouflaged object is very similar to the background, the extracted lowlevel features contain rich spatial detail and adjacent features have great similarity, but they also contain more noise information.Therefore, we use this module to find similar features and eliminate noise, simulating a human's cursory examination of the scene to gather spatial details.To retain the fine-detail information in the low-level features while discarding the noise information and unsuitable features, we combine the adjacent features of the first two layers F C i , i ∈ {1, 2}, with element-wise multiplication on these two features to extract shared features, followed by element-wise addition.We obtain R 1 and R 2 after these operactions.Then, we concatenate R 1 and R 2 to obtain R C .Global average pooling is applied to R C to weight the features, and these are further enhanced with local information through element-wise multiplication with the original feature R C .Finally, the channels of feature are reduced to 32 through convolution to obtain feature P 1 with the size The spatial detail mining process can be described as follows: where CBR(•) represents the Conv + BN + ReLU operation and Concat{•} represents concatenate operation in dim = 1, G(•) represents global average pooling operation, which is used to establish relationships between feature maps and categories.

Cross-Scale Feature Combination (CFC)
Different types of camouflaged objects have varying colors, physical traits, and camouflage techniques.Similarly, objects of the same type that are camouflaged can have different camouflage methods and sizes in various environments, making it more challenging to locate them.In studies on biological vision, researchers have been discussing challenges of perspective.Viewpoint-invariant theories and viewpoint-dependent theories have been proposed [30][31][32].Viewpoint-invariant theories assert that a particular object can be recognized from diverse viewing angles while maintaining its properties.On the other hand, viewpoint-dependent theories suggest that object recognition from different viewing angles may be effective.Separately, Tarr et al. [33] proposed a multi-view model in which objects can be represented by a series of images of familiar viewpoints, with each view describing a different view-specific object characterization.
Inspired by this, we realize that using various receptive fields with reduced-channel operations can provide additional feature information about the objects.Thus, we design the cross-scale feature combination module, which processes features and adjacent features differently to obtain different viewpoints and finally fuses them to obtain advanced features, simulating the person's additional observation and reworking of imperfect scenes.
Specifically, we utilize the Conv3 to handle the features F i , i ∈ {2, 3, 4, 5} to preserve object boundaries and enhance local context information, and we use the TEM [1] to handle the features F i , i ∈ {3, 4, 5} to capture multi-scale information further.This enables us to obtain F C i , i ∈ {2, 3, 4, 5} and F R i , i ∈ {3, 4, 5}.All features' channels are adjusted to 32.After that, we use NCD*, which selective removes upsampling from NCD [1] to ensure dimensional consistency, fine-tune, and effectively combine features from different viewpoints; the inputs are F C i−1 , F C i , and F R i with output F N i .However, since camouflaged objects are relatively blurred, using NCD* may result in detail and other useful information loss while enhancing similar features.Therefore, we use side-join multiplication to re-add details to the output features and filter out noise through multiplication to further enhance the object's features; obtain P i , i ∈ {3, 4, 5} with the size This operation is depicted by the red line in Figure 2. The cross-scale feature combination process can be described as follows: where CBR(•) represents Conv + BN + ReLU operation, N{•} represents neighbor connection decoder (NCD*), T(•) represents texture-enhanced module (TEM).

Hierarchical Feature Aggregation Decoder (HFAD)
We obtain improved features {P 1 , P 2 , P 3 , P 4 } using the method described above.The next crucial problem is how to successfully bridge the context and fuse these features.To address this, our model employs hierarchical cascade technology, which gradually guides the fusion of low-level features using high-level semantics, simulating human processing of aggregated information obtained from different sources.We regard the process of fusing rich features as a decoder.Formally, the hierarchical feature aggregation decoder contains four inputs, as shown in Figure 2. The module's general structure is an inverted triangle hierarchical structure, primarily consisting of Conv3 + BN + ReLU layers and elementwise multiplication to extract similarities between various features.To ensure that cascade processes may be completed, we resize the features to an appropriate size in the process by using an upsampling operation.
Specifically, we first apply an upsampling operation for P 4 to make it the same shape as P 3 , and we multiply them to obtain S 3 .Then, we upsample P 3 and P 4 , respectively, to match the size of P 2 , and multiply them to obtain S 2 .The same operations are applied to obtain S 1 .This progressive method is characterized as follows: where CBR(•) represents Conv + BN + ReLU operation.To ensure that the candidate features have the same size as each other, we use upsampling operation before element-wise multiplication; δ 2 ↑ (•) means 2× upsampling operation by executing the bilinear interpola- tion, and δ 4 ↑ (•) 4× upsampling operation.After performing the above operation, we obtain three refined features, denoted by {S 1 , S 2 , S 3 }.Then, we use the concatenation operation with Conv3 + BN + ReLU layers to enhance the feature step by step, obtaining S Cat 3 , S Cat 2 , and S Cat 1 .Finally, we use a convolution layer to reduce the channels and obtain the final prediction map M ∈ R W×H×1 .The following formulas express this process: where CBR(•) represents Conv3 + BN + ReLU operation.δ 2 ↑ (•) means 2× upsampling operation by executing the bilinear interpolation.Concat{•} represents concatenating operation in dim = 1.Conv means 1 × 1 convolutional layer.Following these operations, we obtain the prediction map.

Loss Function
The binary cross-entropy (BCE) [34] loss, which highlights pixel-level differences, disregards discrepancies between neighboring pixels, and equally weights foreground and background pixels, is often employed in the binary classification problem.For object detection and segmentation, the IoU is a commonly used assessment metric that emphasizes global structure.Inspired by [35,36], we adopt weighted BCE loss and weighted IOU loss as the combined loss.Weighted BCE and weighted IoU losses place a greater focus on hard samples compared to regular BCE and IoU losses.The following formula shows how we define our loss: where L w IOU and L w BCE denote BCE loss and IoU loss, respectively.It has been proven successful to apply the same parameter definition and setup as [36,37].
The four supervision maps in this model are all closely supervised, and their locations are illustrated in Figure 2. Here, each map is enlarged through upsampling to align its dimensions with the GT.The total loss can be calculated using the formula below: The G represents the GT, and L(l i , G) represent the loss calculation between each output and the ground truth, respectively.

Experimental Results
In this section, we will delve into greater detail about the benchmark datasets in the COD field, evaluation measures, experimental setup, and ablation study.
CAMO [21] dataset includes 1250 photos and was suggested in 2019.It contains two types of scenes: indoor scenes (artworks) and outdoor scenes (disguised humans/animals).The dataset also includes a few images that are not camouflaged.
CHAMELEON [22] dataset includes 76 natural images, each of which is matched with an instance-level annotation.This dataset collection primarily focuses on creatures that are disguised in complicated backgrounds, making it challenging for humans to identify them from the environment.
COD10K [1] dataset includes 10K images, which are classified as 5066 camouflaged, 3000 background, and 1934 non-camouflaged images.The dataset is divided into five main categories and sixty-nine subcategories, including images of land, sea, air, and amphibians in camouflaged scenes as well as images of non-camouflaged environments.
NC4K [2] dataset includes 4121 images, which is the largest existing COD testing dataset.The dataset's camouflaged scenes can be generally classified into two categories: natural camouflaged and artificial camouflaged, and the majority of the visual scenes in this collection are also naturally hidden.
Implementation Details: In this instance, we train our model using the same training dataset as stated in [1], which consists of 4040 images from the COD10K and CAMO datasets.The remaining images are used as testing datasets.Additionally, the training dataset is strengthened by randomly flipping images to increase the sufficiency of the network training, and each training image's size is changed to 352 × 352 in the training phase.Our model is built with PyTorch and run on a PC with an NVIDIA GTX 2080Ti GPU.Parts of the parameters are initialized with Res2Net-50 [29] during the training process, while the remaining parameters are randomly initialized.The network is optimized using the Adam algorithm [38], with the initial learning rate, batch size, and maximum epoch number set to 10 −4 , 16, and 100.
Precision and recall are common metrics used to evaluate how well the model works.The recall value is used as the horizontal coordinate and the precision value as the vertical coordinate to create a coordinate system.We can calculate the associated precision and recall scores to evaluate the effectiveness of the models.
S-measure [39] is used to determine the structural similarities between the prediction map and the related ground truth, which is defined as where S o represents the structural similarity measurement based on the object level and S r represents the region-based similarity.According to [39], the α is set to 0.5.F-measure [40] is used to calculate the weighted summation average of the precision and recall under non-negative weights.It is often used to compare the similarity of two images.The formula can be expressed as where P represents precision and R represents recall.We set β 2 to 0.3, as suggested in [43], to emphasize precision.To improve the accuracy and completeness metrics, we determine the weights of recall and precision, as similarly conducted in [41].The following is the formula: The parameters are the same as F β , and w represents the weighted harmonic mean of the precision and recall.
E-measure [42] assesses the similarity between the prediction map and the ground truth by using the pixel significance value and the average significance value.The formula is as follows: where f (•) = S(x, y) − G(x, y) stands for the enhanced alignment term, which is used to record statistics at the image level and pixel level.The image's width and height are denoted by W and H. MAE is used to quantify the average absolute difference between the model's output and the input's ground truth, which is the pixel-level error evaluation index.The formula can be written as follows: where S(i) represents the predicted map.G(i) represents the GT.W and H denote the image's width and height.

Quantitative Comparison
For COD datasets, we first present PR curves and F-measure curves for quantitative comparison.As shown in Figure 3, we observe that our model outperforms the other models in terms of the PR curve and F β curves.This is due to the feature fusion approach we use (see Section 3.4 for details).Moreover, as listed in Table 1, our model obtains superior scores on four COD datasets under five public camouflaged map quality evaluation metrics.For instance, our model outperforms all the advanced models in five evaluation metrics for the CHAMELEON and COD10K datasets, achieving MAE of 0.026 and 0.034, respectively, which is 7.14% and 5.56% lower than FAPNet.Similarly, compared to FAPNet, the F β also improves by 2.06% and 1.66% on the CAMO and CHAMELEON datasets, respectively.Although our model's F w β scores rank second among the available models for the NC4K dataset, their scores only decrease by 0.26%.Furthermore, as shown in Table 2, our model achieves great results in categories within the COD10K dataset.For instance, in COD10K-Amphibian, compared with FAPNet, our model's MAE decreases by 15.63%.Table 1.Quantitative comparison of different methods on four COD testing datasets, which contain S-measure (S m ), weighted F-measure (F w β ), F-measure (F β ), E-measure (E m ), and mean absolute error (MAE).Here, "↑" ("↓") means that the larger (smaller) the better.The best three results in each column are marked in red, green, and blue.), weighted F-measure (F w β ), F-measure (F β ), E-measure (E m ), and mean absolute error (MAE).Here, "↑" ("↓") means that the larger (smaller) the better.The best three results in each column are marked in red, green, and blue.Overall, through Figure 3 and Tables 1 and 2, the excellence and efficiency of our model, which has attained SOTA performance, are readily apparent.

Qualitative Comparison
We carry out several visual contrast experiments and provide corresponding images to make a qualitative comparison for all models.As shown in Figure 4, our model's detection results are more comparable to the GT, indicating that our results are more complete and precise than those of the other models.In general, our model has two major advantages: (a) Object placement accuracy: In the first, second, seventh, and eighth rows of Figure 4, we can see that our model's outcomes closely resemble the GT.In contrast, other deep-learning-based models, (e.g., (d) FAPNet [8], (e) SINet_V2 [1], (l) SINet [19], etc.), shown in Figure 4, find the object but mistake a portion of the background for the object in the process.
(b) Advantages of edge details for optimization: In the third, fourth, ninth, and tenth rows of Figure 4, our model is capable of precisely locating the object and properly identifying microscopic details.For other models, (e.g., (d) FAPNet [8], (e) SINet_V2 [1], (g) C 2 FNet [24], (l) SINet [19], etc.), although they may detect the object's major portion, the object's boundary is unclear, tailing is a serious occurrence, and the edge details are not readily apparent.
Based on the above comparisons, we can indisputably establish the efficacy and superiority of the FSANet that we present.When it comes to identifying camouflaged objects, whether they are inside the object or on its edge, our model performs better than the other models.

Ablation Studies
In this section, we conduct a thorough experiment on two COD datasets to demonstrate the efficacy of each model component.Table 3 displays the quantitative comparison; Figure 5A-E display the qualitative comparisons.We conduct experiments on the SDM, TEM, SJM, and HFAD modules to validate their effectiveness.The following are the details of the implementation of the experiment.Table 3. Ablation studies on two testing datasets.Here, m-m SJM represents many-to-many sidejoin multiplication (as shown by the red line in Figure 2); o-m SJM represents one-to-many sidejoin multiplication.Here, "↑" ("↓") means that the larger (smaller) the better.Table 3 demonstrates how various operations can be used to further enhance the model's performance.When all the proposed modules are combined, our model performs the best, particularly when applied to the CAMO dataset, where our model performs better than any other stage.With relation to the model without many-to-many side-join multiplication, a one-to-many side-join multiplication (No.#5) is used, and the S m and MAE of ours in CAMO are improved by 3.67% and 5.56%, respectively.When we remove TEM from the CFC module and use Conv3 instead (No.#1), each of the two datasets' metrics are noticeably worse; especially, S m and MAE show the most obvious decline.Experiment No.#3 verifies the effectiveness of HFAD; if we remove the HFAD, while the F β in the COD10K dataset improves marginally compared to ours, other indicators of our model significantly decrease; in particular, E m in the CAMO dataset declines 19.25%.
We also provide the prediction map of five ablation settings to visually demonstrate the effectiveness of our strategy.When we do not use TEM to enlarge the receptive field in the CFC module (No.#1), according to Figure 5A, the camouflaged objects can roughly be resolved, but the edge details are not smooth enough.As shown in Figure 5B,C, we introduce the SDM module and CFC module (No.#2, #3) to address the issue that the prediction map is void because the high-level semantic characteristics do not contain image spatial details and other information.Furthermore, we independently confirm the many-tomany and one-to-many side-join multiplication for the CFC module (No.#4, #5), as shown in Figure 5D,E; we improve the detection accuracy by re-adding the information that NCD* overlooked to the prediction feature using the many-to-many side-join multiplication technique that we devised.
It is demonstrated that our model fully complies with the anticipated design standards based on the qualitative analysis and quantitative analysis of the aforementioned ablation study.

Failure Cases and Analysis
As shown in Table 4, we evaluate the inference speed of our model in comparison to other models.The findings demonstrate that, despite our model's successful utilization of the SDM, CFC, and HFAD modules and achievement of the primary design goals, a significant amount of duplication still exists in our model.Thus, the model will be further developed from the efficiency standpoint.On the other hand, the first and second rows of Figure 6 depict certain failed scenarios where numerous camouflaged objects are present but only one can be detected by our model.This could be because the CFC module is being used, which focuses more on scenarios where there is only one camouflaged object and filters out other objects as background data.In our subsequent research, we will further explore methods for multi-object detection, such as instance segmentation [54].Furthermore, the object's edge processing is sloppy, and the background is wrongly identified as the foreground when using artificial camouflage, as demonstrated in the third and fourth rows.This could be as a result of the SDM module's limited ability to effectively filter out interference data.The aforementioned results offer fresh perspectives for our upcoming model design.

Conclusions
In this paper, we propose a new model named features split and aggregation network (FSANet) to detect camouflaged objects, which can be divided into three modules to simulate the three-stage detection process of the human visual mechanism when viewing a camouflaged scene.To begin, we divide the backbone into two stages.The SDM module is used in the first stage to perform information interaction of first-level features to fully mine the spatial details (such as edge, texture, etc.) and fuse the features to create a cursory impression.In parallel, high-level semantic information from several sensory areas is mined by using the CFC module.Furthermore, we apply side-join multiplication in CFC to prevent detail distortion and reduce noise.Finally, we configure HFAD to completely fuse the effective information between the two stages to acquire more thorough detection results.Through in-depth experiments on four public camouflaged datasets, we observe that both quantitative and qualitative results verify the effectiveness of our methodology.These results prove the validity and superiority of our model.However, our model still has some limitations.When there are numerous camouflaged objects, our model can only detect one.Additionally, for artificially camouflaged objects, our model fails to perform fine-grained edge processing.The above results provide new directions for our upcoming model design.Furthermore, we aspire for our model to be adaptable across a broader range of applications, including but not limited to industrial defect detection and medical image segmentation and detection.

Figure 1 .
Figure 1.Examples of camouflaged objects; from left to right are natural camouflaged objects and artificial camouflaged objects.

Figure 2 .
Figure 2. The overall architecture of the proposed FSANet, which can be divided into three key components; they are spatial detail mining module, cross-scale feature combination module, and hierarchical feature aggregation decoder.The input is camouflaged object I, and the result is prediction map M.

CHAMELEONFigure 3 .
Figure 3. Quantitative evaluation of different models.The first row shows PR curves; the second row shows F-measure curves; (a,b) display the results for CAMO dataset; (c,d) display the results for CHAMELEON dataset.
-m SJM o-m SJM PD HFAD S m ↑

Table 2 .
Quantitative comparison of different methods on four COD10K testing dataset categories, which contain S-measure (S m

Table 4 .
Comparisons of the number of parameters, FLOPs, and FPS corresponding to recent COD methods.All evaluations follow the inference settings in the corresponding papers.