LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion

Khan, Abbas; Ullah, Hayat; Munir, Arslan

doi:10.3390/ai6090197

Open AccessArticle

LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion

by

Abbas Khan

,

Hayat Ullah

and

Arslan Munir

^*

Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 197; https://doi.org/10.3390/ai6090197

Submission received: 7 July 2025 / Revised: 13 August 2025 / Accepted: 19 August 2025 / Published: 22 August 2025

Download

Browse Figures

Versions Notes

Abstract

Camouflaged object detection (COD) represents one of the most challenging tasks in computer vision, requiring sophisticated approaches to accurately extract objects that seamlessly blend within visually similar backgrounds. While contemporary techniques demonstrate promising detection performance, they predominantly suffer from computational complexity and resource requirements that severely limit their deployment in real-time applications, particularly on mobile devices and edge computing platforms. To address these limitations, we propose LiteCOD, an efficient lightweight framework that integrates local and global perceptions through holistic feature fusion and specially designed efficient attention mechanisms. Our approach achieves superior detection accuracy while maintaining computational efficiency essential for practical deployment, with enhanced feature propagation and minimal computational overhead. Extensive experiments validate LiteCOD’s effectiveness, demonstrating that it surpasses existing lightweight methods with average improvements of 7.55% in the F-measure and 8.08% overall performance gain across three benchmark datasets. Our results indicate that our framework consistently outperforms 20 state-of-the-art methods across quantitative metrics, computational efficiency, and overall performance while achieving real-time inference capabilities with a significantly reduced parameter count of 5.15M parameters. LiteCOD establishes a practical solution bridging the gap between detection accuracy and deployment feasibility in resource-constrained environments.

Keywords:

computer vision; efficient camouflaged object detection; concealed object segmentation; vision transformer; efficient spatial attention; convolution neural network

1. Introduction

Camouflaged object detection (COD) presents a fundamental challenge in computer vision, requiring models to identify objects that exhibit high visual similarity to their surrounding environments [1]. Inspired by natural camouflage mechanisms observed in chameleons, stingrays, and caterpillars [2], this task demands sophisticated algorithms capable of discerning subtle appearance variations between foreground objects and background regions [3]. The inherent difficulty lies in the objects’ evolutionary adaptation to blend seamlessly with their surroundings, making traditional detection methods inadequate. Beyond its biological inspiration, COD has emerged as a critical component in diverse applications including medical image analysis [4], military target detection [5], underwater image analysis [6], infrastructure inspection [7], and wildlife conservation [8]. The key challenge in achieving robust COD performance centers on developing discriminative feature representations that can capture fine-grained differences while maintaining robustness to environmental variations and lighting conditions.

Recent advances in COD can be broadly categorized into two main research directions. The first focuses on designing sophisticated architectural components that emulate biological hunting mechanisms [9,10] or human visual attention patterns [11,12] for camouflaged object identification. The second direction explores the integration of auxiliary supervision [13,14] or domain-specific prior knowledge [15,16] to enhance detection accuracy. While these approaches have demonstrated significant performance improvements, they typically require substantial computational resources and model complexity, limiting their practical deployment in resource-constrained environments such as mobile platforms. Despite remarkable progress in COD performance, most existing methods suffer from significant computational overhead that constrains their practical deployment. The trend towards increasingly complex architectures, while beneficial for accuracy, often neglects efficiency considerations crucial for real-world applications. Current limitations include heavy reliance on large-scale Vision Transformer (ViT) [17] architectures that result in substantial memory requirements and inference latency, multi-branch and multi-scale designs that lead to parameter redundancy without proportional performance gains, and limited consideration for edge computing scenarios where computational resources are severely restricted. These constraints collectively hinder the deployment of state-of-the-art (SOTA) COD methods in mobile devices, embedded systems, and real-time processing scenarios where efficiency is paramount.

To address these efficiency constraints, recent research has shifted toward developing lightweight COD architectures. These methods primarily leverage multi-scale feature aggregation [18,19], frequency domain analysis [20], texture-aware representations and structural priors [21] to compensate for the reduced capacity of compact backbone networks. However, existing lightweight approaches often sacrifice detection accuracy for computational efficiency, failing to achieve an optimal balance between model performance and practical deployment requirements. A critical limitation in current COD methods is the insufficient integration of global semantic understanding and local spatial precision. Existing approaches primarily process global contextual information only in the encoder stages, neglecting the importance of holistic feature integration throughout the entire decoder hierarchy for dense prediction tasks. This architectural limitation becomes particularly pronounced in lightweight models where computational constraints further restrict the capacity for comprehensive feature fusion. For practical deployment considerations, we define real-time performance as achieving inference speeds suitable for interactive visual applications, specifically targeting frame rates of approximately 30 FPS and parameter counts of less than 10M for continuous video processing scenarios. This corresponds to soft real-time requirements where occasional deadline misses are acceptable, enabling deployment in surveillance systems, agriculture and underwater imaging analysis, military target detection, and mobile wildlife monitoring applications where consistent but not strictly deterministic timing is sufficient.

Drawing inspiration from GLCONET [22], a sophisticated and promising approach for camouflaged object detection, we present LiteCOD, which addresses computational overhead limitations while maintaining detection effectiveness. Our lightweight COD framework effectively balances detection accuracy with computational efficiency through holistic global–local feature integration. While GLCONET demonstrates excellent detection performance, its computational complexity limits practical deployment in resource-constrained environments. Our approach addresses this fundamental trade-off between model performance and practical deployment constraints by introducing LiteCOD, a lightweight COD architecture featuring Holistic Unification Modules (HUMs) for bilateral global–local feature enhancement and Enhanced Context Generation (ECG) for multi-scale contextual understanding. The framework incorporates an ultra-lightweight Efficient Spatial Attention (ESA) mechanism within the HUMs that captures long-range spatial dependencies through shared query–key projections, alongside a multi-stage feature integration (MFI) strategy that facilitates effective information propagation from coarse semantic features to fine-grained spatial details across the decoder hierarchy. Our modular approach consistently outperforms existing lightweight methods while maintaining competitive performance against heavyweight approaches, establishing an optimal balance between detection accuracy and computational efficiency for practical deployment scenarios. The primary contributions of this work are as follows:

We propose LiteCOD, a lightweight COD architecture featuring Holistic Unification Modules (HUMs) for bilateral global–local feature enhancement and Enhanced Context Generation (ECG) for multi-scale contextual understanding, achieving SOTA performance with only 5.1M parameters and real-time inference at 76 FPS.
We introduce an ultra-lightweight Efficient Spatial Attention (ESA) mechanism within the HUMs that captures long-range spatial dependencies with significant FLOPS reduction through shared query–key projections and streamlined attention computation, enabling comprehensive spatial relationship modeling while preserving computational efficiency.
We design a multi-stage feature integration (MFI) strategy that incorporates progressive contextual guidance and region-aware spatial weighting, facilitating effective information propagation from coarse semantic features to fine-grained spatial details across the entire decoder hierarchy through systematic feature refinement.
Comprehensive experiments on standard COD benchmarks (CAMO, COD10K, NC4K) demonstrate that our modular approach consistently outperforms existing lightweight methods while maintaining competitive performance against heavyweight approaches.

The remainder of this paper is organized as follows: Section 2 provides a comprehensive review of related work in camouflaged object detection, covering traditional approaches, deep learning-based methods, and lightweight COD techniques. Section 3 presents our proposed methodology, detailing the overall framework architecture, lightweight backbone feature extraction, global–local processing modules, Holistic Unification Modules (HUMs), Enhanced Context Generation (ECG), multi-stage feature integration (MFI), and progressive upsampling with multi-scale supervision. Section 4 describes the experimental setup, implementation details, datasets, and evaluation metrics and presents comprehensive quantitative and qualitative results, including comparisons with SOTA methods and ablation studies validating each component’s effectiveness. Finally, Section 5 concludes the paper by summarizing our contributions and discussing future research directions for lightweight camouflaged object detection.

2. Related Work

2.1. Traditional Camouflaged Object Detection

Early COD methods relied on hand-crafted features and low-level visual cues such as color consistency, texture patterns, and intensity variations [23,24]. These approaches exploited statistical differences between camouflaged objects and backgrounds through adaptive identification of texture elements (size, aspect ratio, orientation), photometric properties (brightness, color, intensity), and spatial frequency responses combined with filter statistics for foreground–background differentiation [25,26]. However, the inherent similarity between camouflaged objects and background regions severely limits low-level feature effectiveness, as manually designed features fail to capture complex semantic relationships and structural patterns necessary for accurate detection of sophisticated natural camouflage strategies.

2.2. Deep Learning-Based COD Methods

The advent of large-scale COD datasets, including CAMO [27], COD10K [28], and NC4K [29], has catalyzed the development of sophisticated deep learning approaches. Modern COD methods can be categorized into several key paradigms.

2.2.1. Biologically Inspired Approaches

Drawing inspiration from human visual perception mechanisms, several methods have developed dual-stage and multi-stage processing frameworks that mirror natural visual cognition. SINet [1] introduced a search–identification paradigm that mimics the human visual system’s hierarchical processing, where a search module first localizes potential camouflaged regions through coarse attention, followed by an identification module that performs precise segmentation through detailed analysis. This biologically motivated approach effectively replicates human ability to rapidly scan environments for anomalies before engaging in focused examination. Building upon this foundation, PFNet [10] extends the localization–segmentation framework by incorporating a human-like distraction-aware mechanism that first identifies object boundaries and subsequently employs attention refinement to counter visual distractions in camouflaged scenarios. This two-stage paradigm has inspired numerous approaches [30,31] with various extensions including additional stages for object restoration, feature matching, and signal amplification to achieve enhanced detection precision and adaptability across diverse camouflage patterns. Beyond dual-stage frameworks, other techniques explore different aspects of human visual perception. ZoomNet [11] emulates human visual patterns of examining ambiguous images through dynamic zooming mechanisms, implementing scale integration and hierarchical processing units to capture mixed-scale semantic information. Additionally, MirrorNet [32] employs a mirror stream architecture with embedded image flipping as a bio-inspired strategy, disrupting camouflage patterns by leveraging the visual system’s sensitivity to symmetry and orientation changes, thereby enhancing detection of objects that rely on directional camouflage mechanisms.

2.2.2. Supplementary Information-Based Approaches

Recent COD approaches integrate diverse supplementary information sources to enhance detection accuracy and robustness. Frequency-domain methods include FEMNet, which first extended COD into spatial–frequency domains through RGB–frequency feature fusion, and FEDER [15], which decomposes features into frequency bands via learnable wavelets with edge reconstruction. Depth-perceptual approaches utilize auxiliary depth information: DCE [33] introduced depth-guided networks with multimodal confidence-aware loss whereas XMSNet [34] implemented attentive fusion for cross-modal semantic mining, PopNet [35] employed source-free depth for object popout. Prompt-learning methods leverage textual or visual guidance, including CoVP [36] with vision–language prompts for large models, GenSAM [37] employing cross-modal thought prompting, and VSCode [38] introducing 2D domain-specific prompt learning for multimodal SOD and COD with zero-shot capabilities. However, these multimodal approaches require additional data sources and increased computational complexity, limiting practical deployment in resource-constrained environments.

2.2.3. Multi-Scale Feature Integration and Contextual Enhancement

This strategy captures the diverse appearances and varying scales of camouflaged objects with rich context information and then aggregates cross-level features [39,40], while gradually refining features [41,42] specifically in a hierarchical [11,43], residual [44], dual-branch [45,46], X-connection [47], or iterative [12] manner, to enhance the representation. Some are further enhanced by edge [16], frequency [15], or coarse prediction maps [48] in the fusion process. ERRNet [39] and HCM [49] propose a reversible recalibration mechanism that leverages prior prediction maps, specifically targeting low-confidence regions to detect previously missed parts. This approach refines detection by focusing on regions that are initially overlooked. To improve efficiency, TinyCOD [18] introduces an adjacent scale feature fusion strategy whereas CamoFormer [50] adopts masked separable attention, where multi-head self-attention is divided into three components. This approach allows for the simultaneous refinement of features at different levels in a top–down manner. While Vision Transformers excel in global context modeling, they often struggle with locality modeling and feature fusion. To address these issues, FSPNet [51] introduces a nonlocal token enhancement mechanism for improved feature interaction and a feature shrinkage decoder. OWinCANet [52] employs overlapped window cross-level attention. This method enhances low-level features with high-level guidance by sliding aligned window pairs across feature maps, ensuring a balance between local and global features for superior performance. Pixel-wise annotation of camouflaged objects is time-consuming and labor-intensive. Weakly supervised methods, using limited or scribble annotations, aim to reduce labeling effort and address boundary ambiguities. CRNet [53] designs a local-context contrasted module to enhance image contrast and a logical semantic relation module to analyze semantic relations, combined with feature-guided and consistency losses to impose stability on the predictions. Techniques like [53] utilize the visual foundation model SAM [54] with sparse annotations as prompts to achieve initial coarse segmentation. Enhanced with multiscale feature grouping, they generate reliable pseudo-labels for training off-the-shelf methods, addressing intrinsic similarity issues in coherent segmentation.

2.2.4. Lightweight COD Methods

To address computational efficiency constraints in camouflaged object detection, recent research has increasingly focused on developing lightweight architectures suitable for resource-constrained environments. These approaches typically employ lightweight backbones as encoders combined with various multi-scale feature fusion strategies and auxiliary enhancement techniques. Several notable contributions have emerged in this domain: TinyCOD [18] introduces an Adjacent Scale Feature Fusion module to improve lightweight backbone representations alongside an Edge Area Focus module specifically designed for challenging edge regions where camouflaged objects blend with backgrounds. Zhang et al. [55] proposed an attention-induced semantic and boundary interaction network that employs contrastive learning to effectively separate camouflaged objects from their surroundings. DGNet-S [21] utilizes gradient supervision with texture-aware dual context-texture encoders within a lightweight backbone framework to achieve real-time performance. Khan et al. [56] enhanced lightweight backbone representations through focal modulation blocks and masking strategies for camouflaged object excavation. Most recently, FINet [20] introduced a frequency injection module that separately incorporates high-frequency details and low-frequency object-level cues to address computational limitations.

Despite these advances, existing lightweight COD models struggle to balance computational efficiency with detection accuracy due to inadequate integration of global semantic understanding and local spatial precision. As summarized in Table 1, most methods process these complementary information sources separately, resulting in suboptimal feature representations that fail at fine-grained boundary delineation and subtle camouflage pattern recognition. Many lightweight approaches sacrifice critical attention mechanisms to achieve efficiency, reducing their discriminative capability for seamlessly blended objects. Our work addresses these limitations by introducing a holistic integration framework that maintains both global semantic coherence and local spatial precision while preserving computational advantages for practical deployment.

3. Proposed Methodology

In this section, as illustrated in Figure 1, we present our lightweight holistic global–local integration network for COD. Our framework addresses the limitations of existing methods by efficiently combining global semantic understanding with local detail preservation through a unified architectural design.

3.1. Overall Framework Architecture

Figure 1 depicts the overall architecture of our proposed method, which follows an encoder–decoder paradigm with holistic feature integration mechanisms. Given an input image,

I \in R^{H \times W \times 3}

, the lightweight backbone encoder generates initial multi-scale features,

{S_{i}}_{i = 1}^{4}

, with resolutions of

\frac{W}{2^{i}} \times \frac{H}{2^{i}}

. Our global–local (G-L) processing modules split backbone features into global contextual understanding (

G_{i}

) and local detail preservation (

L_{i}

) pathways, generating enhanced representations with reduced channel dimensions. The Holistic Unification Modules (HUMs) fuse these complementary features to produce comprehensive holistic representations,

{H_{i}}_{i = 1}^{4}

. Additionally, the Enhanced Context Generation (ECG) module processes features

S_{4}

and

H_{4}

to generate enriched contextual guidance,

E_{c}

. The holistic features are then fed into multi-stage feature integration (MFI) modules to produce refined outputs,

{M_{i}}_{i = 1}^{4}

, through progressive refinement. Finally, auxiliary supervision generates multi-scale predictions,

{P_{i}}_{i = 1}^{4}

and

P_{c}

, in intermediate stages. Through collaborative optimization of these lightweight components, our method achieves superior performance while maintaining computational efficiency for challenging COD.

3.2. Lightweight Backbone Feature Extraction

We employ MobileVit-S [57] as our backbone encoder, chosen for its optimal balance between representational capacity and computational efficiency as illustrated empirically later in the experimental section. The encoder generates multi-scale feature maps at four different resolutions as given in Equation (1):

\begin{matrix} {S_{1}, S_{2}, S_{3}, S_{4}} = MobileViT-S (I) \end{matrix}

(1)

where

I \in R^{H \times W \times 3}

represents the input image. While MobileVit-S [57] originally contains five stages of feature extraction, we utilize features from the second through fifth stages, as the first stage contains only rudimentary low-level information with limited discriminative capacity for COD tasks. The extracted features have the dimensions

S_{1} \in R^{n \times H / 2 \times W / 2}

,

S_{2} \in R^{2 n \times H / 4 \times W / 4}

,

S_{3} \in R^{4 n \times H / 8 \times W / 8}

, and

S_{4} \in R^{8 n \times H / 16 \times W / 16}

, where n represents the base channel dimension that scales progressively across different feature levels. These features capture hierarchical representations with progressively increasing receptive fields, ranging from high-resolution low-level details in

S_{1}

to low-resolution high-level semantic information in

S_{4}

. The choice of MobileVit-S [57] enables our method to maintain lightweight characteristics while capturing essential multi-scale representations necessary for accurate COD.

A core component of both our global modules and HUMs is the Efficient Spatial Attention (ESA) mechanism, designed to capture long-range spatial dependencies while maintaining computational efficiency essential for lightweight architectures. The ESA mechanism operates on the principle that camouflaged objects require understanding subtle spatial relationships between distant regions to distinguish object boundaries from background patterns.

The attention mechanism begins by conditionally applying adaptive spatial downsampling to the input features, as shown in Equation (2). This is followed by a shared projection-based feature transformation and efficient attention computation. Given the input features

F \in R^{C \times H \times W}

, the mechanism first performs resolution-aware downsampling:

\begin{matrix} F_{down} = \{\begin{matrix} AdaptivePool (F, H / 4, W / 4) & if H \times W > 32^{2} \\ F & otherwise \end{matrix} \end{matrix}

(2)

where spatial reduction maintains computational efficiency while preserving essential spatial structure. The mechanism generates shared query

(Q)

, key

(K)

and value

(V)

representations through efficient projections, as given in Equations (3) and (4), respectively:

\begin{matrix} Q K & = {Conv}_{1 \times 1} ({Conv}_{3 \times 3}^{d w} (F_{down})) \in R^{C_{red} \times H^{'} \times W^{'}} \end{matrix}

(3)

\begin{matrix} V & = {Conv}_{1 \times 1} (F_{down}) \in R^{C_{v} \times H^{'} \times W^{'}} \end{matrix}

(4)

where

C_{r e d} = C / 32

and

C_{v} = C / 4

achieve extreme parameter reduction. The core attention computation focuses on row-wise spatial relationships by reshaping features (Equations (5) and (6)) and computing attention across horizontal dimensions as given in Equations (7) and (8):

\begin{matrix} Q_{h} & = Reshape (Q K) \in R^{(B \times W^{'}) \times H^{'} \times C_{red}} \end{matrix}

(5)

\begin{matrix} K_{h} & = Reshape (Q K) \in R^{(B \times W^{'}) \times C_{red} \times H^{'}} \end{matrix}

(6)

\begin{matrix} Attn & = Softmax (\frac{Q_{h} K_{h}}{\sqrt{C_{red}}}) \in R^{(B \times W^{'}) \times H^{'} \times H^{'}} \end{matrix}

(7)

\begin{matrix} Out & = Attn \cdot Reshape (V) \in R^{(B \times W^{'}) \times H^{'} \times C_{v}} \end{matrix}

(8)

The final output combines attended features with a residual connection as shown in Equation (9):

\begin{matrix} ESA (F) = γ \cdot Interpolate (Reshape (Out), H, W) + F \end{matrix}

(9)

where

γ

is a learnable parameter and interpolation restores original spatial dimensions. This streamlined approach reduces computational complexity by a substantial margin while effectively capturing essential spatial dependencies crucial for COD, enabling both global modules to model long-range semantic relationships and HUMs to capture cross-modal spatial dependencies between global and local features.

3.3. Global–Local Feature Processing Modules

The proposed global and local modules constitute a dual-pathway architecture that efficiently decomposes incoming multi-scale features into complementary representational spaces. This design enables simultaneous extraction of high-level semantic context and fine-grained local details, both critical for effective COD. The global pathway captures long-range dependencies and holistic scene understanding, while the local pathway preserves spatial precision and subtle textural variations essential for accurate boundary localization. The technical implementation of these modules is detailed as follows.

3.3.1. Global Feature Extraction (G)

The global modules capture long-range dependencies and semantic relationships across the entire feature map through an ultra-lightweight architecture. As illustrated in Figure 1, the backbone features

{S_{1}, S_{2}, S_{3}, S_{4}}

are processed through dedicated global processing pathways to generate enhanced global representations

{G_{1}, G_{2}, G_{3}, G_{4}}

. For each scale, i, the global processing is formulated using Equation (10):

\begin{matrix} G_{i} = {Global}_{i} (S_{i}) = ESA (Φ_{d s} (S_{i})) + R (S_{i}) \end{matrix}

(10)

where

S_{i}

represents the input features from the backbone at the scale i, and

G_{i} \in R^{64 \times H_{i} \times W_{i}}

denotes the extracted global features that feed into subsequent HUM processing. The global modules employ a depth-wise separable convolution,

Φ_{d s} (\cdot)

, for efficient feature transformation, followed by our Efficient Spatial Attention mechanism

ESA (\cdot)

, as described previously.

As formulated in Equation (11), the depth-wise separable convolution decomposes standard convolution into efficient components:

\begin{matrix} Φ_{d s} (S_{i}) = ReLU (AdaptiveNorm ({Conv}_{1 \times 1} ({Conv}_{3 \times 3}^{dw} (S_{i})))) \end{matrix}

(11)

where

{Conv}_{3 \times 3}^{dw}

performs depth-wise spatial filtering and

{Conv}_{1 \times 1}

provides point-wise feature mixing. This decomposition enables efficient feature transformation while preserving spatial relationships essential for global context understanding. The ESA mechanism then captures long-range spatial dependencies across the transformed features, enabling the global modules to model semantic relationships between distant regions crucial for understanding holistic scene context and camouflaged object boundaries.

The conditional residual connection

R (\cdot)

adapts to channel dimension changes as illustrated in Equation (12):

\begin{matrix} R (S_{i}) = \{\begin{matrix} {Conv}_{1 \times 1} (S_{i}) & if C_{in} \neq C_{out} \\ S_{i} & otherwise \end{matrix} \end{matrix}

(12)

The resulting global features

G_{i}

are then combined with their corresponding local counterparts,

L_{i}

, in the HUMs to generate comprehensive feature representations,

H_{i}

, as depicted in Figure 1. This synergistic design enables effective global context modeling with minimal computational cost, making it suitable for real-time COD applications.

3.3.2. Local Feature Extraction (L)

Complementary to global processing, local modules focus on preserving fine-grained spatial details and local texture information through an ultra-efficient architecture achieving significant parameter reduction. As depicted in Figure 1, the backbone features

{S_{1}, S_{2}, S_{3}, S_{4}}

are simultaneously processed through dedicated local processing pathways to generate enhanced local representations,

{L_{1}, L_{2}, L_{3}, L_{4}}

. For each scale, i, the local processing is given in Equation (13):

\begin{matrix} L_{i} = {Local}_{i} (S_{i}) = Θ_{scale} (Φ_{group} (S_{i})) + R (S_{i}) \end{matrix}

(13)

where

S_{i}

represents the input features at the scale i, and

L_{i} \in R^{64 \times H_{i} \times W_{i}}

denotes the extracted local features that complement their global counterparts

G_{i}

in the HUMs.

Similarly, as given in Equation (14), the local modules employ an ultra-efficient grouped convolution strategy,

Φ_{group} (\cdot)

, that directly processes spatial features while minimizing computational overhead:

\begin{matrix} Φ_{group} (S_{i}) = ReLU (AdaptiveNorm ({Conv}_{3 \times 3}^{group} (S_{i}))) \end{matrix}

(14)

where

{Conv}_{3 \times 3}^{group}

performs spatial filtering with grouped convolutions using

min (C_{in}, C_{out})

groups to minimize parameters while preserving essential local spatial relationships and textural patterns. This grouped convolution strategy focuses on neighborhood-level feature extraction within localized receptive fields, contrasting with the global modules that capture long-range spatial dependencies across the entire feature map.

Likewise, as given in Equation (15), the ultra-minimal channel recalibration mechanism

Θ_{scale} (\cdot)

implements adaptive feature enhancement through learnable channel-wise scaling:

\begin{matrix} Θ_{scale} (F) = F ⊙ α, where α \in R^{1 \times C \times \times 1 \times 1} \end{matrix}

(15)

This channel scaling mechanism requires only a single learnable parameter

(α)

per channel, dramatically reducing computational complexity while enabling adaptive feature importance weighting. The channel-wise scaling emphasizes discriminative local texture patterns and fine-grained spatial details essential for precise boundary localization, unlike the global attention that models semantic relationships across distant regions.

Similarly to G, the conditional residual connection

R (\cdot)

, as given in Equation (12), ensures effective gradient propagation. This residual design preserves original feature information while allowing the local pathway to focus on incremental spatial detail enhancement rather than complete feature reconstruction. The resulting local features

L_{i}

are subsequently fused with their corresponding global counterparts,

G_{i}

, through the HUMs to generate comprehensive representations,

H_{i}

. This architecture enables precise local feature extraction with minimal computational overhead, crucial for accurate boundary delineation and texture preservation in camouflaged object detection applications.

3.4. Holistic Unification Modules (HUMs)

Upon extraction of global and local feature representations,

{G_{i}, L_{i}}_{i = 1}^{4}

, the critical challenge lies in effectively fusing these complementary features into unified holistic representations,

{H_{i}}_{i = 1}^{4}

, that preserve both semantic richness and spatial precision. As illustrated in Figure 1, our framework employs four distinct HUM configurations optimized for different processing stages and information requirements throughout the hierarchical decoding process. At the deepest semantic level, HUM1 performs bilateral enhancement followed by direct fusion between global and local features through a two-stage process as given in Equations (16) and (17), respectively:

\begin{matrix} G_{4}^{enh}, L_{4}^{enh} & = BE (G_{4}, L_{4}) \end{matrix}

(16)

\begin{matrix} H_{4} & = Ψ_{proc} ([G_{4}^{enh}; L_{4}^{enh}]) \end{matrix}

(17)

where

BE (\cdot, \cdot)

denotes bilateral enhancement that mutually refines both feature streams through shared spatial weighting. The bilateral enhancement operates by concatenating downsampled global and local features, generating unified spatial attention weights through extreme channel reduction (C/64), and applying these shared weights to enhance both pathways simultaneously, enabling global features to benefit from local spatial precision while local features gain semantic understanding from global context. The subsequent HUMs, HUM2-4, employ the same bilateral enhancement mechanism (Equation (18)) followed by region-aware spatial weighting (Equation (19)) that incorporates multi-scale contextual guidance, where for levels

i \in {4, 3, 2}

, the progressive holistic integration is given in Equation (20):

\begin{matrix} G L_{i} & = BE (G_{i}, L_{i}) \end{matrix}

(18)

\begin{matrix} G L_{i}^{weighted} & = W_{region} (G L_{i}, H_{prev}, E_{c}) \end{matrix}

(19)

\begin{matrix} H_{i} & = Ψ_{out} ([G L_{i}^{weighted}; G L_{i}]) + H_{prev} + E_{c} \end{matrix}

(20)

where

W_{region} (\cdot, \cdot, \cdot)

generates adaptive spatial weights by processing concatenated current features, inverted previous predictions, and enhanced context through lightweight convolutions. The bilateral enhancement mechanism ensures that both global semantic understanding and local spatial precision are mutually reinforced before fusion, while the region-aware weighting selectively emphasizes discriminative regions based on multi-scale contextual priors, enabling effective information flow from coarse semantic levels to fine spatial details with each stage benefiting from both current-scale bilateral enhancement and multi-scale contextual guidance essential for accurate camouflaged object detection.

3.5. Enhanced Context Generation (ECG)

To provide consistent global understanding and semantic guidance across all scales, we employ an Enhanced Context Generation (ECG) module that generates comprehensive contextual features through ultra-efficient processing. As depicted in Figure 1, the ECG module processes the deepest backbone features

S_{4}

and the corresponding holistic representation

H_{4}

to produce enriched contextual guidance,

E_{c}

, that influences all subsequent multi-stage feature integration (MFI) stages. The enhanced context generation

E_{c}

is formulated as given in Equation (21):

\begin{matrix} E_{c} = ECG (Cat [S_{4}, H_{4}]) = {Conv}_{1 \times 1} (ESA (Φ_{d s} (Cat [S_{4}, H_{4}])) + R (Cat [S_{4}, H_{4}])) \end{matrix}

(21)

where

Cat [S_{4}, H_{4}] \in R^{2 C \times H_{4} \times W_{4}}

represents the concatenation of the deepest encoder features and the corresponding holistic representation, and

Φ_{d s} (\cdot)

denotes depth-wise separable convolution processing. The ECG module processes this fused representation through efficient operations designed for maximum contextual understanding with minimal computational overhead, where the depth-wise separable convolution efficiently transforms the concatenated deep features while preserving essential spatial relationships, and the ESA mechanism captures long-range spatial dependencies across the deepest feature representations, enabling comprehensive contextual understanding essential for camouflaged object detection by modeling relationships between different regions of the scene at the most semantic level. A residual connection ensures effective information flow and gradient propagation, while the final

1 \times 1

convolution generates the contextual prediction map, where

E_{c} \in R^{1 \times H_{4} \times W_{4}}

represents the enhanced contextual features that serve as a global semantic reference providing consistent guidance across all subsequent MFI stages. The contextual guidance enables consistent understanding of camouflaged object boundaries and semantic regions throughout the multi-scale hierarchy, facilitating accurate detection of objects that blend seamlessly with their surroundings, while the strategic placement of the ECG module at the deepest feature level ensures that global semantic information is effectively propagated throughout the entire decoding process.

3.6. Multi-Stage Feature Integration (MFI)

Our LiteCOD employs a cascaded multi-stage feature integration strategy that progressively refines holistic features with enhanced contextual guidance through specialized integration mechanisms. As illustrated in Figure 1, the hierarchical refinement processes features sequentially from deepest to shallowest levels through a unified integration approach that employs reverse attention weighting in the initial stage. The multi-stage integration operates through bilateral enhancement of holistic features followed by contextual refinement, where the integration process is given in Equation (22):

\begin{matrix} M_{i} = {MFI}_{i} (H_{i}, H_{p r e v}, E_{c}) = Ψ_{out} (A_{r e v} (BE (H_{i}, H_{i}), E_{c})) + E_{c} \end{matrix}

(22)

where

BE (\cdot, \cdot)

denotes bilateral enhancement that processes holistic features through the same mechanism used in the HUMs,

A_{r e v} (\cdot, \cdot)

represents reverse attention weighting that emphasizes discriminative regions while suppressing background interference through inverted enhanced context

(1 - sigmoid (E_{c}))

, and

Ψ_{out} (\cdot)

denotes the output projection function. The reverse attention mechanism operates by expanding the inverted contextual guidance across all feature channels and applying element-wise multiplication to the bilateral-enhanced features, effectively highlighting regions where global context indicates low confidence for camouflaged object presence. This approach enables the integration modules to focus on areas requiring refinement while maintaining spatial consistency across the multi-scale hierarchy. The unified MFI design processes each holistic representation,

H_{i}

, through the same bilateral enhancement and reverse attention pipeline, ensuring consistent feature refinement strategies across all scales while the enhanced context

E_{c}

provides global semantic guidance that influences all integration stages. This cascaded integration strategy generates semantically rich multi-stage representations,

{M_{i}}_{i = 1}^{4}

, that effectively balance fine-grained spatial precision with global semantic coherence, forming the foundation for subsequent progressive upsampling and final camouflaged object detection.

3.7. Progressive Upsampling and Multi-Scale Supervision

The refined multi-stage features

{M_{i}}_{i = 1}^{4}

and enhanced context

E_{c}

undergo progressive upsampling to generate full-resolution predictions for multi-scale supervision. Our framework generates outputs at each processing stage through direct bilinear interpolation as given in Equation (23):

\begin{matrix} P = {P_{c}, P_{4}, P_{3}, P_{2}, P_{1}} = {Interpolate (E_{c}, H \times W), Interpolate (M_{i}, H \times W)}_{i = 1}^{4} \end{matrix}

(23)

where

Interpolate (\cdot, H \times W)

denotes bilinear interpolation that upsamples features to the original input resolution. These multi-scale predictions

P

are used for auxiliary supervision during training, providing learning signals in intermediate processing stages to ensure effective gradient flow and robust optimization throughout the lightweight network architecture, ultimately leading to superior camouflaged object detection performance.

4. Experiments and Discussion

Our proposed lightweight framework demonstrates exceptional performance through its efficient bilateral enhancement mechanism and multi-scale feature integration strategy, enabling accurate camouflaged object detection while maintaining computational efficiency. The synergistic combination of global–local feature processing, Holistic Unification Modules, and enhanced contextual guidance enables our method to achieve superior detection accuracy across diverse challenging scenarios. In this section, we present comprehensive experimental validation including implementation details, quantitative comparisons with SOTA methods, qualitative analysis, and thorough ablation studies to demonstrate the effectiveness of each proposed component in our lightweight camouflaged object detection framework.

4.1. Implementation Settings and Reproducibility

Following recent works [58], we implemented our proposed method using the open-source PyTorch version 2.5 deep learning framework [59] and conducted all experiments on a single NVIDIA GeForce RTX 4080 GPU with 16 GB of memory for development and benchmarking purposes. To evaluate real-world edge deployment capabilities, we additionally tested our method on an NVIDIA Jetson Orin AGX edge computing platform, which better represents the computational constraints of practical deployment scenarios in mobile and embedded systems. For the backbone encoders, we employed MobileVit-S [57] pre-trained on the ImageNet dataset [60] as our primary architecture, while also evaluating other backbone networks including ResNet-50 [61] and Pyramid Vision Transformer (PVT-V2) [62] for comprehensive analysis. All input images were resized to

512 \times 512

pixels during both the training and inference phases. We adopted the Adam optimizer [63] with an initial learning rate of

1 \times 10^{- 4}

and applied a learning rate decay factor of 0.1 every 60 epochs. The model was trained for 250 epochs with a batch size of 16. Additionally, similarly to prominent techniques like SINet [1], we employed several data augmentation strategies including horizontal flipping, random cropping, and color enhancement during the training process to prevent model overfitting and enhance generalization capability.

4.2. Datasets

We conducted comprehensive experiments on three widely adopted benchmark datasets for camouflaged object detection: CAMO [27], COD10K [28], and NC4K [29]. The CAMO dataset comprises 1250 images featuring camouflaged objects in natural scenes, with a standard split of 1000 training and 250 testing images. COD10K [28] represents the largest-scale COD benchmark, containing 10,000 high-quality images across 78 object categories, partitioned into 3040 training samples and 2026 testing samples. The NC4K [29] dataset provides an additional challenging evaluation set of 4121 testing images with diverse camouflaged scenarios. Following standard evaluation protocols used in prominent tecniques [1,28], we trained our LiteCOD on the combined training sets from CAMO-Train (1000 images) and COD10K-Train (3040 images), ensuring fair comparison with existing SOTA methods.

4.3. Objective Function

To train our network effectively, we adopted a weighted combination of multi-scale loss functions between the predictions

P_{i}

and the ground truth

G_{t}

. The total loss function is defined in Equation (24):

\begin{matrix} L_{total} = α_{c} L_{BCE} (P_{c}, G_{t}) + \sum_{i = 1}^{4} α_{i} L_{BCE} (P_{i}, G_{t}) + β \sum_{j \in {c, 1, 2, 3, 4}} L_{IoU} (P_{j}, G_{t}) \end{matrix}

(24)

Here,

L_{BCE}

and

L_{IoU}

denote the binary cross-entropy and intersection-over-union loss functions, respectively. The term

P_{c}

refers to the enhanced context prediction, while

P_{1}

through

P_{4}

represent the multi-stage refined predictions at various scales. The weights

α_{c}

and

α_{i}

control the influence of the context and stage-wise predictions, respectively, and

β

balances the contribution of the IoU loss. These hyperparameters are empirically set to encourage effective learning of both global semantics and fine-grained spatial details across all supervision stages.

4.4. Evaluation Metrics

We adopt four widely used evaluation metrics established in the camouflaged object detection literature: the Structure-measure

(S_{m})

[64], Mean Absolute Error (MAE), weighted F-measure (

F_{β}^{w}

) [65], and adaptive E-measure (

E_{m}^{ϕ}

) [66]. The Structure-measure

(S_{m})

evaluates the structural similarity between predicted segmentation maps and ground truth by considering both region-aware and object-aware structural information. The MAE computes the pixel-wise absolute difference between predictions and ground truth, providing a direct measure of segmentation accuracy with lower values indicating superior performance. The weighted F-measure (

F_{β}^{w}

) calculates the harmonic mean of precision and recall with an adaptive weighting mechanism to address inherent class imbalance in COD datasets, while we also report the mean F-measure (

F_{β}^{m e a n}

) to provide an additional perspective on precision–recall balance. The adaptive E-measure (

E_{m}^{ϕ}

) provides a comprehensive assessment by jointly considering both local-pixel-accuracy- and global-image-level statistics, offering enhanced discriminative power for camouflaged object evaluation; we complement this with the maximum E-measure (

E_{m}^{m a x}

) to capture the peak performance potential across different thresholds. For all metrics except the MAE, higher values indicate better performance.

4.5. Qualitative and Quantitative Results

Our LiteCOD achieved superior performance across all evaluation metrics on the three benchmark datasets, as demonstrated in Table 2. Quantitatively, the method attained SOTA results on the COD10K, CAMO, and NC4K datasets, while maintaining real-time inference at 76 FPS with only 5.1M parameters. As shown in the comparative analysis, LiteCOD consistently outperformed existing lightweight methods while maintaining competitive performance against heavyweight approaches across all standard evaluation metrics. Qualitatively, Figure 2 highlights our method’s superior detection quality with well-defined object boundaries and reduced false positives compared to contemporary approaches. Likewise, Figure 3 illustrates LiteCOD’s exceptional boundary preservation and accurate object localization across diverse challenging scenarios, including complex textural patterns, varying object scales, and extreme camouflage conditions. The lightweight architecture successfully balances detection accuracy with computational efficiency, requiring substantially fewer computational resources while delivering enhanced performance, thereby establishing an optimal trade-off between accuracy and practical deployment feasibility for resource-constrained environments.

Table 2. Performance comparison of LiteCOD against SOTA methods across multiple COD benchmarks (CAMO, COD10K, NC4K). Our method achieves optimal balance between detection accuracy and computational efficiency, outperforming lightweight techniques while maintaining competitive performance with heavyweight approaches at significantly reduced parameter count and computational overhead. Best results are highlighted in bold.

Method	Publication	Param/M	FLOPs/G	FPS	CAMO (250)				COD10K (2026)				NC4K (4121)
Method	Publication	Param/M	FLOPs/G	FPS	$S_{m}$ ↑	$E_{m}^{ϕ}$ ↑	$F_{β}^{w}$ ↑	M↓	$S_{m}$ ↑	$E_{m}^{ϕ}$ ↑	$F_{β}^{w}$ ↑	M↓	$S_{m}$ ↑	$E_{m}^{ϕ}$ ↑	$F_{β}^{w}$ ↑	M↓
SINet [1]	CVPR’20	48.95	19.30	82	0.751	0.771	0.606	0.100	0.771	0.797	0.551	0.051	0.808	0.838	0.723	0.058
PFNet [10]	CVPR’21	46.50	26.39	106	0.782	0.855	0.695	0.085	0.800	0.868	0.660	0.040	0.829	0.894	0.745	0.053
LSR [29]	CVPR’21	50.94	17.36	150	0.793	0.859	0.743	0.080	0.804	0.883	0.678	0.037	0.840	0.904	0.666	0.048
SINetV2 [28]	TPAMI’22	26.98	12.17	130	0.820	0.884	0.743	0.070	0.815	0.864	0.689	0.037	0.847	0.901	0.770	0.044
BGNet [16]	ICCAI’22	79.85	58.24	88	0.812	0.876	0.749	0.073	0.831	0.892	0.722	0.033	0.851	0.911	0.788	0.044
SegMaR [12]	CVPR’22	56.97	33.49	85	0.815	0.881	0.753	0.071	0.833	0.869	0.724	0.034	0.861	0.905	0.781	0.046
ZoomNet [11]	CVPR’22	32.28	101.35	41	0.820	0.883	0.752	0.066	0.838	0.893	0.729	0.029	0.853	0.907	0.784	0.043
FEDER [15]	CVPR’23	44.13	35.80	42	0.802	0.877	0.738	0.071	0.822	0.901	0.716	0.032	0.847	0.913	0.789	0.044
CamoFocus-P [56]	WACV’24	73.2	44	45	0.817	0.884	0.752	0.067	0.838	0.900	0.724	0.029	0.865	0.913	0.788	0.042
FSEL [67]	ECCV’24	29.15	35.64	-	0.822	0.892	0.758	0.067	0.833	0.898	0.728	0.031	0.855	0.913	0.792	0.042
CamoFormer [50]	TPAMI’24	36.11	34.20	87	0.817	0.884	0.752	0.067	0.838	0.900	0.724	0.029	0.865	0.913	0.788	0.042
ESNet [68]	KBS’25	10.77	4.52	99	0.848	-	-	0.049	0.850	-	-	0.031	0.862	-	-	0.040
Lightweight COD Techniques
DGNet-S [21]	MIR’23	7.02	1.14	153	0.826	0.896	0.754	0.063	0.810	0.869	0.672	0.036	0.845	0.902	0.764	0.047
ASBI [55]	CVIU’23	9.47	9.84	95	0.839	0.896	0.761	0.064	0.825	0.872	0.690	0.035	0.855	0.902	0.775	0.046
ERRNET [39]	PR	9.47	9.84	95	0.839	0.896	0.761	0.064	0.825	0.872	0.690	0.035	0.855	0.902	0.775	0.046
CamoFocus-E [56]	WACV’24	4.76	5.54	78	0.817	0.884	0.752	0.067	0.838	0.900	0.724	0.029	0.865	0.913	0.788	0.042
TinyCOD [18]	ICASSP’23	4.72	1.40	60	0.822	0.890	0.752	0.066	0.831	0.877	0.678	0.036	0.843	0.903	0.766	0.047
FINet [20]	SPL’24	3.74	1.16	127	0.828	0.890	0.752	0.065	0.817	0.882	0.686	0.034	0.847	0.904	0.771	0.047
LiteCOD(Ours)	-	5.15	7.95	72	0.841	0.907	0.796	0.056	0.852	0.920	0.765	0.026	0.870	0.926	0.822	0.036

- denotes that the result is unavailable. The best results are presented in bold. The ↑ symbol indicates that higher values represent better performance, whereas ↓ denotes that lower values are more favorable.

4.6. Ablation Study

We conducted comprehensive ablation studies to validate the effectiveness of our design choices, focusing on three critical aspects: backbone architecture selection, input image resolution impact, and incremental analysis of the proposed LiteCOD modules. The incremental module analysis (Table 3) demonstrates the individual contribution of each proposed component, while the resolution study evaluates the computational–accuracy trade-offs across different input sizes.

Backbone Architecture Analysis: We evaluate our LiteCOD framework with different backbone encoders to demonstrate its generalizability and identify the optimal efficiency-accuracy trade-off. As shown in Table 4, we compare three representative architectures: ResNet-50 [61] (CNN-based), PVT-V2 [62] (transformer), and MobileViT-S (hybrid CNN-transformer). The results demonstrate that MobileVit-S [57] achieves the best balance between computational efficiency and detection performance, providing superior accuracy while maintaining a significantly lower parameter count and faster inference speed. ResNet-50 achieves competitive accuracy but requires more parameters, while Swin Transformer shows strong performance at the cost of increased computational overhead. These findings validate our choice of MobileVit-S [57] as the optimal backbone for lightweight COD.

Input Image Resolution Analysis: We investigate the impact of different input resolutions on both detection performance and computational efficiency. Table 5 presents results across three resolution settings: 384 × 384, 416 × 416, and 512 × 512 pixels. The analysis reveals that higher resolutions consistently improve detection accuracy, with 512 × 512 achieving the best performance across all metrics. However, this comes with increased computational cost and reduced inference speed. The 416 × 416 resolution provides an optimal balance for most practical applications, offering substantial performance gains over 384 × 384 while maintaining reasonable computational requirements. These results guide the selection of appropriate input resolution based on specific deployment constraints and accuracy requirements.

4.7. Edge Computing Deployment Analysis

To validate the practical applicability of our LiteCOD for edge computing scenarios, we conducted additional experiments on an NVIDIA Jetson Orin AGX development kit, which represents a typical edge computing platform with significantly reduced computational resources compared to our development hardware. As shown in Table 6, the Jetson Orin AGX features an ARM-based CPU architecture with integrated GPU acceleration, 32 GB of shared memory, and a 50 W power envelope, closely approximating computational constraints in real-world deployment scenarios. Our experiments demonstrated that LiteCOD maintained practical inference capabilities on this edge hardware, achieving approximately 20 FPS with

512 \times 512

input resolution while consuming only 50 W of power. Although this represents a reduction from our RTX 4080 results (76 FPS), the performance remains suitable for real-time applications and validates our lightweight design philosophy. Key deployment factors include memory bandwidth limitations and thermal constraints, where our architectural choices—particularly the ultra-lightweight Efficient Spatial Attention mechanism and 5.1 M parameter count—prove essential for maintaining performance under edge computing constraints.

4.8. Discussion and Limitations

Our LiteCOD successfully addresses the computational challenges in camouflaged object detection while maintaining competitive accuracy. The method excels in real-time applications and resource-constrained environments, making it suitable for mobile and edge computing scenarios. However, certain limitations persist: the lightweight design occasionally struggles with extremely small camouflaged objects (<1% image area) and complex multi-object scenarios with significant occlusion. The streamlined attention mechanism, while computationally efficient, may miss subtle texture variations in highly challenging cases. Future work could explore adaptive attention mechanisms and multi-scale training strategies to further enhance performance on edge cases while preserving the lightweight characteristics that make this approach practical for real-world deployment.

Edge Deployment Considerations: While our Jetson Orin AGX experiments demonstrate practical edge deployment capabilities, several platform-specific challenges remain. Memory bandwidth limitations on ARM-based systems can affect feature map processing efficiency, and thermal management becomes critical during continuous operation. Additionally, quantization and model compression techniques could further enhance edge performance but may introduce accuracy trade-offs that require careful evaluation. Future work should explore hardware-specific optimizations, including custom inference engines like TensorRT and specialized neural processing unit acceleration, to maximize performance across diverse edge computing platforms while maintaining the detection quality achieved in our current implementation.

5. Conclusions

In this work, we addressed the challenging task of camouflaged object detection (COD) by proposing LiteCOD, a lightweight framework that effectively balances detection accuracy with computational efficiency for practical deployment scenarios. Our approach tackles the fundamental limitations of existing methods that suffer from computational complexity limiting their real-time applicability on mobile devices and edge computing platforms. LiteCOD integrates local and global perceptions through holistic feature fusion and efficient attention mechanisms, achieving superior detection accuracy while maintaining minimal computational overhead. Comprehensive experiments across standard COD benchmarks (CAMO, COD10K, NC4K) demonstrate that LiteCOD consistently surpasses existing lightweight methods while maintaining competitive performance with heavyweight approaches, successfully bridging the gap between detection accuracy and deployment feasibility with only 5.1M parameters and real-time inference at 76 FPS. Future research should address the identified limitations through (1) developing adaptive attention mechanisms for extremely small camouflaged objects and complex multi-object scenarios, (2) investigating model compression and quantization techniques for enhanced edge device performance, (3) exploring hardware-specific optimizations including TensorRT and neural processing unit acceleration, and (4) designing multi-scale training strategies for improved texture variation detection. Additionally, addressing thermal management challenges and expanding to video sequences with temporal consistency could further enhance practical applicability in surveillance, autonomous navigation, and environmental monitoring applications.

Author Contributions

Conceptualization, A.K.; methodology, A.K.; software, A.K.; validation, A.K. and H.U.; formal analysis, A.K. and H.U.; investigation, A.M.; resources, A.M.; data curation, A.K.; writing—original draft preparation, A.K.; writing—review and editing, H.U. and A.M.; visualization, A.K. and H.U.; supervision, A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We provide the code used in this study for reproducibility at https://github.com/iscaas/AFOSR-HAR-2021-2025/tree/main/LiteCOD. All datasets used in this research are open-source and accessible online. Specifically, the COD10K dataset is available at https://paperswithcode.com/dataset/cod10k (accessed on 15 May 2025), the CAMO dataset can be found at https://paperswithcode.com/dataset/camo (accessed on 15 May 2025), and the NC4K dataset is accessible at https://paperswithcode.com/dataset/nc4k (accessed on 15 May 2025). All datasets were accessed on 15 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fan, D.P.; Ji, G.P.; Sun, G.; Cheng, M.M.; Shen, J.; Shao, L. Camouflaged Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Stevens, M.; Ruxton, G.D. The key role of behaviour in animal camouflage. Biol. Rev. 2019, 94, 116–134. [Google Scholar] [CrossRef] [PubMed]
Xiao, F.; Hu, S.; Shen, Y.; Fang, C.; Huang, J.; He, C.; Tang, L.; Yang, Z.; Li, X. A survey of camouflaged object detection and beyond. arXiv 2024, arXiv:2408.14562. [Google Scholar] [CrossRef]
Ngoc Lan, P.; An, N.S.; Hang, D.V.; Long, D.V.; Trung, T.Q.; Thuy, N.T.; Sang, D.V. Neounet: Towards accurate colon polyp segmentation and neoplasm detection. In Proceedings of the Advances in Visual Computing: 16th International Symposium, ISVC 2021, Virtual, 4–6 October 2021; pp. 15–28. [Google Scholar]
Liu, M.; Di, X. Extraordinary MHNet: Military high-level camouflage object detection network and dataset. Neurocomputing 2023, 549, 126466. [Google Scholar] [CrossRef]
Luo, C.; Wu, J.; Sun, S.; Ren, P. TransCODNet: Underwater transparently camouflaged object detection via RGB and event frames collaboration. IEEE Robot. Autom. Lett. 2023, 9, 1444–1451. [Google Scholar] [CrossRef]
Khan, A.; Khan, M.; Gueaieb, W.; El Saddik, A.; De Masi, G.; Karray, F. SpotCrack: Leveraging a Lightweight Framework for Crack Segmentation in Infrastructure. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; pp. 1–5. [Google Scholar]
Pérez-de la Fuente, R.; Delclòs, X.; Peñalver, E.; Speranza, M.; Wierzchos, J.; Ascaso, C.; Engel, M.S. Early evolution and ecology of camouflage in insects. Proc. Natl. Acad. Sci. USA 2012, 109, 21414–21419. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention—MICCAI 2020, Lima, Peru, 4–8 October 2020; pp. 263–273. [Google Scholar]
Mei, H.; Ji, G.P.; Wei, Z.; Yang, X.; Wei, X.; Fan, D.P. Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8772–8781. [Google Scholar]
Pang, Y.; Zhao, X.; Xiang, T.Z.; Zhang, L.; Lu, H. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2160–2170. [Google Scholar]
Jia, Q.; Yao, S.; Liu, Y.; Fan, X.; Liu, R.; Luo, Z. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4713–4722. [Google Scholar]
Wang, Q.; Yang, J.; Yu, X.; Wang, F.; Chen, P.; Zheng, F. Depth-aided camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3297–3306. [Google Scholar]
Fang, C.; He, C.; Tang, L.; Zhang, Y.; Zhu, C.; Shen, Y.; Chen, C.; Xu, G.; Li, X. Integrating extra modality helps segmentor find camouflaged objects well. arXiv 2025, arXiv:2502.14471. [Google Scholar]
He, C.; Li, K.; Zhang, Y.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22046–22055. [Google Scholar]
Sun, Y.; Wang, S.; Chen, C.; Xiang, T.Z. Boundary-guided camouflaged object detection. arXiv 2022, arXiv:2207.00794. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Xing, H.; Gao, S.; Tang, H.; Mok, T.Q.; Kang, Y.; Zhang, W. TINYCOD: Tiny and effective model for camouflaged object detection. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Khan, A.; Khan, M.; Gueaieb, W.; El Saddik, A.; De Masi, G.; Karray, F. Recod: Resource-efficient camouflaged object detection for UAV-based smart cities applications. In Proceedings of the 2023 IEEE International Smart Cities Conference (ISC2), Bucharest, Romania, 24–27 September 2023; pp. 1–5. [Google Scholar]
Liang, W.; Wu, J.; Wu, Y.; Mu, X.; Xu, J. FINet: Frequency injection network for lightweight camouflaged object detection. IEEE Signal Process. Lett. 2024, 31, 526–530. [Google Scholar] [CrossRef]
Ji, G.P.; Fan, D.P.; Chou, Y.C.; Dai, D.; Liniger, A.; Van Gool, L. Deep gradient learning for efficient camouflaged object detection. Mach. Intell. Res. 2023, 20, 92–108. [Google Scholar] [CrossRef]
Sun, Y.; Xuan, H.; Yang, J.; Luo, L. Glconet: Learning multisource perception representation for camouflaged object detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 13262–13275. [Google Scholar] [CrossRef]
Galun; Sharon; Basri; Brandt. Texture segmentation by multiscale aggregation of filter responses and shape elements. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 716–723. [Google Scholar]
Bhajantri, N.U.; Nagabhushan, P. Camouflage defect identification: A novel approach. In Proceedings of the 9th International Conference on Information Technology (ICIT’06), Bhubaneswar, India, 18–21 December 2006; pp. 145–148. [Google Scholar]
Song, L.; Geng, W. A new camouflage texture evaluation method based on WSSIM and nature image features. In Proceedings of the 2010 International Conference on Multimedia Technology, Ningbo, China, 29–31 October 2010; pp. 1–4. [Google Scholar]
Boot, W.R.; Neider, M.B.; Kramer, A.F. Training and transfer of training in the search for camouflaged targets. Atten. Percept. Psychophys. 2009, 71, 950–963. [Google Scholar] [CrossRef] [PubMed]
Le, T.N.; Nguyen, T.V.; Nie, Z.; Tran, M.T.; Sugimoto, A. Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 2019, 184, 45–56. [Google Scholar] [CrossRef]
Fan, D.P.; Ji, G.P.; Cheng, M.M.; Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6024–6042. [Google Scholar] [CrossRef] [PubMed]
Lv, Y.; Zhang, J.; Dai, Y.; Li, A.; Liu, B.; Barnes, N.; Fan, D.P. Simultaneously Localize, Segment and Rank the Camouflaged Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11591–11601. [Google Scholar]
Liu, Y.; Zhang, D.; Zhang, Q.; Han, J. Integrating part-object relationship and contrast for camouflaged object detection. IEEE Trans. Inf. Forensics Secur. 2021, 16, 5154–5166. [Google Scholar] [CrossRef]
Mei, H.; Xu, K.; Zhou, Y.; Wang, Y.; Piao, H.; Wei, X.; Yang, X. Camouflaged object segmentation with omni perception. Int. J. Comput. Vis. 2023, 131, 3019–3034. [Google Scholar] [CrossRef]
Yan, J.; Le, T.N.; Nguyen, K.D.; Tran, M.T.; Do, T.T.; Nguyen, T.V. Mirrornet: Bio-inspired camouflaged object segmentation. IEEE Access 2021, 9, 43290–43300. [Google Scholar] [CrossRef]
Xiang, M.; Zhang, J.; Lv, Y.; Li, A.; Zhong, Y.; Dai, Y. Exploring depth contribution for camouflaged object detection. arXiv 2021, arXiv:2106.13217. [Google Scholar]
Wu, Z.; Wang, J.; Zhou, Z.; An, Z.; Jiang, Q.; Demonceaux, C.; Sun, G.; Timofte, R. Object segmentation by mining cross-modal semantics. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3455–3464. [Google Scholar]
Wu, Z.; Paudel, D.P.; Fan, D.P.; Wang, J.; Wang, S.; Demonceaux, C.; Timofte, R.; Van Gool, L. Source-free depth for object pop-out. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1032–1042. [Google Scholar]
Tang, L.; Jiang, P.T.; Shen, Z.H.; Zhang, H.; Chen, J.W.; Li, B. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 8805–8814. [Google Scholar]
Hu, J.; Lin, J.; Gong, S.; Cai, W. Relax image-specific prompt requirement in sam: A single generic prompt for segmenting camouflaged objects. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 38, pp. 12511–12518. [Google Scholar]
Luo, Z.; Liu, N.; Zhao, W.; Yang, X.; Zhang, D.; Fan, D.P.; Khan, F.; Han, J. Vscode: General visual salient and camouflaged object detection with 2d prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 17169–17180. [Google Scholar]
Ji, G.P.; Zhu, L.; Zhuge, M.; Fu, K. Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recognit. 2022, 123, 108414. [Google Scholar] [CrossRef]
Zhang, Q.; Yan, W. CFANet: A cross-layer feature aggregation network for camouflaged object detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2441–2446. [Google Scholar]
Hu, X.; Zhang, X.; Wang, F.; Sun, J.; Sun, F. Efficient camouflaged object detection network based on global localization perception and local guidance refinement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5452–5465. [Google Scholar] [CrossRef]
Dong, Y.; Zhou, H.; Li, C.; Xie, J.; Xie, Y.; Li, Z. You do not need additional priors in camouflage object detection. arXiv 2023, arXiv:2310.00702. [Google Scholar] [CrossRef]
Zhang, C.; Wang, K.; Bi, H.; Liu, Z.; Yang, L. Camouflaged object detection via neighbor connection and hierarchical information transfer. Comput. Vis. Image Underst. 2022, 221, 103450. [Google Scholar] [CrossRef]
Ren, J.; Hu, X.; Zhu, L.; Xu, X.; Xu, Y.; Wang, W.; Deng, Z.; Heng, P.A. Deep texture-aware features for camouflaged object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 33, 1157–1167. [Google Scholar] [CrossRef]
Wang, K.; Bi, H.; Zhang, Y.; Zhang, C.; Liu, Z.; Zheng, S. D²C-Net: A Dual-Branch, Dual-Guidance and Cross-Refine Network for Camouflaged Object Detection. IEEE Trans. Ind. Electron. 2021, 69, 5364–5374. [Google Scholar] [CrossRef]
Chen, G.; Liu, S.J.; Sun, Y.J.; Ji, G.P.; Wu, Y.F.; Zhou, T. Camouflaged object detection via context-aware cross-level fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6981–6993. [Google Scholar] [CrossRef]
Zhuge, M.; Lu, X.; Guo, Y.; Cai, Z.; Chen, S. CubeNet: X-shape connection for camouflaged object detection. Pattern Recognit. 2022, 127, 108644. [Google Scholar] [CrossRef]
Chen, T.; Xiao, J.; Hu, X.; Zhang, G.; Wang, S. A bioinspired three-stage model for camouflaged object detection. arXiv 2023, arXiv:2305.12635. [Google Scholar] [CrossRef]
Xiao, F.; Zhang, P.; He, C.; Hu, R.; Liu, Y. Concealed object segmentation with hierarchical coherence modeling. In Proceedings of the CAAI International Conference on Artificial Intelligence, Fuzhou, China, 22–23 July 2023; pp. 16–27. [Google Scholar]
Yin, B.; Zhang, X.; Fan, D.P.; Jiao, S.; Cheng, M.M.; Van Gool, L.; Hou, Q. Camoformer: Masked separable attention for camouflaged object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10362–10374. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Dai, H.; Xiang, T.Z.; Wang, S.; Chen, H.X.; Qin, J.; Xiong, H. Feature shrinkage pyramid for camouflaged object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5557–5566. [Google Scholar]
Li, J.; Lu, F.; Xue, N.; Li, Z.; Zhang, H.; He, W. Cross-level attention with overlapped windows for camouflaged object detection. arXiv 2023, arXiv:2311.16618. [Google Scholar] [CrossRef]
He, C.; Li, K.; Zhang, Y.; Xu, G.; Tang, L.; Zhang, Y.; Guo, Z.; Li, X. Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. Adv. Neural Inf. Process. Syst. 2023, 36, 30726–30737. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, Q.; Sun, X.; Chen, Y.; Ge, Y.; Bi, H. Attention-induced semantic and boundary interaction network for camouflaged object detection. Comput. Vis. Image Underst. 2023, 233, 103719. [Google Scholar] [CrossRef]
Khan, A.; Khan, M.; Gueaieb, W.; El Saddik, A.; De Masi, G.; Karray, F. CamoFocus: Enhancing camouflage object detection with split-feature focal modulation and context refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1434–1443. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Gao, D.; Zhou, Y.; Yan, H.; Chen, C.; Hu, X. COD-SAM: Camouflage object detection using SAM. Pattern Recognit. 2025, 168, 111826. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar] [CrossRef]
Sun, Y.; Xu, C.; Yang, J.; Xuan, H.; Luo, L. Frequency-spatial entanglement learning for camouflaged object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 343–360. [Google Scholar]
Ren, P.; Bai, T.; Sun, F. ESNet: An Efficient Skeleton-guided Network for camouflaged object detection. Knowl.-Based Syst. 2025, 311, 113056. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of LiteCOD featuring multi-scale feature extraction through hierarchical stages (

S_{1}

–

S_{4}

), Holistic Unification Modules (HUMs) for bilateral global–local feature enhancement, Enhanced Context Generation (ECG) for semantic guidance, multi-stage feature integration (MFI) for progressive feature refinement, and multi-level supervision with predictions (

P_{1} - P_{4}

and

P_{c}

) in each stage for comprehensive COD.

Figure 1. Overall architecture of LiteCOD featuring multi-scale feature extraction through hierarchical stages (

S_{1}

–

S_{4}

), Holistic Unification Modules (HUMs) for bilateral global–local feature enhancement, Enhanced Context Generation (ECG) for semantic guidance, multi-stage feature integration (MFI) for progressive feature refinement, and multi-level supervision with predictions (

P_{1} - P_{4}

and

P_{c}

) in each stage for comprehensive COD.

Figure 2. Qualitative comparison of LiteCOD with recent COD methods across diverse challenging scenarios. The comparison includes SegMaR, ZoomNet, SINet-V2, FSPNet, FEDER, MRRNet, PUENet, and EVP across five different test cases showing various camouflaged objects (including what appears to be camouflaged animals and objects in natural environments). Each row shows the original image, ground truth mask, and segmentation results from LiteCOD (ours) and the comparison methods. The results demonstrate LiteCOD’s superior performance in accurately detecting and segmenting camouflaged objects while maintaining better boundary preservation and structural fidelity compared to existing approaches.

Figure 3. Qualitative comparison between our proposed method and contemporary lightweight COD approaches (TinyCOD [18], FINet [20], and DGNet-S [38], among others) across diverse challenging scenarios. Our lightweight approach demonstrates superior boundary preservation and structural fidelity compared to other efficient methods, particularly excelling in cases involving both large-scale and minute camouflaged targets and complex textural patterns. These results validate the effectiveness of our proposed architecture in achieving high-quality detection while maintaining computational efficiency suitable for practical deployment.

Table 1. Summary of related work in COD: advantages and limitations.

Category	Representative Methods	Advantages	Limitations
Traditional Approaches	Galun et al. [23], Bhajantri & Latte [24], Song et al. [25]	Low computational requirements Simple implementation No training data needed Interpretable hand-crafted features	Limited accuracy in complex scenes Fails with sophisticated camouflage Poor generalization across domains Cannot capture semantic relationships
Biologically Inspired Methods	SINet [1], PFNet [10], ZoomNet [11], MirrorNet [32]	Mimics human visual perception Effective two-stage processing Good boundary localization Strong performance on standard benchmarks	High computational complexity Large parameter count (26–57 M) Slow inference speed Not suitable for real-time applications
Supplementary Information Methods	FEMNet, FEDER [15], DCE [33], XMSNet [34], CoVP [36]	Enhances accuracy through multimodal fusion Leverages frequency/depth information Strong performance on challenging cases Vision–language prompt capabilities	Requires additional data sources Increased computational overhead Complex training procedures Limited practical deployment scenarios
Multi-Scale Feature Integration	ERRNet [39], CamoFormer [50], FSPNet [51], OWinCANet [52]	Captures diverse object scales Rich contextual information Hierarchical feature refinement Good cross-level feature interaction	Computational redundancy in multi-branch designs Parameter inefficiency Memory intensive operations Complex architecture design
Lightweight COD Methods	TinyCOD [18], DGNet-S [21], FINet [20], CamoFocus [56]	Real-time inference capability Low parameter count (3–10 M) Suitable for edge deployment Energy-efficient processing	Accuracy sacrifice for efficiency Limited global context modeling Poor performance in complex scenarios Insufficient attention mechanisms

This table summarizes the key advantages and limitations of different COD approaches, highlighting the research gap that our LiteCOD addresses by combining accuracy with efficiency.

Table 3. Progressive ablation study investigating the incremental contribution of each component in LiteCOD. The analysis demonstrates how the HUMs and MFI with ECG progressively improved performance while maintaining computational efficiency on COD10K dataset. Results show the importance of each architectural component for achieving SOTA COD.

Model Variant	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$F_{β}^{mean}$ ↑	$E_{m}^{ϕ}$ ↑	$E_{m}^{max}$ ↑	M↓
Backbone	0.811	0.678	0.713	0.876	0.916	0.036
+ HUM module	0.831	0.719	0.747	0.896	0.923	0.030
+ HUM + MFI w/o ECG	0.849	0.758	0.785	0.916	0.926	0.027
+ HUM + MFI (Ours)	0.852	0.765	0.790	0.920	0.928	0.026

The ↑ symbol indicates that higher values represent better performance, whereas ↓ denotes that lower values are more favorable.

Table 4. Ablation study evaluating the impact of different backbone architectures on LiteCOD performance across the COD10K and NC4K datasets. The analysis demonstrates how varying backbone configurations affect detection accuracy, computational complexity, and inference speed, validating the effectiveness of our lightweight design choices.

Backbone	Params (M)	FLOPs (G)	FPS	COD10K				NC4K
Backbone	Params (M)	FLOPs (G)	FPS	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$E_{m}^{ϕ}$	$M$ ↓	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$E_{m}^{ϕ}$	$M$ ↓
PVT-V2 [62]	63.55	34.2	46	0.859	0.791	0.929	0.023	0.884	0.846	0.933	0.031
ResNet-50 [61]	31.34	15.8	62	0.830	0.777	0.922	0.025	0.872	0.829	0.924	0.035
MobileVit-S [57]	5.13	7.95	76	0.852	0.765	0.920	0.026	0.870	0.822	0.926	0.036

The ↑ symbol indicates that higher values represent better performance, whereas ↓ denotes that lower values are more favorable.

Table 5. Ablation study investigating the impact of varying input image sizes on LiteCOD performance across COD10K and NC4K datasets. The analysis demonstrates how different image resolutions affect detection accuracy, computational complexity (FLOPs), and inference speed (FPS), showcasing the method’s scalability across various input configurations.

Image Size	FLOPs (G)	FPS	COD10K					NC4K
Image Size	FLOPs (G)	FPS	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$E_{m}^{ϕ}$	$E_{m}^{max}$	$M$ ↓	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$E_{m}^{ϕ}$	$E_{m}^{max}$	$M$ ↓
384 × 384	4.47	88	0.844	0.759	0.912	0.915	0.032	0.861	0.814	0.917	0.924	0.039
416 × 416	5.28	83	0.847	0.761	0.919	0.921	0.029	0.863	0.817	0.923	0.928	0.033
512 × 512	7.95	76	0.852	0.765	0.920	0.928	0.026	0.870	0.822	0.926	0.930	0.036

The ↑ symbol indicates that higher values represent better performance, whereas ↓ denotes that lower values are more favorable.

Table 6. Ablation study of LiteCOD performance at different input resolutions on the COD10K dataset. The table reports GFLOPs, GMACs, and inference speed on the NVIDIA Jetson AGX Orin (50 W mode), showing the model’s lightweight design and near-real-time capability. FPS values are reported as the mean ± standard deviation over multiple test runs with varying scene complexities.

Input Resolution	FLOPs (G)	GMACs	FPS (Jetson)	$S_{m}$ ↑	$F_{β}^{w}$ ↑	$E_{m}^{ϕ}$	$E_{m}^{max}$	M↓
384 × 384	4.47	4.6	$24.5 \pm 1.5$	0.846	0.753	0.914	0.922	0.028
416 × 416	5.28	5.61	$22.5 \pm 1.6$	0.849	0.759	0.917	0.925	0.027
512 × 512	7.95	9.44	$19.0 \pm 1.8$	0.852	0.765	0.920	0.928	0.026

The ↑ symbol indicates that higher values represent better performance, whereas ↓ denotes that lower values are more favorable.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, A.; Ullah, H.; Munir, A. LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion. AI 2025, 6, 197. https://doi.org/10.3390/ai6090197

AMA Style

Khan A, Ullah H, Munir A. LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion. AI. 2025; 6(9):197. https://doi.org/10.3390/ai6090197

Chicago/Turabian Style

Khan, Abbas, Hayat Ullah, and Arslan Munir. 2025. "LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion" AI 6, no. 9: 197. https://doi.org/10.3390/ai6090197

APA Style

Khan, A., Ullah, H., & Munir, A. (2025). LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion. AI, 6(9), 197. https://doi.org/10.3390/ai6090197

Article Menu

LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion

Abstract

1. Introduction

2. Related Work

2.1. Traditional Camouflaged Object Detection

2.2. Deep Learning-Based COD Methods

2.2.1. Biologically Inspired Approaches

2.2.2. Supplementary Information-Based Approaches

2.2.3. Multi-Scale Feature Integration and Contextual Enhancement

2.2.4. Lightweight COD Methods

3. Proposed Methodology

3.1. Overall Framework Architecture

3.2. Lightweight Backbone Feature Extraction

3.3. Global–Local Feature Processing Modules

3.3.1. Global Feature Extraction (G)

3.3.2. Local Feature Extraction (L)

3.4. Holistic Unification Modules (HUMs)

3.5. Enhanced Context Generation (ECG)

3.6. Multi-Stage Feature Integration (MFI)

3.7. Progressive Upsampling and Multi-Scale Supervision

4. Experiments and Discussion

4.1. Implementation Settings and Reproducibility

4.2. Datasets

4.3. Objective Function

4.4. Evaluation Metrics

4.5. Qualitative and Quantitative Results

4.6. Ablation Study

4.7. Edge Computing Deployment Analysis

4.8. Discussion and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI