3.2. Discrete Wavelet-Enhanced State-Space Inverse Graphics Architecture
3.2.1. Overall Architecture
To enhance the accuracy and reliability of 3D attribute recovery under single-view image input conditions, we propose a novel inverse graphics network architecture termed DWT-Mamba, which is augmented with wavelet transforms. This architecture adopts a typical four-stage pyramid design. The input image is first processed by an initial convolutional stem to extract shallow features, followed by four progressively downsampled stages to capture deep semantic representations. Each stage comprises a stack of DWT-Mamba blocks, with the number of blocks per stage set to [2, 4, 8, 4] and output channel dimensions of [64, 128, 256, 512], respectively. For effective multiscale feature modeling, the number of attention heads in each stage is configured as [2, 4, 8, 16], in proportion to the channel width. Overlapping convolutional layers with a stride of 2 are used for downsampling between stages to maintain spatial continuity and support efficient multiscale semantic feature extraction. The structure of the network is illustrated in
Figure 1.
In the first three stages, the DWT-Mamba block serves as the core component and is repeatedly stacked. Based on the linear decoding advantages of state-space models, this block introduces a non-causal modeling mechanism to eliminate information flow constraints and implements an HNC-SSD. This design enables the joint modeling of local structures and long-range dependencies, effectively overcoming the inherent causal limitations of conventional state-space formulations. Furthermore, to address the intrinsic suppression of high-frequency texture details by the Mamba architecture, a wavelet-enhanced dual-branch local perception module (WD-LPM) is incorporated at the front end of the network. This module models high-frequency information through parallel pathways in both the frequency and spatial domains and employs a high-frequency gating mechanism for adaptive fusion, thereby improving the modeling capacity for boundary contours and fine-grained structures.
In addition, prior studies have shown that self-attention mechanisms are beneficial for modeling high-level semantic relationships [
35]. Unlike the uniform distribution strategy adopted by Mamba2 [
31], our architecture strategically replaces the final DWT-Mamba block with a Multi-Head Latent Attention (MLA) module. By introducing a small number of learnable latent tokens, this module enables efficient global semantic interaction while significantly reducing computational complexity compared to standard multi-head self-attention. This not only enhances the modeling of high-order feature dependencies but also complements the earlier wavelet-enhanced modules focused on local detail refinement.
Overall, the proposed DWT-Mamba network integrates the efficiency of state-space modeling with the multiscale frequency-aware capabilities of wavelet enhancement while incorporating lightweight attention mechanisms. This combination facilitates unified modeling from local structural detail to global semantic abstraction. The specific design and implementation of the HNC-SSD and WD-LPM modules are detailed in the following sections.
3.2.2. Hybrid Non-Causal State-Space Duality
To improve context modeling and local structure expression of state-space models in non-causal vision tasks, this study introduces an HNC-SSD. The module combines global non-causal modeling with local window perception and employs a hierarchical aggregation scheme for unified multiscale feature representation. It mitigates the restricted information flow and limited fine granularity found in conventional state-space models when used in non-causal vision applications.
In a conventional SSM, the state update equation can be expressed as [
29]
where
denotes the hidden state,
denotes the input vector at time t,
denotes the output,
together with
represent the state transition matrix and the input mapping matrix, respectively, while
denotes the output mapping matrix.
From the equation, it is clear that the conventional update rule is causal, meaning that the current state must rely on the previous hidden state , which creates a strict one-direction propagation path. This sequential dependence limits information flow in both directions within the sequence and leads to insufficient use of information when processing images or other non-temporal data, presenting evident constraints in vision tasks that demand long-range dependence and global context.
To overcome the above limitation, the HNC-SSD block adopts the idea of state-space duality [
31]. The state transition matrix is reduced to a scalar on each channel, and a non-recursive structure converts the state update into a prefix sum accumulation that can be executed in parallel, thus enabling non-causal global information aggregation. The corresponding state update equation is provided by
where
denotes the scalar form of the simplified state transition matrix and regulates the contribution of the current token to the hidden state. For brevity, denote
. Unrolling Equation (
4) provides a prefix-weighted sum
This reformulation eliminates the dependence on the previous hidden state, enabling parallel computation while preserving the non-causal aggregation property.
This structure is essentially equivalent to a prefix-weighted sum and permits information from all positions in the sequence to be accumulated in a non-recursive parallel manner. Each token’s contribution no longer depends on the hidden state of the preceding token; instead, it is directly weighted by . In this way, every token becomes self-referenced, enabling information to flow in both directions within the sequence and thus eliminating the causal constraint.
Further, a two-direction scan strategy integrates the forward and backward pass results to model information in both directions, thus producing a global hidden state. To capture bidirectional context, we define the forward and backward prefix accumulations at position
i:
Therefore, for each token i, its hidden state is expressed as
Here, L denotes the sequence length, i.e., the total number of tokens after flattening the spatial dimensions of the input feature map . The index j enumerates over this flattened sequence.
By omitting the bias term and simplifying, we obtain
The above expression indicates that the model at every position can access the full input feature structure. All tokens share a single global hidden state, which realizes non-causal information aggregation.
Although global non-causal modeling improves context awareness, its lack of local perception with high-resolution images can limit the capture of fine detail [
36]. The HNC-SSD block therefore adopts a hierarchical aggregation scheme. In the early layers, a local window mechanism models spatial neighborhoods, a spatial decay kernel strengthens fine feature extraction, and the layered fusion keeps both local and global information while retaining non-causality. Let a local window function
denote the neighborhood of token i at distance r; the local hidden state is then provided by
Here, is a learnable Gaussian kernel, and the normalization factor keeps the local weight distribution stable by ensuring it sums to one even when the window size varies.
This design lets each position aggregate only its neighborhood information, preserves the non-causal property, reduces computation, and improves sensitivity to local edges and textures. In deeper layers, the global aggregation of Equation (
8) is restored, gathering features from all positions to produce the global hidden state
. To combine local and global cues, HNC-SSD introduces a gated fusion coefficient
and performs dynamic weighted fusion of the two hidden states, expressed as
where
is a tunable factor that balances the contributions of local and global information. The layered aggregation ensures effective integration across levels, widening modeling scope while refining detail representation.
3.2.3. Wavelet-Enhanced Dual-Branch Local Perception Module
To mitigate the suppression of high-frequency information observed in Mamba-based vision tasks, this study proposes a wavelet-enhanced dual-branch local perception module, denoted as WD-LPM. By means of an explicit frequency-domain and spatial-domain two-branch design, WD-LPM preserves computational efficiency while markedly improving the model’s ability to capture high-frequency detail, thereby correcting the inherent low-frequency bias of the Mamba architecture.
Given an input image
, the frequency-domain branch first applies a two-dimensional discrete wavelet transform (DWT) in the horizontal and vertical directions, decomposing
X into one low-frequency subband
and three high-frequency subbands
. The transform adopts the classical two-dimensional orthogonal Haar basis, whose filter bank is obtained by taking the Kronecker product of a low-pass filter and a high-pass filter [
37]. To ensure perfect reconstruction and strict consistency between decomposition and synthesis, both the DWT and its inverse (lDWT) adopt the same set of orthogonal Haar filters and use mirror padding at the boundaries throughout the frequency-domain pathway.
To highlight key high-frequency cues, a scheme of dynamic modulation of weights is introduced. Low-frequency semantic information is used to adjust the response strength of the high-frequency subbands. Specifically, Global Average Pooling (GAP) is applied to the low-frequency subband, and the high-frequency modulation weights
are defined by
Here,
denotes the Sigmoid activation function. The resulting weights
are split into three channel groups and applied to the three high-frequency subbands through per-channel multiplication implemented as
where ⊙ denotes multiplication applied to each channel, and
C denotes the channel count. The mechanism enables the network to enhance high-frequency representations adaptively according to low-frequency global semantics, thereby supplying dynamic compensation for high-frequency information.
Next an inverse discrete wavelet transform (IDWT) rebuilds the frequency domain from the modulated subbands, producing the enhanced frequency feature . On the spatial branch, a lightweight asymmetric depth-separable convolution extracts local spatial cues and generates an efficient spatial representation .
Finally, to balance the contributions of frequency and spatial features, a dynamic gating mechanism driven by high-frequency energy is introduced. The proportion of low-frequency energy to the total high-frequency energy is mapped to a fusion coefficient
, which adaptively weights and merges the frequency and spatial paths. The procedure is formulated as
where
denotes the L1 norm of the feature map and
denotes a small constant that keeps the computation stable. When low-frequency energy is dominant, the mechanism preserves more spatial structural cues, whereas a pronounced high-frequency component strengthens frequency domain detail, leading to smooth fusion of the two branches.
With the explicit cooperation of wavelet domain decomposition and spatial convolution, the method compensates for the Mamba architecture’s limited attention to high-frequency detail without notable extra computation and markedly improves both high-frequency feature modeling and overall performance.
To further enhance the transparency and reproducibility of our method, we provide a unified pseudocode implementation that systematically summarizes the entire forward process of a single DWT-Mamba block. Building on the modular decomposition in
Section 3.2.2 and
Section 3.2.3, the pseudocode explicitly integrates the WD-LPM and HNC-SSD modules, following the exact order of frequency-domain feature modulation, energy-gated fusion, and non-causal global–local aggregation. This formal description not only bridges the theoretical derivations above with the practical realization shown in
Figure 1 but also facilitates precise reproduction of the block’s computation pipeline in both research and application scenarios. The detailed procedure is presented in Algorithm 1.
Algorithm 1: DWT-Mamba Block (Forward) |
1: | Input: feature map ; orthogonal Haar filters ; ; linear map |
| ; gating scalar ; learnable positive scalars ; radii |
| ; kernel MLP |
2: | Output: non-causal representation |
| WD-LPM |
3: | |
4: | |
5: | |
6: | |
7: | |
8: | |
9: | |
| HNC-SSD |
10: | |
11: | |
12: | for to L do |
13: | |
14: | end for |
15: | |
16: | for to L do |
17: | |
18: | |
19: | |
20: | |
21: | |
22: | |
23: | end for |
24: | |
25: | return
H |
3.3. Cross-Domain 3D Adversarial Texture Generation Framework
To address adversarial texture generation for three-dimensional objects in real-world target detection, this study presents a three-stage framework supervised by a single-view two-dimensional image. Using the image as guidance, the pipeline performs geometric modeling, texture refinement, and adversarial optimization in sequence, gradually producing a three-dimensional texture with strong adversarial effect and physical robustness. The workflow of the first two training stages is shown in
Figure 2.
Stage 1. Training stage for 3D attribute recovery with multi-view pseudo-supervision. Earlier work shows that differentiable renderers can train neural networks for three-dimensional inference, but they usually require multi-view images, camera parameters, and object silhouettes to reach high accuracy [
38,
39,
40], and collecting such data is costly. To overcome the scarcity of real three-dimensional data, this stage adopts a synthetic multi-view supervision scheme. The aim is to train the proposed inverse graphics model, the DWT-Mamba block, to predict the target object’s mesh, texture, and lighting. A StyleGAN generator [
41] supplies latent three-dimensional structure encoded in its hidden space, allowing a single-view target image to be expanded into a large set of multi-view images of the same object. The inverse graphics model is updated with these multi-view signals; it decouples the input view from a randomly sampled target view and exploits geometric constraints between views to guide the learning of latent three-dimensional representations.
The overall stage 1 training pipeline (multi-view pseudo-supervision for 3D attribute recovery) is illustrated in Algorithm 2. During training, the network receives a single-view image
. The DWT-Mamba block predicts the target’s three-dimensional attributes, which a differentiable renderer converts into images from other views
. These renderings are compared with the new views
produced by StyleGAN, and the resulting difference defines the loss to be minimized. This strategy prevents the network from fitting only one viewpoint.
Algorithm 2: CAM3D Stage 1: Multi-View Pseudo-Supervised Reconstruction |
1: | Input: single-view image x; StyleGAN generator G; inverse-graphics network F (DWT- |
| Mamba, params ); differentiable renderer R; feature extractor ; viewpoint set V; |
| edge set E; loss weights ; learning rate ; maximum iteration |
2: | Output: updated |
3: | Initialize: |
4: | for to do |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | |
13: | end for |
14: | return
|
We first define an image perceptual reconstruction loss. A pretrained feature extractor
chosen as ResNet 50 computes at several feature levels
m the masked difference between the synthesized target-view image
and the rendered image
, written as
To secure geometric accuracy, we introduce a geometric consistency loss
where
and
denote the predicted and true mask regions at view
, respectively. A Laplacian smoothness loss further constrains the difference between unit normals of adjacent mesh vertices, expressed as
where
denotes the unit normal of the
i-th vertex and
denotes the mesh edge set. These losses, combined with fixed weights, form the stage 1 training objective
At this stage, because the synthetic data contain view inconsistencies, the goal is to obtain reliable three-dimensional mesh predictions and a plausible texture estimate rather than a finely detailed texture. The multi-view supervision scheme limits single-view overfitting; random viewpoint shifts help the model to remain stable under unstructured noise. Although local representation errors exist across views, the random sampling makes these errors mutually uncorrelated, so statistical consistency steers the network toward an optimal geometric solution. Mathematically, this process is equivalent to a maximum-likelihood multi-hypothesis ensemble learning method.
Stage 2. Training stage for high-fidelity texture refinement from real single-view real image. After the first stage, the inverse graphics model attains stable preliminary predictions of three-dimensional attributes. Because the pseudo-supervision used earlier lacks certain details, the initial texture is coarse and shows color shift and edge noise. The objective now shifts from enforcing consistency across multiple views to fine-tuning texture detail under the real single view. The input remains the original single-view image
, the network outputs the three-dimensional feature triple
, and the differentiable renderer
produces a rendered image
from the same view. In contrast with stage one, the real input image
itself serves as the high-fidelity supervision signal, thereby enhancing the detail quality of the generated texture. The stage 2 fine-tuning procedure for high-fidelity texture refinement from a real single view is illustrated in Algorithm 3.
Algorithm 3: CAM3D Stage 2: Real-Image Detail Refinement |
1: | Input: real image x at view ; trained ; renderer R; texture-domain pixel set ; |
| neighbor index set ; loss weights ; learning rate ; |
| maximum iteration |
2: | Output: updated parameters |
3: | Initialize: set train mode of F and load |
4: | for to do |
5: | |
6: | |
7: | |
8: | |
9: | |
10: | |
11: | |
12: | end for |
13: | return
|
This stage maintains the losses from stage one and adds a color consistency loss and a visual smoothness loss to raise texture quality.
The color consistency loss ensures that the predicted texture matches the real texture in color space and is defined by
where
denotes the set of all pixel indices;
and
are the predicted and real color values at pixel
n. Optimizing this term improves color fidelity and lowers the visual gap between predicted and real textures.
The visual smoothness loss is provided by
where
lists all pixel neighborhood pairs in the texture map. Minimizing this loss discourages abrupt color changes and yields a smoother texture appearance.
The overall objective for stage two
is
With real-image supervision and the joint perceptual, color, and smooth constraints, the inverse graphics network preserves sound geometry and lighting while achieving finer texture detail through more accurate view alignment.
Stage 3. Adversarial texture generation stage. After completing three-dimensional reconstruction and texture recovery, this phase aims to create physically robust adversarial textures able to mislead mainstream detectors such as YOLOv5, DETR, CenterNet, and YOLOX across diverse viewpoints, lighting conditions, and paint application errors. Accordingly, this phase establishes an end-to-end differentiable optimization process built upon the pretrained inverse graphics network and the neural renderer. This stage’s pipeline is summarized in
Figure 3, and the optimization steps are detailed in Algorithm 4.
Algorithm 4: CAM3D Stage 3: Cross-Domain Adversarial Texture Optimization |
1: | Input: target image x; trained network ; renderer R; detector set ; viewpoint |
| set V; lighting set ; perturbation model ; reference view ; reference |
| light ; weights ; learning rate ; label y; box b; |
| maximum iteration |
2: | Output: physically robust adversarial texture |
3: | Initialize: |
|
|
4: | for to do |
5: | |
6: | for do |
7: | for do |
8: | |
9: | |
10: | |
11: | end for |
12: | end for |
13: | |
14: | |
15: | |
16: | |
17: | |
18: | |
19: | end for |
20: | return
|
First, the inverse-graphics model takes a single-view image of the target and automatically reconstructs its 3D mesh, texture maps, and illumination parameters. A differentiable renderer then synthesizes 2D renderings under varied viewpoints, lighting, and simulated physical conditions, thereby emulating the diverse camouflage scenarios expected in practice. These renderings are fed to mainstream object detectors to evaluate the deception effect, and a joint adversarial loss is computed from their outputs. Finally, back-propagation simultaneously optimizes the texture, illumination, and related parameters, strengthening the adversarial texture against changes in viewpoint, complex lighting, and a spectrum of real-world disturbances, including color deviations from imperfect printing or spraying.
Specifically, the joint loss in this stage comprises four components, namely a basic adversarial loss, a multi-view robustness loss, an illumination robustness loss, and a paint-error constraint. We first define the basic adversarial loss
, which combines the detector classification error and the bounding-box localization error to quantify how effectively the current texture deceives the detection model:
where
denotes the cross-entropy loss between the predicted class
and the true class
y, and
denotes the localization error between the predicted box
and the ground-truth box
b, usually measured with IoU.
At the same time, to simulate robustness degradation arising from viewpoint changes in real deployments, we introduce the multi-view robustness loss
. With the current optimized three-dimensional attributes, a set of predefined viewing angles
V is specified, and a differentiable renderer produces two-dimensional renderings at each angle. Every rendering is fed to the object detector, the basic adversarial loss is computed, and their mean value is adopted as the optimization target for viewpoint invariance:
This loss encourages the optimized adversarial texture to mislead the detector over diverse viewpoints, thereby strengthening the generalization of the attack.
To further accommodate illumination variations encountered in physical environments and to ensure that the generated adversarial texture preserves a consistent attack capability under different light intensities, we introduce the illumination robustness loss
. With the texture and three-dimensional mesh features held fixed, the environmental illumination vector
L is varied, images
are rendered under diverse lighting conditions, and the mean detection loss serves as the constraint
In real-world deployment, adversarial textures undergo color shifts from painting and printing. To model these physical perturbations during training, we introduce a paint-error loss
. Given the current texture
, we render both the original and a color-perturbed version
with controllable bounded deviations and minimize the average adversarial loss over the two renders:
The total loss for this stage combines all terms with fixed weights
By optimizing this joint objective, the proposed three-stage method yields three-dimensional adversarial textures that stay effective and robust under multiple perturbations in real environments. The approach is applicable to a wide range of targets, such as cars, aircraft, and ships, and offers promising practical value.
In order to provide a solid empirical foundation for the aforementioned methodological innovations, it is particularly critical to construct a comprehensive experimental framework that is tightly aligned with the theoretical design. Accordingly, the subsequent experimental section is organized to be highly consistent with the overall design philosophy of CAM3D, systematically covering both quantitative and qualitative validation for each key module. Specifically, to rigorously assess the improvements in geometric reconstruction accuracy and texture fidelity brought by the DWT-Mamba backbone—including its hybrid state-space modeling and wavelet enhancement modules—we first design and conduct systematic single-view 3D reconstruction experiments across diverse object categories. Building upon this, to quantitatively evaluate the cross-domain robustness and transferability of adversarial textures generated by the three-stage optimization, we further devise adversarial attack experiments that encompass both digital simulation and real-world physical scenarios, thus thoroughly reflecting the challenges encountered by the model under multi-view and diverse weather conditions in practical deployment. Finally, in order to dissect the independent contributions of each structural module and clarify the computational advantages of the state-space design, we perform targeted ablation studies and efficiency analyses. Meanwhile, we also systematically investigate the effects and synergistic contributions of different training stages and loss function combinations on the overall adversarial and reconstruction performance. Through this progressive and purpose-driven experimental design, we ensure that each methodological innovation is rigorously and scientifically mapped to empirical evidence, thereby laying a solid foundation for the subsequent analysis and theoretical discussion of experimental results.