Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception

Liu, Yang; Nie, Zishan; Yu, Tong; Chen, Minghui; Yao, Zhiheng; Lu, Jieke; Peng, Linya; Fan, Fuming

doi:10.3390/s26030835

Open AccessArticle

Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception

by

Yang Liu

¹,

Zishan Nie

¹,

Tong Yu

²,

Minghui Chen

¹

,

Zhiheng Yao

¹,

Jieke Lu

^1,3,

Linya Peng

⁴ and

Fuming Fan

^1,*

¹

Hubei Key Laboratory of Internet of Intelligence, School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

²

School of Computer Science, Northeast Electric Power University, Jilin 132000, China

³

Electric Power Research Institute of Guangxi Power Grid Co., Ltd., Nanning 530000, China

⁴

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(3), 835; https://doi.org/10.3390/s26030835

Submission received: 19 December 2025 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Special Issue Advanced Sensor Technologies for Multimodal Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

Autonomous driving perception systems are vulnerable to physical adversarial attacks. Existing defenses largely adopt loosely coupled architectures where visual and kinematic cues are processed in isolation, thus failing to exploit physical spatiotemporal consistency as a structural prior and often struggling to balance adversarial robustness, transferability, accuracy, and efficiency under realistic attacks. We propose a physics-aware trajectory–appearance consistency defense that detects and corrects spatiotemporal inconsistencies by tightly coupling visual semantics with physical dynamics. The module combines a dual-stream spatiotemporal encoder with endogenous feature orchestration and a frequency-domain kinematic embedding, turning tracking artifacts that are usually discarded as noise into discriminative cues. These inconsistencies are quantified by a Trajectory–Appearance Mutual Exclusion (TAME) energy, which supports a physics-aware switching rule to override flawed visual predictions. Operating on detector backbone features, outputs, and tracking states, the defense can be attached as a plug-in module behind diverse object detectors. Experiments on nuScenes, KITTI, and BDD100K show that the proposed defense substantially improves robustness against diverse categories of attacks: on nuScenes, it improves Correction Accuracy (CA) from 86.5% to 92.1% while reducing the computational overhead from 42 ms to 19 ms. Furthermore, the proposed defense maintains over 71.0% CA when transferred to unseen detectors and sustaining 72.4% CA under adaptive attackers.

Keywords:

autonomous driving perception; physical adversarial attacks; adversarial robustness; transferability; spatiotemporal consistency

1. Introduction

Autonomous driving is transforming transportation through enhanced safety and operational efficiency [1,2]. Modern perception stacks rely heavily on Deep Neural Networks (DNNs), which provide strong visual recognition but also introduce critical vulnerabilities [3,4,5]. Physical adversarial attacks inject real-world perturbations, such as patches or projected patterns, to mislead perception without accessing internal sensor data [6,7,8,9]. These attacks are stealthy and low-cost. They can cause dangerous misclassifications and missed detections, posing severe safety risks for intelligent transportation systems.

Existing defenses remain difficult to deploy in real driving systems. Certified defenses offer provable guarantees but their computational cost scales poorly with high-resolution, multi-sensor inputs, making real-time deployment challenging [10,11,12]. Input purification methods reconstruct or denoise sensor data, yet often distort semantics and incur high false positive rates in benign scenes [13,14,15]. More recent work exploits spatiotemporal consistency between appearance and motion [16,17,18] but typically in a loosely coupled manner, where visual and kinematic cues are processed in separate branches and only compared at score level. As a result, physical consistency is not used as a structural prior, and these defenses still struggle to jointly achieve robustness, transferability, accuracy, and real-time efficiency under realistic, adaptive physical attacks.

We argue that physical trajectories should not be treated as an external verifier, but as an internal organizer of visual representations. Building on this view, we propose a physics-aware trajectory–appearance consistency defense that uses physical motion as a structural prior to audit and correct visual predictions. Our design is based on a simple but important observation: physical adversarial attacks inevitably induce a trajectory–appearance inconsistency [16]. An attacker can make an object look like a car, but cannot fully control its inertial trend, high-frequency detection jitter, or long-term dynamics [19,20]. Genuine objects show stable alignment between how they look and how they move; adversarial objects exhibit a semantic gap, often accompanied by abnormal jitter and unstable tracks. To instantiate this idea, we employ a physics-aware, dual-stream spatiotemporal encoder with endogenous feature orchestration that consumes detector backbone features together with detection boxes, labels, and tracked trajectories as input. Motion is decomposed into low-frequency inertial trends and high-frequency jitter in the frequency domain, producing compact kinematic embeddings. These embeddings then drive the orchestration mechanism: frequency-guided queries probe the visual stream, measure trajectory–appearance discrepancy, and modulate visual features accordingly. The resulting inconsistency is quantified by a Trajectory–Appearance Mutual Exclusion (TAME) energy, which serves as a differentiable measure of physical–visual conflict. We instantiate the encoder with a lightweight Transformer for temporal modeling, but treat it as a generic spatiotemporal processor rather than an architectural novelty, and the defense is calibrated once on a source detector and then reused across different perception stacks.

On top of this pipeline, TAME energy enables a transferable physical defense module. Because the module interacts with the perception stack only through backbone features, detector outputs, and tracking states, it can be attached as a plug-in safety layer behind heterogeneous object detectors without modifying their weights or retraining the defense. The combination of frequency-domain kinematic embedding, endogenous feature orchestration, and TAME inconsistency reasoning allows the module to generalize across attack types, including adaptive attacks, as well as across datasets and detector architectures. The overall defense pipeline is illustrated in Figure 1.

The main contributions of this paper are summarized as follows:

•: Dual-stream spatiotemporal encoder with frequency-domain kinematic embedding. We design a dual-stream spatiotemporal encoder that jointly models visual and kinematic streams. Motion is decomposed into low-frequency inertial trends and high-frequency jitter in the frequency domain, turning tracking artifacts that are often treated as noise into informative cues for trajectory–appearance consistency.
•: Endogenous feature orchestration with TAME inconsistency head. On top of this encoder, we introduce an endogenous, frequency-guided feature orchestration module that uses kinematic queries to reorganize visual features along the trajectory–appearance consistency manifold. We further define the TAME energy as a differentiable measure of physical–visual conflict, which provides a unified inconsistency head for both attack detection and label correction when visual predictions are compromised.
•: Transferable physical defense module. We package the encoder, orchestration, and TAME head into a plug-in safety module that can be attached behind heterogeneous object detectors by reusing their backbone features, outputs, and tracking states, without modifying detector weights or retraining the defense. Experiments across multiple datasets, detectors, and both patch-based and projection-based attacks show strong robustness and clear cross-detector/cross-dataset transferability. We further demonstrate that the module maintains nontrivial protection under adaptive attacks such as trajectory smoothing and joint optimization, highlighting the practicality of frequency-guided, physics-aware consistency defense.

2. Related Work

2.1. Visual Perception for Autonomous Driving

The visual perception stack is at the core of an autonomous vehicle’s ability to interpret its surroundings [1]. It is responsible for real-time analysis of road conditions and directly affects driving safety. Object detection algorithms based on Convolutional Neural Networks (CNNs) remain the dominant approach [21], and recent lightweight architectures can meet the stringent real-time requirements of autonomous driving [22,23,24]. Despite their strong performance in benign scenarios, these models are highly vulnerable to adversarial perturbations. Small but carefully crafted changes to the input can lead to severe misclassifications or complete target loss [25,26,27].

2.2. Physical Adversarial Attacks on Autonomous Driving Perception

Adversarial examples are inputs with imperceptible perturbations that cause DNNs to output incorrect predictions [28,29]. Early work mainly focused on digital-domain attacks such as the Fast Gradient Sign Method (FGSM) [30] and Projected Gradient Descent (PGD) [31], where gradient-based perturbations are generated to cross decision boundaries. However, these attacks assume full access to input pixels, which is often unrealistic for deployed autonomous systems. Physical adversarial attacks, by contrast, require perturbations that are feasible and robust in the real world [8,27]. Attackers must contend with illumination changes, viewpoint variations, and sensor noise, and therefore often optimize perturbations under constraints such as Non-Printability Score (NPS) and Total Variation (TV) [32,33].

Adversarial patches are a common vehicle for physical attacks. Robust Physical Perturbations (RP2) generate robust perturbations on road signs that mislead detectors over a wide range of distances and viewing angles [8]. Other methods, such as PatchAttack for vehicles and CAPatch for image captioning, demonstrate the versatility of patch-based attacks across tasks and domains [28,34]. In addition, optical attacks such as Short-Lived Adversarial Perturbation (SLAP) project transient patterns onto object surfaces using a projector [35], enabling non-contact, hard-to-trace attacks that pose serious challenges to visual perception systems.

2.3. Physical Adversarial Defenses for Autonomous Driving

Defense strategies against physical attacks can be broadly grouped into three categories. Certified defenses offer provable robustness guarantees through mathematical analysis. For example, Certified Interval Bound Propagation (CertIBP) [10] uses interval bound propagation to bound input perturbations, and PatchGuard [11] constrains localized corruptions via small receptive fields and feature masking. Despite their theoretical rigor, these methods are computationally expensive and scale poorly to high-resolution inputs and multi-sensor settings, limiting their applicability in real-time autonomous driving [36]. Input purification methods aim to remove perturbations at the sensor level. Approaches such as Jujutsu [14] and Diffusion Purification (DiffPure) [13] reconstruct images using classical filters or diffusion models. While they can suppress high-frequency noise, they also tend to erase fine semantic details (e.g., small or distant objects), leading to performance degradation and high false positive rates in benign driving conditions. Spatiotemporal consistency-based defenses leverage temporal information or physical cues to detect anomalies [16,17,18]. PercepGuard [16], for instance, monitors object trajectories to flag predictions that are inconsistent with motion patterns, while PhySense integrates additional physical attributes and relational cues [17,37].

Despite their effectiveness, existing consistency-based defenses still rely on loosely coupled, modular designs where visual and kinematic features are processed in separate pipelines and only fused at score or decision level [37,38]. This limits their ability to use physical consistency as a structural prior and leaves nontrivial safety gaps under adaptive physical attacks.

3. Proposed Algorithm

3.1. Method Overview

For clarity, the key symbols and abbreviations used throughout this paper are summarized in Table 1. To overcome the limitations of existing defenses in real-time performance, semantic preservation, and feature coupling, we propose a physics-aware trajectory–appearance consistency framework. Rather than relying on hand-crafted consistency checks or loosely coupled pipelines, the framework learns the nonlinear coupling between visual semantics and physical motion within a unified computation graph.

Given a continuous observation sequence

S = {(I_{t}, B_{t})}_{t = 1}^{T}

, where

I_{t}

is the raw image at time t and

B_{t}

denotes 2D bounding boxes with lifted 3D coordinates, the framework proceeds in three stages, as shown in Figure 2. First, a dual-modal feature embedding module maps deep visual features and structured kinematic states into a shared latent space. The kinematic branch adopts a frequency-domain design that separately encodes low-frequency inertial trends and high-frequency jitter, yielding multi-scale motion embeddings for subsequent reasoning (Section 3.2). Second, a dual-stream spatiotemporal encoder, instantiated as a lightweight Transformer, jointly processes the visual and kinematic sequences. Within each layer, temporal self-attention aggregates long-range context in each stream, and frequency-domain cross-attention implements an endogenous feature orchestration mechanism: low-frequency Inertial Queries and high-frequency Jitter Queries retrieve two appearance patterns from the visual stream, whose discrepancy is distilled into a fused signal

Z_{d i s}

and injected back into the visual features via residual connections (Section 3.3). This layer-wise process produces consistency-aware contextual representations that encode how well appearance and trajectory agree over time. Third, a TAME head attaches classification heads to the final visual and kinematic representations to obtain

P_{t}^{v i s}

and

P_{t}^{k i n}

, and defines the (TAME) energy as a differentiable measure of physical–visual conflict (Section 3.4).

At each time step t, the model outputs a tuple

({\hat{y}}_{t}, E_{t})

for downstream safety decisions. Specifically,

{\hat{y}}_{t}

is the object label selected by a TAME-based switching rule: when

E_{t} \leq τ

, the appearance-based prediction is trusted; when

E_{t} > τ

, the system overrides the visual decision with the motion-based prior. The scalar

E_{t}

thus acts both as an attack confidence score and as a switch for semantic correction. The overall inference procedure is summarized in Algorithm 1.

Algorithm 1 Inference of the dual-stream consistency defense (dual-frequency retrieval + TAME correction)

Require:: Image sequence ${I_{t}}_{t = 1}^{T}$ ;

1:: Detector $D$ ;
2:: Dual-stream spatiotemporal encoder ${Enc}_{θ}$ (L layers);
3:: Fourier matrices $M_{low}, M_{high}$ ;
4:: MLPs ${MLP}_{L}, {MLP}_{H}$ ;
5:: TAME threshold $τ$ .

Ensure:: Corrected labels $\hat{Y} = {{\hat{y}}_{t}}_{t = 1}^{T}$ ; energies $E = {E_{t}}_{t = 1}^{T}$ .

6:: Stage 1: Dual-Modal Embedding
7:: for $t = 1$ to T do
8:: $(F_{m a p, t}, B_{t}) \leftarrow D (I_{t})$
9:: $e_{t}^{v i s} \leftarrow VisualEmbed (F_{m a p, t}, B_{t}) \in R^{d}$
10:: $s_{t} \leftarrow KinematicState (B_{1 : t}) \in R^{9}$ ▹ $[x, y, l, w, h, v_{x}, v_{y}, a_{x}, a_{y}]$
11:: $γ_{low} (s_{t}) \leftarrow [cos (2 π M_{low} s_{t}), sin (2 π M_{low} s_{t})]$
12:: $γ_{high} (s_{t}) \leftarrow [cos (2 π M_{high} s_{t}), sin (2 π M_{high} s_{t})]$
13:: $e_{t}^{k i n} \leftarrow Concat ({MLP}_{L} (γ_{low}), {MLP}_{H} (γ_{high})) \in R^{d}$
14:: end for
15:: $E_{0}^{v i s} \leftarrow {[e_{t}^{v i s}]}_{t = 1}^{T} \in R^{T \times d}$ , $E_{0}^{k i n} \leftarrow {[e_{t}^{k i n}]}_{t = 1}^{T} \in R^{T \times d}$
16:: Stage 2: Dual-Stream Spatiotemporal Encoding with Endogenous Orchestration
17:: $(E_{L}^{v i s}, E_{L}^{k i n}) \leftarrow {Enc}_{θ} (E_{0}^{v i s}, E_{0}^{k i n})$ ▹ per layer: MHSA + dual-frequency cross-attn $\to Z_{l o w}, Z_{h i g h}$ ; $Δ Z = Z_{l o w} - Z_{h i g h}$ ; inject $Z_{d i s}$
18:: Stage 3: TAME Check and Physics-Guided Correction
19:: for $t = 1$ to T do
20:: $P_{t}^{v i s} \leftarrow Softmax (W_{c}^{v i s} E_{L}^{v i s} [t] + b_{c}^{v i s})$
21:: $P_{t}^{k i n} \leftarrow Softmax (W_{c}^{k i n} E_{L}^{k i n} [t] + b_{c}^{k i n})$
22:: $E_{t} \leftarrow D_{KL} (P_{t}^{k i n} ‖ P_{t}^{v i s}) + D_{KL} (P_{t}^{v i s} ‖ P_{t}^{k i n})$
23:: ${\hat{y}}_{t} \leftarrow \{\begin{matrix} arg {max}_{y} P_{t}^{k i n} (y), & E_{t} > τ \\ arg {max}_{y} P_{t}^{v i s} (y), & otherwise \end{matrix}$
24:: end for
25:: return $\hat{Y}, E$

3.2. Dual-Modal Feature Embedding

This module maps unstructured visual information and structured kinematic data into a unified latent space

R^{d}

, enabling end-to-end interaction between heterogeneous modalities.

Visual semantic embedding. We reuse the backbone of the object detector to extract deep semantic features, avoiding redundant computation while retaining high-level information [23]. Given an input image

I_{t}

and the corresponding bounding boxes

B_{t}

at time t, we first obtain an intermediate feature map

F_{m a p}

. Region-of-interest features are then extracted by pooling over the bounding box locations and compressed into a feature vector using Global Average Pooling (GAP). A learnable linear projection

W_{v i s}

maps the pooled feature to the target dimension d:

e_{t}^{v i s} = W_{v i s} \cdot GAP (Pooling (F_{m a p}, B_{t})) .

(1)

The embedding

e_{t}^{v i s}

encodes texture, shape, and category-level semantics, and later serves as Keys and Values in the cross-attention with kinematic queries [39].

Frequency-domain kinematic embedding. In real-world driving, low-frequency trajectories (e.g., smooth velocity profiles) capture the coarse motion of objects, while high-frequency fluctuations often reflect sensor noise and tracking instability [20,40]. To capture both aspects, we introduce a frequency-domain motion embedding inspired by Fourier feature mappings [41]. This design separately encodes low-frequency inertial trends and high-frequency jitter, providing richer evidence for trajectory–appearance consistency.

Each instance is associated with the 3D bounding box of the object. We compute the centroid

p_{t}

as the mean of the eight corners and use its

(x_{t}, y_{t})

coordinates on the ground plane, ignoring the vertical coordinate due to its limited variation and high noise [17]. Given a frame rate

f_{r}

, the instantaneous velocity

v_{α, t}

and acceleration

a_{α, t}

along axis

α

are obtained via finite differences:

v_{α, t} = f_{r} [p_{α, t} - p_{α, t - 1}], α \in {x, y},

(2)

a_{α, t} = f_{r}^{2} [p_{α, t} - 2 p_{α, t - 1} + p_{α, t - 2}], α \in {x, y} .

(3)

We then construct a compact physical state vector

s_{t} \in R^{9}

as [17]:

s_{t} = {[x_{t}, y_{t}, l_{t}, w_{t}, h_{t}, v_{x, t}, v_{y, t}, a_{x, t}, a_{y, t}]}^{⊤} .

(4)

To parameterize motion at different temporal frequencies, we apply learnable Fourier feature mappings:

γ_{l o w} (s_{t}) = {[cos (2 π M_{l o w} s_{t}), sin (2 π M_{l o w} s_{t})]}^{⊤},

(5)

γ_{h i g h} (s_{t}) = {[cos (2 π M_{h i g h} s_{t}), sin (2 π M_{h i g h} s_{t})]}^{⊤},

(6)

where

M_{l o w} \sim N (0, σ_{l o w}^{2})

encodes smooth inertial trends and

M_{h i g h} \sim N (0, σ_{h i g h}^{2})

with

σ_{h i g h} ≫ σ_{l o w}

captures higher-frequency jitter.

The two frequency components are processed by separate Multi-Layer Perceptrons (MLPs) and concatenated to form the final kinematic embedding:

e_{t}^{k i n} = Concat ({MLP}_{L} (γ_{l o w} (s_{t})), {MLP}_{H} (γ_{h i g h} (s_{t}))) .

(7)

By decoupling low- and high-frequency components, the encoder retains crucial jitter signals, enhancing the sensitivity of the downstream endogenous feature orchestration and TAME metric to adversarial perturbations, especially under adaptive attacks that primarily manipulate low-frequency trajectories.

3.3. Dual-Stream Spatiotemporal Encoder with Endogenous Feature Orchestration

The core reasoning unit of our framework is a dual-stream spatiotemporal encoder that captures temporal continuity within each modality and logical consistency across modalities. The encoder consists of L identical layers. At layer l, the inputs are the visual feature sequence

E_{l - 1}^{v i s}

and the frequency-domain kinematic feature sequence

E_{l - 1}^{k i n}

. Each layer comprises two components: temporal self-attention in each stream and frequency-guided cross-attention with endogenous feature orchestration. The overall structure is illustrated in Figure 3.

Temporal continuity modeling via self-attention. In the physical world, both visual appearance and motion evolve smoothly over time. To model this continuity, we apply Multi-Head Self-Attention (MHSA) independently to the visual and kinematic streams. At each layer l,

E_{l - 1}^{v i s}

and

E_{l - 1}^{k i n}

are processed to obtain temporally contextualized features

H^{v i s}

and

H^{k i n}

, which aggregate long-range context in each modality and provide stable inputs for cross-modal reasoning.

Endogenous feature orchestration via frequency-domain retrieval. Beyond separate temporal modeling, we use kinematics as an internal organizer of visual representations. Leveraging the frequency-domain embeddings from Section 3.2, each layer constructs a low-frequency Inertial Query

Q_{l o w}

and a high-frequency Jitter Query

Q_{h i g h}

from the kinematic stream, and uses them to attend to the visual stream. These queries retrieve two appearance patterns,

Z_{l o w}

and

Z_{h i g h}

, that explain the observed motion under different spectral viewpoints.

The disparity between these two retrieved patterns,

Δ Z = Z_{l o w} - Z_{h i g h},

(8)

captures how consistently visual semantics are supported across inertial and jitter-aware motion cues. For benign objects, both queries typically converge to compatible semantic explanations, yielding small

Δ Z

. Under physical attacks, high-frequency jitter and mismatched dynamics induce conflicting retrievals, resulting in a large semantic gap.

To turn this gap into an internal control signal, we feed the concatenated triplet

(Z_{l o w}, Z_{h i g h}, Δ Z)

into a lightweight Feed-Forward Network (FFN), denoted as

{FFN}_{d i s c}

, and obtain a fused discrepancy code:

Z_{d i s} = {FFN}_{d i s c} (Concat (Z_{l o w}, Z_{h i g h}, Δ Z)) .

(9)

Rather than using

Z_{d i s}

as a separate detector, we treat it as an endogenous feature orchestration signal that reorganizes the visual stream. Concretely,

Z_{d i s}

is injected back into the visual features through residual connections and Layer Normalization, amplifying channels that are consistent with kinematic evidence and suppressing channels dominated by adversarial artifacts or sensor noise. The kinematic stream is updated independently to preserve physically grounded dynamics.

Through this recurrent interaction, each layer performs frequency-aware feature orchestration: kinematic queries probe the visual stream, measure trajectory–appearance discrepancy, and use the resulting discrepancy code to reshape the internal representation manifold. As shown in the ablation study, removing this discrepancy feedback significantly increases false positives, confirming that endogenous feature orchestration is crucial for stabilizing benign predictions and exposing adversarial inconsistencies.

3.4. Trajectory–Appearance Mutual Exclusion Energy

Let

E_{L}^{v i s}, E_{L}^{k i n} \in R^{T \times d}

denote the final-layer outputs of the encoder for the visual and kinematic streams, respectively. The row vectors

e_{L, t}^{v i s}

and

e_{L, t}^{k i n}

serve as time-wise contextual representations for constructing the TAME energy.

Built on the encoder’s layer-wise reasoning,

E_{L}^{v i s}

and

E_{L}^{k i n}

integrate three sources of information: (i) visual appearance cues, (ii) frequency-domain kinematic patterns, and (iii) discrepancy-sensitive corrections injected by the endogenous feature orchestration. The final representations thus encode the compatibility between visual appearance and physical dynamics, allowing us to define the TAME energy in this inconsistency-aware space.

To obtain semantic predictions from appearance and motion, we attach classification heads to the final representations. At each time step t, we compute class posterior distributions via softmax:

P_{t}^{α} (y) = Softmax {(W_{c}^{α} e_{L, t}^{α} + b_{c}^{α})}_{y}, α \in {vis, kin} .

(10)

Here,

W_{c}^{α}

and

b_{c}^{α}

denote the learnable weights and bias of the classifier.

P_{t}^{v i s} (y)

reflects the class distribution inferred from visual appearance, and

P_{t}^{k i n} (y)

reflects the motion-based prior.

Assuming that physical trajectories are harder to forge than appearance, we expect

P_{t}^{v i s}

and

P_{t}^{k i n}

to agree on benign samples and to diverge under physical attacks. We define the TAME energy at time t as the sum of the forward and reverse Kullback–Leibler divergences:

E_{t} = D_{KL} (P_{t}^{k i n} ∥ P_{t}^{v i s}) + D_{KL} (P_{t}^{v i s} ∥ P_{t}^{k i n}),

(11)

with a small constant

ϵ

(e.g.,

10^{- 8}

) added for numerical stability:

D_{KL} (P_{t}^{a} ∥ P_{t}^{b}) = \sum_{y \in Y} P_{t}^{a} (y) log \frac{P_{t}^{a} (y)}{P_{t}^{b} (y) + ϵ} .

(12)

When appearance and motion are compatible (e.g., a vehicle-like appearance and a high-speed trajectory), both distributions concentrate on the same classes, leading to low TAME energy. In contrast, adversarial attacks cause a semantic conflict, and

E_{t}

becomes large.

We use TAME for both attack detection and semantic correction. Given a threshold

τ

, the final prediction at time t is

{\hat{y}}_{t} = \{\begin{matrix} arg max_{y \in Y} P_{t}^{v i s} (y), & E_{t} \leq τ, \\ arg max_{y \in Y} P_{t}^{k i n} (y), & E_{t} > τ . \end{matrix}

(13)

Thus,

{\hat{y}}_{t}

is an error-correcting prediction: it trusts the appearance-based decision when trajectory and appearance align and switches to the motion-based prior when a physical–visual inconsistency is detected.

3.5. Model Training and Inference

Supervised classification. Given a training sequence with ground-truth labels

{y_{t}}_{t = 1}^{T}

, we compute

P_{t}^{v i s}

,

P_{t}^{k i n}

, and

E_{t}

as in Equations (10) and (11). The primary supervision is a cross-entropy loss applied to both heads:

L_{cls} = \frac{1}{T} \sum_{t = 1}^{T} (CE (P_{t}^{v i s}, y_{t}) + CE (P_{t}^{k i n}, y_{t})),

(14)

which encourages both modalities to predict the correct class.

Consistency regularization and adversarial calibration. To shape the TAME energy landscape, we penalize large energy on clean samples and enforce a margin on adversarial ones. For clean data, a consistency term

L_{con} = \frac{1}{T} \sum_{t = 1}^{T} E_{t}

(15)

pushes

P_{t}^{v i s}

and

P_{t}^{k i n}

to agree, forming a low-energy manifold for benign samples. When adversarial examples

(I_{t}^{a d v}, B_{t}^{a d v})

are available, we further apply a hinge-style margin loss:

L_{adv} = \frac{1}{T} \sum_{t = 1}^{T} (CE (P_{t}^{k i n}, y_{t}) + max (0, m - E_{t})),

(16)

where

m > 0

is a margin hyperparameter. This term keeps the kinematic head aligned with the true class while pushing adversarial samples to high-energy regions.

Overall objective and inference. The full training loss combines the above components on clean and adversarial data:

L = L_{cls}^{(clean)} + λ_{con} L_{con}^{(clean)} + λ_{adv} L_{adv}^{(adv)},

(17)

where

λ_{con}

and

λ_{adv}

control the strength of consistency regularization and adversarial calibration.

At inference, given

S = {(I_{t}, B_{t})}_{t = 1}^{T}

, we reuse the frozen detector backbone to obtain

F_{m a p}

, perform dual-modal embedding as in Section 3.2, and feed the resulting sequences into the dual-stream spatiotemporal encoder to obtain

E_{L}^{v i s}

and

E_{L}^{k i n}

. The TAME head yields

P_{t}^{v i s}

,

P_{t}^{k i n}

, and

E_{t}

, and the decision rule in Equation (13) is applied with a threshold

τ

selected based on validation data. Since the detector backbone is reused and all additional modules are lightweight, the overall overhead is small, making the module suitable for real-time deployment behind existing object detectors.

4. Experiments

4.1. Experimental Setup

Datasets. We evaluate the proposed defense on three widely used autonomous driving benchmarks: KITTI [42], nuScenes [43], and BDD100K [44], which cover diverse driving environments and multimodal sensor data. These datasets cover a spectrum of driving complexities, ranging from the structured urban scenarios in KITTI and the multimodal sensor data in nuScenes to the large-scale, heterogeneous traffic environments in BDD100K. We focus on five representative traffic participants: bicycle, bus, pedestrian, car, and truck. nuScenes and KITTI provide 21,763 and 5212 object-specific clips, respectively. For BDD100K, we use its Multi-Object Tracking subset (1600 videos) and select approximately 5000 instances from our target categories. These datasets allow for a rigorous evaluation of defense performance under real-world conditions.

Implementation Details. For visual perception, we use a YOLOv8 [23] detector fine-tuned on BDD100K and keep its backbone frozen. During training, we run YOLOv8 and Simple Online and Realtime Tracking (SORT) [45] on BDD100K videos to obtain backbone feature maps, detection boxes/scores, and per-object trajectories

{(I_{t}, B_{t})}_{t = 1}^{T}

, and optimize the consistency encoder and TAME head end-to-end on top of these signals using AdamW with an initial learning rate of

3 \times 10^{- 4}

, weight decay of

1 \times 10^{- 4}

, and batch size of 32, which were optimized to ensure stable convergence on the validation set. We train for 60 epochs with a cosine learning-rate schedule and a warm-up of 5 epochs. The encoder has

L = 3

layers, hidden dimension

d = 256

, and 8 attention heads per layer. Hyperparameters were determined through sensitivity analysis on the validation set to balance defense effectiveness and training stability: the loss weights were set to

λ_{con} = 0.1

and

λ_{adv} = 1.0

, assigning a lower weight to

λ_{con}

to prevent regularization from dominating early training while ensuring sufficient penalty on adversarial samples via

λ_{adv}

. The TAME margin was set to

m = 0.9

to enforce a significant energy gap between benign and adversarial manifolds. Finally, the decision threshold

τ = 0.2

was determined via quantitative trade-off analysis, aiming to maximize Detection Accuracy (DA) while strictly bounding the false positive rate (FPR) below 5% in benign scenarios. At inference time, this trained module is reused as a plug-in safety layer without retraining. All experiments run on 2 NVIDIA RTX A6000 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) with 48 GB memory.

4.2. Attack Configuration

We focus on physically realizable adversarial attacks, as modifying the surfaces of traffic participants is a tangible threat. The perturbation mask is constrained within the target’s physical boundaries to ensure realism. We evaluate three patch scales, large (

100 \times 100

), medium (

80 \times 80

), and small (

60 \times 60

), optimized under

L_{p}

-norm and NPS constraints.

We use three representative attack methods:

RP2 [8]: Generates robust physical perturbations to induce misclassification under varying conditions.
CAPatch [34]: Adapted from image captioning, it maximizes detection errors in autonomous driving contexts.
SLAP [35]: A projector-based optical attack simulating light-based perturbations.

These attacks are applied to the selected object categories. We simulate dynamic attacks using ground-truth 3D poses and adjust the patch’s homography frame-by-frame, ensuring realistic appearance changes during motion. To intuitively understand these threats, visual examples of the RP2, CAPatch, and SLAP attacks applied to our target datasets are illustrated in Figure 4.

4.3. Evaluation Metrics

To evaluate defense effectiveness, we use metrics that assess detection ability, correction ability, false alarms, and efficiency.

Detection Accuracy (DA). DA reflects the ability of a defense to identify misclassified instances caused by attacks:

DA = \frac{N_{\det}}{N_{mis}}

(18)

Correction Accuracy (CA). CA measures the ability of a defense to recover the correct label once an attack has occurred:

CA = \frac{N_{corr}}{N_{mis}}

(19)

False Positive Rate (FPR). FPR characterizes the risk that benign samples are incorrectly treated as attacked by the defense:

FPR = \frac{N_{fp}}{N_{benign}}

(20)

False Negative Rate (FNR). FNR measures the proportion of truly attacked samples that are still misclassified after applying the defense, i.e., the missed attacks of the defense:

FNR = \frac{N_{fn}}{N_{mis}}

(21)

Runtime Efficiency (RE). RE evaluates whether a defense satisfies real-time constraints. Let

t_{i}

denote the end-to-end processing time of the i-th sample and n the total number of samples. The average runtime per sample is:

RE = \frac{1}{n} \sum_{i = 1}^{n} t_{i}

(22)

4.4. Baselines

To validate the effectiveness, we compare it with five representative defenses covering input purification, certified robustness, and spatiotemporal consistency modeling. These baselines include both state-of-the-art general defense strategies and physics-aware approaches in autonomous driving.

DiffPure [13] is an input purification method that uses pre-trained diffusion models to sanitize adversarial examples. While effective in removing perturbations, it may degrade high-frequency semantic details necessary for small object recognition.

PatchGuard [11] provides certified robustness against localized adversarial patches. It uses small receptive fields and robust aggregation mechanisms to limit feature corruption, but its high computational overhead restricts real-time object detection.

DetectorGuard [46] secures object detectors against patch-hiding attacks. It cross-references the detector’s output with a robust objectness predictor to detect inconsistencies. However, it focuses more on object presence than spatiotemporal dynamics.

PercepGuard [16] uses spatiotemporal consistency to detect misclassification attacks. It employs a Recurrent Neural Network (RNN) to classify 2D bounding boxes and flags alarms when the trajectory-inferred class contradicts the visual detection. However, it filters out high-frequency jitter, limiting robustness against adaptive attacks.

PhySense [17] is a physics-aware defense that integrates features like texture, dynamic behavior, and inter-object interactions. While comprehensive, its loose coupling of feature extraction modules leads to significant latency and fails to fully capture correlations between visual and kinematic modalities.

4.5. Defense Performance

We first evaluate the defense performance of the proposed defense against RP2, CAPatch, and SLAP on nuScenes, KITTI, and BDD100K, each with three patch scales (large, medium, small). As shown in Table 2, the proposed defense consistently outperforms PhySense across almost all attack types, patch sizes, and datasets. In most configurations, our DA is comparable to or slightly higher than that of PhySense, while CA improves by a clear margin and FPR/FNR are typically reduced across datasets and patch sizes. In a few relatively easy KITTI settings, PhySense attains marginally higher DA, but ours still achieves much higher CA and significantly lower FPR/FNR, indicating a strictly better robustness–utility trade-off.

Effect of patch size and attack type. As the patch size shrinks from large to small, both ours and PhySense exhibit the expected degradation in DA and CA due to the increased visual stealthiness and reduced footprint of the adversarial patch. CA is consistently higher and FPR/FNR are generally lower than PhySense across datasets and patch sizes, with only minor deviations in a few easy settings. This trend is especially salient under SLAP, the projector-based optical attack that induces rapid, transient appearance changes. On nuScenes with small SLAP patches, for instance, our method raises CA from

0.728

to

0.835

and cuts FNR by more than half, showing that the TAME energy is sensitive to physically inconsistent motion even when visual perturbations are small and short-lived.

Comparison with baselines. Table 3 further positions our method against a broader spectrum of defenses on nuScenes under RP2 with large patches. Input purification (DiffPure) and certified patch defenses (PatchGuard) provide useful robustness guarantees but either incur high false alarms on benign samples or struggle to maintain correction performance in realistic detection settings. Detector-oriented defenses (DetectorGuard) and trajectory-only methods (PercepGuard) capture parts of the physical picture but still leave a considerable gap in either DA, CA, or FPR. PhySense, as a strong physics-aware baseline, narrows this gap by integrating multiple hand-crafted physical cues, yet it still operates under a loosely coupled, modular architecture. In contrast, our method achieves leading performance across all metrics, supporting the benefits of deeply coupled, frequency-guided trajectory–appearance reasoning.

Runtime analysis. In terms of RE, we reuse the frozen detector backbone and rely only on Transformer-style operations without external hand-crafted feature extractors. As shown in Table 2, the per-frame overhead of PhySense ranges from about

0.028

s to

0.043

s across datasets, whereas our method remains in the

0.015

–

0.019

s range. Thus, our method achieves stronger robustness and better calibration of physical inconsistency while still meeting real-time constraints in autonomous driving deployments.

4.6. Black-Box Transferability

We further examine how well the proposed defense transfers in a realistic setting, where the safety module is trained once and then reused across heterogeneous detectors, attacks, and datasets. Using the defense module trained as described in Section 4.1, we then evaluate this single model under three settings: (i) changing the base detector to Faster R-CNN [47] or CenterNet [48], (ii) changing the dataset to nuScenes or KITTI, and (iii) changing the attack family to CAPatch or SLAP, still with medium patches. Table 4 summarizes the results. The configuration corresponds to the training setting, while all other entries represent zero-shot transfer without any re-training of the defense module.

Cross-detector transfer. On BDD100K under RP2, replacing YOLOv8 with Faster R-CNN or CenterNet leads to only a small drop in DA and CA, and a slight increase in FPR/FNR. The overall performance remains in a similar range as the original YOLOv8-based configuration. This indicates that the dual-stream spatiotemporal encoder and TAME head indeed behave as a detector-agnostic safety layer: as long as bounding boxes, labels, and trajectories are available, the module can be plugged behind different detectors without re-training, while still providing substantial gains over PhySense and other baselines (Table 2).

Cross-attack and cross-dataset transfer. Using the same model and threshold, we then change both the dataset and the attack type. Across nuScenes and KITTI, and for RP2, CAPatch, and SLAP, YOLOv8-based results show only modest degradation in DA/CA compared with the in-domain BDD100K–RP2 configuration, while FPR/FNR remain low. The trends are similar when switching to Faster R-CNN or CenterNet: although absolute performance slightly decreases due to detector- and domain-specific differences, the defense remains effective across all combinations. Notably, the model retains strong correction ability against CAPatch and SLAP even though it was adversarially calibrated on RP2, suggesting that the frequency-domain kinematic embedding and TAME-based inconsistency reasoning capture generic trajectory–appearance discrepancies instead of overfitting to a single patch pattern or dataset.

Overall, the results in Table 4 show that a single trained module can be transferred across heterogeneous perception stacks and deployment scenarios, with only limited loss of robustness. This transferability is particularly attractive for large-scale autonomous driving systems, where maintaining one bespoke safety module per detector or per fleet would be impractical.

4.7. Defense Against Adaptive Attackers

We finally evaluate the proposed defense against adaptive attackers that are aware of the trajectory–appearance consistency checks and attempt to jointly fool both the detector and the defense.

4.7.1. Attacker Knowledge and Goals

We consider a strong white-box threat model in which the attacker has access to the architecture and parameters of both the base detector and the module. (We assume no access to the validation set used to select the TAME threshold and no control over the tracking pipeline.) The adversary optimizes a physically realizable patch as in Section 4.2, under the same constraints on patch size, location, and NPS. The goal is two-fold: (i) induce a targeted misclassification by the detector and (ii) keep the TAME energy

E_{t}

below the detection threshold

τ

, so that the defense neither raises an alarm nor corrects the label. In other words, the attacker seeks perturbations that jointly maximize detector loss on the target class and minimize

E_{t}

or its contributing terms.

4.7.2. Adaptive Attack Strategies

We instantiate this threat model with three representative strategies that exploit progressively more internal details:

Trajectory-Smoothing RP2. The standard RP2 loss is augmented with a smoothness regularizer on the sequence of 2D/3D bounding boxes, penalizing frame-to-frame variations in velocity and acceleration. This encourages low-frequency, inertial-like trajectories but does not directly optimize TAME.
TAME-Aware Joint Optimization. The attacker differentiates through the dual-stream encoder and TAME head. The patch is optimized to (a) drive the visual head $P_{t}^{vis}$ toward a target class $y^{adv}$ and (b) reduce the symmetric TAME energy so that $P_{t}^{vis}$ and $P_{t}^{kin}$ agree on $y^{adv}$ :

$L_{adv}^{TAME} = L_{\det} (y^{adv}) + β E_{t} (P_{t}^{vis}, P_{t}^{kin}),$

(23)

where $β$ balances misclassification and energy suppression.
Frequency-Suppression Attack. Assuming knowledge of the frequency-decoupling mechanism, the attacker penalizes the magnitude of the high-frequency component $γ_{high} (s_{t})$ :

$L_{freq} = ∥ γ_{high} (s_{t}) - γ_{low} (s_{t}) ∥_{2}^{2},$

(24)

aiming to suppress jitter-related responses in the kinematic stream while still fooling the detector.

4.7.3. Results and Analysis

As summarized in Table 5, we present the defense performance on nuScenes under adaptive attackers.

The Trajectory-Smoothing strategy reduces CA from

0.921

to

0.786

by making 3D box sequences closer to the ideal inertial motion, but the drop is moderate, as the frequency-domain embedding still captures residual discrepancies. The TAME-aware attack is the most effective, lowering CA to

0.724

and increasing FNR to

0.169

, showing that a fully informed attacker can sometimes force the two heads to agree on wrong labels. The Frequency-Suppression attack achieves similar CA (

0.725

): suppressing jitter weakens the high-frequency cue but inevitably distorts low-frequency motion, which remains detectable.

Overall, these results expose a fundamental dilemma for adaptive attackers. To reliably fool the base detector, the patch must introduce persistent appearance changes that create additional jitter and trajectory–appearance mismatch, pushing the TAME energy

E_{t}

upward. To evade TAME, the attacker must instead smooth motion and suppress jitter, which weakens the perturbation and undermines the misclassification. Because our frequency-domain kinematic embedding defines robustness in terms of the contrast between inertia and jitter rather than any single trajectory statistic, lowering

E_{t}

by manipulating one band typically worsens the other; so in practice, adaptive optimization can at best move sequences from the high-energy region to a narrow band around

τ

, rather than back to the benign low-energy manifold.

4.8. Scene-Level Behavior and Consistency Landscape

Beyond aggregate metrics, we analyze how the proposed trajectory–appearance consistency behaves at the scene and trajectory level. All visualizations in this subsection are produced on held-out nuScenes sequences; the observations are representative of the trends seen on other datasets.

Frame-wise energy evolution. As illustrated in Figure 5, we plot the TAME energy

E_{t}

over time for three typical sequences under RP2, SLAP, and adaptive attacks, together with the benign counterpart. For benign trajectories (green curves),

E_{t}

stays close to a low baseline around

0.05

and rarely approaches the decision threshold

τ = 0.2

, indicating that appearance and motion remain compatible over the whole sequence. Once an RP2 patch becomes effective (frames 15–35), the energy quickly rises into a high plateau (≈

0.6

–

0.9

) and remains above the shaded alarm region, clearly separating attacked frames from clean ones. SLAP produces a similar but more oscillatory plateau, reflecting the transient nature of projector-based perturbations. In the adaptive case, where the attacker explicitly tries to keep

E_{t}

small, the curve oscillates tightly around

τ

instead of returning to the benign baseline, showing that it is difficult to simultaneously fool the detector and keep the trajectory on the low-energy manifold defined in Section 3.4.

To examine potential false alarms, as shown in Figure 6, we compare a benign trajectory, a “hard benign” case with sharp braking, and an RP2 attack. Sharp braking temporarily increases

E_{t}

and produces a short bump that touches or slightly crosses the threshold, but quickly falls back to the benign band. In contrast, RP2 induces a long, high plateau that stays far above

τ

. This difference explains why the defense maintains a low FPR while still detecting physically inconsistent attacks.

Consistency vs. detector confidence. As illustrated in Figure 7, we present scatter plots of TAME energy versus detector confidence for benign and attacked samples under RP2, SLAP and adaptive attacks. Benign detections (green dots) cluster in the lower-right region: high confidence and low energy, which corresponds to predictions that are both visually confident and physically plausible. RP2 and SLAP attacks (red crosses) mainly occupy the upper-right and upper-middle area: the base detector is still reasonably confident, but the TAME energy is well above

τ

, revealing strong trajectory–appearance conflict. Under adaptive attacks, adversarial samples move closer to the threshold and their confidence decreases slightly, yet they still form a distinct high-energy cloud separated from benign points. These plots confirm that

E_{t}

provides information complementary to detector confidence: it exposes “high-confidence but physically inconsistent” cases that cannot be filtered by confidence alone.

Energy distributions across patch size and object class. As shown in Figure 8, we report the marginal distributions of

E_{t}

for benign and attacked frames under large, medium and small patches. For large patches, benign and attack distributions are almost disjoint: benign frames concentrate well below

τ

, whereas attacks form a broad peak around

0.7

–

0.8

. As the patch shrinks, the attack distribution gradually shifts towards the threshold and slightly overlaps with the benign tail, reflecting the increased visual stealthiness of smaller perturbations. Even for small patches, however, the main attack mass remains on the high-energy side of

τ

, which is consistent with the low FNR observed in Table 2.

Finally, as shown in Figure 9, we decompose the TAME distributions by object category (bicycle, bus, pedestrian, car, and truck). Across all classes, benign samples exhibit a sharp peak near zero and only a light tail around the threshold, indicating that the consistency prior is not biased towards a specific category. Attack distributions are shifted to higher energies, with large separation for buses and trucks (whose motion is more inertial) and slightly broader overlap for bicycles and pedestrians (which naturally move more erratically). Importantly, a single global threshold

τ = 0.2

still separates most benign and adversarial frames in every class, supporting the use of a class-agnostic decision rule in Equation (13) and explaining why the defense achieves stable performance across heterogeneous traffic participants.

4.9. Ablation Study

4.9.1. Analysis of Deep-Coupling Mechanisms

We first examine the necessity of the high-order interactions modeled by the dual-stream spatiotemporal encoder. To this end, we contrast our fully coupled architecture with variants that represent typical designs. The quantitative comparison results are listed in Table 6.

The loose coupling variant follows the conventional pipeline in which visual and kinematic features are processed independently and only concatenated at the classification head. This failure confirms that correcting subtle inconsistencies requires early feature-level interaction to actively attenuate compromised visual cues. Replacing our frequency-domain design with a unified query (Single Q) noticeably degrades detection on jitter-heavy attacks such as SLAP, proving that a coarse motion representation fails to probe high-frequency adversarial artifacts. Furthermore, ablating the discrepancy feedback loop (w/o Discrepancy) spikes FPR, demonstrating that

Z_{d i s}

acts as a necessary stabilizer to suppress ambiguous features in benign scenes. Finally, the failure of frame-wise reasoning (w/o Self-Attn) under transient attacks underscores the necessity of temporal self-attention for capturing dynamic inconsistencies.

4.9.2. Impact of Frequency-Domain Kinematic Embedding

We evaluate the spectral kinematic components in Table 7. The baseline (No Fourier) utilizing raw states underperforms, indicating that a single MLP fails to fully exploit spectrally localized cues. Crucially, discarding jitter information (w/o High Freq) results in high FNR, confirming that high-frequency fluctuations are strong discriminators for adversarial instability. Conversely, removing inertial context (w/o Low Freq) causes FPR to spike, showing that low-frequency trends are essential for stabilizing benign predictions against sensor noise. These observations are consistent with the hypothesis in Section 3.2 and justify the full frequency-domain design.

4.9.3. Manifold Shaping via TAME Energy and Objectives

We analyze the decision manifold shaping in Table 8. (1) Benign Compression: Removing consistency regularization (w/o Con-Reg) causes a sharp rise in FPR, confirming that

L_{con}

is critical for compressing benign sequences into a compact low-energy manifold. (2) Adversarial Margin: Eliminating calibration (w/o Adv-Calib) significantly drops DA, proving that

L_{adv}

is necessary to explicitly push attacks into high-energy regions to ensure separability. (3) Metric Sensitivity: The inferiority of linear (L1 Distance) and asymmetric metrics (Asym. KL) highlights that Symmetric KL provides the necessary probabilistic sensitivity and steep gradients for decisive inconsistency detection.

5. Discussion

The experimental results validate our central hypothesis that physical adversarial attacks inevitably disturb the intrinsic coupling between visual appearance and motion, and that explicitly modeling this coupling in a shared latent space yields a more robust and efficient defense. Across three datasets, three attack families (RP2, CAPatch, SLAP), and multiple patch scales, our defense consistently improves DA and, more importantly, CA over PhySense while typically reducing FPR, FNR, and runtime. The scene-level visualizations further support this picture: benign trajectories remain on a compact low-energy manifold, whereas physical attacks induce sustained high-energy plateaus, and even adaptive attacks can only force the TAME curve to oscillate around the threshold instead of returning to the benign baseline (Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9). Compared with certified defenses and input purification methods, our defense offers a different trade-off: rather than reconstructing clean images or providing formal but conservative guarantees, it leverages physically grounded consistency checks to achieve strong empirical robustness under strict real-time constraints.

The comparisons with spatiotemporal consistency-based methods further highlight the benefits of deep coupling. PercepGuard- and PhySense-style approaches already exploit trajectory information, yet they operate under loosely coupled, modular architectures and largely treat motion features as a post hoc auxiliary signal. In contrast, our method integrates visual and kinematic cues throughout the entire reasoning process via dual-modal self-attention and frequency-domain cross-attention. The confidence–energy scatter plots in Figure 7 reveal that attacked samples occupy a distinct high-energy band even when the detector is confident, providing a physically interpretable signal that complements conventional confidence scores. The transfer experiments in Table 4 further demonstrate that a single module trained on YOLOv8 with RP2 in BDD100K can be plugged behind Faster R–CNN and CenterNet and transferred to nuScenes and KITTI, as well as to unseen CAPatch and SLAP attacks, with only modest accuracy degradation and consistently low FPR/FNR. Together with the measured 15–19 ms per-frame overhead, this suggests that trajectory–appearance consistency can be deployed as a detector-agnostic safety layer in practical perception stacks.

The ablation studies provide additional insight into the mechanism of robustness. Removing temporal self-attention or reverting to loose coupling significantly degrades performance, confirming that inconsistency detection requires long-range temporal context and early interaction between modalities rather than simple score-level fusion. The frequency-domain kinematic embedding also proves crucial: dropping the high-frequency branch sharply increases FNR, whereas discarding low-frequency trends raises FPR, indicating that robustness emerges from the relative configuration of inertia and jitter rather than either component alone (Table 6 and Table 7). Patch-size and class-wise TAME histograms (Figure 8 and Figure 9) are consistent with the quantitative trends: smaller patches and intrinsically jittery participants such as bicycles and pedestrians exhibit larger overlap between benign and adversarial energies and correspondingly higher FNR, while heavy vehicles are much easier to separate.

Although implemented on RGB streams, the proposed Physics-Aware Spatiotemporal Consistency principle is fundamentally applicable to multimodal AV stacks. Since production perception systems often prioritize the visual branch for semantic classification in hybrid-fusion architectures [49], compromised visual inputs can propagate erroneous semantics to the fusion engine or trigger conservative failsafes. By sanitizing the visual branch at the feature level, our method effectively blocks this error propagation source.

Moreover, the framework provides resilience against second-order attack strategies. While sophisticated adversaries might attempt to jointly optimize appearance and trajectory to evade detection, our adaptive analysis (Section 4.7) exposes a fundamental stealthiness-dynamics dilemma: enforcing effective semantic misclassification inevitably induces high-frequency jitter or inertial violations [40]. Bypassing this defense in a multimodal setting would require satisfying kinematic constraints across all sensors simultaneously (e.g., aligning fake visual and LiDAR trajectories), imposing prohibitive optimization costs that render such attacks computationally infeasible or physically conspicuous. Future work will extend TAME to explicitly model cross-modal consistency (e.g., RGB-LiDAR flow alignment) to further heighten the barrier for adaptive threats.

Despite these advantages, our work is not a complete solution to physical adversarial threats. The framework assumes reasonably reliable tracking and 3D box lifting; severe tracking failures or sensor outages could impair the quality of kinematic features and thus the effectiveness of TAME. Moreover, our experiments focus on RGB-based perception and representative patch and projector attacks; other sensing modalities (e.g., LiDAR, radar), more complex multi-object attacks, and jointly optimized sensor-fusion strategies remain to be explored. Finally, our defense is trained and deployed with access to detector backbone feature maps. This relaxes a strict output-only black-box assumption, but allows us to reuse already computed features instead of running a separate visual backbone, substantially reducing computational overhead while preserving a detector-agnostic, plug-in interface. We view this as a deliberate trade-off between strict black-box constraints and the practical need to balance robustness, universality, and real-time efficiency in large-scale autonomous driving systems. These limitations point to important directions for future research on physically grounded, spatiotemporal defenses.

6. Conclusions

We presented a physics-aware trajectory–appearance consistency defense that treats physical trajectories not as an external verifier, but as an internal organizer of visual representations. By combining a dual-stream spatiotemporal encoder with endogenous feature orchestration and a frequency-domain kinematic embedding, the defense uses inertial trends and detection jitter to probe and modulate visual features, and it quantifies trajectory–appearance conflict via TAME energy. The resulting module can be attached as a transferable safety layer behind diverse object detectors by reusing their backbone features, outputs, and tracking states without modifying detector weights.

Extensive experiments on nuScenes, KITTI, and BDD100K show that the proposed defense substantially improves robustness against patch-based and projection-based physical attacks, achieving higher Correction Accuracy and typically lower FPR/FNR than prior consistency-based defenses such as PhySense, while reducing inference latency. The defense further exhibits strong cross-detector and cross-dataset transferability and maintains nontrivial protection under adaptive attackers. In future work, we plan to extend this trajectory–appearance consistency perspective to multi-sensor 3D perception, tighter integration with detection and tracking in closed-loop systems, and stronger adaptive benchmarks that jointly optimize over appearance and motion to further stress-test physically grounded defenses.

Author Contributions

Conceptualization, F.F.; methodology, Y.L. and Z.N.; software, L.P. and M.C.; validation, Z.N., Y.L. and T.Y.; investigation, Z.Y. and T.Y.; resources, Z.N.; data curation, Z.Y. and F.F.; writing-original draft, Z.N.; writing-review and editing, J.L. and L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Key Research and Development Program of Hubei Province, China, under Grant 2024BAA011; and in part by the Technological Innovation Program of Hubei Province, China, under Grant 2025BEA006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

You can find the dataset we use in this experiment at https://www.cvlibs.net/datasets/kitti (KITTI), https://www.nuscenes.org/nuscenes##data-collection (nuScenes), http://bdd-data.berkeley.edu/download.html (BDD100K), accessed on 8 December 2025. The codes supporting the result of this study will be made available by the author on request.

Conflicts of Interest

Author Jieke Lu was employed by the company Electric Power Research Institute of Guangxi Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhao, J.; Zhao, W.; Deng, B.; Wang, Z.; Zhang, F.; Zheng, W.; Cao, W.; Nan, J.; Lian, Y.; Burke, A.F. Autonomous driving system: A comprehensive survey. Expert Syst. Appl. 2024, 242, 122836. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhang, Q.; Jin, S.; Zhu, R.; Sun, J.; Zhang, X.; Chen, Q.A.; Mao, Z.M. On data fabrication in collaborative vehicular perception: Attacks and countermeasures. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 6309–6326. [Google Scholar]
Yuan, Q.; Li, R.; Zhou, B.; Lu, J.; Hu, M.; Lai, P.; Zhao, Y.; Zhang, X. Collaborative Truck-UAV Delivery Routing Optimization under Dynamic Weather Conditions and Customer Demands. IEEE Trans. Consum. Electron. 2025, 71, 10950–10964. [Google Scholar] [CrossRef]
Hong, D.S.; Chen, H.H.; Hsiao, P.Y.; Fu, L.C.; Siao, S.M. CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving. Image Vis. Comput. 2020, 100, 103955. [Google Scholar] [CrossRef]
Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar] [CrossRef]
Song, D.; Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Tramer, F.; Prakash, A.; Kohno, T. Physical adversarial examples for object detectors. In Proceedings of the 12th USENIX Workshop on Offensive Technologies (WOOT 18), Baltimore, MD, USA, 13–14 August 2018. [Google Scholar]
Sun, J.; Cao, Y.; Choy, C.B.; Yu, Z.; Anandkumar, A.; Mao, Z.M.; Xiao, C. Adversarially robust 3D point cloud recognition using self-supervisions. Adv. Neural Inf. Process. Syst. 2021, 34, 15498–15512. [Google Scholar]
Chiang, P.Y.; Ni, R.; Abdelkader, A.; Zhu, C.; Studer, C.; Goldstein, T. Certified Defenses for Adversarial Patches. In Proceedings of the International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Xiang, C.; Bhagoji, A.N.; Sehwag, V.; Mittal, P. PatchGuard: A provably robust defense against adversarial patches via small receptive fields and masking. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11–13 August 2021; pp. 2237–2254. [Google Scholar]
Li, S.; Zhu, S.; Paul, S.; Roy-Chowdhury, A.; Song, C.; Krishnamurthy, S.; Swami, A.; Chan, K.S. Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 396–413. [Google Scholar]
Nie, W.; Guo, B.; Huang, Y.; Xiao, C.; Vahdat, A.; Anandkumar, A. Diffusion models for adversarial purification. arXiv 2022, arXiv:2205.07460. [Google Scholar]
Chen, Z.; Dash, P.; Pattabiraman, K. Jujutsu: A two-stage defense against adversarial patch attacks on deep neural networks. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, Melbourne, VIC, Australia, 10–14 July 2023; pp. 689–703. [Google Scholar]
Garg, S.; Chattopadhyay, N.; Chattopadhyay, A. Robust Perception for Autonomous Vehicles using Dimensionality Reduction. In Proceedings of the 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Wuhan, China, 9–11 December 2022; pp. 1516–1521. [Google Scholar] [CrossRef]
Man, Y.; Muller, R.; Li, M.; Celik, Z.B.; Gerdes, R. That person moves like a car: Misclassification attack detection for autonomous systems using spatiotemporal consistency. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 6929–6946. [Google Scholar]
Yu, Z.; Li, A.; Wen, R.; Chen, Y.; Zhang, N. Physense: Defending physically realizable attacks for autonomous systems via consistency reasoning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 3853–3867. [Google Scholar]
Xiao, C.; Deng, R.; Li, B.; Lee, T.; Edwards, B.; Yi, J.; Song, D.; Liu, M.; Molloy, I. Advit: Adversarial frames identifier based on temporal consistency in videos. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3968–3977. [Google Scholar]
Ding, L.; Wang, Y.; Yuan, K.; Jiang, M.; Wang, P.; Huang, H.; Wang, Z.J. Towards universal physical attacks on single object tracking. In Proceedings of the Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1236–1245. [Google Scholar]
Han, H.; Hu, X.; Hao, Y.; Xu, K.; Dang, P.; Wang, Y.; Zhao, Y.; Du, Z.; Guo, Q.; Wang, Y.; et al. Real-Time Robust Video Object Detection System Against Physical-World Adversarial Attacks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 366–379. [Google Scholar] [CrossRef]
Cheng, L.; Sengupta, A.; Cao, S. Deep learning-based robust multi-object tracking via fusion of mmWave radar and camera sensors. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17218–17233. [Google Scholar] [CrossRef]
Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-time vehicle detection based on improved yolo v5. Sustainability 2022, 14, 12274. [Google Scholar] [CrossRef]
Varghese, R.; M., S. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Hennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Liu, W.; Cao, B.; Lu, T.; Cai, C.; Hu, M.; Peng, K.; Xiong, Z. Lightweight Hybrid Device Identification for IoT Applications. IEEE Internet Things J. 2025, 12, 23747–23762. [Google Scholar] [CrossRef]
Wang, C.; Muller, R.; Song, R.; Monteuuis, J.P.; Petit, J.; Man, Y.; Gerdes, R.; Celik, Z.B.; Li, M. From Threat to Trust: Exploiting Attention Mechanisms for Attacks and Defenses in Cooperative Perception. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; pp. 7387–7406. [Google Scholar]
Chatziioannou, I.; Tsigdinos, S.; Tzouras, P.G.; Nikitas, A.; Bakogiannis, E. Connected and Autonomous Vehicles and Infrastructure Needs: Exploring Road Network Changes and Policy Interventions. In Deception in Autonomous Transport Systems: Threats, Impacts and Mitigation Policies; Springer: Berlin/Heidelberg, Germany, 2024; pp. 65–83. [Google Scholar]
Tu, J.; Wang, T.; Wang, J.; Manivasagam, S.; Ren, M.; Urtasun, R. Adversarial attacks on multi-agent communication. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7768–7777. [Google Scholar]
Yang, C.; Kortylewski, A.; Xie, C.; Cao, Y.; Yuille, A. Patchattack: A black-box texture-based attack with reinforcement learning. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 681–698. [Google Scholar]
Chen, S.T.; Cornelius, C.; Martin, J.; Chau, D.H. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2018; pp. 52–68. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Lu, W.; Xiao, Q.; Jia, H.; Jin, Y.; Mu, Y.; Zhu, J.; Shen, C.; Teodorescu, R.; Guerrero, J.M. A projected gradient descent-based distributed optimal control method of medium-voltage DC distribution system considering line loss. IEEE Trans. Power Syst. 2024, 40, 1751–1763. [Google Scholar] [CrossRef]
Purnekar, N.; Tondi, B.; Barni, M. Physical Domain Adversarial Attacks Against Source Printer Image Attribution. In Proceedings of the 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Macau, Macao, 3–6 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Yusuf, B.; Huang, H.H. Robust Handwritten Text Recognition via Multi-Source Adversarial Domain Adaptation for Low-Resource Scripts. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Republic of Korea, 10–14 November 2025; pp. 5464–5468. [Google Scholar]
Zhang, S.; Cheng, Y.; Zhu, W.; Ji, X.; Xu, W. CAPatch: Physical Adversarial Patch against Image Captioning Systems. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 679–696. [Google Scholar]
Lovisotto, G.; Turner, H.; Sluganovic, I.; Strohmeier, M.; Martinovic, I. SLAP: Improving physical adversarial examples with Short-Lived adversarial perturbations. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11–13 August 2021; pp. 1865–1882. [Google Scholar]
Xiang, C.; Mittal, P. Patchguard++: Efficient provable attack detection against adversarial patches. arXiv 2021, arXiv:2104.12609. [Google Scholar]
Han, X.; Wang, H.; Zhao, K.; Deng, G.; Xu, Y.; Liu, H.; Qiu, H.; Zhang, T. VisionGuard: Secure and Robust Visual Perception of Autonomous Vehicles in Practice. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 1864–1878. [Google Scholar]
Liang, J.; Yi, R.; Chen, J.; Nie, Y.; Zhang, H. Securing autonomous vehicles visual perception: Adversarial patch attack and defense schemes with experimental validations. IEEE Trans. Intell. Veh. 2024, 9, 7865–7875. [Google Scholar] [CrossRef]
Chen, Z.; Liu, Y.; Ni, W.; Hai, H.; Huang, C.; Xu, B.; Ling, Z.; Shen, Y.; Yu, W.; Wang, H.; et al. Predicting driving comfort in autonomous vehicles using road information and multi-head attention models. Nat. Commun. 2025, 16, 2709. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Ye, Q.; Luo, W.; Zhang, K.; Shi, Z.; Chen, J. F&F Attack: Adversarial Attack against Multiple Object Trackers by Inducing False Negatives and False Positives. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 4550–4560. [Google Scholar] [CrossRef]
Yu, X.; Tian, X.; Chen, J.; Wang, Y. FreqSpace-NeRF: A fourier-enhanced Neural Radiance Fields method via dual-domain contrastive learning for novel view synthesis. Comput. Graph. 2025, 127, 104171. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2633–2642. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar]
Xiang, C.; Mittal, P. Detectorguard: Provably securing object detectors against localized patch hiding attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 3177–3196. [Google Scholar]
He, Q.; Mei, Z.; Zhang, H.; Xu, X. Automatic Real-Time Detection of Infant Drowning Using YOLOv5 and Faster R-CNN Models Based on Video Surveillance. J. Soc. Comput. 2023, 4, 62–73. [Google Scholar] [CrossRef]
Wang, W.; Xi, L.; Cui, H.; Yin, D.; Jingbo, L. 3D Vehicle Detection From Roadside Monocular View Based on Improved CenterNet. In Proceedings of the 2024 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM), Hangzhou, China, 8–11 August 2024; pp. 346–351. [Google Scholar] [CrossRef]
Zhou, Z.; Li, B.; Song, Y.; Yu, Z.; Hu, S.; Wan, W.; Zhang, L.Y.; Yao, D.; Jin, H. Numbod: A spatial-frequency fusion attack against object detectors. In Proceedings of the 2025 AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 1201–1209. [Google Scholar]

Figure 1. Pipeline of physical adversarial attacks and our defense on autonomous driving perception.

Figure 2. Overall architecture of the proposed physics-aware trajectory–appearance consistency defense framework.

Figure 3. Structure of the dual-stream spatiotemporal encoder with endogenous feature orchestration.

Figure 4. Illustration of physical adversarial attacks on autonomous driving perception: (a) RP2 patch-based attack on traffic signs or vehicles; (b) CAPatch-style patch attack adapted to detection scenarios; (c) SLAP optical projection attack on object surfaces.

Figure 5. Trajectory-level evolution of the TAME energy

E_{t}

under different physical attack types. Each panel compares a benign trajectory (green) with an attacked one (colored curve), with the horizontal dashed line indicating the detection threshold

τ = 0.2

. The proposed defense responds to RP2 and SLAP by producing persistent high-energy segments, while adaptive attacks succeed in partially suppressing the peak but still incur noticeable deviations from the benign profile.

Figure 5. Trajectory-level evolution of the TAME energy

E_{t}

under different physical attack types. Each panel compares a benign trajectory (green) with an attacked one (colored curve), with the horizontal dashed line indicating the detection threshold

τ = 0.2

. The proposed defense responds to RP2 and SLAP by producing persistent high-energy segments, while adaptive attacks succeed in partially suppressing the peak but still incur noticeable deviations from the benign profile.

Figure 6. Comparison of TAME energy

E_{t}

over time for a normal benign trajectory, a hard-benign scenario (sharp braking), and an RP2 attack. Although sharp braking transiently increases

E_{t}

towards the threshold

τ

, its peak remains lower and much shorter than the sustained high-energy plateau induced by RP2, illustrating that our defense can distinguish physically plausible maneuvers from genuine adversarial instability.

Figure 6. Comparison of TAME energy

E_{t}

over time for a normal benign trajectory, a hard-benign scenario (sharp braking), and an RP2 attack. Although sharp braking transiently increases

E_{t}

towards the threshold

τ

, its peak remains lower and much shorter than the sustained high-energy plateau induced by RP2, illustrating that our defense can distinguish physically plausible maneuvers from genuine adversarial instability.

Figure 7. Joint distribution of detector confidence and TAME energy

E_{t}

under different attack strategies. Green dots denote benign samples and red crosses indicate attacked samples. The shaded region in the lower-right corner represents the ideal operating regime (high confidence and low energy), while the horizontal dashed line marks the decision threshold

τ

. RP2 and SLAP mainly push samples into a high-energy band, whereas adaptive attacks concentrate around

τ

with moderately reduced confidence, confirming that our defense reshapes the score space into a physically calibrated separation between clean and adversarial states.

Figure 7. Joint distribution of detector confidence and TAME energy

E_{t}

under different attack strategies. Green dots denote benign samples and red crosses indicate attacked samples. The shaded region in the lower-right corner represents the ideal operating regime (high confidence and low energy), while the horizontal dashed line marks the decision threshold

τ

. RP2 and SLAP mainly push samples into a high-energy band, whereas adaptive attacks concentrate around

τ

with moderately reduced confidence, confirming that our defense reshapes the score space into a physically calibrated separation between clean and adversarial states.

Figure 8. Distribution of TAME energy

E_{t}

for benign and attacked trajectories under different patch sizes. Each panel plots kernel-smoothed histograms of benign (green) and adversarial (red) samples, with the vertical dashed line indicating the global threshold

τ = 0.2

. Large patches lead to a clear bimodal separation, whereas smaller patches shift the attack distribution leftwards and increase overlap with benign tails, explaining the gradual increase in FNR observed in Table 2.

Figure 8. Distribution of TAME energy

E_{t}

for benign and attacked trajectories under different patch sizes. Each panel plots kernel-smoothed histograms of benign (green) and adversarial (red) samples, with the vertical dashed line indicating the global threshold

τ = 0.2

. Large patches lead to a clear bimodal separation, whereas smaller patches shift the attack distribution leftwards and increase overlap with benign tails, explaining the gradual increase in FNR observed in Table 2.

Figure 9. Class-wise TAME energy

E_{t}

distributions for key traffic participants under RP2 attacks. Heavy vehicles such as buses and trucks exhibit very low benign energies and clearly separated high-energy attack modes, making them easier to defend. In contrast, bicycles and pedestrians show broader benign tails with substantial overlap around the threshold

τ

, reflecting intrinsically jittery motion patterns and explaining the slightly higher FNR on these categories. The black dashed vertical line marks the decision threshold

τ = 0.2

, below which samples are considered benign and above which they are flagged as potential attacks.

Figure 9. Class-wise TAME energy

E_{t}

distributions for key traffic participants under RP2 attacks. Heavy vehicles such as buses and trucks exhibit very low benign energies and clearly separated high-energy attack modes, making them easier to defend. In contrast, bicycles and pedestrians show broader benign tails with substantial overlap around the threshold

τ

, reflecting intrinsically jittery motion patterns and explaining the slightly higher FNR on these categories. The black dashed vertical line marks the decision threshold

τ = 0.2

, below which samples are considered benign and above which they are flagged as potential attacks.

Table 1. Nomenclature and abbreviations used in the proposed framework.

Symbol	Description
Input and Embedding
S	Observation sequence ${(I_{t}, B_{t})}_{t = 1}^{T}$
$I_{t}, B_{t}$	Image and bounding boxes (2D box with lifted 3D state) at t
T	Temporal window length
$F_{m a p}$	Backbone feature map
$e_{t}^{v i s}$	Visual semantic embedding ( $\in R^{d}$ )
d	Latent space dimension
$p_{t}$	Object centroid $(x_{t}, y_{t})$ on ground plane
$f_{r}$	Input frame rate
$s_{t}$	Kinematic state (position, velocity, acceleration)
$M_{l o w / h i g h}$	Random Fourier projection matrices (low/high freq.)
$γ_{l o w / h i g h}$	Fourier feature mappings (inertial/jitter)
$e_{t}^{k i n}$	Kinematic embedding
Dual-Stream Spatiotemporal Encoder
L	Number of encoder layers
$E_{l}^{v i s / k i n}$	Visual/Kinematic feature sequences at layer l
MHSA	Multi-Head Self-Attention
$Q_{l o w / h i g h}$	Inertial (low-freq.) and Jitter (high-freq.) queries
$Z_{l o w / h i g h}$	Retrieved appearance patterns via kinematic queries
$Δ Z$	Semantic discrepancy vector ( $Z_{l o w} - Z_{h i g h}$ )
$Z_{d i s}$	Fused discrepancy code for orchestration
TAME Energy and Inference
$P_{t}^{v i s / k i n}$	Predicted class probabilities (visual/kinematic)
$D_{KL}$	Kullback–Leibler divergence
$E_{t}$	Trajectory–Appearance Mutual Exclusion (TAME) energy
$τ$	Decision threshold
${\hat{y}}_{t}$	Final predicted label
Optimization Objectives
$L_{cls}$	Cross-entropy classification loss
$L_{con}$	Consistency regularization loss
$L_{adv}$	Adversarial calibration loss
m	Margin for adversarial calibration
$λ_{con / adv}$	Loss weighting coefficients

Table 2. Defense performance against RP2, CAPatch, and SLAP attacks. Bolded values in the table indicate optimal performance results.

Attack	Patch	Dataset	Frames	PhySense (Baseline)					Ours
Attack	Patch	Dataset	Frames	DA	CA	FPR	FNR	RE	DA	CA	FPR	FNR	RE
RP2	Large	nuScenes	21,763	0.915	0.865	0.088	0.071	0.042	0.935	0.921	0.041	0.035	0.019
		BDD100K	4980	0.932	0.884	0.075	0.062	0.038	0.952	0.945	0.035	0.028	0.017
		KITTI	5212	0.975	0.925	0.042	0.031	0.029	0.971	0.942	0.015	0.012	0.015
	Medium	nuScenes	21,763	0.898	0.842	0.096	0.085	0.042	0.918	0.906	0.048	0.042	0.019
		BDD100K	4980	0.915	0.865	0.082	0.074	0.038	0.936	0.928	0.039	0.035	0.017
		KITTI	5212	0.952	0.908	0.051	0.042	0.029	0.968	0.960	0.021	0.018	0.015
	Small	nuScenes	21,763	0.882	0.821	0.105	0.098	0.042	0.902	0.891	0.055	0.051	0.019
		BDD100K	4980	0.901	0.845	0.045	0.085	0.038	0.921	0.912	0.046	0.042	0.017
		KITTI	5212	0.938	0.889	0.062	0.055	0.029	0.955	0.945	0.078	0.025	0.015
CAPatch	Large	nuScenes	21,763	0.902	0.851	0.092	0.078	0.043	0.922	0.910	0.045	0.039	0.019
		BDD100K	4980	0.921	0.872	0.081	0.068	0.039	0.941	0.932	0.038	0.032	0.018
		KITTI	5212	0.965	0.912	0.048	0.035	0.030	0.955	0.922	0.018	0.015	0.016
	Medium	nuScenes	21,763	0.885	0.832	0.101	0.089	0.043	0.905	0.894	0.051	0.046	0.019
		BDD100K	4980	0.925	0.915	0.088	0.069	0.039	0.905	0.854	0.072	0.078	0.018
		KITTI	5212	0.941	0.895	0.055	0.048	0.030	0.958	0.950	0.024	0.021	0.016
	Small	nuScenes	21,763	0.868	0.805	0.112	0.105	0.043	0.885	0.878	0.058	0.055	0.019
		BDD100K	4980	0.888	0.828	0.098	0.092	0.039	0.908	0.898	0.049	0.046	0.018
		KITTI	5212	0.925	0.872	0.068	0.061	0.030	0.942	0.935	0.031	0.028	0.016
SLAP	Large	nuScenes	21,763	0.865	0.792	0.118	0.105	0.041	0.892	0.876	0.055	0.048	0.019
		BDD100K	4980	0.882	0.818	0.105	0.095	0.037	0.912	0.894	0.048	0.042	0.017
		KITTI	5212	0.925	0.872	0.065	0.055	0.028	0.945	0.935	0.026	0.022	0.015
	Medium	nuScenes	21,763	0.842	0.762	0.128	0.122	0.041	0.875	0.852	0.065	0.058	0.019
		BDD100K	4980	0.865	0.788	0.115	0.110	0.037	0.895	0.878	0.055	0.048	0.017
		KITTI	5212	0.905	0.852	0.075	0.068	0.028	0.928	0.918	0.032	0.028	0.015
	Small	nuScenes	21,763	0.818	0.728	0.145	0.142	0.041	0.855	0.835	0.075	0.068	0.019
		BDD100K	4980	0.838	0.762	0.132	0.125	0.037	0.825	0.758	0.125	0.110	0.017
		KITTI	5212	0.882	0.825	0.082	0.055	0.028	0.912	0.895	0.090	0.035	0.015

Table 3. Comparison with baselines on nuScenes under RP2 attack with large patches.

Method	DA	CA	FPR	FNR
DiffPure	0.546	0.271	0.365	0.241
PatchGuard	0.612	0.338	0.219	0.208
DetectorGuard	0.741	0.612	0.184	0.162
PercepGuard	0.863	0.731	0.236	0.151
PhySense	0.915	0.865	0.088	0.071
Ours	0.935	0.921	0.041	0.035

Table 4. Comprehensive black-box transferability analysis. The defense module is trained on YOLOv8 (BDD100K) and evaluated on unseen detectors (Faster R-CNN, CenterNet), datasets (nuScenes, KITTI), and attack types (RP2, CAPatch, SLAP) using medium patches. Bold indicates the source model performance.

Attack	Dataset	YOLOv8 (Source)				Faster R-CNN (Transfer)				CenterNet (Transfer)
Attack	Dataset	DA	CA	FPR	FNR	DA	CA	FPR	FNR	DA	CA	FPR	FNR
RP2	nuScenes	0.918	0.906	0.048	0.042	0.826	0.770	0.098	0.092	0.815	0.758	0.105	0.099
	BDD100K	0.936	0.928	0.039	0.035	0.842	0.789	0.089	0.085	0.832	0.775	0.095	0.092
	KITTI	0.968	0.960	0.021	0.018	0.871	0.816	0.071	0.068	0.860	0.805	0.078	0.075
CAPatch	nuScenes	0.905	0.894	0.051	0.046	0.815	0.760	0.101	0.093	0.805	0.748	0.108	0.102
	BDD100K	0.905	0.854	0.072	0.078	0.815	0.726	0.122	0.121	0.802	0.715	0.128	0.135
	KITTI	0.958	0.950	0.024	0.021	0.862	0.808	0.074	0.071	0.850	0.795	0.082	0.078
SLAP	nuScenes	0.875	0.852	0.065	0.058	0.788	0.729	0.115	0.101	0.774	0.713	0.122	0.115
	BDD100K	0.895	0.878	0.055	0.048	0.806	0.746	0.105	0.091	0.795	0.732	0.113	0.105
	KITTI	0.928	0.918	0.032	0.028	0.835	0.780	0.082	0.078	0.825	0.765	0.089	0.085

Table 5. Defense performance of the proposed method under adaptive attackers on nuScenes.

Attack Setting	DA	CA	FPR	FNR
Non-Adaptive RP2	0.935	0.921	0.041	0.035
Trajectory-Smoothing RP2	0.805	0.786	0.118	0.115
TAME-Aware Joint Optimization	0.761	0.724	0.125	0.169
Frequency-Suppression Attack	0.784	0.725	0.121	0.136

Table 6. Ablation study on architectural coupling mechanisms on nuScenes.

Architecture Variant	RP2				SLAP
Architecture Variant	DA	CA	FPR	FNR	DA	CA	FPR	FNR
Loose Coupling	0.790	0.772	0.190	0.180	0.758	0.642	0.200	0.195
Single Q	0.810	0.795	0.165	0.160	0.775	0.761	0.180	0.175
w/o Discrepancy	0.820	0.708	0.170	0.146	0.780	0.768	0.182	0.169
w/o Self-Attn	0.800	0.785	0.155	0.181	0.760	0.748	0.170	0.203
Full Arch (Ours)	0.935	0.921	0.041	0.035	0.895	0.878	0.055	0.048

Table 7. Ablation on kinematic embedding strategies on nuScenes.

Embedding Variant	RP2				SLAP
Embedding Variant	DA	CA	FPR	FNR	DA	CA	FPR	FNR
No Fourier	0.813	0.800	0.260	0.155	0.770	0.753	0.278	0.170
w/o High Freq	0.785	0.770	0.230	0.110	0.760	0.745	0.235	0.115
w/o Low Freq	0.818	0.805	0.215	0.132	0.786	0.772	0.225	0.145
Full Freq (Ours)	0.935	0.921	0.041	0.035	0.895	0.878	0.055	0.048

Table 8. Ablation on TAME energy formulation and objectives.

Configuration	RP2				SLAP
Configuration	DA	CA	FPR	FNR	DA	CA	FPR	FNR
w/o Con-Reg ( $λ_{con} = 0$ )	0.812	0.800	0.285	0.162	0.753	0.716	0.214	0.273
w/o Adv-Calib ( $λ_{adv} = 0$ )	0.798	0.784	0.250	0.192	0.744	0.732	0.252	0.205
Asym. KL	0.822	0.810	0.226	0.150	0.686	0.671	0.288	0.193
L1 Distance	0.710	0.698	0.255	0.170	0.680	0.668	0.268	0.182
Full TAME (Ours)	0.935	0.921	0.041	0.035	0.895	0.878	0.055	0.048

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Nie, Z.; Yu, T.; Chen, M.; Yao, Z.; Lu, J.; Peng, L.; Fan, F. Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception. Sensors 2026, 26, 835. https://doi.org/10.3390/s26030835

AMA Style

Liu Y, Nie Z, Yu T, Chen M, Yao Z, Lu J, Peng L, Fan F. Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception. Sensors. 2026; 26(3):835. https://doi.org/10.3390/s26030835

Chicago/Turabian Style

Liu, Yang, Zishan Nie, Tong Yu, Minghui Chen, Zhiheng Yao, Jieke Lu, Linya Peng, and Fuming Fan. 2026. "Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception" Sensors 26, no. 3: 835. https://doi.org/10.3390/s26030835

APA Style

Liu, Y., Nie, Z., Yu, T., Chen, M., Yao, Z., Lu, J., Peng, L., & Fan, F. (2026). Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception. Sensors, 26(3), 835. https://doi.org/10.3390/s26030835

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Physics-Aware Spatiotemporal Consistency for Transferable Defense of Autonomous Driving Perception

Abstract

1. Introduction

2. Related Work

2.1. Visual Perception for Autonomous Driving

2.2. Physical Adversarial Attacks on Autonomous Driving Perception

2.3. Physical Adversarial Defenses for Autonomous Driving

3. Proposed Algorithm

3.1. Method Overview

3.2. Dual-Modal Feature Embedding

3.3. Dual-Stream Spatiotemporal Encoder with Endogenous Feature Orchestration

3.4. Trajectory–Appearance Mutual Exclusion Energy

3.5. Model Training and Inference

4. Experiments

4.1. Experimental Setup

4.2. Attack Configuration

4.3. Evaluation Metrics

4.4. Baselines

4.5. Defense Performance

4.6. Black-Box Transferability

4.7. Defense Against Adaptive Attackers

4.7.1. Attacker Knowledge and Goals

4.7.2. Adaptive Attack Strategies

4.7.3. Results and Analysis

4.8. Scene-Level Behavior and Consistency Landscape

4.9. Ablation Study

4.9.1. Analysis of Deep-Coupling Mechanisms

4.9.2. Impact of Frequency-Domain Kinematic Embedding

4.9.3. Manifold Shaping via TAME Energy and Objectives

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI