4.1. Experimental Setup
Datasets. We evaluate the proposed defense on three widely used autonomous driving benchmarks: KITTI [
42], nuScenes [
43], and BDD100K [
44], which cover diverse driving environments and multimodal sensor data. These datasets cover a spectrum of driving complexities, ranging from the structured urban scenarios in KITTI and the multimodal sensor data in nuScenes to the large-scale, heterogeneous traffic environments in BDD100K. We focus on five representative traffic participants:
bicycle,
bus,
pedestrian,
car, and
truck. nuScenes and KITTI provide 21,763 and 5212 object-specific clips, respectively. For BDD100K, we use its Multi-Object Tracking subset (1600 videos) and select approximately 5000 instances from our target categories. These datasets allow for a rigorous evaluation of defense performance under real-world conditions.
Implementation Details. For visual perception, we use a YOLOv8 [
23] detector fine-tuned on BDD100K and keep its backbone frozen. During training, we run YOLOv8 and Simple Online and Realtime Tracking (SORT) [
45] on BDD100K videos to obtain backbone feature maps, detection boxes/scores, and per-object trajectories
, and optimize the consistency encoder and TAME head end-to-end on top of these signals using AdamW with an initial learning rate of
, weight decay of
, and batch size of 32, which were optimized to ensure stable convergence on the validation set. We train for 60 epochs with a cosine learning-rate schedule and a warm-up of 5 epochs. The encoder has
layers, hidden dimension
, and 8 attention heads per layer. Hyperparameters were determined through sensitivity analysis on the validation set to balance defense effectiveness and training stability: the loss weights were set to
and
, assigning a lower weight to
to prevent regularization from dominating early training while ensuring sufficient penalty on adversarial samples via
. The TAME margin was set to
to enforce a significant energy gap between benign and adversarial manifolds. Finally, the decision threshold
was determined via quantitative trade-off analysis, aiming to maximize Detection Accuracy (DA) while strictly bounding the false positive rate (FPR) below 5% in benign scenarios. At inference time, this trained module is reused as a plug-in safety layer without retraining. All experiments run on 2 NVIDIA RTX A6000 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) with 48 GB memory.
4.2. Attack Configuration
We focus on physically realizable adversarial attacks, as modifying the surfaces of traffic participants is a tangible threat. The perturbation mask is constrained within the target’s physical boundaries to ensure realism. We evaluate three patch scales, large (), medium (), and small (), optimized under -norm and NPS constraints.
We use three representative attack methods:
RP2 [
8]: Generates robust physical perturbations to induce misclassification under varying conditions.
CAPatch [
34]: Adapted from image captioning, it maximizes detection errors in autonomous driving contexts.
SLAP [
35]: A projector-based optical attack simulating light-based perturbations.
These attacks are applied to the selected object categories. We simulate dynamic attacks using ground-truth 3D poses and adjust the patch’s homography frame-by-frame, ensuring realistic appearance changes during motion. To intuitively understand these threats, visual examples of the RP2, CAPatch, and SLAP attacks applied to our target datasets are illustrated in
Figure 4.
4.3. Evaluation Metrics
To evaluate defense effectiveness, we use metrics that assess detection ability, correction ability, false alarms, and efficiency.
Detection Accuracy (DA). DA reflects the ability of a defense to identify misclassified instances caused by attacks:
Correction Accuracy (CA). CA measures the ability of a defense to recover the correct label once an attack has occurred:
False Positive Rate (FPR). FPR characterizes the risk that benign samples are incorrectly treated as attacked by the defense:
False Negative Rate (FNR). FNR measures the proportion of truly attacked samples that are still misclassified after applying the defense, i.e., the missed attacks of the defense:
Runtime Efficiency (RE). RE evaluates whether a defense satisfies real-time constraints. Let
denote the end-to-end processing time of the
i-th sample and
n the total number of samples. The average runtime per sample is:
4.4. Baselines
To validate the effectiveness, we compare it with five representative defenses covering input purification, certified robustness, and spatiotemporal consistency modeling. These baselines include both state-of-the-art general defense strategies and physics-aware approaches in autonomous driving.
DiffPure [
13] is an input purification method that uses pre-trained diffusion models to sanitize adversarial examples. While effective in removing perturbations, it may degrade high-frequency semantic details necessary for small object recognition.
PatchGuard [
11] provides certified robustness against localized adversarial patches. It uses small receptive fields and robust aggregation mechanisms to limit feature corruption, but its high computational overhead restricts real-time object detection.
DetectorGuard [
46] secures object detectors against patch-hiding attacks. It cross-references the detector’s output with a robust objectness predictor to detect inconsistencies. However, it focuses more on object presence than spatiotemporal dynamics.
PercepGuard [
16] uses spatiotemporal consistency to detect misclassification attacks. It employs a Recurrent Neural Network (RNN) to classify 2D bounding boxes and flags alarms when the trajectory-inferred class contradicts the visual detection. However, it filters out high-frequency jitter, limiting robustness against adaptive attacks.
PhySense [
17] is a physics-aware defense that integrates features like texture, dynamic behavior, and inter-object interactions. While comprehensive, its loose coupling of feature extraction modules leads to significant latency and fails to fully capture correlations between visual and kinematic modalities.
4.5. Defense Performance
We first evaluate the defense performance of the proposed defense against RP2, CAPatch, and SLAP on nuScenes, KITTI, and BDD100K, each with three patch scales (large, medium, small). As shown in
Table 2, the proposed defense consistently outperforms PhySense across almost all attack types, patch sizes, and datasets. In most configurations, our DA is comparable to or slightly higher than that of PhySense, while CA improves by a clear margin and FPR/FNR are typically reduced across datasets and patch sizes. In a few relatively easy KITTI settings, PhySense attains marginally higher DA, but ours still achieves much higher CA and significantly lower FPR/FNR, indicating a strictly better robustness–utility trade-off.
Effect of patch size and attack type. As the patch size shrinks from large to small, both ours and PhySense exhibit the expected degradation in DA and CA due to the increased visual stealthiness and reduced footprint of the adversarial patch. CA is consistently higher and FPR/FNR are generally lower than PhySense across datasets and patch sizes, with only minor deviations in a few easy settings. This trend is especially salient under SLAP, the projector-based optical attack that induces rapid, transient appearance changes. On nuScenes with small SLAP patches, for instance, our method raises CA from to and cuts FNR by more than half, showing that the TAME energy is sensitive to physically inconsistent motion even when visual perturbations are small and short-lived.
Comparison with baselines. Table 3 further positions our method against a broader spectrum of defenses on nuScenes under RP2 with large patches. Input purification (DiffPure) and certified patch defenses (PatchGuard) provide useful robustness guarantees but either incur high false alarms on benign samples or struggle to maintain correction performance in realistic detection settings. Detector-oriented defenses (DetectorGuard) and trajectory-only methods (PercepGuard) capture parts of the physical picture but still leave a considerable gap in either DA, CA, or FPR. PhySense, as a strong physics-aware baseline, narrows this gap by integrating multiple hand-crafted physical cues, yet it still operates under a loosely coupled, modular architecture. In contrast, our method achieves leading performance across all metrics, supporting the benefits of deeply coupled, frequency-guided trajectory–appearance reasoning.
Runtime analysis. In terms of RE, we reuse the frozen detector backbone and rely only on Transformer-style operations without external hand-crafted feature extractors. As shown in
Table 2, the per-frame overhead of PhySense ranges from about
s to
s across datasets, whereas our method remains in the
–
s range. Thus, our method achieves stronger robustness and better calibration of physical inconsistency while still meeting real-time constraints in autonomous driving deployments.
4.6. Black-Box Transferability
We further examine how well the proposed defense transfers in a realistic setting, where the safety module is trained once and then reused across heterogeneous detectors, attacks, and datasets. Using the defense module trained as described in
Section 4.1, we then evaluate this single model under three settings: (i) changing the base detector to Faster R-CNN [
47] or CenterNet [
48], (ii) changing the dataset to nuScenes or KITTI, and (iii) changing the attack family to CAPatch or SLAP, still with medium patches.
Table 4 summarizes the results. The configuration corresponds to the training setting, while all other entries represent zero-shot transfer without any re-training of the defense module.
Cross-detector transfer. On BDD100K under RP2, replacing YOLOv8 with Faster R-CNN or CenterNet leads to only a small drop in DA and CA, and a slight increase in FPR/FNR. The overall performance remains in a similar range as the original YOLOv8-based configuration. This indicates that the dual-stream spatiotemporal encoder and TAME head indeed behave as a detector-agnostic safety layer: as long as bounding boxes, labels, and trajectories are available, the module can be plugged behind different detectors without re-training, while still providing substantial gains over PhySense and other baselines (
Table 2).
Cross-attack and cross-dataset transfer. Using the same model and threshold, we then change both the dataset and the attack type. Across nuScenes and KITTI, and for RP2, CAPatch, and SLAP, YOLOv8-based results show only modest degradation in DA/CA compared with the in-domain BDD100K–RP2 configuration, while FPR/FNR remain low. The trends are similar when switching to Faster R-CNN or CenterNet: although absolute performance slightly decreases due to detector- and domain-specific differences, the defense remains effective across all combinations. Notably, the model retains strong correction ability against CAPatch and SLAP even though it was adversarially calibrated on RP2, suggesting that the frequency-domain kinematic embedding and TAME-based inconsistency reasoning capture generic trajectory–appearance discrepancies instead of overfitting to a single patch pattern or dataset.
Overall, the results in
Table 4 show that a single trained module can be transferred across heterogeneous perception stacks and deployment scenarios, with only limited loss of robustness. This transferability is particularly attractive for large-scale autonomous driving systems, where maintaining one bespoke safety module per detector or per fleet would be impractical.
4.7. Defense Against Adaptive Attackers
We finally evaluate the proposed defense against adaptive attackers that are aware of the trajectory–appearance consistency checks and attempt to jointly fool both the detector and the defense.
4.7.1. Attacker Knowledge and Goals
We consider a strong white-box threat model in which the attacker has access to the architecture and parameters of both the base detector and the module. (We assume no access to the validation set used to select the TAME threshold and no control over the tracking pipeline.) The adversary optimizes a physically realizable patch as in
Section 4.2, under the same constraints on patch size, location, and NPS. The goal is two-fold: (i) induce a targeted misclassification by the detector and (ii) keep the TAME energy
below the detection threshold
, so that the defense neither raises an alarm nor corrects the label. In other words, the attacker seeks perturbations that jointly maximize detector loss on the target class and minimize
or its contributing terms.
4.7.2. Adaptive Attack Strategies
We instantiate this threat model with three representative strategies that exploit progressively more internal details:
Trajectory-Smoothing RP2. The standard RP2 loss is augmented with a smoothness regularizer on the sequence of 2D/3D bounding boxes, penalizing frame-to-frame variations in velocity and acceleration. This encourages low-frequency, inertial-like trajectories but does not directly optimize TAME.
TAME-Aware Joint Optimization. The attacker differentiates through the dual-stream encoder and TAME head. The patch is optimized to (a) drive the visual head
toward a target class
and (b) reduce the symmetric TAME energy so that
and
agree on
:
where
balances misclassification and energy suppression.
Frequency-Suppression Attack. Assuming knowledge of the frequency-decoupling mechanism, the attacker penalizes the magnitude of the high-frequency component
:
aiming to suppress jitter-related responses in the kinematic stream while still fooling the detector.
4.7.3. Results and Analysis
As summarized in
Table 5, we present the defense performance on nuScenes under adaptive attackers.
The Trajectory-Smoothing strategy reduces CA from to by making 3D box sequences closer to the ideal inertial motion, but the drop is moderate, as the frequency-domain embedding still captures residual discrepancies. The TAME-aware attack is the most effective, lowering CA to and increasing FNR to , showing that a fully informed attacker can sometimes force the two heads to agree on wrong labels. The Frequency-Suppression attack achieves similar CA (): suppressing jitter weakens the high-frequency cue but inevitably distorts low-frequency motion, which remains detectable.
Overall, these results expose a fundamental dilemma for adaptive attackers. To reliably fool the base detector, the patch must introduce persistent appearance changes that create additional jitter and trajectory–appearance mismatch, pushing the TAME energy upward. To evade TAME, the attacker must instead smooth motion and suppress jitter, which weakens the perturbation and undermines the misclassification. Because our frequency-domain kinematic embedding defines robustness in terms of the contrast between inertia and jitter rather than any single trajectory statistic, lowering by manipulating one band typically worsens the other; so in practice, adaptive optimization can at best move sequences from the high-energy region to a narrow band around , rather than back to the benign low-energy manifold.
4.8. Scene-Level Behavior and Consistency Landscape
Beyond aggregate metrics, we analyze how the proposed trajectory–appearance consistency behaves at the scene and trajectory level. All visualizations in this subsection are produced on held-out nuScenes sequences; the observations are representative of the trends seen on other datasets.
Frame-wise energy evolution. As illustrated in
Figure 5, we plot the TAME energy
over time for three typical sequences under RP2, SLAP, and adaptive attacks, together with the benign counterpart. For benign trajectories (green curves),
stays close to a low baseline around
and rarely approaches the decision threshold
, indicating that appearance and motion remain compatible over the whole sequence. Once an RP2 patch becomes effective (frames 15–35), the energy quickly rises into a high plateau (≈
–
) and remains above the shaded alarm region, clearly separating attacked frames from clean ones. SLAP produces a similar but more oscillatory plateau, reflecting the transient nature of projector-based perturbations. In the adaptive case, where the attacker explicitly tries to keep
small, the curve oscillates tightly around
instead of returning to the benign baseline, showing that it is difficult to simultaneously fool the detector and keep the trajectory on the low-energy manifold defined in
Section 3.4.
To examine potential false alarms, as shown in
Figure 6, we compare a benign trajectory, a “hard benign” case with sharp braking, and an RP2 attack. Sharp braking temporarily increases
and produces a short bump that touches or slightly crosses the threshold, but quickly falls back to the benign band. In contrast, RP2 induces a long, high plateau that stays far above
. This difference explains why the defense maintains a low FPR while still detecting physically inconsistent attacks.
Consistency vs. detector confidence. As illustrated in
Figure 7, we present scatter plots of TAME energy versus detector confidence for benign and attacked samples under RP2, SLAP and adaptive attacks. Benign detections (green dots) cluster in the lower-right region: high confidence and low energy, which corresponds to predictions that are both visually confident and physically plausible. RP2 and SLAP attacks (red crosses) mainly occupy the upper-right and upper-middle area: the base detector is still reasonably confident, but the TAME energy is well above
, revealing strong trajectory–appearance conflict. Under adaptive attacks, adversarial samples move closer to the threshold and their confidence decreases slightly, yet they still form a distinct high-energy cloud separated from benign points. These plots confirm that
provides information complementary to detector confidence: it exposes “high-confidence but physically inconsistent” cases that cannot be filtered by confidence alone.
Energy distributions across patch size and object class. As shown in
Figure 8, we report the marginal distributions of
for benign and attacked frames under large, medium and small patches. For large patches, benign and attack distributions are almost disjoint: benign frames concentrate well below
, whereas attacks form a broad peak around
–
. As the patch shrinks, the attack distribution gradually shifts towards the threshold and slightly overlaps with the benign tail, reflecting the increased visual stealthiness of smaller perturbations. Even for small patches, however, the main attack mass remains on the high-energy side of
, which is consistent with the low FNR observed in
Table 2.
Finally, as shown in
Figure 9, we decompose the TAME distributions by object category (bicycle, bus, pedestrian, car, and truck). Across all classes, benign samples exhibit a sharp peak near zero and only a light tail around the threshold, indicating that the consistency prior is not biased towards a specific category. Attack distributions are shifted to higher energies, with large separation for buses and trucks (whose motion is more inertial) and slightly broader overlap for bicycles and pedestrians (which naturally move more erratically). Importantly, a single global threshold
still separates most benign and adversarial frames in every class, supporting the use of a class-agnostic decision rule in Equation (
13) and explaining why the defense achieves stable performance across heterogeneous traffic participants.