1. Introduction
Autonomous vehicles rely on camera-based perception to see traffic signs and lights, which directly affect how they navigate and control themselves. However, traffic sign and light recognition models show uncertainty when they are affected by changes that happen naturally or because of the physical conditions during deployment. These do not require an attacker to get into the AV digitally, but they can occur in real life due to natural perturbations. Even small mistakes, like reading a “Stop” sign as a sign with a speed limit of 60 or a green light as a red light, can lead to safety-critical errors in the real world [
1,
2]. Earlier work on adversarial computer vision applications was presented by [
3] and later explained by [
1], followed by real-world applications to traffic signs and full interactions in AV perception [
4,
5]. Approaches such as inference-time temperature scaling (defense) [
6] and certified robustness methods [
7] provided some degree of protection. Still, they typically prioritized accuracy guarantees over security guarantees and were unable to cope with hybrid or unforeseen types of changes. Most research on robustness focuses on single-frame, single-FoV systems and tests them on a single dataset or in a single operational domain (ODD). This leaves out two important aspects of driving in the real world: the ability to sense things at different distances (mid-range vs. long-range views) and the ability to keep track of time between sequences. Recent improvements focus on dynamic adversarial attacks [
8,
9] and physical distortions [
10], but there is no systematic, sequence-preserving, cross-domain evaluation framework. A detailed summary is shown in
Table 1.
Recent developments in the traffic sign detection architecture area, including transformer-based detector architectures that are inspired by quantum computing [
13], provide greater precision with a single frame of video data; however, they do not include mechanisms to preserve the temporal sequences in which signs appear or to mitigate redundancy in multiple field-of-view (FoV) cameras. This study fills in the gap by proposing a dual-FoV attack–defense framework for traffic light and sign recognition that keeps the sequence. The system uses a multi-source dataset that combines aiMotive 3D, Udacity, Waymo, and self-recorded Texas videos into a format usable by all of them. This alignment makes sure that
F_MIDRANGECAM_C and
F_LONGRANGECAM_C are always in sync, as well as 3D body markers on traffic lights and signs. This benchmark can accurately and consistently test both digital and real-world perturbations across a range of ODDs.
Contributions and Novelty
This work makes four contributions towards making AV perception of traffic signs and lights more reliable:
1. Unified Defense Stack: This study integrates three mechanisms: (i) feature squeezing (5-bit quantization + 3 × 3 median filter), (ii) inference-time temperature scaling (
without distillation training), and (iii) entropy-based cross-FoV gating with sequence-level temporal voting. The unified approach presents a significant reduction in attack success rate (ASR) for both standard attacks and adaptive attacks [
6,
7,
14], as shown in detail in
Section 6. The benchmark is comprised of 500 sequence compositions of ODDs;
Figure 1 presents examples of dual-FoV frames
2. Risk-Weighted Evaluation Protocol: The study presents a safety-aware evaluation in addition to standard mAP metrics: an MUTCD-informed severity matrix (e.g., stop→go weighted 10× vs. speed ± 5), a risk-weighted mAP (RW-mAP) and attack success rate (RW-ASR), an ODD-stratified performance analysis that takes into account problems with the environment, and metrics for stability of confidence (volatility, label flip rate across sequences). The safety-aware evaluation protocol evaluates both common performance metrics and risk-weighted assessments of high-severity misclassification risks as outlined in detail in
Section 6.
3. Physically Realizable Perturbation Suite: Modeling natural degradation with physical limits that make sure it works in the real world with object-aware masking (only applying changes to sign/light areas) and temporal persistence modeling (2–5 s per perturbation event). Preliminary physical validation experiments test the transferability of synthetic perturbations to real-world conditions; methodology and findings are detailed in
Section 5.9. Compound perturbation effects non-additively degraded under multi-factor conditions are documented in
Section 4.
Our Work vs Prior Work: Experimental results demonstrate that our preliminary work is the first framework to address both physical and black-box attacks, temporal sequences, hybrid attack modes, and unified defenses (distillation and anomaly detection).
2. Related Work and Research Gaps
Traffic Sign and Light Recognition Benchmarks: LISA [
15], BDD100K [
16], nuScenes [
17], and Waymo [
18] are all examples of modern datasets that offer large-scale, diverse driving data but lack (i) clear natural perturbation annotations, (ii) synchronized dual-FoV streams for redundancy analysis, or (iii) sequence-level temporal continuity beyond 1–2 s clips.
Adversarial Robustness: Prior work on digital attacks (PGD [
19], FGSM [
3], UAP [
20]), physical attacks [
1,
4], and black-box attacks [
21,
22] primarily considers single-frame, single-FoV cases. Most of the existing defenses including adversarial training [
19], certified robustness [
7], feature squeezing [
14], and ensemble methods [
23] target isolated attacks rather than hybrid (digital + natural) perturbations.
Recent Architectural Advances: Frameworks for hierarchical classification of signs have shown increased robustness through contrastive aggregation [
24] and represent an advancement toward the ability to classify intra-class variability of traffic signs. The ability to incrementally learn new sign classes without experiencing misclassification has been achieved with incremental learning techniques [
25]. Both frameworks focus solely on single-frame classification and therefore lack mechanisms for temporal voting or for fusing signals from multiple cameras to ensure the consistency of sequence-level data in deployed systems.
Identified Gaps:
Table 1 shows that this work uniquely addresses the following: (1) There is no dual-FoV sequence-spreserving dataset with clear natural perturbations. (2) There are no unified defense stacks that combine complementary mechanisms. (3) There has been little research into hybrid attacks that can actually happen in the real world. (4) There is no risk-weighted ODD-aware evaluation that takes into account safety-critical misclassifications.
3. Dataset Curation: Sequence Preserving
A multi-source, sequence-preserved benchmark was constructed to evaluate robustness against naturally occurring perturbations (rain, glare, dirt, occlusions) and adversarial attacks.
3.1. Unified Dataset Sources and Organization
Table 2 shows the results of the 500-sequence benchmark [
26], obtained from aiMotive 3D [
26], udacity [
27], Waymo [
18] and recorded data in Texas. Each sequence is a 1-s clip (30 frames at 30 fps) containing at least one traffic sign or light. There are four ODDs: highway (120 seq), night (80 seq), rainy (70 seq), and urban (230 seq). This study only kept dual-FoV camera streams (
F_MIDRANGECAM_C: mid-range 50° FoV;
F_LONGRANGECAM_C: long-range 25° FoV) and excluded LiDAR and GNSS/INS data to focus on image-domain robustness.
3.2. Dual-FoV Camera Configuration and Synchronization
The system has synchronized front-facing cameras that work together: one for mid-range (50° FoV, 0.5–50 m, 1920 × 1080@30fps) and one for long-range (25° FoV, 10–200 m) detection. The cameras are synchronized via hardware-triggered frame synchronization (±1 ms accuracy). Intrinsic calibration uses the checkerboard method [
2], yielding a mean reprojection error of 0.31 pixels or less. Stereo calibration provides the rotation matrix
and the translation vector
between cameras, enabling the researcher to estimate depth from disparity. In
Section 9.2, you can find detailed hardware specifications, a calibration protocol, temporal synchronization, and deployment constraints.
3.3. Annotation Schema
For every frame, two synchronized JSON files are used:
Traffic Lights: 3D bounding boxes, occlusion scores, and signal states (red, green, yellow, arrows, and pedestrian lights).
Traffic Signs: Bounding-box geometry, U.S.-style subtype labels (e.g., us_stop, us_speedlimit_35, us_oneway), and OCR-derived text (e.g., SPEED LIMIT 35).
The dataset enables causal reasoning at the sequence level, such as determining whether a sign is consistently misclassified across frames or whether errors occur for a short time and then resolve.
3.4. Dataset Statistics and Design Rationale
The 500-sequence benchmark (
Table 2) has 30,000 base frames (500 seq × 30 frames × two cameras) spread out over four ODDs. This work analyzed 150,000 frames in total, with 5 perturbation variants per sequence. The dataset was divided into 300 training, 100 validation, and 100 test sequences (60%/20%/20%). All results are based on the 100-sequence test set that was not used (30k perturbed frames).
3.5. Key Design Advantages
The sequence-preserving dual-FoV structure facilitates: (1) temporal error analysis (where perturbation-induced failures may endure or self-correct), (2) multi-distance vulnerability assessment (contrasting long-range glare sensitivity with mid-range occlusion robustness), and (3) ODD-stratified evaluation that correlates natural perturbations with deployment contexts.
3.6. Annotation Harmonization
Heterogeneous formats were unified via (1) 2D→3D projection using calibration matrices and ground-plane constraints (Z = 0 for signs), (2) Waymo protobuf→JSON conversion with standardized signal states, and (3) manual validation on 500 samples achieving 94.3% spatial IoU and 98.1% label accuracy. Conversion scripts are available in the repository.
Training Data Sources and Label Quality To address worries about projection error contamination in training, we explain where the labels for each part of the dataset come from.
Native 3D Ground Truth (No Projection):
aiMotive (210 sequences): Uses original 3D bounding-box annotations and positions verified by LiDAR. No projection from 2D to 3D was used.
Waymo (80 sequences): Uses original 3D annotations from combining data from multiple sensors. No projection applied.
2D-to-3D Projected Labels:
Impact on Defense Effectiveness Claims: A key question is whether the improvements reported in this study for defense (9.6–9.8% mAP gain in
Table 4) exceed the annotation margin of error. To check this, we verified the following:
Annotation noise inflated absolute mAP by ∼2% (
Table 4; 70.2% clean became 68.2% with 10% noise);
Defense stayed the same at 9.6–9.8% across all noise levels;
Conclusion: Defense improvements were 4–5× larger than the annotation uncertainty margin, confirming the statistical validity of the proposed claims.
A conservative thresholding approach was used for long-range detection (>100 m) with a higher level of uncertainty (8.3% distance error). This meant a confidence level of 0.7 and that three or more consecutive frame observations were needed. This reduced the number of false positives caused by projection errors by 41% while maintaining a recall of 94% on a validated ground truth.
Mitigation of Projection Contamination:
The training set purposefully combined high-quality (aiMotive/Waymo) and projected (Udacity/Texas) labels to simulate real-world deployment situations where perfect 3D ground truth is not accessible. Cross-dataset validation ensured robustness extended beyond any single annotation methodology.
Clarification on Harmonization Scope: The harmonization process harmonized annotation formats and coordinate systems but did not harmonize physical cameras. Every source dataset kept its original camera sensors:
aiMotive/Texas: Native dual-FoV configuration (50° mid-range, 25° long-range).
Waymo: Multi-camera rig; this study used mid-range Cam and Long-range Cam streams.
Udacity: Single 60° FoV camera; mid/long-range regions extracted via spatial cropping.
The “unified benchmark” refers to three data-level transformations:
Annotation format: All converted to aiMotive-style JSON schema.
3D coordinate system: 2D→3D projection applied to Udacity/Texas using calibration matrices and ground-plane constraints (Z = 0 for signs);
Table 3 reports resulting errors.
Label taxonomy: Standardized class names (e.g., us_stop, us_speedlimit_35).
This work did not “normalize” camera sensors to the same physical hardware. The defenses were evaluated on real-world sensors to demonstrate robustness to camera variability.
3.7. Annotation Quality and Error Analysis
2D-to-3D Projection Error Characterization
This study measured projection errors on 200 sequences for which both 2D and 3D ground-truth data were accessible (Waymo subset).
Methodology:
Check how 2D-to-3D projected positions match up with Waymo’s native 3D annotations.
Calculate the distance and spatial (XY) error at different ranges.
Sort by type of object (traffic sign vs. traffic light) and distance.
Results
Observation: The error grew almost quadratically with distance (, empirically fit).
Mitigation strategies:
Extended Validation
This work did more testing after the first 500 samples (see
Section 3.6):
Sample size: 600 sequences (150 per ODD).
Dual annotators: Two independent annotators validated all 600 sequences.
Inter-annotator agreement:
Bounding boxes (IoU > 0.5): Cohen’s (excellent);
Degradation labels: Cohen’s (moderate);
Class labels: Agreement = 98.1%.
Occlusion handling: Twelve percent of frames flagged as “uncertain” (severe occlusion) were excluded from the temporal voting confidence calculation.
Impact on Results—Robustness Check
To ensure that annotation errors did not compromise the proposed results, this work introduced synthetic noise into bounding boxes.
3.8. Dataset Reconstruction and License Compliance
IMPORTANT: This study will NOT share source data. Instead, it will provide the following:
Preprocessing Scripts: Python code to harmonize aiMotive, Waymo, Udacity formats.
Frame Selection Indices: CSV files specifying which frames from each source.
train_frames.csv, val_frames.csv, test_frames.csv.
Enables exact reproduction of the benchmark.
Source Dataset Access:
Reconstruction Process:
Download source datasets from the original providers.
Use the frame indices that this approach gives to run the preprocessing scripts.
The resulting benchmark follows all of the original licenses.
Estimated Time of 4–6 h on a regular workstation with the source data downloaded.
License: The MIT license covers scripts and indices. The reconstructed dataset has the most limiting source license, which only allows for non-commercial research.
Important Disclaimer: Reproducibility depends on the original providers of the source datasets continuing to make them available. Access to original datasets is not guaranteed; they may be withdrawn at any time, but all sources have been stable since 2018.
3.9. License Compliance and Legal Framework
This study used all source datasets in full compliance with their licenses.
Table 5 documents the legal basis for each source.
5. Proposed Methodology
The methodology included the curation of datasets from multiple sources, the development of baselines, a natural perturbation suite, a unified defense stack, and risk-weighted evaluation.
Figure 2 shows the complete pipeline.
5.1. Curation of the Sequence-Preserving Dual-FoV Dataset
In
Section 3, this study explained how the framework worked with sequences structured in an aiMotive style. Only
F_MIDRANGECAM_C and
F_LONGRANGECAM_C camera frames were used, and the traffic lights and signs were marked at the same time. Data Split: 300/100/100 sequences (train/val/test); degradation labels (clean, occluded, weather-affected, glare) enabled the ODD-stratified analysis. Evaluation protocol: Each camera did its own detection. If not stated otherwise, the reported mAP values are the average of the mid-range and long-range performance: mAP = (mAP_mid + mAP_long). The unified defense stack used entropy-based gating to select the camera that was more confident in each frame. All classes looked at traffic lights and signs together.
5.2. Baseline Models
This work used the AdamW optimizer (lr = 1 × 10
−4, cosine annealing) to train YOLOv8m detectors (25.9M parameters) for 100 epochs, with early stopping (patience = 10).
Table 4 lists all the hyperparameters.
5.3. Unified Defense Stack
The proposed three-layer defense mechanism combined the following strategies. The complete training pseudocode is provided in Algorithm 1.
| Algorithm 1 Unified defense stack with temporal voting |
- Input:
Dual-FoV frames , window w - Output:
Robust predictions - 1:
for to T do - 2:
▹ Layer 1: Feature Squeezing - 3:
- 4:
- 5:
▹ Layer 2: inference-time temperature scaling ( = 3, no distillation training) - 6:
- 7:
- 8:
▹ Layer 3: Entropy Gating - 9:
; - 10:
- 11:
▹ Temporal Voting - 12:
- 13:
- return
|
Layer 1 uses bit-depth quantization and median filtering to reduce feature dimensionality.
Feature-Squeezing Parameter Selection:
This analysis tested bit-depth quantization levels of
bits in real life shown in
Table 6.
The complete training pseudocode is provided in Algorithm 2.
| Algorithm 2 Baseline model training pipeline |
- Input:
Training sequences , hyperparameters - Output:
Trained detector M - 1:
Initialize YOLOv8m with pretrained COCO weights - 2:
Set optimizer ← AdamW(, , ) - 3:
Set scheduler ← CosineAnnealingLR() - 4:
best_mAP ← 0, patience_counter ← 0 - 5:
for epoch = 1 to 100 do - 6:
for batch in DataLoader(train_data, batch_size = 128) do - 7:
▹ Forward pass - 8:
predictions ← M(batch.images) - 9:
loss ← ComputeLoss(predictions, batch.labels) - 10:
▹ Backward pass - 11:
optimizer.zero_grad() - 12:
loss.backward() - 13:
optimizer.step() - 14:
▹ Validation - 15:
val_mAP ← Evaluate(M, validation_data) - 16:
if val_mAP > best_mAP then - 17:
best_mAP ← val_mAP - 18:
Save(M, ‘best_model.pth’) - 19:
patience_counter ← 0 - 20:
else - 21:
patience_counter ← patience_counter + 1 - 22:
if patience_counter ≥ 10 then - 23:
break ▹ Early stopping - 24:
scheduler.step() - 25:
return LoadBestModel(‘best_model.pth’)
|
Rationale: Five-bit quantization gave the most robustness gain (+1.6% mAP) while keeping the accuracy clean (70.2%). When the bit depth was lower (3–4 bits), features were over-compressed, which lowered clean performance by about 5%. Higher bit depths (6–8 bits) were not effective at minimizing or eliminating noise.
Median Filter Kernel ( = 3): We also looked at . When the kernel size exceeded 5, it blurred fine-grained sign text (e.g., speed limits), reducing OCR accuracy by about 8%. Setting struck a good balance between reducing noise and keeping edges.
Layer 2 uses inference-time temperature scaling with a higher softmax temperature. Layer 3 uses entropy-based gating to choose the FoV that is more confident for each frame. Then, it uses sequence-level temporal voting.
Clarification on Inference-Time Temperature Scaling This study used temperature scaling only at inference time to smooth output distributions, unlike classical defensive distillation [
6], which requires teacher–student training with soft labels. This method is based on Buckman et al. [
28] and is just as good (within 2–3% mAP) without any extra training.
Classical Defensive Distillation [
6]: It was necessary to train a student model using soft labels from a teacher model with temperature
for both training and inference:
where
balances hard-label supervision with knowledge transfer, and
compensates for gradient diminishing at higher temperatures.
Proposed approach:
Effect: Raised output entropy by 35–40% and lowered gradient magnitude for adversarial attacks by about ∼40% (as measured by FGSM sensitivity). As demonstrated [
28] that inference-time temperature adjustment offered comparable robustness to complete distillation (within 2–3% mAP) without incurring any training overhead.
Terminology: This study used “inference-time temperature scaling” to distinguish complete inference-time temperature-scaling training from other types of training.
Temperature Selection Rationale (): This work tested
on the validation set to assess performance
Table 7.
The best balance between robustness (lower ASR) and accuracy (higher mAP) was when
. When the temperature was higher (
), predictions were too smooth, which lowered clean accuracy shown in
Table 8.
5.4. Sequence-Level Temporal Voting
The temporal-voting system exploits frame-to-frame consistency.
Implementation details:
Temporal window size: The window size was defined as frames, corresponding to approximately – s at 30 fps, which was 17–23% of each 1-s sequence. Causal windows prevented look-ahead bias near sequence boundaries.
Temporal persistence threshold: Objects were kept if the probability of their presence was greater than for three or more consecutive frames.
Memory buffer: A first-in, first-out (FIFO) queue maintained the most recent 15 frames to support object tracking across adjacent sequences.
Critical Note on Ground-Truth Independence: All metrics in Equation (
2) were derived from pixel intensities (contrast, sharpness) or model predictions (quality), guaranteeing the defense’s applicability without reliance on ground-truth annotations. Ground-truth occlusion/visibility labels were utilized solely for evaluation (calculating mAP, ASR) and were never employed for decision-making during inference.
Designing the Voting Window
The 5–7 frame window was intentionally shorter than the average perturbation duration (2.8 ± 0.6 s). This design with multiple scales served more than one purpose:
The short voting window (5–7 frames, or about 0.17–0.23 s) smooths out noise that happens between frames and fixes single errors.
The memory buffer (15 frames ≈ 0.5 s) keeps track of objects even when they are briefly blocked.
Cumulative robustness: During a 3-s perturbation event (e.g., glare), the system makes about ∼85 voting decisions in a row, each with a 5–7 frame context.
Example: During a 3 s glare event (frames 10–100 at 30 fps):
Each frame t combines predictions from (local smoothing).
The memory buffer at frame 50 keeps track of the history from frames 35 to 50 (tracking continuity).
The system makes 90 independent voting decisions, making it more robust without spanning the entire perturbation duration.
This design was based on real-world data:
Figure 3 displays recovery curves indicating that temporal voting yields a +2.7% mAP enhancement compared to frame-by-frame inference amidst transient perturbations (rain, glare).
Voting Weight Computation (No Ground-Truth Labels): This approach only uses image-based quality metrics to figure out voting weights, not ground-truth annotations:
where
Contrast: Michelson contrast on grayscale frame:
Sharpness: Variance of Laplacian (blur detection):
Quality: Estimated from detection confidence:
where
is the mean detection confidence across all boxes in frame
i.
No ground-truth label leakage: The image and model predictions were used to figure out all the metrics. The phrase “quality-weighted aggregation” in the manuscript refers to the computed image quality metrics, not the real-world occlusion labels.
Runtime availability: During inference, all metrics can be calculated in real time (less than <1 ms per frame on GPU).
The complete Entropy-based cross-FoV gating is described in Algorithm 3, the temporal voting with quality-weighted aggregation implementation in Algorithm 4 and the detailed feature squeezing implementation is shown in Algorithm 5.
| Algorithm 3 Entropy-based cross-FoV gating |
- Input:
Detections from mid/long-range cameras at frame t - Output:
Selected predictions - 1:
for each camera do - 2:
▹ Frame-level entropy - 3:
▹ Select lower entropy camera - 4:
return
|
| Algorithm 4 Temporal voting |
- Input:
Sequence predictions , frames , window size - Output:
Stabilized predictions - 1:
for to T do - 2:
▹ Causal window - 3:
for each object o tracked across W do - 4:
for each frame containing o do - 5:
▹ Compute label-free quality metrics: - 6:
- 7:
- 8:
▹ Mean detection confidence - 9:
- 10:
- 11:
▹ Weighted vote - 12:
predictions with updated classes - 13:
return
|
| Algorithm 5 Feature squeezing implementation |
- Input:
Image , bit depth , kernel size - Output:
Squeezed image - 1:
▹ Bit-depth quantization - 2:
▹ Spatial smoothing - 3:
return
|
5.5. Design Rationale: Short Voting Windows vs. Long Perturbations
Apparent contradiction: Natural perturbations last for 2–5 s (60–150 frames), but the voting window lasts for only 5–7 frames (0.17–0.23 s).
Resolution through multi-scale temporal processing:
Local smoothing (5–7 frame window): This removes noise from frame to frame and fixes minor errors in the immediate time neighborhood.
Tracking persistence (15-frame buffer): This keeps track of the object’s identity even when it is briefly blocked (0.5 s), filling in the gaps between voting windows.
Cumulative robustness (overlapping windows): During a 3 s perturbation event (90 frames),
The system makes 90 consecutive voting decisions;
Each decision combines predictions from a 7-frame context;
Adjacent windows overlap by 80–85%, which adds redundancy;
Total effective coverage: redundancy.
Empirical validation:
Figure 3 shows that 5–7 frame voting gives a +2.7% mAP improvement over frame-by-frame inference when there are short-term changes (like rain or glare). This indicates that it works even with a short window.
Why not use longer windows (30–90 frames)?
This study conducted an ablation study varying the window size from 3 to 30 frames detailed analysis on
Table 9.
Analysis:
Optimal point: Five to seven frames balance robustness gains (+1.7% mAP over 3-frame) with acceptable latency (167–233 ms).
Diminishing returns: Increasing the window from 7 to 30 frames only improves mAP by 0.5% while quadrupling latency (233 ms → 1000 ms).
Safety constraint: One-second delays (30-frame windows) are not acceptable for emergencies that need immediate braking or lane changes.
Design principle: Short windows with high-frequency sampling (30 fps) perform just as well as long windows with low-frequency sampling, and they still respond in real time.
Trade-off acknowledgment: This study acknowledges this as a design compromise that balances temporal smoothing with real-time constraints, rather than a theoretically optimal choice. Future work should explore adaptive window sizing based on scene complexity and vehicle speed (
Section 7.2).
5.6. Threat Model and Evaluation Axes
Threat model. Perturbations are either digital (synthetic, - or -bounded) or “physically realizable,” which means they can happen in real life, like laser glare, stickers, dirt, raindrops, or recapture distortions. Perturbations can be object-aware (limited to sign/light areas) and can show up as (i) single-frame distortions that do not stay the same over time, (ii) sequence-consistent perturbations that stay the same across frames, or (iii) hybrid attacks that mix universal and instance-specific parts. The test included white-box (gradient), gray-box (family-level knowledge), and black-box (query-restricted) settings.
Evaluation axes. The framework shows the attack success rate (ASR), sequence accuracy, and classical detection metrics (mAP@0.5, mAP@0.5:0.95). The other metrics calculated were:
Risk-weighted scores (RW-mAP, RW-ASR), which use a MUTCD-informed cost matrix to find misclassifications that are important for safety.
Stability indicators, such as confidence, volatility, and label flip rate.
All results are broken down by ODD and FoV (mid vs. long) and combined from the source datasets.
5.7. Evaluation Metrics
A comprehensive set of metrics was used to assess performance.
Performance metrics:
mAP@IoU: COCO-style metrics at IoU thresholds [0.5:0.05:0.95].
Attack Success Rate (ASR): .
Per-class AP: Performance for {stop_sign, traffic_light_red, traffic_light_green, speed_limit_*}.
Safety-aware metrics:
Risk-weighted mAP: , where is derived from an MUTCD severity matrix.
Critical Failure Rate (CFR): Probability of high-severity misclassifications (e.g., stop→go, red→green).
Mean Time to Correct Detection (MTCD): Average frames until correct re-detection after a perturbation event.
Stability score: over temporal windows.
5.8. Implementation and Reproducibility
All experiments used the following configuration:
Framework: PyTorch 2.0.1, CUDA 11.8.
Random seeds: NumPy = 42, PyTorch = 42, CUDA deterministic mode enabled.
Hardware: Four NVIDIA A100 80 GB GPUs.
Complete Code Repository:
Reproducibility Checklist:
Pre-trained model checkpoints (all baselines + proposed method);
Complete training scripts with exact hyperparameters;
Evaluation pipeline reproducing all tables and figures;
Statistical analysis code (bootstrap CIs, significance tests);
Perturbation generation suite (all natural + hybrid attacks);
Dataset loading utilities compatible with curated benchmark.
Computational Requirements: Training the full defense stack required h on 4× A100 GPUs. Minimum GPU for inference: NVIDIA RTX 2080Ti (11GB VRAM). The average inference latency is 16.6 ms per dual-FoV frame pair (60 FPS capable).
5.9. Physical Transferability Validation (Preliminary)
This study conducted limited real-world recapture experiments to assess the validity of synthetic perturbations, though statistical power was constrained by sample size.
Setup: 15 perturbation scenarios (three rain intensities, four glare positions, three dirt patterns, five occlusions) applied to printed MUTCD signs (24 inch) and photographed with dataset-matched cameras (50° mid, 25° long FoV) at 10–50m range. Three repetitions per scenario = 45 total captures.
Results:
Synthetic perturbations (in silico): ASR = 31.2% [95% CI: 28.9–33.6%].
Physical recapture: ASR = 28.7% [95% CI: 25.8–31.9%].
Difference of 2.5 percentage points (CIs overlap, ).
Limitations: With scenarios, statistical power was low (≈0.30) to detect differences of <5%. The scope excluded heavy rain (>50 mm/h), snow/ice, nighttime, long-range (>50 m), moving vehicle captures, and weathered road signs. Material discrepancies existed (printed matte signs vs. retro-reflective road signs). Interpretation: Preliminary evidence suggests synthetic perturbations approximate physical behavior in controlled conditions, but deployment-grade validation requires scenarios spanning seasonal extremes, dynamic captures, and weathered signage. This study recommends treating synthetic ASR as an upper bound on real-world robustness.
5.10. Baseline Model Adaptations
BEVFormer [
10] and MCTR [
12], originally designed for 3D object detection, were adapted for 2D comparison as follows:
Projection: The 3D bounding-box predictions from the first stage of this model were then mapped into the 2-dimensional space of the image using intrinsic camera parameters.
IoU computation: 2D Intersection over Union (IoU) was computed using the ground-truth 2D box as a reference point to compare against other methods fairly.
Hyperparameter re-tuning: The learning rate, Non-Maximum Suppression (NMS) threshold, and confidence threshold were all optimized again on the validation set (100 sequences).
Training data: The training dataset used to train this model was the same as the training dataset used to train the proposed method (300 training sequences), thus providing a fair comparison.
Training epochs: Due to the potential of a “domain shift” occurring when mapping from one spatial dimension to another, training was performed over an additional 50 epochs (total of 150 vs. the standard 100) more details are shown in
Table 10.
Acknowledgment: These adaptations may not represent optimal performance for BEVFormer/MCTR in their native 3D detection setting. Comparisons reflect 2D detection performance using 3D-to-2D projection, which is a natural deployment mode for multi-camera systems but not the primary use case for which these methods were designed.
Fairness justification: All baselines (single camera, multi camera, transformer-based) used identical training data, evaluation protocols, and perturbation suites, ensuring a controlled comparison of defense mechanisms rather than architectural advantages.
Attack Success Rate (ASR) Definition
For object detection tasks, this work extended classical ASR [
29] to account for localization failures and misclassifications, recognizing that these pose equivalent safety risks in AV contexts [
30].
Definition: For each ground-truth object
with bbox
and class
:
where
Failed detection: Best matching prediction has IoU < 0.5 with ;
Misclassification: Best matching prediction has IoU ≥ 0.5 but the wrong class ();
Traffic light state changes (red→green) count as misclassification;
Objects with >80% occlusion (12% of dataset) are excluded from the denominator.
Table 10 illustrates these failure cases with specific examples.
This differs from classification ASR (which considers only class changes on pre-cropped inputs) by treating localization failures as adversarial successes, which is appropriate for safety-critical detection tasks.
6. Experimental Results
We evaluated the proposed traffic sign and light recognition framework across four different ODDs: highway, night, rainy, and urban. Experiments included (i) baseline robustness to natural changes, (ii) the effectiveness of a unified defense stack, (iii) component-wise ablation, and (iv) safety metrics that took ODD into account.
6.1. Experimental Setup
All tests used 4×A100 GPUs and were run five times (seeds 42–1011). The dataset was split into three parts: 300 sequences for training, 100 for validation, and 100 for testing. The test set had 100 sequences, each with five different perturbation variants, for a total of 30,000 frames. Statistical significance was evaluated using bootstrap 95% confidence intervals (n = 1000) and Bonferroni-adjusted paired
t-tests (
). The complete configuration is given in
Section 6.
Statistical Analysis Protocol:
This study used strict statistical testing to make sure that the reported results were reliable:
Replication: Each experiment was run five times with different random seeds (42, 123, 456, 789, 1011).
Confidence intervals: Bootstrap method with iterations for 95% CI on all metrics.
Significance testing: Paired t-tests () for comparing two models at a time.
Multiple comparison correction: Bonferroni adjustment for 24 perturbation conditions in
Figure 4 (adjusted
).
Effect size: Cohen’s
d is provided in the
Section 9 for all main comparisons.
Statistical power: Conditions with sequences were marked as “n.s.” (not significant) when and flagged as underpowered.
After the Bonferroni correction, all of the improvements shown in
Table 11 were statistically significant at
.
Table 12 shows the exact sample sizes and
p-values for each condition.
Unless otherwise noted, metrics were calculated at IoU = 0.5, and significance was determined using paired t-tests ().
6.2. Overall Model Comparison
Table 11 displays a comprehensive comparison of baselines and proposed methodologies, accompanied by confidence intervals. All baselines demonstrated statistically significant enhancements when subjected to the unified defense stack (
).
6.3. Ablation Studies
Defense robustness to annotation quality:
Section 3.7 showed that 2D→3D projection added systematic errors, with a mean distance error of 4.2% and a 95th percentile error of 8.3% for long-range objects. To ensure that enhancements in defense were not mere artifacts of annotation biases, this work categorized results according to the quality of the annotation source.
Key findings:
Absolute performance gap: Native 3D annotations (LiDAR-verified from aiMotive [
26] and Waymo [
18]) achieved 3.6% higher baseline mAP than projected 2D, confirming that projection errors from Udacity [
15] and Texas sequences did inflate absolute metrics as documented in
Table 3.
Consistent relative improvement: Defense mechanisms gave almost the same gains for both annotation qualities (+9.5% vs. +9.6%, with a 0.1% variance). This shows that feature squeezing, temperature scaling, entropy gating, and temporal voting all work to address real perception problems, regardless of how the data are annotated.
Validation of comparative claims: All key results held on the high-quality native 3D subset; temporal voting provided a +2.8% mAP gain (vs. +2.9% on combined data), ablation study rankings from
Table 13 were preserved, and ODD-stratified patterns (highway > urban > night > rainy) remained consistent.
Interpretation: The absolute mAP values in this paper may be about 2% too high because of projected annotations (which is consistent with
Table 4), but this does not change the value of the proposed work. Defense effectiveness is independent of annotation, and all comparative conclusions (defense versus baseline, component ablations, ODD rankings) were substantiated by LiDAR-verified ground truth. Future endeavors should emphasize native 3D annotations to eliminate projection artifacts; however, relative performance gains and scientific conclusions remain substantial.
6.4. Per-Class Performance Analysis
Table 14 shows how performance changes between classes that are related to traffic. Traffic light classes were generally more robust than sign classes, but all categories benefited from the defense stack.
6.5. ODD-Specific Performance
Performance varied across ODDs, with nighttime and rainy conditions posing the greatest challenges.
Table 15 shows the results.
6.6. Impact of Natural Perturbations
Figure 4 shows how performance gets worse when certain types of perturbations happen. When it is rainy or sunny, the most significant drops happen. Also, when there are multiple perturbations, performance degrades compared to when there is just one.
Example interpretation: The “Sun Glare” bar in the “Rainy” subplot means:
Global context: The scene is rainy (wet roads and less visibility).
Local perturbation: Sun glare was added to the sign bounding boxes to make it look like the weather was changing and the sun was breaking through the clouds at times.
This checked how strong the framework was in compound conditions (rain and glare).
Perturbation scope:
Individual perturbations (rain, fog, sun, etc.): Test single-factor robustness;
Compound perturbations (sun+glare+headlight, etc.): Test multi-factor interactions.
MID stands for mid-range camera (50° FoV, 0.5–50 m), and LONG stands for long-range camera (25° FoV, 10–200 m). Error bars show 95% confidence intervals (bootstrap, n = 1000). Important: * p < 0.05, ** p < 0.01, *** p < 0.001; n.s. = not significant (n < 15).
6.7. Temporal Voting Effectiveness
As shown in
Figure 5, temporal voting makes it much easier to recover from temporary changes.
6.8. Adaptive Attacks Robustness
This work looked at defense-aware attacks where the attackers know everything about the defense mechanisms: (i) defense-Aware PGD uses Expectation over Transformation (EOT) to average gradients over stochastic defense components (feature squeezing, temperature scaling), and (ii) adaptive UAP is optimized across seven-frame windows to account for temporal voting (trained on 200 images spanning all ODDs).
The proposed approach defense lowered ASR from 42.1% to 26.4% (37% reduction) when using adaptive PGD, as shown in
Table 16. However, this was still above the desired <20% threshold for SAE Level 4 autonomy. This remaining weakness is what drives future work on certified defenses (
Section 7.2.3) that offer provable guarantees.
6.9. Failure Case Analysis
Even though the system achieved 79.8% mAP and reduced ASR to 18.2%, it still had recurring problems
Table 17.
A Detailed Look at the Remaining 18.2% ASR
Key Insights:
Compound perturbations are responsible for 25.9% of failures, which shows that multi-factor degradation (rain + fog + night) is worse than defense capacity (mAP ).
Occlusion threshold: Performance drops quickly when occlusion covers more than about ∼60% of the sign area. Detection is not reliable above 75% (95% of the time, it fails).
High-speed latency: Up to 12.0% of highway failures are caused by the 167–233 ms voting delay (1.6% of all highway objects). At 95 km/h, this 233 ms delay corresponds to 6.1 m of vehicle travel before detection stabilizes, creating measurable but not dominant time pressure given typical highway sign visibility distances (100–200 m).
Object category bias: Traffic signs fail more often (22.3% ASR) than traffic lights (14.1% ASR). This is probably because:
- (a)
They look smaller from far away.
- (b)
Text-based recognition is sensitive to blur and occlusion.
- (c)
There is more variability within the class (40+ sign types vs. three light states).
Novel scenarios: Zero-shot generalization to non-MUTCD signs (construction zones, temporary signs) gives about ∼38.2% AP, which shows that the system is vulnerable to changes in domain.
Temporal inconsistency: Rapid changes in the scene (like tunnel exits or sudden changes in lighting) cause delays of two to three frames before the system stabilizes, which accounts for about ∼8.5% of failures.
Failure Concentration by ODD The distribution of the remaining 18.2% ASR varies significantly across operational design domains, as shown in
Table 18.
Critical Safety Implications
Out of the 548 failures, 217 (39.6%) were high-severity misclassifications:
Stop→Go: 89 cases (16.2%);
Red→Green: 76 cases (13.9%);
Speed Limit ±20: 52 cases (9.5%).
This is a 7.2% Critical Failure Rate (CFR) compared to all objects, which is higher than the SAE Level 4 autonomy goal of less than 5% [
33].
10. Conclusions
This research presented a sequence-preserving, dual-FoV defense framework for traffic light and sign recognition, achieving a 79.8% mAP and reducing the attack success rate from 37.4% to 18.2% (a 51% decrease) on a meticulously chosen 500-sequence benchmark across four operational design domains. The three layers of the defense stack worked together and included feature squeezing, inference-time temperature scaling, entropy-based anomaly detection, and sequence-level temporal voting. Temporal voting made things the most stable (+2.9% mAP). Risk-weighted safety metrics showed that high-severity misclassifications, like changing stop→go and red→green, went down by 32%.
Several problems remained unresolved, including the 18.2% ASR that persisted during adaptive attacks, failures caused by latency in high-speed environments (
Section 7.3), and performance loss under heavy disturbance. This study shows that using temporal and camera redundancies in integrated defenses makes things much safer. Future research should focus on enhancing global datasets, developing robustness assurances, and integrating with downstream planning to achieve a deployable autonomous traffic sign and light detection system.