3.1. CNN-Based Image Recognition Model Construction
In this study, a convolutional neural network (CNN) was employed to perform sensitivity analysis on the fundamental construction elements of autonomous driving scenarios at urban tunnel portals. The objective was to simulate the first stage of visual perception in autonomous vehicles—namely, scene information classification. The key task at this stage is to design a feature extraction network framework suitable for the characteristics of image data from tunnel portal scenes. Considering the specific visual characteristics of the tunnel portal environment, this study integrates the pyramid pooling module of the Pyramid Scene Parsing Network (PSPNet) with the Residual Network (ResNet) architecture. The constructed network model consists of three main components: the backbone for feature extraction, the neck for multi-scale feature integration, and the head responsible for the final output and classification.
First, tunnel image data collected by vehicle-mounted cameras under autonomous driving mode from ten tunnels at different time periods were organized and pre-processed. The processed data were then fed into the network, where the backbone performed initial feature extraction to prepare for subsequent processing stages. In this study, ResNet50 was adopted as the backbone network, followed by a pyramid pooling module to capture multi-scale contextual information. The extracted global and local features were upsampled and fused through a feature fusion module (FFM), while skip connections were employed to concatenate features from different layers to enrich semantic representation. Finally, through the output network (head), the model accomplished both image segmentation and classification tasks (as illustrated in
Figure 3).
The image parameter dataset was first fed into the front-end backbone of the network. According to the principle of backpropagation, when the input values are excessively large, the gradient magnitude during backward propagation also increases, which may reduce the effective learning rate. Since the parameter weights and gradients across different neural network layers can vary by several orders of magnitude, this imbalance significantly increases computational cost and search time. To address this issue, batch normalization (BN) was applied to the input data, and BN layers were also introduced into the intermediate hidden layers. After BN processing, the nonlinear representation capability of the network was enhanced, computational efficiency was optimized, and a stable learning process was ensured.
The batch normalization (BN) algorithm can be summarized as follows:
Batch process input
x (min-batch):
Standardizing online content:
① Calculate the mean of batch data:
② Calculate the variance of batch processing data:
④ Scale change and offset:
⑤ Return value: the learned parameter scale factor γ and shift factor β.
The three-channel input data are first processed by the designed front-end backbone module ResNet50. This module employs a 7 × 7 convolution kernel with 64 filters to generate a 64-channel feature map, followed by pooling using a 3 × 3 pooling window. Subsequently, after 48 convolution operations, the number of channels expands to 2048, fully extracting the features of the data information. The data then proceed through forward propagation into the pyramid pooling module, where pooling operations are performed under 1 × 1, 2 × 2, 3 × 3, and 6 × 6 pooling windows to obtain multi-scale feature maps. Next, 1 × 1 convolution is used to reduce the dimensionality of the channels, and bilinear interpolation upsampling is applied to restore the feature maps to an appropriate size before entering the feature fusion module. In the feature fusion module, two fusion methods are typically employed, i.e., one is channel concatenation (abbreviated as Concat), and the other is pixel-wise addition followed by convolution operations. Assuming the input channels are x
1, x
2,..., xₙ and y
1, y
2,..., yₙ, the concatenation method performs feature fusion through convolution operations (Equation (7)) to enhance the model’s ability to represent features at different scales.
During the forward propagation process, it is necessary to perform semantic segmentation on the output, dividing it into different categories. The choice of activation function in classification tasks depends on the number of categories: binary classification problems typically use the Sigmoid function, while multi-class classification problems employ the Softmax function. This study selects the Softmax activation function to map the output into a probability distribution of discrete categories, which aligns with the discrete frequency characteristics of the input data. Compared to regression problems that produce continuous outputs, classification modeling better fits the characteristics of the research problem and helps improve prediction accuracy. To further enhance the model’s generalization capability and stability, the mean squared error (L2 norm), commonly used in regression tasks, is introduced into the objective function as the loss function.
The definition of the Softmax function is as follows:
Here, zj represents the input of the j-th neuron in the final layer, and aj denotes the output of the j-th neuron in the final layer. The natural number e is used to amplify the differences in probabilities, while ∑i ezj represents the sum of the inputs of all neurons in the final layer. The purpose of using the Softmax function is to evaluate the output of the neurons in the final layer in the form of a probability distribution. The higher the output probability of a specific neuron, the more likely it corresponds to the true class.
The CNN architecture adopted in this study is based on the standard ResNet50–PSPNet framework and does not represent a methodological contribution. It is employed as a well-established and widely validated baseline to ensure that the observed differences in inference latency can be attributed to scene-element characteristics rather than architectural novelty or optimization bias. Similar configurations have been extensively reported in the literature for scene parsing [
13,
14] and semantic segmentation [
15,
16] tasks.
The mathematical formulations of batch normalization, Softmax activation, and feature concatenation presented in this section follow standard definitions and are included for completeness and reproducibility. In the context of this study, these components serve a specific purpose: batch normalization stabilizes intermediate feature distributions to prevent layer-wise computational imbalance; Softmax ensures consistent multi-class probability normalization across element categories; and feature concatenation enables multi-scale feature aggregation required for accurate pixel-level segmentation.
3.2. Training Dataset Construction
The training data for autonomous driving scenarios in urban tunnel portal sections must simultaneously meet two requirements: “comparable feature sensitivity” and “balanced sample distribution.”
- (1)
Combinatorial Generation Rules
According to national standards, each tunnel portal must appear, with the road, tunnel portal, and traffic markings being mandatory elements. The remaining six elements—traffic signs, signboards, portal vegetation, portal structures, cars, and two-wheeled motorcycles—are set as optional. The theoretical number of combinations is 2
6 = 64. To avoid category imbalance caused by purely static scenes, 16 combinations featuring “only vegetation/structures without vehicles or signs” are excluded. The remaining 48 logical scenarios, labeled Scene-01 to Scene-48 (see
Appendix A,
Table A1), form the training framework.
- (2)
Annotation System and Quality Control
To meet the dual requirements of label accuracy and scalability for “feature sensitivity evaluation,” this study adopts a “semantic-instance” two-tier annotation system and establishes a three-level quality control process to ensure the mask error remains below 0.5%, thereby preventing annotation noise from interfering with the comparison of CNN inference speeds.
(1) Two-tier labeling system
① Semantic level: Format—8-bit single-channel PNG, values 0–9; 0 represents background, 1–9 correspond sequentially to tunnel portal, traffic sign, traffic marking, signboard, vegetation, structure, car, and motorcycle; resolution consistent with the original image (1024 × 512), using nearest-neighbor interpolation to avoid category aliasing caused by edge anti-aliasing.
② Instance level: Unique Instance IDs (starting from 1000 and incrementing) are assigned only to the dynamic elements of “cars” and “motorcycles” to support subsequent validation of instance segmentation model extensions; the ID encoding is written into an independent 16-bit PNG layer, sharing the same filename as the semantic label file but with a different suffix for easy reading and maintenance.
(2) Three-level quality control process
Step 1: Pre-labeling
Manually draw rough polygons using LabelMe, then invoke Meta AI’s Segment Anything Model (SAM, ViT-H checkpoint) to generate initial masks. For high-contrast areas (e.g., tunnel entrances with strong backlighting), apply CLAHE enhancement before inputting to SAM to improve edge recall. The average pre-labeling time is 2.1 s per frame, with 92% of elements achieving IoU ≥ 0.85, laying a solid foundation for subsequent manual correction.
Step 2: Cross-manual verification
Two annotators independently corrected the pre-annotated results without knowing each other’s outcomes. A pixel-level IoU was used as the consistency metric: elements with IoU ≥ 0.95 were directly approved, while those with IoU < 0.95 were forced into a “dispute zone” for online collaborative redrawing by both parties until IoU ≥ 0.95 was achieved. This process increased the average IoU from 0.87 to 0.97, with disputed elements accounting for 6% and an average redrawing time of 4.5 s per element.
Step 3: Expert Sampling Inspection
Randomly select 10% of the frames that have passed cross-validation and submit them to engineers with over 3 years of perception algorithm experience for blind review. Review criteria: ① edge deviation > 2 pixels; ② incorrect category labeling; and ③ duplicate or skipped instance IDs. Any error is counted as a defect. The defect rate must be less than 0.5% for overall approval; otherwise, expand the sampling to 20% and return for re-labeling. The measured defect rate was 0.38%, meeting the quality threshold.
(3) Quality Indicators and Results
Through the three-level process of “automatic initialization—manual cross-checking—expert sampling,” the pixel-level error of 3428 element masks across 376 images was controlled within 0.5% (
Table 5), providing a high-consistency, low-noise ground truth foundation for the fair comparison of CNN sensitivity models. Although the final dataset consists of 376 images, its size is sufficient for the objective of this study, which focuses on element-level inference latency comparison rather than large-scale classification generalization. Unlike accuracy-driven perception benchmarks that require extensive data diversity, the proposed sensitivity analysis relies on controlled experimental conditions and low-variance measurements of inference time.
To ensure statistical reliability, all images undergo strict screening in terms of illumination range, motion blur, occlusion, and annotation quality, effectively reducing noise that would otherwise require larger sample sizes. Each of the 48 logical scene combinations is represented by multiple frames collected across different tunnels, and inference latency is measured repeatedly under identical network and hardware configurations.
- (3)
Dataset Partitioning and Augmentation
To validate the generalization capability of the feature sensitivity model while avoiding evaluation bias caused by tunnel specificity or class imbalance in scenarios, this paper implements a rigorous hierarchical partitioning of 376 frames of “golden samples” and adopts a conservative photometric augmentation scheme. This ensures that differences in CNN inference speed solely reflect the “feature category” itself, without interference from image content shifts or geometric distortions.
A two-layer sampling strategy of “intra-class balancing and tunnel isolation” is adopted:
(1) Intra-class balancing: at least 6 frames are retained for each of the 48 logical scenarios, split in an 80%/12.5%/7.5% ratio to ensure consistent category distribution across training, validation, and test sets are compared to the overall dataset (X2 test p = 0.21, no significant deviation).
(2) Tunnel isolation: the tunnel numbers included in the test set do not appear in the training or validation sets, achieving “tunnel-level” out-of-domain validation to prevent inflated sensitivity scores due to the model memorizing background textures.
The final division results are shown in
Table 7.
The core assumption of sensitivity evaluation is that differences in inference speed are solely driven by feature categories. Therefore, only lossless geometric photometric augmentations are applied during the training phase, with specific parameters as follows: brightness shift: ±10% (0.9–1.1 times linear gain); contrast adjustment: ±5% (0.95 to 1.05 times the slope); and additive Gaussian noise: σ = 2, truncated at ±3σ. Augmentations are applied online only during training with a probability of 0.5, while validation and test sets retain original pixel values. This strategy, on the one hand, expands the sample space for illumination robustness, and on the other hand, avoids edge misalignment caused by geometric transformations such as rotation, scaling, and cropping, thereby ensuring that timing differences across features in CN are not influenced by the coupling of deformation and computational load. Pre-experimental results indicate that, within the above augmentation range, the inference time fluctuation of ResNet50 is <0.3 ms, significantly below the threshold for notable inter-feature differences (1ms), meeting the requirements for fair comparison. Through the combined strategy of class-balanced and domain-isolated partitioning, along with geometric-invariant and photometric fine-tuning augmentations, this dataset provides an unbiased and reproducible experimental foundation for subsequent sensitivity ranking.
(4) Statistical Characteristics
The dataset resolution is uniformly scaled to 1024 × 512, with the pixel proportion of elements consistent with the real vehicle perspective (road 40%, vehicles 6%, etc.); the average brightness is 78 ± 21, covering typical transition zones of tunnels; the class imbalance ratio of 1:14 is reduced to 1:1.8 using median-frequency weighting. Annotation errors in 376 frames are less than 0.5%, meeting the fairness requirement for sensitivity comparison.