Next Article in Journal
Improving Ultrasonic Power Transfer in Air Through Hybrid S-Parameter Modeling and High-Efficiency Compensation
Previous Article in Journal
An Approach to Modeling and Developing Virtual Sensors Used in the Simulation of Autonomous Vehicles
Previous Article in Special Issue
Opportunistic Allocation of Resources for Smart Metering Considering Fixed and Random Wireless Channels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The TDGL Module: A Fast Multi-Scale Vision Sensor Based on a Transformation Dilated Grouped Layer

1
School of Rail Transportation, Shandong Jiaotong University, Jinan 250357, China
2
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(11), 3339; https://doi.org/10.3390/s25113339
Submission received: 18 April 2025 / Revised: 15 May 2025 / Accepted: 20 May 2025 / Published: 26 May 2025
(This article belongs to the Special Issue AI and Smart Sensors for Intelligent Transportation Systems)

Abstract

:
Effectively capturing multi-scale object features is crucial for vision sensors used in road object detection tasks. Traditional spatial pyramid pooling methods fuse multi-scale feature information but lack adaptability in dynamically adjusting convolution operations based on their actual needs. This limitation prevents them from fully utilizing spatial hierarchies and contextual information. To address this challenge, we propose a Transformation Dilated Grouped Layer (TDGL) module, a fast multi-scale vision sensor based on deep learning, designed to enhance both efficiency and accuracy in road target feature extraction networks. The TDGL is built upon the Global Layer Normalization Convolution (GLConv) unit, which mitigates internal covariate shift by introducing scaling and offset parameters, modifying dilation strategies, and employing grouped convolution. These improvements enable the network to distinguish features at different scales effectively while optimizing spatial information processing and reducing computational costs. To validate its effectiveness, we integrate the TDGL module into the backbone of several YOLO models, forming the TDGL Net feature extractor. The experimental results obtained on the BDD100K dataset show that the mAP of the TDGL net reaches 40.3% with around 3.1M parameters. The inference speed of the TDGL net after transformation optimization reaches 58 FPS, which meets the requirement for the real-time detection of road obstacle targets by autonomous vehicles.

1. Introduction

Multi-scale vision sensors, as a fundamental and critical area of research, drive technological innovation in several fields such as smart cities, autonomous driving, and intelligent surveillance. With the continuous development of convolutional neural networks CNNs [1], the performance of target detection based on vision sensors has achieved significant improvement, but their computational burden has also increased significantly and cannot satisfy the real-time requirements of autonomous driving. The candidate region-based two-stage convolutional neural network R-CNN [2], Fast R-CNN [3], and Faster RCNN [4] can be regarded as high-performance versions of a CNN and have achieved higher detection accuracy on datasets such as KITTI [5], VOC 2012 [6], and MS COCO [7]. However, the main challenge that they are faced with is the strict limitations on their input image size. Spatial pyramid pooling has been proposed to solve this problem, and a combination of classifiers can effectively detect and localize target objects. To improve the accuracy of the detector, a deeper layer is usually adopted to achieve better results. Ma et al. [8] introduced a pyramid cavity convolution module in parallel with the RESA module to enhance the model’s receptor field to enrich and extract global spatial feature information, but the detection speed was decreased to a low level by this. S. alaiarasan et al. [9] used a deep convolution neural network containing 12 nested processing layers for object detection. Wen et al. [10] proposed an MSADark module with global attention and the location attention-weighted feature fusion network LAFFN to enhance network feature representation for target perception in automatic driving. The main deficiency of the module lies in the huge amount of computation required. Niu et al. [11] proposed a lightweight method based on the improved YOLOv8 to significantly improve the detection performance of a model through multidimensional optimization. The method introduces an efficient multi-scale attention mechanism while combining it with the SPD-Conv module to alleviate the problem of the loss of fine-grained information about targets. However, in real-time detection scenarios, this model still needs to be optimized to enable multi-target feature extraction in dynamic scenes. Figure 1 illustrates road object detection using a dual label assignment strategy, which balances precision and computation. Liu et al. [12] carried out a detailed analysis of the batch normalization layer and integrated low precision, range batch normalization, and block floating point technology to effectively reduce the running time overhead in the batch normalization process. All these methods are based on increasing the depth of convolution and greatly increase the computational burden of the model and reduce its inference speed.
To achieve the optimal balance between accuracy and real-time performance, this paper tries to study the operation mechanism of deep neural networks and focuses on the improvement and optimization of basic convolutional units, considering that the fine-tuning of parameters in any layer of the convolutional neural network may cause a significant shift in data distribution in subsequent layers. After multiple parameter updates and training transfers, the changes in input distribution in the rear layer will intensify [13]. Variable data distribution will not only cause the model to behave unstably in the training process, affecting the quality and speed of the model convergence, but it will also affect the scale and variance of the data and tends to cause the gradient to vanish or explode in multilayer networks, especially when saturated activation functions, such as Sigmoid or Tanh, are used. Therefore, effective strategies should be adopted to stabilize the data’s distribution and optimize the network structure to improve the performance and generalization ability of the model.
This paper attempts to make a lightweight multi-scale vision sensor feature extraction network with better performance by using an optimized GLConv convolutional module as its basic unit and multi-branch architecture design to obtain multiple receptor fields. In the network, the TDGL module replaces the spatial pyramid module of YOLO [14] to build the TDGL Net feature extractant, which is highly portable and can be used in multiple target detection models. The main contributions of this paper are as follows:
  • A new standard convolutional unit, GLConv, is designed, and non-zero values are added to the batch variance to enhance the stability of feature information extraction.
  • A multi-branch TDGL detection module based on normalized convolutional units is proposed. The TDGL adopts an integrated and more flexible convolutional kernel mechanism, which improves its performance in extracting feature information from targets of multiple sizes with limited computational resources and enhances the detection capability of visual sensors for multi-scale targets on roads.
  • Experiments are carried out on the BDD100K benchmark, and the effectiveness of our network is verified.
The rest of the main sections of this article are arranged as follows: Related work is shown in Section 2. Section 3 elaborates on the TDGL network used in multi-scale vision sensors and provides an in-depth analysis of the proposed strategy. Comparison experiments, ablation experiments, and relevant analyses are described in Section 4. Finally, the conclusion is presented in Section 5.

2. Related Work

Capturing multi-scale target features effectively is essential to improving the performance of vision sensors. To improve their feature extraction ability, the spatial pyramid pooling (SPP) proposed by He et al. [15] fuses features of different levels through multi-scale pooling operations, significantly improving the model’s adaptability to scale changes. However, the parallel pooling structure of SPP has high computational complexity and struggles to meet the real-time requirements. Therefore, Liu et al. [16] introduced a simplified version of SPP (SPPF) into their model. SPPF enhances the original spatial pyramid pooling by transforming the computationally intensive parallel pooling layer into a more efficient serial structure, improving accuracy while increasing processing speed. SPPF employs a progressive pooling strategy that begins by pooling a large area before gradually reducing the pooling window size. This approach minimizes redundant operations while effectively capturing features at various scales. Inevitably, our ability to adjust the receptive field is still limited by the fixed pool window. Max pooling is calculated as follows:
z i = MaxPool 2 d z i 1 , k = 5 , s = 1 , p = 2
where z i is the result of the i time pooling operation; z 0 = Y, where Y is the initial convolution dimension reduction value of the feature graph; and p is the filling value, that is, the zero layers added to the edge of the input feature graph to maintain the size of the output feature graph.
We apply adaptive maximum pooling followed by adaptive average pooling to the output feature map in succession, as detailed below:
a max = AdaptiveMaxPool 2 d ( z 3 , o = 1 ) a avg = AdaptiveAvPool 2 d ( a max , o = 1 )
where a max and a avg are the processed results, O is the output size, and O = 1 indicates that adaptive pooling will convert input feature maps of any size into 1 × 1 feature maps.
The results of all pooling operations are spliced along the channel dimension, and the spliced feature diagram f is obtained:
f = Concat ( y , z 1 , z 2 , z 3 , a max , a arg )
We convolve f and apply the activation function to obtain the final output feature graph:
Z [ h , w , c ] = c C in W ( c ) [ 0 , 0 , c ] × f [ h , w , c ] + B [ c ] Y [ h , w , c ] = SiLU ( Z [ h , w , c ] ) C in = c × 6 + c / 2
where Z [ h , w , c ] is the result after the convolution operation, Y [ h , w , c ] is the value of the output feature map, c is the output channel number, C in is the number of channels in the feature map, W ( c ) [ 0 , 0 , c ] is the cth output channel weight matrix element corresponding to the position of the input channel c , f [ h , w , c ] is the value of the feature map at the corresponding position, and B [ c ] is the bias term of output channel c.
Some researchers have tried to optimize multi-scale feature fusion through dynamic convolution strategies. For example, Chen et al. [17] proposed adaptive spatial pyramid pooling (ASPP), which uses void convolution to flexibly adjust receptive fields, but its fixed setting of the void rate limits its adaptability to complex scenes. Similarly, the field enhancement network (RFB-Net) designed by Li et al. [12] has an improved ability to detect small targets through multi-branch cavity convolution, but the balance between computational efficiency and accuracy is not solved.
In the process of feature fusion in multi-scale vision sensors, with the deepening of feature information transmission in the network, the distribution of the feature graphs after fusion will become more complex due to the influence of multiple factors such as the update of the network parameters and the action of an activation function and pooling operation, which may aggravate the shift of internal covariables. Huang et al. [18] proposed a method for measuring the displacement of internal covariates using the EM distance, derived the upper and lower bounds of the method, and combined the output with adjustable parameters to further constrain the data’s distribution and reduce information loss. This approach brings the performance of the detection model close to that of the two-stage scheme, sometimes even exceeding it, but it has a great impact on the processing speed.
To solve these problems, this paper reconstructs the multi-scale feature convolution module and designs a new normalized convolution unit. By introducing normalized parameters and a targeted expansion strategy, multi-scale feature fusion is realized efficiently and spatial information processing is optimized to reduce the computational overhead.

3. Methods

The road detection task for vision sensors involves multiple detection targets, which places a large computational burden on edge devices. This study aims to enhance the performance of single-stage detection models without increasing the computational load. To improve training efficiency and accelerate convergence, we avoid simply increasing the convolution depth of the backbone network. Instead, we focus on optimizing the convolution units, implementing batch normalization, adjusting the weight of historical data when updating the running mean and variance, and incorporating learning for affine transformation parameters during normalization. The TDGL feature extraction module utilizes a multi-branch convolutional network structure with GLConv as its foundational unit. Each branch employs different spatial expansion rates to capture features at multiple scales. Below, we provide specific parameter information and testing solutions. This structure is inspired by the Receptive Fields Block, which allows for the efficient extraction of structural and texture information from multi-scale objects in images by adjusting the receptive field of the detection module. As a result, it demonstrates a better performance than traditional spatial pyramid networks. The TDGL module is integrated at the end of the YOLO v8 backbone network, creating an efficient TDGL Net feature extractor, as illustrated in Figure 2. In the figure, a sketch of the structure of the TDGL module is presented in the light yellow box on the left-hand side, while the structure of the CSPLayer in the backbone network is shown in the light yellow box on the right-hand side. The blue part in the top right corner contains the structural diagram of DarknetBottleneck, the dark green box in the middle contains the backbone structure of the network, and the bright green box at the bottom displays the neck and detection header parts of the network.

3.1. TDGL Feature Extraction Module

The TDGL module significantly enhances the feature extraction capabilities of convolutional neural networks and consists of a multi-branch convolutional block. Its structure comprises two main components: a multi-branch convolutional layer with various kernel sizes and an extended pooled convolution layer, which is added at the end. This design effectively implements a multi-receptive field structure using multiple kernel sizes, outperforming fixed-size shared convolution kernels.
The specific design of the TDGL module is inspired by Inception-ResNet V2 [19]. First, to reduce the number of channels in the feature map, a bottleneck structure with 1 × 1 GLConv convolutional layers is employed in each branch. Second, two stacked 3 × 3 convolutional layers replace larger convolution kernels, minimizing the depth of nonlinear layers and reducing computational complexity. Finally, a 1 × n followed by an n × 1 convolutional layer is utilized instead of the traditional n × n convolutional layer, employing a shortcut spanning approach at the end. The branch convolution operation assumes that the input is a C channel, and then each pixel in the C channel output map at (y,x) is evaluated as follows:
O y , x = i = k h k h j = k w k w W k h + i , k w + j · I y + i , x + j
where x and y represent the x and y axes of the output map; k h and k W are the size of the nucleus, which represent thes convolution filter; I v + i , x + j R C and O y , x R C are the input and output, respectively. The bias term of the convolution is ignored in this equation for simplicity of presentation. The primary goal of employing a convolutional layer structure is to generate higher-resolution feature maps that capture a broader range of contextual information while maintaining a constant number of parameters. This design is also evident in the single-stage detector SSD, which enhances its detection speed effectively.
The detailed structure of the TDGL module is illustrated in Figure 3. Following the research of Liu et al. [12], extended convolution is employed to simulate the effects of receptive field deviation in the human visual cortex. This design enables the module to capture a diverse range of features across various spatial levels, thereby enhancing the detection of different sizes of objects across contextual scenes. In each branch, convolutional layers with specific kernel sizes are combined with corresponding expansion layers, and the outputs of these branches are fused by summing to create a richer feature representation. The three branches extract features at different scales using different sizes of convolution kernels or different expansion rates. Finally, the output feature maps of multiple branches are merged along the channel dimensions through the operation of Concat (channel splicing), as shown in Equation (4). Compared with element-by-element addition, Concat does not destroy the feature distributions of each branch and directly retains all the output information to avoid information loss. The kernel size expansion rate exhibits a positive correlation with both size and the eccentricity of the visual cortex’s receptive field. Adjusting the expansion rate effectively enlarges the kernel size from k × k to k ε without increasing the number of parameters or the computational load. The expression is as follows:
k ε = k + ( k 1 ) ( r 1 )
where k ε represents the size of the equivalent expansion convolution kernel and r represents the size of the expansion rate. The formula for calculating the receptive field of the current layer is as follows:
R F i = R F i 1 + ( k 1 ) × i = 1 i S t r i d e
where R F i is the receptive field of the current layer, R F i 1 is the receptive field of the previous layer, and i = 1 i S t r i d e is the product of the steps of all previous layers.
Finally, each module merges the original input with the output from feature fusion through a residual connection, forming a spatial convolution array that enhances network information transfer and training convergence. The TDGL module consists of three parallel branches. In this paper, the default step value for controlling the convolution operation is set to 1, and the reduction factor for the number of channels in the middle layer is set to 8 to minimize parameters. The scaling factor for shortcut connections is also set to 0.1. Each branch has a different convolution expansion rate: the last layer of Branch 1 uses a 3 × 3 convolution with an expansion rate of 1, focusing on smaller image regions to preserve detailed features. Branch 2’s final convolutional layer is set to an expansion rate of 3, allowing it to capture medium-scale features such as partial targets or backgrounds. Meanwhile, the last layer of Branch 3 has an expansion rate of 5, enabling it to capture large-scale global information and better understand the overall structure and background of the image.
The TDGL module is designed to capture multi-scale contextual information by combining more flexible and efficient mechanisms. It allows us to avoid replacing the SPPF module with a deeper or denser network that incurs significant computational costs. This paper proposes a deep learning-based single-level framework for a visual sensor feature extraction network and integrates the TDGL module to enhance the lightweight feature extraction backbone. This approach maintains fast detection while improving accuracy. The TDGL module boasts high compatibility and has been successfully applied to multiple versions of YOLO, with the primary modification being the replacement of the spatial pyramid module at the end of the backbone with TDGL.
The SPPF module is situated in layer 9 of the backbone network. Its input is typically processed through three max pooling layers with a kernel size of 5, following a 1 × 1 convolution. In this paper, we remove the max pooling layer and restrict the convolution kernel size to less than 3 to reduce the computational burden. Max pooling decreases the size of the feature map by selecting the maximum value from each pooling window, which can result in a loss of detailed information. Additionally, gradient propagation is limited to the maximum position within the pooling window, leading to an uneven gradient flow and potentially hindering model training. By carefully selecting the step size and filling strategy, we can retain more information from the input feature map. This approach allows for more uniform gradient propagation back to the earlier layers of the network, facilitating efficient parameter tuning.

3.2. GL Feature Extraction Convolution

Changes in the distribution of network layer activation inputs can slow learning and destabilize the model’s performance during training. The GLConv module is designed to enhance the speed and accuracy of feature extraction in convolutional neural networks while minimizing the impact of internal covariate shift on training stability. Its structure is illustrated in Figure 4.
We improve the feature output after the two-dimensional convolution layer by incorporating an integrated batch normalization layer. A small non-zero value is added to the batch variance to enhance numerical stability, and the parameters are adjusted accordingly. Batch normalization standardizes the features of each instance using the mean and variance statistics of the batch data, ensuring that the mean is 0 and the variance is 1. This keeps the inputs to each layer relatively stable and reduces internal covariate shift. The batch data dependence of this method enhances network convergence, thereby accelerating the speed of convergence and improving the model’s generalization ability. Input image feature values x = ( x ( 1 ) , x ( 2 ) , , x ( m ) ) are normalized using the following steps:
x ^ ( i ) = x ( i ) μ B σ B 2 + ϵ
μ B = 1 m i = 1 m x ( i ) , σ B 2 = 1 m i = 1 m x ( i ) μ B 2
where m represents the number of samples in a batch, x is the input feature, i represents the sample i, μ B is the mean of the data features of the training batch, σ B 2 is then the variance of the data features, ϵ is a small positive value that prevents the denominator from being zero when calculating normalization, and x ^ ( i ) is the result of the normalization.
The value ϵ usually defaults to 10 5 . Consider that BDD100K is a dataset containing various target classes with complex features. Conditions such as bad weather or uneven lighting in the images may result in high or low variance in the input data. In this paper, the value of ϵ is adjusted to 10 4 . Increasing the value of ϵ improves numerical stability and helps to deal with the differences in image data captured under different environmental conditions. To minimize the impact of noise on the normalization process, we made slight adjustments to the momentum values used for computing the sliding average. While normalized data enhance training stability by maintaining zero mean and no unit variance, these properties may not always align with the feature expression needs of the current layer. Therefore, this paper sets the affine Boolean parameter to true and introduces scaling and offset parameters to perform an affine transformation on the normalized data, allowing for the recovery or retention of certain characteristics from the original data. The transformation formula is as follows:
y i = γ x ^ i + β
where γ is a scaling parameter that directly affects the scale of the gradient, β is an offset parameter that regulates the bias of the gradient, and y i represents the final output feature.
After the convolution operation, the ReLU activation function is applied to the output feature maps of each convolutional layer and the ReLU forward propagation formula is as follows:
x j l = f ( i M j x i l 1 w i j l + b j l )
y k i , j = x k l i , j , x k l i , j > 0 0 , x k l i , j 0
where x j l represents the jth feature graph in the l layer, f x is the nonlinear activation function, M j is the set of input images, ∗ represents the convolution operation, w i j l is the weight matrix of the convolution kernel, b j l is the bias value, and y k ( i , j ) is the output.
The backward propagation of ReLU requires it to be one of the layers of the network. Taking x l as the output of layer l, the partial derivative δ L of the loss function L related to the output of the l layer is formulated as follows:
δ L = L x l = δ l + 1 . Re L U ( x l ) x l = δ l + 1 1 , i f   x l > 0 0 , i f   x l 0
One major drawback of ReLU is that the gradient is zero when the input is negative. This leads to neuron death [20], which prevents these neurons from being updated during the training process. To solve this problem, Leaky ReLU [21] was introduced into the convolutional unit. Leaky ReLU allows a small negative slope when the input x is negative, thus alleviating the problem of neuron death, and its forward and backward propagation formulas are as follows:
L e a k y ( x ) = x , x > 0 l e a k x , x 0
δ L = δ l + 1 . L e a k y ( x l ) x l = δ l + 1 1 , i f   x l > 0 l e a k , i f   x l 0
where ∗ is the multiplication sign and l e a k y is a small positive constant, usually around 0.01 [22].

3.3. Training Strategies

The TDGL multi-scale vision sensor was developed using the PyTorch framework and integrates various components from the YOLO multi-version open-source repository. This paper’s training strategy primarily follows that of the baseline model, which includes Mosaic data augmentation, anchor box selection and matching strategies, multi-scale prediction, and a specific loss function. The TDGL network’s loss function comprises three main components: border loss, inter-class loss, and target loss, with the CIoU loss function specifically used for the border loss. This loss function takes into account the centroid distance, overlap ratio, and aspect ratio of the detection frame and is defined as follows:
C I o U = I o U ρ 2 b , b g t c 2 + α v
α = v ( 1 I o U ) + v
v = 4 π 2 arctan w g t h g t arctan w h 2
where b is the center point of the predicted box, c is the diagonal length of the smallest closed box covering the predicted box and the true box, b g t is the center point of the true box, α is the trade-off coefficient, v is used to measure the consistency of the aspect ratio, and ρ is the Euclidean distance between the center point of the true box and the center point of the predicted box. w, h, w g t , and h g t in the relevant formula denote the length and width of the predicted box and the true box, respectively. The AdamW optimizer is employed during training to adjust the model weights and minimize the loss function. The learning rate is set to 0.01 for v8 and v5 and 0.001 for v3, enhancing the effectiveness of the TDGL module’s embedding. Only the v8 model utilizes the official weights; the others do not use initial weights. More detailed experimental parameters are provided in the Experiments section.

4. Experiments

To comprehensively evaluate the improvement in performance created by the proposed TDGL feature extraction module when used in YOLO series object detection models, this section tests these models on the BDD100K [5] dataset. In this experiment, we randomly selected 4220 images as the training set and 600 images as the validation set. Various original algorithm models and their improved versions were compared. Before the experiment, a small number of training classes and two categories of non-signal traffic light targets were removed to balance the dataset. A traffic scene dataset was constructed, featuring 10 target types: person, rider, car, bus, truck, bike, tl_green, tl_red, tl_none, and traffic_sign. The visualized data information from the BDD100K dataset is presented in Figure 5. Here, “quantity” represents the number of categories in the dataset, while the x and y axes depict the distribution of the labeled box positions. The label box size distribution graph in Figure 5c clearly shows that target sizes are primarily concentrated around 200 pixels, indicating a high prevalence of small targets within the dataset.

4.1. Training Settings

The experimental environment used in this article is the Pytorch 1.10.1 deep learning framework and the Python 3.8 programming language. The graphics card used in the hardware section is the NVIDIA GeForce GTX 1660 Ti. The CPU model is an Intel(R) Core(TM) i7-9700, and the system storage size is 16.0 GB. The detailed hyperparameter settings used for training are shown in Table 1, and the rest of the parameters are consistent with the official model.
The models’ detection performance is evaluated using the mean average precision (mAP), which provides a comprehensive assessment of the algorithms by calculating their mean accuracy at various confidence thresholds. The primary evaluation metrics used include the recall rate, mAP@0.5, and mAP@0.5:0.95. The key difference between these mAP metrics lies in their IoU thresholds; mAP@0.5 considers only an IoU threshold of 0.5, while mAP@0.5:0.95 averages the mAP across IoU thresholds from 0.5 to 0.95. Additionally, speed of performance is measured using frames per second (FPS), indicating the number of images the model can detect per second. The mAP calculation formula is as follows:
m A P = 1 c i = 1 c A P i
where C is the number of categories and A P i denotes the accuracy rate of the ith category, which is calculated as follows:
A P i = 0 1 ( P i · R i )
where P i and R i denote the accuracy and recall of detection category i. The methods used to calculate precision and recall are shown in Formulas (9) and (10) respectively:
P r e s s i o n = T P T P + F P
R e c a l l = T P T P + F N
where FP denotes the number of negative samples that were incorrectly identified as positive samples, TP denotes the number of positive samples that were correctly identified, FN denotes the number of positive samples that were incorrectly identified as negative samples, and the ratio of the number of positive samples to the total number of samples is calculated.

4.2. Experiments into Replacing Backbone

The TDGL module was integrated into a series of YOLO releases. Specifically, in v5, TDGL took the place of a spatial pyramid structure at layer 10 at the end of the backbone network, with the same backbone network depth as in Figure 2. The location of the TDGL’s embedding in v6 is in the backbone network ERBlock5, one layer below the RepVGG Block and RepBlock. Due to hardware limitations, TDGL was configured in YOLOv3-tiny for testing. The standard v3 structure was not evaluated, as its replacement location differs from the others, being situated in layers 8 and 10 of the backbone. Additionally, replacements were made in layer 1 of the HEAD section by removing the original convolution layer. The weight file chosen for the experiments is based on n, which offers a balanced trade-off between size and detection performance.
Table 2 shows the detection results when replacing the GLConv in TDGL with the standard Conv + BN in v6 and v8, as well as the test results after replacing the TDGL module with the YOLO series. For a fair comparison, this paper maintains consistency in the training parameters used for the model before and after the improvement, and all experiments are conducted in the same environment. During the experimental process of integrating the TDGL module into multiple versions of YOLOv3, v5, and v6, this paper uses the same training strategy, including consistent data enhancement methods, optimizer selection, and learning rate scheduling. The original models and improved models used the same training strategy and hyperparameter settings during the training process.
The data indicate that the improved models achieved a higher detection accuracy, with only a slight increase in the number of parameters and floating-point computations. The module constructed using the standard Conv + BN is similar to the original model in terms of its number of parameters, but its detection results are slightly lower than those of the original model. Improvements in v3 have made the number of parameters and calculations smaller, improving its weight. By integrating the TDGL layer, the improved model YOLOv8-TDGL outperforms other models with a 40.3% mAP while maintaining a real-time speed of 58 frames per second. Even its accuracy is higher than that of the advanced model Faster-RCNN [4] in the two-stage framework. The feature information extraction capability of TDGL is superior to the SPPF module using a maximum pooling layer. Its performance is superior compared to that of the v6 model using the deepened base backbone network approach, which also retains an excellent processing speed.
The improved model was trained alongside the original v3, v5, v6, and v8 models for 100 epochs on the BDD100K dataset. The change curve of the loss function is depicted in Figure 6, illustrating the model’s convergence during training. Notably, all loss values in the improved models show a reduction compared to their original models, indicating that the TDGL Net feature extractor enhances the networks’ ability to learn more complex and abstract features. This reduction in loss helps decrease the deviation between the predicted and actual bounding boxes, thereby improving the model’s generalization capability. Specifically, the loss function value for YOLOv8-TDGL, which achieved the best detection performance, dropped to 0.68. The loss function value for the v6 model also significantly decreased after the improvements, likely due to the original model’s inability to fully learn the features of the image targets during training.
Figure 7 shows the detection results for five representative scenes (Sunny, Tunnel, Night, Snow, and Rainy) before and after model optimization. Blue arrows indicate misdetected objects, while red arrows mark missed detections. The original v5 model exhibits a higher incidence of missed detections for small targets, such as traffic signs and distant vehicles, and it mistakenly identifies buildings as traffic signs. The same issues are observed in the v6 model. In contrast, the improved models effectively address these problems, enhancing detection accuracy. Notably, YOLOv8-TDGL demonstrates the most precise localization of target edges and overall detection performance. The integration of the TDGL Net feature extractor allows the model to more accurately locate targets, resulting in fewer false positives and missed detections. Additionally, it provides a higher confidence level for detected targets, significantly improving the models’ overall detection performance.
Multiple sets of weights were used to generate heat maps for the same road image before and after enhancement to illustrate the TDGL module’s improvement of the vision sensor’s feature extraction efficiency. The results are presented in Figure 8. The intensity of the red color in the heat map indicates the level of attention the model pays to specific regions, with darker red signifying greater computational weight and a more significant impact on detection outcomes. In the original v3 plot, the red areas are small and light, indicating that the model only sporadically focuses on a limited portion of the target, while neglecting the center of the image. After integrating TDGL, the red region covering the target is larger, demonstrating that the model more accurately concentrates on the feature information of road targets. These results confirm that the TDGL module enhances the extraction and learning of target feature information, effectively improving the model’s detection performance.

4.3. Experiments Comparing Our Approach to SOTA Methods

To evaluate the performance of the improved model, this paper experimentally compares the YOLOv8 model with the embedded TDGL module against various mainstream target detection models on the BDD100K dataset. The results are presented in Table 3. YOLOv8-TDGL demonstrates an outstanding detection performance across multiple indicators. Compared to the SSD, which also employs a single-stage strategy, YOLOv8-TDGL shows significant improvements in various metrics. Notably, its mAP@0.5 surpasses that of the two-stage method Fast R-CNN, which utilizes a deeper network, and it significantly outperforms other methods in terms of precision.
Overall, the detection frame rate of the single-stage YOLOv8 model reaches 58 FPS, far exceeding that of the two-stage detection models. Although RetinaNet, which uses ResNet-18 as its base network, exhibits faster inference speeds due to it having fewer convolutional layers, YOLOv8-TDGL maintains a competitive edge. The recall metrics for v3 and v5 are lower than those of the two-stage approach, likely due to the fixed number of anchor boxes used for target prediction. This design and matching strategy may lack flexibility, resulting in poor alignment between real targets and predefined anchor boxes, especially when dealing with objects of varying scales and shapes. Consequently, this leads to an increased number of missed detections.

4.4. Ablation Experiments

This section includes a series of ablation experiments. This study investigated the effects of the expansion rate, module depth, channel reduction factor, and activation function on the detection performance of the normalized convolution units employed in the TDGL module. Experiments were conducted on a v8 network, with results presented using various evaluation metrics. Table 4 displays the experimental results for different convolutional kernel sizes. A total of 15 groups of expansion rates were compared to assess how varying sizes influence the feature extraction capabilities of the normalized convolution units. Notably, these experimental improvements do not alter the model’s parameter count. Three of them, 1\2\6, 1\2\5, and 1\3\6, produced unsatisfactory performance test indicators and their results are not shown in the table. The data in the table show that the Leaky ReLU activation function produces the best detection results when the expansion rate is set to 1, 3, or 5, the three-branch depth is 3, 3, and 4, and the channel reduction factor is 4.
On a sunny day, the details of the near targets are clear, and the small receptive field, with an expansion rate of 1, can accurately capture these details, such as the text on a license plate and the pattern of a traffic sign. Distant vehicle targets occupy a small area in the image, and the large receptive field of expansion rate 5 captures the vehicle’s overall shape and positional information. The expansion rate of 3 in the medium receptive field strikes a balance between local details and global information, which helps to differentiate between the target and the background and reduces background interference, such as the misdetection of buildings. On the other hand, the light inside the tunnel is dim, and the contour of near vehicles may not be clear enough for detection. A small receptive field can focus on nearby areas and enhance the detection of the vehicles’ contours. At the tunnel’s exit, the light varies greatly, and a large receptive field can capture a wider range of information about the light variations, which helps the model adapt to such variations and maintains its stable detection performance. In the night scene, headlights and pedestrian edges generate noise, and a small receptive field can locate these areas more accurately and suppress the effect of that noise through appropriate processing. At night, the overall light is weak, but large obstacles may create obvious light intensity distribution regions in the image, meaning the large receptive field can capture information about these regions to realize the localization of large obstacles.
Simply increasing the receptive field by raising the expansion rate does not enhance model performance. While a larger expansion size extracts a broader range of features, it also introduces excessive redundant pixel information from the background. This can cause the response region to spread across the entire feature map, impairing the model’s ability to accurately localize targets and increasing its overall computational demands. The Leaky ReLU activation function helps maintain a smooth gradient interval, facilitating easier backpropagation and gradient updates, which ultimately improves the model’s performance.

5. Summary and Conclusions

In this paper, we propose the TDGL, a fast multi-scale vision sensor module based on deep learning and designed for accurate and efficient road object detection. Instead of deepening the backbone network, we introduce the Transformation Dilated Grouped Layer feature extraction module, which adopts a custom lightweight three-branch structure. This design utilizes three-dimensional expansion rates at the end of each branch to effectively capture both detailed and global information. This normalized convolution unit enhances a model’s robustness. This multi-scale feature extraction method is particularly well suited to the complexities of autonomous driving environments, which often feature intricate backgrounds and targets of varying scales. By incorporating the TDGL module into the lightweight YOLO family model, we significantly enhanced the performance of this detection model on the BDD100K dataset. The proposed TDGL-based model surpasses leading two-stage detectors in accuracy while maintaining the high-speed inference advantage of lightweight architectures, making it an ideal vision sensor for real-time autonomous driving applications.

Author Contributions

F.Z. designed the project, proposed the main design ideas, and worked out all the details of the model improvements. L.X. was responsible for the design of the experimental program, including the dataset selection, the training and testing of the model, and the writing. Z.W. contributed to the concept and design of the study and to the statistical analysis of the results. All authors discussed the results, commented on the manuscript, and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Beijing Natural Science Foundation under grants 8252019, Jiangxi Provincial Nataural Science Foundataion No. 20232ABC03A07, and Natural Science Foundation of China under grant U24A20277.

Institutional Review Board Statement

All data used in this study were obtained from legitimate and publicly available sources and did not violate any individual or organization’s privacy or intellectual property rights. We have adopted strict anonymization and de-identification measures when handling data involving personal privacy. We are willing to share data to an appropriate extent in order to comply with relevant laws, regulations, and ethical guidelines. We promise that the none of the research process involved questions of ethics or morality.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

  1. Zhan, J.; Yang, Y.; Jiang, W.; Jiang, K.; Shi, Z.; Zhuo, C. Fast multi-lane detection based on cnn differentiation for adas/ad. IEEE Trans. Veh. Technol. 2023, 72, 15290–15300. [Google Scholar] [CrossRef]
  2. Song, F.; Zhong, H.; Li, J.; Zhang, H. Multi-point rcnn for predicting deformation in deep excavation pit surrounding soil mass. IEEE Access 2023, 11, 124808–124818. [Google Scholar] [CrossRef]
  3. Duth, S.; Vedavathi, S.; Roshan, S. Herbal leaf classification using rcnn, fast rcnn, faster rcnn. In Proceedings of the IEEE 2023 7th International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 18–19 August 2023; pp. 1–8. [Google Scholar]
  4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  5. Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv 2018, arXiv:1805.04687. [Google Scholar]
  6. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  7. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  8. Ma, S.; Jiang, H.; Chang, L.; Zheng, C. Based on attention mechanism and characteristics of the polymerization of the lane line detection. J. Microelectron. Comput. 2022, 39, 40–46. [Google Scholar] [CrossRef]
  9. Malaiarasan, S.; Ravi, R.; Maheswari, D.; Rubavathi, C.Y.; Ramnath, M.; Hemamalini, V. Towards enhanced deep cnn for early and precise skin cancer diagnosis. In Proceedings of the IEEE 2023 International Conference on Networking and Communications (ICNWC), Chennai, India, 5–6 April 2023; pp. 1–7. [Google Scholar]
  10. Wen, H.; Tong, M. Autopilot based on a semi-supervised learning situations of target detection. J. Microelectron. Comput. 2023, 40, 22–36. [Google Scholar] [CrossRef]
  11. Niu, S.; Xu, X.; Liang, A.; Yun, Y.; Li, L.; Hao, F.; Bai, J.; Ma, D. Research on a Lightweight Method for Maize Seed Quality Detection Based on Improved YOLOv8. IEEE Access 2024, 12, 32927–32937. [Google Scholar] [CrossRef]
  12. Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
  13. Awais, M.; Iqbal, M.T.B.; Bae, S.H. Revisiting internal covariate shift for batch normalization. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 5082–5092. [Google Scholar] [CrossRef] [PubMed]
  14. Huang, T.Y.; Lee, M.C.; Yang, C.H.; Lee, T.S. Yolo-ore: A deep learning-aided object recognition approach for radar systems. IEEE Trans. Veh. Technol. 2023, 72, 5715–5731. [Google Scholar] [CrossRef]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  16. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  17. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  18. Huang, Y.; Yu, Y. An internal covariate shift bounding algorithm for deep neural networks by unitizing layers’ outputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8465–8473. [Google Scholar]
  19. Wang, J.; Li, X.; Wang, J. Energy saving based on transformer models with leakyrelu activation function. In Proceedings of the IEEE 2023 13th International Conference on Information Science and Technology (ICIST), Cairo, Egypt, 8–14 December 2023; pp. 623–631. [Google Scholar]
  20. Zhang, T.; Yang, J.; Song, W.; Song, C. Research on improved activation function trelu. mall Microcomput. Syst. 2019, 40, 58–63. [Google Scholar]
  21. El Mellouki, O.; Khedher, M.I.; El-Yacoubi, M.A. Abstract layer for leakyrelu for neural network verification based on abstract interpretation. IEEE Access 2023, 11, 33401–33413. [Google Scholar] [CrossRef]
  22. Deng, G.; Zhao, Y.; Zhang, L.; Li, Z.; Liu, Y.; Zhang, Y.; Li, B. Image classification and detection of cigarette combustion cone based on inception resnet v2. In Proceedings of the IEEE 2020 5th International Conference on Computer and Communication Systems (ICCCS), Shanghai, China, 15–18 May 2020; pp. 395–399. [Google Scholar]
  23. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
  24. Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar]
  25. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  26. Zheng, Z.; Ye, R.; Wang, P.; Ren, D.; Zuo, W.; Hou, Q.; Cheng, M.M. Localization distillation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9407–9416. [Google Scholar]
  27. Liu, Y.; Liu, R.; Wang, S.; Yan, D.; Peng, B.; Zhang, T. Video face detection based on improved ssd model and target tracking algorithm. J. Web Eng. 2022, 21, 545–568. [Google Scholar] [CrossRef]
  28. Gao, P.; Tian, T.; Zhao, T.; Li, L.; Zhang, N.; Tian, J. Double fcos: A two-stage model utilizing fcos for vehicle detection in various remote sensing scenes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4730–4743. [Google Scholar] [CrossRef]
Figure 1. Illustration of road object detection framework.
Figure 1. Illustration of road object detection framework.
Sensors 25 03339 g001
Figure 2. Improved one-stage network architecture.
Figure 2. Improved one-stage network architecture.
Sensors 25 03339 g002
Figure 3. Network structure of TDGL feature extraction module.
Figure 3. Network structure of TDGL feature extraction module.
Sensors 25 03339 g003
Figure 4. GLConv structure in terms of normalized convolutional units.
Figure 4. GLConv structure in terms of normalized convolutional units.
Sensors 25 03339 g004
Figure 5. Visualization of BDD100K dataset. (a) Number of targets in each category. (b) Distribution diagram of label frame center points. (c) Distribution diagram of label frame sizes.
Figure 5. Visualization of BDD100K dataset. (a) Number of targets in each category. (b) Distribution diagram of label frame center points. (c) Distribution diagram of label frame sizes.
Sensors 25 03339 g005
Figure 6. Loss function value change curve.
Figure 6. Loss function value change curve.
Sensors 25 03339 g006
Figure 7. Comparison of multi-scale detection results.
Figure 7. Comparison of multi-scale detection results.
Sensors 25 03339 g007
Figure 8. Comparison of model heat map effects.
Figure 8. Comparison of model heat map effects.
Sensors 25 03339 g008
Table 1. Hyperparameter settings.
Table 1. Hyperparameter settings.
HyperparameterValues
epochs100
optimizerAuto
optimizer weight decay0.0005
initial learning rate0.01
final learning rate0.01
imgsz640
box loss gain7.5
cls loss gain0.5
dfl loss gain1.5
warmup-epochs3.0
warmup-momentum0.8
warmup-bias-lr0.10
mosaic augmentation10
image HSV—saturation augmentation0.7
image HSV—value augmentation0.4
image HSV—hue augmentation0.015
number of images per batch16
Table 2. Comparison of improved results for multi-class models.
Table 2. Comparison of improved results for multi-class models.
MethodsConv#ParamFLOPsPrecision/%Recall/%mAP@0.5/%FPS
YOLOv3-tiny12.1M18.9G38.721.319.761
YOLOv3-TDGLGLConv9.7M18.2G44.221.921.055
YOLOv52.5M7.1G36.627.325.750
YOLOv5-TDGLGLConv2.6M7.2G42.327.626.643
YOLOv64.2M11.8G39.726.225.164
YOLOv6-TDGLConv + BN4.3M11.9G38.625.024.854
YOLOv6-TDGLGLConv4.4M11.9G39.326.326.653
YOLOv83.0M8.1G55.535.539.065
Faster-RCNN [4]37.3M92.8G26.433.426.827
YOLOv8-TDGLConv + BN3.0M8.1G55.134.338.557
YOLOv8-TDGLGLConv3.1M8.2G55.937.340.358
Table 3. Comparative experiments with SOTA object detection methods.
Table 3. Comparative experiments with SOTA object detection methods.
MethodsBackbone#ParamFLOPsPrecision/%Recall/%mAP@0.5/%FPS
ATSS [23]ResNet-5030.7M92.1G24.635.422.934
AutoAssign [24]ResNet-5031.4M95.2G28.334.827.033
YOLOv3-tinyTiny-Darknet12.1M18.9G38.721.319.761
Faster-RCNN [4]ResNet-5037.3M92.8G26.433.426.827
CenterNet [25]ResNet-5025.8M45.6G29.937.429.032
Localization Distillation [26]ResNet-1810.1M18.2G22.630.420.738
YOLOv5CSPDarknet2.5M7.1G36.627.325.750
SSD300 [27]VGG21.6M31.2G24.331.918.645
FCOS [28]ResNet-5027.2M38.9G27.139.725.636
Generalized Focal LossResNet-5023.8M25.8G27.034.325.329
RetinaNetResNet-1812.2M16.8G24.633.722.740
OursCSPDarknet3.1M8.2G55.937.340.358
Table 4. Ablation experiment.
Table 4. Ablation experiment.
GroupsExpansion RateReluSoftmaxL_reluDepthChannelPrecision/%Recall/%mAP@0.95/%mAP@0.5/%
13\3\4455.535.519.339.0
21\4\63\3\4451.138.119.439.8
31\4\63\3\4457.336.319.639.9
41\4\63\3\4451.737.619.939.4
51\2\43\3\4455.735.519.138.9
61\2\43\3\4456.336.719.339.4
71\2\43\3\4456.237.519.339.8
81\3\53\3\4453.737.219.239.3
91\3\53\3\4454.237.219.039.1
101\4\64\4\5854.935.318.738.2
111\2\44\4\5853.334.418.237.3
121\2\44\4\51654.935.318.939.4
131\3\53\3\41650.531.317.935.8
141\3\54\4\5856.237.119.240.0
151\3\53\3\4455.937.319.740.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, L.; Zhu, F.; Wang, Z. The TDGL Module: A Fast Multi-Scale Vision Sensor Based on a Transformation Dilated Grouped Layer. Sensors 2025, 25, 3339. https://doi.org/10.3390/s25113339

AMA Style

Xie L, Zhu F, Wang Z. The TDGL Module: A Fast Multi-Scale Vision Sensor Based on a Transformation Dilated Grouped Layer. Sensors. 2025; 25(11):3339. https://doi.org/10.3390/s25113339

Chicago/Turabian Style

Xie, Leilei, Fenghua Zhu, and Zhixue Wang. 2025. "The TDGL Module: A Fast Multi-Scale Vision Sensor Based on a Transformation Dilated Grouped Layer" Sensors 25, no. 11: 3339. https://doi.org/10.3390/s25113339

APA Style

Xie, L., Zhu, F., & Wang, Z. (2025). The TDGL Module: A Fast Multi-Scale Vision Sensor Based on a Transformation Dilated Grouped Layer. Sensors, 25(11), 3339. https://doi.org/10.3390/s25113339

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop