CenterPNets: A Multi-Task Shared Network for Traffic Perception

The importance of panoramic traffic perception tasks in autonomous driving is increasing, so shared networks with high accuracy are becoming increasingly important. In this paper, we propose a multi-task shared sensing network, called CenterPNets, that can perform the three major detection tasks of target detection, driving area segmentation, and lane detection in traffic sensing in one go and propose several key optimizations to improve the overall detection performance. First, this paper proposes an efficient detection head and segmentation head based on a shared path aggregation network to improve the overall reuse rate of CenterPNets and an efficient multi-task joint training loss function to optimize the model. Secondly, the detection head branch uses an anchor-free frame mechanism to automatically regress target location information to improve the inference speed of the model. Finally, the split-head branch fuses deep multi-scale features with shallow fine-grained features, ensuring that the extracted features are rich in detail. CenterPNets achieves an average detection accuracy of 75.8% on the publicly available large-scale Berkeley DeepDrive dataset, with an intersection ratio of 92.8% and 32.1% for driveableareas and lane areas, respectively. Therefore, CenterPNets is a precise and effective solution to the multi-tasking detection issue.


Introduction
In recent years, the rapid development of embedded systems and neural networks has made autonomous driving a popular field in computer vision, where panoramic traffic perception systems play a crucial role in autonomous driving. Research has shown that vehicle onboard camera image processing enables scene understanding, including road target detection, driveable area detection, and lane detection, which greatly reduces overhead compared to the traditional approach of using LIDAR and millimeter wave radar to establish the vehicle's surroundings.
The traffic panorama perception system's detection precision and decision-making speed significantly influence the vehicle's judgment and decision-making and determine the safety of autonomous vehicles. However, actual vehicle driver assistance systems, such asthe Advanced Driver Assistance System, have limited computing power and are expensive. Therefore, achieving a good balance between detection accuracy and model complexity from a practical application perspective is a challenge for decision-makers.
Current target detection can be broadly divided into one-stage detection models and two-stage detection models. The two-stage detection approach usually starts by acquiring a candidate region and then performing a regression prediction from that candidate region to ensure the accuracy of the detection. However, this step-by-step detection approach is not friendly to embedded systems. The end-to-end, one-stage detection model has the advantage of fast inference speed and is gaining more attention in the field of detection.The use of direct regression bounding boxes in the SSD [1] series, YOLO [2] series, etc. is correlation between each task's semantic features, which can help the network reduce model redundancy. (2) The detection head adopts an anchor-free mechanism to directly return the target key point information, size, and offset, without the need for pre-clustering anchor box ratio and tedious subsequent processing, thus enhancing the overall inference speed of the network. (3) In the segmentation head section, similar features from the shared detection task are used and proposed to fuse multi-scale, deep semantic information with shallow features so that the feature information extracted from the segmentation task is rich in fine-grained information, thus enhancing the detail segmentation capability of the model.

Network Architecture
This paper proposes a multi-task traffic panorama perception architecture that can be jointly trained, called CenterPNets. As shown in Figure 1, the structure mainly contains encoders, decoders, and task-independent detection heads to handle the corresponding detection tasks, and there are no redundant parts between the modules, which reduces computational consumption to a certain extent.
To sum up, the main contributions of this research are: (1)This paper propose effective end-to-end shared multi-task network structure that can jointly handle t important traffic sensing tasks: lane detection, driveable area segmentation, and r target detection. The network's encoders and decoders are shared to fully exploit correlation between each task's semantic features, which can help the network red model redundancy. (2)The detection head adopts an anchor-free mechanism to dire return the target key point information, size, and offset, without the need pre-clustering anchor box ratio and tedious subsequent processing, thus enhancing overall inference speed of the network. (3)In the segmentation head section, similar tures from the shared detection task are used and proposed to fuse multi-scale, d semantic information with shallow features so that the feature information extra from the segmentation task is rich in fine-grained information, thus enhancing the d segmentation capability of the model.

Network Architecture
This paper proposes a multi-task traffic panorama perception architecture that be jointly trained, called CenterPNets. As shown in Figure 1, the structure mainly tains encoders, decoders, and task-independent detection heads to handle the co sponding detection tasks, and there are no redundant parts between the modules, w reduces computational consumption to a certain extent. In the encoder part, feature extraction is the core structure in the network, w directly determines the accuracy of the network detection. Many modern networks rently extract features directly using networks that have good detection performanc the ImageNet dataset. One of the most traditional deep networks, Darknet, comb In the encoder part, feature extraction is the core structure in the network, which directly determines the accuracy of the network detection. Many modern networks currently extract features directly using networks that have good detection performance in the Ima-geNet dataset. One of the most traditional deep networks, Darknet, combines Resnet [19] features to ensure excellent feature representation while avoiding the gradient issues that come with overlying deep networks. CenterPNetsusesCSPDarkNet as the backbone, which combines the advantages of CSPNet and SPP [20] modules to maximize the difference in gradient union, and its use of gradient stream splitting and merging to avoid different layers of learning to duplicate gradient information is effective enough to reduce duplicate gradient learning. As a result, the backbone network of CenterPNets can extract crucial feature information while lowering the network's computational cost.
The feature map extracted by the encoder is passed to the neck structure of the network. The Feature Pyramid Network(FPN) module [21] a feature extractor design for generating multi-scale feature maps to obtain better information. However, the limitation of FPN is that the information features are inherited by a uni-directional flow. As a result, the CenterPNets neck network makes use of the PANet module with the addition of a top-down feature pyramid behind the FPN layer. Through its structural properties, it can effectively compensate for the fact that FPN only enhances the semantic information of the feature pyramid and lacks localization information.
A. Anchor-free detection head As shown in Figure 2, in the detection head section, CenterPNets integrates information from the P3_out, P4_out, and P5_out multi-level feature maps in the neck network at the same resolution in order to obtain multi-level semantic features, followed by pyramid pooling and attention mechanisms to reinforce the relevant feature information, which is recovered by upsampling to a feature map with 1/4 of the input image resolution. CenterP-Nets uses an anchor-free mechanism [22] for direct regression prediction, eliminating the need for K-means clustering to determine pre-defined anchor box proportions and tedious NMS follow-up, allowing for direct regression of key point heatmaps, size prediction, and offset prediction, thus improving the overall speed of inference in the network. Resnet [19] features to ensure excellent feature representation while avoiding the gradient issues that come with overlying deep networks. CenterPNetsusesCSPDarkNet as the backbone, which combines the advantages of CSPNet and SPP [20] modules to maximize the difference in gradient union, and its use of gradient stream splitting and merging to avoid different layers of learning to duplicate gradient information is effective enough to reduce duplicate gradient learning. As a result, the backbone network of CenterPNets can extract crucial feature information while lowering the network's computational cost. The feature map extracted by the encoder is passed to the neck structure of the network. The Feature Pyramid Network(FPN) module [21] a feature extractor design for generating multi-scale feature maps to obtain better information. However, the limitation of FPN is that the information features are inherited by a uni-directional flow. As a result, the CenterPNets neck network makes use of the PANet module with the addition of a top-down feature pyramid behind the FPN layer. Through its structural properties, it can effectively compensate for the fact that FPN only enhances the semantic information of the feature pyramid and lacks localization information.
A. Anchor-free detection head As shown in Figure 2, in the detection head section, CenterPNets integrates information from the P3_out, P4_out, and P5_out multi-level feature maps in the neck network at the same resolution in order to obtain multi-level semantic features, followed by pyramid pooling and attention mechanisms to reinforce the relevant feature information, which is recovered by upsampling to a feature map with 1/4 of the input image resolution.CenterPNets uses an anchor-free mechanism [22] for direct regression prediction, eliminating the need for K-means clustering to determine pre-defined anchor box proportions and tedious NMS follow-up, allowing for direct regression of key point heatmaps, size prediction,and offset prediction, thus improving the overall speed of inference in the network. Keypoint heatmap: Assume that the input image is , where W and H are the width and height of the input image, respectively, C is the category type of the detection target, and R is the output step. In this paper, only the car category label is detected, so C = 1. We use the default output step of R = 4 and deflate the output prediction by R. For the real category labeling point during model training, where C represents the number of category labels detected.In this paper, only the car category is detected.

= Y
means that the target to be measured is detected at (x, y) out. 0 = Y indicates a background area. For each ground truth keypoint P, we splat it onto a heatmap using a Gaussian kernel δ is an object size-adaptive standard deviation [23].Whenthere are two Gaussian kernels that overlap, the maximum Keypoint heatmap: Assume that the input image is I ∈ R W•H•3 , where W and H are the width and height of the input image, respectively, C is the category type of the detection target, and R is the output step. In this paper, only the car category label is detected, so C = 1. We use the default output step of R = 4 and deflate the output prediction by R. For the real category labeling point P ∈ R 2 , a low-resolution equivalent pointP = P R is used instead. Generate a heat map of key pointsŶ ∈ [0, 1] W×H×C during model training, where C represents the number of category labels detected.In this paper, only the car category is detected.Ŷ = 1 means that the target to be measured is detected at (x, y) out.Ŷ = 0 indicates a background area. For each ground truth keypoint P, we splat it onto a heatmap using a Gaussian , δ p is an object size-adaptive standard deviation [23].
When there are two Gaussian kernels that overlap, the maximum value of the elements is taken in this paper [24]. The difference between the predicted and real heatmaps is the pixel-wise focus loss [25].
where α and β are hyperparameters of focal loss and N is the number of critical points. In our experiments, we used α = 2, β = 4 [23].
Size prediction: Assume that the kth bounding box has coordinates x k 1 , y k 1 , x k 2 , y k 2 and a width and We calculate the predicted loss using L 1 only at the center of the target.
Offset prediction: The output feature map will contain accuracy errors when remapping to the original image size because the decoder outputs features at a resolution that is one-fourth that of the original input image. As a result, an extra local offset is applied for each key point to make up for the inaccuracy.
where O p denotes the offset of the network prediction, P denotes the image centroid coordinates, and R denotes the heatmap scaling factor.

B. Segmentation heads incorporating fine-grained features
As shown in Figure 3, the segmented head section outputs 3 categories of labels, namely background, road trafficable area, and road lane lines. There is a correlation between the feature information of the detection task and the segmentation task, so CenterPNets shares the same feature mapping between the two and upsamples the feature fusion with the shallow, fine-grained feature P1 layer with rich localization information based on the detection feature mapping, thus enhancing the network's ability to segment image edge details. Finally, we recover the output features to the original image resolution (W, H, 3), storing the probability values for each pixel category label.
value of the elements is taken in this paper [24]. The difference between the predicted an real heatmaps is the pixel-wise focus loss [25].
where α and β are hyperparameters of focal loss and N is the number of critical points. our experiments, we used α=2, β=4 [23].
Size prediction: Assume that the kth bounding box has coordinates ( ) and a width an The coordinates of its center point are We calculate the predicted loss using L1 only at the center of the target.
The output feature map will contain accuracy errors when remapping to the origin image size because the decoder outputs features at a resolution that is one-fourth that the original input image. As a result, an extra local offset is applied for each key point make up for the inaccuracy.
O denotes the offset of the network prediction, P denotes the image centroid c ordinates, and R denotes the heatmap scaling factor.

B. Segmentation heads incorporating fine-grained features
As shown in Figure 3, the segmented head section outputs 3 categories of labe namely background, road trafficable area, and road lane lines. There is a correlation b tween the feature information of the detection task and the segmentation task, so Ce terPNets shares the same feature mapping between the two and upsamples the featu fusion with the shallow, fine-grained feature P1 layer with rich localization informatio based on the detection feature mapping, thus enhancing the network's ability to segme image edge details. Finally, we recover the output features to the original image resol tion (W, H, 3), storing the probability values for each pixel category label.

Loss of Function for Joint Multi-Task Training
The end-to-end network is trained by CenterPNets using a multi-task loss function, which sums two components to represent the entire loss function.
where L det is the target detection loss and L seg is the semantic segmentation loss. α, β are the balance factors of the loss function in order to keep the detection task in the same order of magnitude as the segmentation task.
where L size , L off use the ordinary L 1 loss function, which is used to regress the width and height and centroid offsets, respectively. For heat map losses, L k is calculated by focal loss.
Multi-class mixture loss is used for multi-class segmentation of backgrounds, driveable areas, and lane lines. Semantic segmentation is difficult due to the uneven distribution of data. Therefore, CenterPNets combines Tversky loss L Tversky [26] and focus loss L Focal [27] to predict the class to which the pixel belongs. L Tversky performs well on the class imbalance problem and is optimized for score maximization, while L Focal aims to minimize classification errors between pixels and focuses on hard labeling.
where TP p (c), FN p (c), and FP p (c) are classes of true positives, false negatives, and false positives. P n (c) is the predicted probability of a class of pixels. g n (c) is denoted as the true annotation category C for pixel n. C is the number of classes in Equation (8) and N is the total number of pixels in the input image in Equation (9).

Dataset Setting
The experiments in this paper use image data from the Berkeley DeepDrive dataset (BDD100K) to train and validate the model. Existing multi-task networks are trained against datasets from three tasks on BDD100K to help compare performance with other models. In the target detection task, "car, truck, bus, train" are combined into a single category label "car," as MultiNet, YOLOP, and HybridNets can only detect vehicle category labels. Basic enhancements such as rotation, scaling etc. are used in image pre-processing.

Evaluation Indicators
In the traffic target detection task, performance is evaluated with the help of mAP50. mAP50 is calculated by averaging the average accuracy of the categories below a single IoU threshold of 0.5.
where r 1 , r 2 , . . . , r n are the recall values corresponding to the first interpolation of the precision interpolation segment in ascending order.
In the semantic segmentation task, the IoU metric is used to evaluate the driveable area and lane line segmentation. In this paper, mIoU is represented as the average IoU per class and the IoU metric for individual classes. In order to illustrate the validity of the experiment more favorably, accuracy has been added as an additional criterion.
where B p is the predicted bounding box and B t is ground truth bounding box.

Experimental Analysis of Multi-Tasking Networks
In this section, we first train the model end-to-end, then compare it with other representative models in the corresponding tasks and illustrate the effect of each module on the network and the effectiveness of multi-task network learning by means of ablation experiments and freeze-out training, respectively.

Road Target Detection Tasks
The CenterPNets algorithm istestedfor vehicle target detection on the BDD100K dataset and the algorithms are compared with MultiNet, Faster R-CNN, YOLOP, and HybridNets, and their experimental results are shown in Table 1. As shown in Table 1, CenterPNets uses detection accuracy (mAP50) and recall (recall) as evaluation metrics. The CenterPNets model outperforms MultiNet and Faster R-CNN networks in terms of detection accuracy, but falls short of YOLOP and HybridNets. Since YOLOP uses a network structure based on the anchor box mechanism of YOLOV4, it has a high recall rate in feature regression by generating a dense anchor box approach, which allows the network to perform target classification and bounding box coordinate regression directly on this basis; HybridNets also uses a similar mechanism. CenterPNets, on the other hand, uses an anchor-free box mechanism, which results in average regression box quality because the anchor-free mechanism only predicts at locations closer to the center of In this study, we use the same image and video data to verify the inference speed of the model. As can be seen from Table 2, the number of model parameters in this study has increased compared to the benchmark algorithm, which is due to the deeper network structure we have used to ensure detection performance. Secondly, the overall structure of the network in this studyis more integrated and eliminates tedious subsequent processing, etc., which optimizes the network to a certain extent. As can be seen from Table 2, the CenterPNets algorithm has an inference speed of 8.709 FPS when we perform unified image inference, compared to 5.719 FPS for HybridNets network inference, which shows a roughly 1.5-times improvement in inference speed, as the anchorless framework mechanism eliminates tedious subsequent processing. To further verify the reliability of the experiments, CenterPNetswastested uniformly using video data for inference, and it can be seen that the inference performance of CenterPNets is still very good. To further evaluate the effectiveness of CenterPNets in real road traffic scenarios, images of road scenes at different times of the day areselected from the BDD100K test set for experimental effectiveness testing. The YOLOP, HybridNets, and CenterPNets algorithms for traffic target recognition tasks at various times of the day are visually compared in Figure 4. The first row displays the results of the YOLOP test, the second row the results of the HybridNets test, and the third row the results of the CenterPNets test. Orange circles denote false negatives, and red circles false positives. The CenterPNets shared network architecture is a further improvement compared to YOLOP and HybridNets. As shown, it can be seen that YOLOP and HybridNets both have a certain degree of missed and false vehicle target detection, while the CenterPNets algorithm has better vehicle target detection capabilities and more accurate bounding boxes in different environments.

A. Travelable area segmentation tasks
CenterPNets uses the IoU metric to evaluate the driveable area segmentation capability and is compared with the algorithms MultiNet, PSPNet, YOLOP, and HybridNets, whose experimental results are shown in Table 3.
The driveable portion of the image and the backdrop are the only things the CenterP-Nets model needs to differentiate between. Comparing the five driveable area detection networks, Table 3 demonstrates that the CenterPNets algorithm had the highest mIoU performance of 92.8%, an improvement of 1.3% and 2.3% over YOLOP and HybridNets, respectively. Due to the feature correlation between the road vehicle detection task and the road travel area segmentation task, the CenterPNets shared network can effectively information correlation between the two; secondly, CenterPNets first fuses deep multiscale features and combines shallow feature information so that the extracted semantic feature information has local fine-grained information at the same time, smoothing the road edge segmentation.
For the driveable area segmentation task, in Figure 5, red in the depiction is a false positive and orange is a false negative. The CenterPNets method is more precise than YOLOP and HybridNets region segmentation, as demonstrated by a visual comparison of the CenterPNets network with those two algorithms. YOLOP considers the intersection of bounding boxes while concentrating on determining the class to which the pixel belongs. As a result, the YOLOP model's detection suffers from some lane line and road Sensors 2023, 23, 2467 9 of 15 area misdetection as well as an inability to precisely segment the driveable portion of the road. For the neck network, HybridNets uses a BiFPN architecture, in which information from different receptive fields is combined from different feature map levels by weighting parameters, an improvement over the YOLOP segmentation structure but still with regional underdetection. The CenterPNets algorithm uses the PANet architecture of the neck network to fuse different scale features to make the global information richer, while taking advantage of the correlation between multi-task features and combining it with rich shallow fine-grained feature information to ensure that the network captures more detailed information. The driveable area segmentation task can therefore be effectively improved by the CenterPNets network. The CenterPNets method exhibits some inadequate area segmentation at complicated junctions, as seen in the picture, but the overall highway driveable area may be more precisely segregated from the backdrop and lane lines.

A. Travelable area segmentation tasks
CenterPNets uses the IoU metric to evaluate the driveable area segmentation capability and is compared with the algorithms MultiNet, PSPNet, YOLOP, and HybridNets, whose experimental results are shown in Table 3. The driveable portion of the image and the backdrop are the only things the Cen-  with rich shallow fine-grained feature information to ensure that the network captur more detailed information. The driveable area segmentation task can therefore be effe tively improved by the CenterPNets network. The CenterPNets method exhibits som inadequate area segmentation at complicated junctions, as seen in the picture, but t overall highway driveable area may be more precisely segregated from the backdrop a lane lines.

B. Lane area splitting task
Lane detection is one of the main challenges for autonomous driving. CenterPNets uses accuracy and IoU as evaluation metrics for lane detection and compares the algorithms with ENet, SCNN, YOLOP, and HybridNets, whose experimental results are shown in Table 4. As shown in Table 4, CenterPNets' shared network multi-tasking architecture accomplished both the driving area and lane line segmentation tasks in the segmentation head section, with the CenterPNets algorithm achieving the best performance results of 86.20% accuracy and 32.1% IoU, an improvement in performance compared to other detection networks.
As shown in Figure 6   As shown in Table 4, CenterPNets' shared network multi-tasking architecture accomplished both the driving area and lane line segmentation tasks in the segmentation head section, with the CenterPNets algorithm achieving the best performance results of 86.20% accuracy and 32.1% IoU, an improvement in performance compared to other detection networks.
As shown in Figure 6

A. Split task ablation experiment
CenterPNets is further used to analyze the impact of modules such as multi-scale feature information (MFI),spatial pyramidal pooling (SPP), attention mechanism (Attention), and superficial feature information (SCI) on the segmentation task. As can be seen from Experiments 1 and 2 in Table 5, by introducing multi-scale information fusion, the driveable area IoU and accuracy improved by 3.8% and 1.9%, respectively, and the lane

C. Split task ablation experiment
CenterPNets is further used to analyze the impact of modules such as multi-scale feature information (MFI),spatial pyramidal pooling (SPP), attention mechanism (Atten-tion), and superficial feature information (SCI) on the segmentation task. As can be seen from Experiments 1 and 2 in Table 5, by introducing multi-scale information fusion, the driveable area IoU and accuracy improved by 3.8% and 1.9%, respectively, and the lane detection IoU and accuracy improved by 2.8% and 2.3%, respectively, thus demonstrating the effectiveness of multi-level contextual feature information for the segmentation task. Experiments 2, 3, 4, and 5 show that the spatial pyramid pooling and attention mechanism effectively enhance the road-related area features, with 1.0% and 1.8% improvements in the IoU and accuracy of the lane lines, respectively.

Training Method Comparison Experiment
In order to verify the effectiveness of joint multi-task training, this paper compares the impact of the multi-task training approach and the single-task training approach on the overall performance of the network. Table 6 shows a comparison of the performance of these two schemes on their specific tasks. It can be seen that the overall performance of the model in this paper using a multi-task training scheme outperforms the performance of the individual tasks. More importantly, the multi-task model can save a significant amount of inference time compared to performing the respective tasks individually.  Figure 7 shows the results of some of the CenterPNets tests, where yellow is the lane line, red is the driveable area, and the green border is the traffic vehicle target. As can be seen, CenterPNets performed relatively well in most cases.CenterPNets exploits the correlation between the detection task and the segmentation task based on contextual information in order to help the training model converge more quickly. Therefore, CenterPNets in this paper can perform the traffic perception task more easily. In general, CenterPNets can perform the detection task well in the vast majority of scenarios. However, there are still some lane prediction interruptions and missed detections at complex intersections. seen , CenterPNets performed relatively well in most cases.CenterPNets exploits the correlation between the detection task and the segmentation task based on contextual information in order to help the training model converge more quickly. Therefore, Cen-terPNets in this paper can perform the traffic perception task more easily.In general,CenterPNets can perform the detection task well in the vast majority of scenarios. However, there are still some lane prediction interruptions and missed detections at complex intersections.

Conclusions
In this paper, the effectiveness of multi-task network detection is systematically described, and a perceptual structure called theCenterPNets shared codec is proposed to integrate multi-scale feature information through a path aggregation network, which is used for direct regression to target key points.In the semantic segmentation task, the detailed information of the image is enhanced by fusing the multi-level features of the path aggregation network with shallow fine-grained information and building an effective training loss function to improve accuracy and performance.CenterPNets achieved an average detection accuracy of 75.8% on the publicly available large-scale Berkeley DeepDrive dataset, with an average intersection ratio of 92.8% in the driveable area and 32.1% in the lane area, respectively. Compared to the baseline algorithm, CenterPNets showed a 2.3% and 0.5% improvement in the cross-merge ratio for the roadway driveable area and lane line segmentation tasks, respectively. More importantly, CenterPNets achieved more accurate traffic segmentation tasks with relatively fast inference compared to other multi-task detection networks.

Conclusions
In this paper, the effectiveness of multi-task network detection is systematically described, and a perceptual structure called theCenterPNets shared codec is proposed to integrate multi-scale feature information through a path aggregation network, which is used for direct regression to target key points.In the semantic segmentation task, the detailed information of the image is enhanced by fusing the multi-level features of the path aggregation network with shallow fine-grained information and building an effective training loss function to improve accuracy and performance.CenterPNets achieved an average detection accuracy of 75.8% on the publicly available large-scale Berkeley Deep-Drive dataset, with an average intersection ratio of 92.8% in the driveable area and 32.1% in the lane area, respectively. Compared to the baseline algorithm, CenterPNets showed a 2.3% and 0.5% improvement in the cross-merge ratio for the roadway driveable area and lane line segmentation tasks, respectively. More importantly, CenterPNets achieved more accurate traffic segmentation tasks with relatively fast inference compared to other multi-task detection networks.