Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes

Vision-based target detection and segmentation has been an important research content for environment perception in autonomous driving, but the mainstream target detection and segmentation algorithms have the problems of low detection accuracy and poor mask segmentation quality for multi-target detection and segmentation in complex traffic scenes. To address this problem, this paper improved the Mask R-CNN by replacing the backbone network ResNet with the ResNeXt network with group convolution to further improve the feature extraction capability of the model. Furthermore, a bottom-up path enhancement strategy was added to the Feature Pyramid Network (FPN) to achieve feature fusion, while an efficient channel attention module (ECA) was added to the backbone feature extraction network to optimize the high-level low resolution semantic information graph. Finally, the bounding box regression loss function smooth L1 loss was replaced by CIoU loss to speed up the model convergence and minimize the error. The experimental results showed that the improved Mask R-CNN algorithm achieved 62.62% mAP for target detection and 57.58% mAP for segmentation accuracy on the publicly available CityScapes autonomous driving dataset, which were 4.73% and 3.96%% better than the original Mask R-CNN algorithm, respectively. The migration experiments showed that it has good detection and segmentation effects in each traffic scenario of the publicly available BDD autonomous driving dataset.


Introduction
With the rapid development of deep learning and the advancement of intelligent control technology, autonomous driving is gradually becoming the focus of research in various fields. For autonomous driving, three key technologies need to be improved, namely, perception, decision making, and execution [1]. Of these technologies, environment perception is considered to be the eyes of the self-driving vehicle and is the basis and prerequisite for behavioral decisions and executive control. The environment perception of self-driving vehicles on the road with the help of cameras can be seen as the detection and segmentation of multiple targets in front of the vehicle. Therefore, deep learningbased target detection algorithms and instance segmentation algorithms can be applied to environment perception in autonomous driving to help self-driving vehicles obtain information about their surroundings through visual information [2].
Based on the network structure as well as the detection principle, deep learning-based target detection algorithms can be divided into one-stage and two-stage target detection methods. One-stage target detection algorithms achieve higher prediction speed, while twostage target detection algorithms have region proposal steps with higher localization and object recognition accuracy [3]. In 2016, Redmon et al. proposed a one-stage target detection framework, YOLO, which transforms the target detection problem into a regression problem to realize fast real-time target detection [4]. Liu W et al. introduced multi-scale detection and proposed the single shot multibox detector (SSD) [5]. By detecting multiple targets at different scales through different layers of the neural network to suggest the accuracy of small targets, two-stage target detection divides target detection into two parts, namely, one for proposal generation and one for target classification within the proposal. Girshick et al. proposed R-CNN in 2014, which became the basis of research on two-stage target detection algorithms [6]. By generating proposals through selective searching, the images are then classified by support vector machines. Kai-Ming He et al. optimized R-CNN for speedup and proposed SPPNet by introducing the spatial pyramid pooling model in 2015 [7]. Girshick et al. propose Fast R-CNN on the basis of R-CNN [8], and later with Kai-Ming He proposed Faster R-CNN, which becomes a two-stage target detection model with detection speed closest to real-time in the R-CNN family [9].
Instance segmentation algorithms are different compared to other algorithms in the field of computer vision such as target detection, image classification, etc. Instance segmentation can be seen as labeling out each instance in an image with the aim of labeling out its instances for each pixel. Bai et al. used FCN lattices to learn the energy of the watershed transform and segment the image into several regions by energy segmentation [10]. Gao et al. used pixel-pair affinity pyramids to learn the probability of two pixels belonging to the same instance and generate instances sequentially by cascaded graph partitioning [11]. In 2015, Dai et al. proposed an instance segmentation model using a cascade structure of shared convolutional features by breaking the instance segmentation into three different subtasks to achieve a better effect of instance segmentation [12]. In 2017, Kai-Ming He et al. proposed Mask R-CNN, which achieves instance segmentation by adding a parallel third mask output through an extension for Faster R-CNN [13]. Yi Li et al. presented the first fully convolutional end-to-end solution for instance-aware semantic segmentation tasks. It inherits the merits of FCNs for semantic segmentation and instance mask proposals [14]. Daniel Bolya et al. proposed YOLACT in 2019, with the main idea of directly adding a mask branch to the one-stage target detection algorithm without adding any RoI pooling operation to split the instance segmentation into two parallel branches [15]. In 2020, Wang et al. proposed SOLO, which introduces instance categories and enables direct instance segmentation by quantifying target centers and target sizes [16]. In 2021, Lei Ke et al. proposed BCNet, a two-layer instance segmentation network under masking perception, to explicitly model the masking relationship using a two-layer structure that naturally decouples the boundaries of occluded instances [17]. The interactions during mask regression between them are considered, and the interactions during mask regression between them are considered. In 2022, Tao Zhang et al. proposed a new contour-based image instance segmentation algorithm, E2EC, for high-quality segmentation [18]. Their use of a learnable contour initialization architecture with a multi-directional alignment label sampling method allows the E2EC model to exhibit more advanced segmentation performance. In 2023, He et al. proposed Fastlnst, a query-based real-time instance segmentation model, which mainly includes real-time activation-guided query, dual-path update strategy, and true mask-guided learning for better segmentation performance [19]. In 2023, Zhang et al. proposed a mask-piloted Transformer MP-Former, which additionally adds real masks with noise to the mask attention and trains the model to reconstruct the original masks [20]. It effectively improves the mask prediction accuracy and significantly optimizes the model performance.
Although Mask R-CNN shows good instance segmentation effect, it still has some shortcomings because it does not have good generalization ability when applied to complex traffic scenes. To address the shortcomings of Mask R-CNN when it is difficult to perform instance segmentation and localization for dense multiple targets in autonomous driving scenarios, this paper improved on the basis of Mask R-CNN to adapt it to environment perception under autonomous driving.

Mask R-CNN Instance Segmentation Algorithm Model
The detection of targets such as vehicles, pedestrians, and non-motorized vehicles is one of the core problems of environment perception in self-driving cars. Most traffic Sensors 2023, 23, 3853 3 of 20 accidents are caused by the collision of vehicles, pedestrians, and non-motorized vehicles, so the detection and segmentation of such objects in the traffic scene is helpful to avoid the occurrence of traffic accidents [21]. As a two-stage instance segmentation model, Mask R-CNN extends Faster R-CNN and adds a parallel mask branch to achieve end-to-end multi-target detection and segmentation.

The Model Structure of Mask R-CNN
Mask R-CNN is based on the Faster R-CNN with some improvements, and the structure is shown in Figure 1. The backbone of the feature extraction network is changed from VGG to ResNet and combined with the structure of the Feature Pyramid Network (FPN) to add multiscale information while solving network degradation. The pooling of the region of interest uses Region of Interesting Align (RoI Align) to replace RoI Pooling. The pooling of the region of interest using bilinear interpolation solved the quantization mismatch problem caused by rounding off during pooling and improves the localization accuracy. It also generates a high-quality mask for detecting objects, which can obtain a perfect effect of segmentation.
one of the core problems of environment perception in self-driving cars. Most traffic accidents are caused by the collision of vehicles, pedestrians, and non-motorized vehicles, so the detection and segmentation of such objects in the traffic scene is helpful to avoid the occurrence of traffic accidents [21]. As a two-stage instance segmentation model, Mask R-CNN extends Faster R-CNN and adds a parallel mask branch to achieve end-to-end multitarget detection and segmentation.

The Model Structure of Mask R-CNN
Mask R-CNN is based on the Faster R-CNN with some improvements, and the structure is shown in Figure 1. The backbone of the feature extraction network is changed from VGG to ResNet and combined with the structure of the Feature Pyramid Network (FPN) to add multi-scale information while solving network degradation. The pooling of the region of interest uses Region of Interesting Align (RoI Align) to replace RoI Pooling. The pooling of the region of interest using bilinear interpolation solved the quantization mismatch problem caused by rounding off during pooling and improves the localization accuracy. It also generates a high-quality mask for detecting objects, which can obtain a perfect effect of segmentation.
Mask R-CNN uses a two-stage target detection method. In the first stage, the proposals are obtained through the Region Proposal Network (RPN). In the second stage, the features of target locations are obtained based on the proposal and regions of interest generated in the first stage, and finally classification, localization, and mask prediction are performed.

Backbone Feature Extraction Network
The feature extraction network of Mask R-CNN is divided into two paths: bottomup and top-down. The bottom-up path is a module composed of residual structures of different sizes, which is responsible for feature extraction of the image. Taking ResNet-101 as an example, as shown in Figure 2, the residual modules are noted as C1, C2, C3, C4, and C5. The output of each module contains different levels of feature information, and the size of the mapping map output by the modules is halved from the bottom-up module by module through the residual structure and the convolution with step size of 2 between Mask R-CNN uses a two-stage target detection method. In the first stage, the proposals are obtained through the Region Proposal Network (RPN). In the second stage, the features of target locations are obtained based on the proposal and regions of interest generated in the first stage, and finally classification, localization, and mask prediction are performed.

Backbone Feature Extraction Network
The feature extraction network of Mask R-CNN is divided into two paths: bottom-up and top-down. The bottom-up path is a module composed of residual structures of different sizes, which is responsible for feature extraction of the image. Taking ResNet-101 as an example, as shown in Figure 2, the residual modules are noted as C1, C2, C3, C4, and C5. The output of each module contains different levels of feature information, and the size of the mapping map output by the modules is halved from the bottom-up module by module through the residual structure and the convolution with step size of 2 between modules [23]. The top-down path fuses the abstract features of high-level semantic information with low-level detailed features. The high-level feature maps are reduced to the resolution corresponding to the bottom-up branched feature maps by 2-fold upsampling. The new Sensors 2023, 23, 3853 4 of 20 feature maps P2, P3, P4, and P5 are obtained by adding them pixel by pixel and convolving 3 × 3 to make the high-level semantic information better represented in the mapped images. modules [23]. The top-down path fuses the abstract features of high-level semantic information with low-level detailed features. The high-level feature maps are reduced to the resolution corresponding to the bottom-up branched feature maps by 2-fold upsampling.
The new feature maps P2, P3, P4, and P5 are obtained by adding them pixel by pixel and convolving 3 × 3 to make the high-level semantic information better represented in the mapped images. Figure 2. Backbone feature extraction network. Adapted from Ref. [24].

Regional Proposal Network-RPN
After the input images are extracted by the backbone feature extraction network, a series of feature maps is obtained. These feature maps are then sent into RPN, which performs preliminary classification and prediction on them, predicting the presence or absence of targets as well as bounding box regression prediction. Here the preliminary RoIs are obtained, and the RoIs include the bounding box regression values of the region of interest in the original image.
The RPN network takes anchors on the feature map to calculate the positions of proposals, and the principle of anchor generation is shown in Figure 3. A sliding window of n × n is set on the feature map to generate k identically centered anchors corresponding to each point in the feature map. To fit various sizes of targets, the anchor is set to 3 area sizes as well as 3 aspect ratios, for a total of 9 sizes; thus k = 9. Then, the sliding window is used to convolute the feature map, and the classification layer and regression layer composed of a fully connected network are respectively input through the middle layer to describe the position and type of information of the anchor. The classification layer outputs 2k fore-and-aft probability values, and the regression layer outputs 4k coordinate values. The RPN corrects the length, width, and center of each anchor based on the correction values calculated by the regression, and the corrected RoIs are sent to the subsequent network after filtering.

Regional Proposal Network-RPN
After the input images are extracted by the backbone feature extraction network, a series of feature maps is obtained. These feature maps are then sent into RPN, which performs preliminary classification and prediction on them, predicting the presence or absence of targets as well as bounding box regression prediction. Here the preliminary RoIs are obtained, and the RoIs include the bounding box regression values of the region of interest in the original image.
The RPN network takes anchors on the feature map to calculate the positions of proposals, and the principle of anchor generation is shown in Figure 3. A sliding window of n × n is set on the feature map to generate k identically centered anchors corresponding to each point in the feature map. To fit various sizes of targets, the anchor is set to 3 area sizes as well as 3 aspect ratios, for a total of 9 sizes; thus k = 9. Then, the sliding window is used to convolute the feature map, and the classification layer and regression layer composed of a fully connected network are respectively input through the middle layer to describe the position and type of information of the anchor. The classification layer outputs 2k fore-and-aft probability values, and the regression layer outputs 4k coordinate values. The RPN corrects the length, width, and center of each anchor based on the correction values calculated by the regression, and the corrected RoIs are sent to the subsequent network after filtering.

RoI Align
RoI Align serves to pool the corresponding regions in the feature maps into fixedsize feature maps based on the location coordinates of the RoIs for subsequent classification and regression operations. The whole feature aggregation process is transformed into a continuous operation by canceling the quantization operation and using bilinear inter-

RoI Align
RoI Align serves to pool the corresponding regions in the feature maps into fixed-size feature maps based on the location coordinates of the RoIs for subsequent classification and regression operations. The whole feature aggregation process is transformed into a continuous operation by canceling the quantization operation and using bilinear interpolation [25] to obtain the values on the pixel points with floating point coordinates. The operation process is as follows: RoI Align crops the candidate regions from the feature maps and pools them into 7 × 7 and 14 × 14 feature maps. These two sizes are the specified input sizes for the target classification, and localization and mask generation stages, respectively, as shown in Figure 4. The region proposals are first kept floating-point without quantization. Then it is partitioned into k × k cells with no quantization of edges. Finally, each cell is quadratically divided, and four centers are found. The coordinate values of these four locations are calculated by bilinear interpolation, and maximum pooling is performed. RoI Align thus transforms the whole feature aggregation process from a discrete to a continuous operation, solving the problem of errors caused by mismatched quantization twice in the RoI Pooling operation.

RoI Align
RoI Align serves to pool the corresponding regions in the feature maps into fixedsize feature maps based on the location coordinates of the RoIs for subsequent classification and regression operations. The whole feature aggregation process is transformed into a continuous operation by canceling the quantization operation and using bilinear interpolation [25] to obtain the values on the pixel points with floating point coordinates. The operation process is as follows: RoI Align crops the candidate regions from the feature maps and pools them into 7 × 7 and 14 × 14 feature maps. These two sizes are the specified input sizes for the target classification, and localization and mask generation stages, respectively, as shown in Figure 4. The region proposals are first kept floating-point without quantization. Then it is partitioned into k × k cells with no quantization of edges. Finally, each cell is quadratically divided, and four centers are found. The coordinate values of these four locations are calculated by bilinear interpolation, and maximum pooling is performed. RoI Align thus transforms the whole feature aggregation process from a discrete to a continuous operation, solving the problem of errors caused by mismatched quantization twice in the RoI Pooling operation.    [13]. The orange arrow points to the center point obtained after the region has been quadratically divided.

Mask Branch
Mask R-CNN adds the "head" part after RoI Align in order to make the predicted masks more accurate, thus expanding the output dimension of RoI Align to achieve more accurate predicted masks. In the training session of the mask branch, k mask prediction maps are output (one for each class), and average binary cross-entropy loss is used for training. The mask branch uses a fully connected layer for each RoI with a segmentation output dimension of k × m × m: a binary mask of m × m for k categories. The use of RoI Align in Mask R-CNN ensures that RoIs can be aligned when mapped to the original map, so that pixel-to-pixel operations can be accomplished, and pixel-level alignment errors can be reduced.

Improved Mask R-CNN Model
Mask R-CNN can achieve better detection and segmentation of targets, but there are still the following main problems for the detection and segmentation of multiple targets in complex traffic scenes: (1) In urban traffic scenarios, there are multiple targets in front of the car, such as cars, pedestrians, trucks, and so on. When there is traffic congestion or a large number of pedestrians in front, the problem of overlapping targets is likely to occur. This will lead to individual targets in front of the car not being accurately detected and segmented. (2) In complicated and variable weather, the detection and segmentation of targets are easily affected by light and weather, which leads to false detection or low detection accuracy. In this paper, we improved on the Mask R-CNN model to make it more suitable for multi-target detection and segmentation in complex traffic scenes. The improved network model is shown in Figure 5.
(1) In urban traffic scenarios, there are multiple targets in front of the car, such as cars, pedestrians, trucks, and so on. When there is traffic congestion or a large number of pedestrians in front, the problem of overlapping targets is likely to occur. This will lead to individual targets in front of the car not being accurately detected and segmented. (2) In complicated and variable weather, the detection and segmentation of targets are easily affected by light and weather, which leads to false detection or low detection accuracy.
In this paper, we improved on the Mask R-CNN model to make it more suitable for multi-target detection and segmentation in complex traffic scenes. The improved network model is shown in Figure 5  In the ResNeXt section, the input image is first passed through a convolution kernel of size 7 * 7 with a convolutional stride of 2 for the conv operation. The input image is fixed as a feature layer of size 112 * 112 * 64. Then it enters the first block module, in which group convolution is used to increase the diagonal correlation between the convolution kernels and reduce their training parameters. A total of 32 groups are divided to do the convolution operation on the input separately, and then the output of the 32 groups is combined and used as the input of the next block. There are 16 block modules in ResNeXt, divided into four parts. The first part contains 3 blocks with 112 * 112 * 64 inputs and 56 * 56 * 256 outputs. The second part contains 4 blocks with 56 * 56 * 256 inputs and 28 * 28 * 512 outputs. The third part contains 6 blocks with 28 * 28 * 512 inputs and 14 * 14 * 1024 In the ResNeXt section, the input image is first passed through a convolution kernel of size 7 * 7 with a convolutional stride of 2 for the conv operation. The input image is fixed as a feature layer of size 112 * 112 * 64. Then it enters the first block module, in which group convolution is used to increase the diagonal correlation between the convolution kernels and reduce their training parameters. A total of 32 groups are divided to do the convolution operation on the input separately, and then the output of the 32 groups is combined and used as the input of the next block. There are 16 block modules in ResNeXt, divided into four parts. The first part contains 3 blocks with 112 * 112 * 64 inputs and 56 * 56 * 256 outputs. The second part contains 4 blocks with 56 * 56 * 256 inputs and 28 * 28 * 512 outputs. The third part contains 6 blocks with 28 * 28 * 512 inputs and 14 * 14 * 1024 outputs. The fourth part contains 3 blocks with 14 * 14 * 1024 inputs and 7 * 7 * 2048 outputs. The outputs of four components are obtained by ResNeXt: C2, C3, C4, and C5. They are input into the ECA attention mechanism module for each channel weight allocation to obtain CE2, CE3, CE4, and CE5 and input into the improved FPN structure.
CE2-CE5 are fused with the features generated by the top-down layer structure based on the output of the CE5 layer through a lateral connection to obtain four fused multi-scale features of P2-P5, with the width or height of adjacent feature layers differing by a factor of 2. CE2-CE5 are fused with the features generated by the top-down layer structure based on the output of the CE5 layer through a lateral connection to obtain four fused multi-scale features of P2-P5, with the width or height of adjacent feature layers differing by a factor of 2. In addition, a bottom-up feature enhancement branch is added. First, the bottom features are convolved by a convolutional layer with a convolutional step size of 2 and convolutional kernel size of 3 * 3 to change the width and height of the input feature map to half of the original one. Next, a nonlinear layer (Relu [26]) is adopted to reduce the problem of overfitting. Then, the top layer features are passed through a convolutional layer with a convolutional step of 1 and convolutional kernel of 1 * 1 to match the number of channels of the bottom layer features. Finally, the Ni and Pi + 1 layers are summed by elements, and the result is passed through a convolutional layer with a convolution step of 1 and convolutional kernel size of 3 * 3 to obtain the new fused features N2-N5.
The feature maps obtained after ResNeXt and the improved FPN network are sent to the RPN network. The RPN network performs preliminary classification prediction and bounding-box regression prediction on the feature map and then obtains preliminary RoIs, where the loss function of bounding-box regression prediction is CIoU. A large number of RoIs is obtained after the classification prediction and bounding-box regression prediction of the RPN network, and each RoI contains its coordinate position in the original map. The feature map is pooled into a fixed size of k * k by RoI Align to fit the size requirement of the feature map in the final classification prediction and bounding-box regression prediction of the model.
The fixed size feature maps are obtained in RoI Align, and each target corresponds to one feature map. These feature maps are sent to the fully connected layer for final classification prediction and bounding-box regression prediction, where the loss function for bounding-box regression prediction is CIoU. Finally, the feature maps with targets are input into the FCN network to obtain the final mask information of the targets and to perform pixel-by-pixel prediction [27].

ResNet Backbone Network Improvements
In order to balance the model complexity and model accuracy, ResNeXt-50 is adopted as the backbone network in this paper.
The ResNeXt network proposes a parallel stacking of blocks with the same topology instead of the original ResNet block with three convolutional layers [28]. The concept of cardinality is introduced in the block, which refers to the number of identical branches in a block. The accuracy of the model is improved without significantly increasing the number of parameters, and the hyperparameters are also reduced due to the same topology, which makes it easier to port the model. As shown in Figure 6, the ResNet network structure is shown on the left and the ResNeXt structure is shown on the right.  Unlike Resnet, ResneXt first groups the input feature maps into 32 groups and downscales them with a 1 * 1 convolution operation. Then it performs a convolution operation with a convolution kernel size of 3 * 3 for each group of branches, and finally it up-dimensions them by a 1 * 1 convolution operation in which the grouped convolution can reduce the computation and parameter amount of the network to 1/g of the normal convolution with the same input and output, where g is the number of groupings. However, to improve the feature extraction effect of the backbone network, ResNeXt increases the number of feature layer channels per branch after grouping. Finally, the number of network parameters of Resnet-50 and ResNeXt-50 is kept almost the same, but the feature extraction effect of the backbone network is effectively improved.

Feature Pyramid Network Improvements
Feature Pyramid Network (FPN) is a network topology that aggregates higher-level features with lower-level features, allowing the model to learn information that retains both location and stronger semantic information [24]. Its network topology is shown in Figure 7. It can be observed that the FPNs are predicted by up-sampling the higher-level feature maps and then superimposing them horizontally with the semantic information at lower levels than them. This results in a feature map with strong semantic information and thus achieves the effect of feature fusion. Unlike Resnet, ResneXt first groups the input feature maps into 32 groups and downscales them with a 1 * 1 convolution operation. Then it performs a convolution operation with a convolution kernel size of 3 * 3 for each group of branches, and finally it updimensions them by a 1 * 1 convolution operation in which the grouped convolution can reduce the computation and parameter amount of the network to 1/g of the normal convolution with the same input and output, where g is the number of groupings. However, to improve the feature extraction effect of the backbone network, ResNeXt increases the number of feature layer channels per branch after grouping. Finally, the number of network parameters of Resnet-50 and ResNeXt-50 is kept almost the same, but the feature extraction effect of the backbone network is effectively improved.

Feature Pyramid Network Improvements
Feature Pyramid Network (FPN) is a network topology that aggregates higher-level features with lower-level features, allowing the model to learn information that retains both location and stronger semantic information [24]. Its network topology is shown in Figure 7. It can be observed that the FPNs are predicted by up-sampling the higher-level feature maps and then superimposing them horizontally with the semantic information at

Feature Pyramid Network Improvements
Feature Pyramid Network (FPN) is a network topology that aggregates higher-level features with lower-level features, allowing the model to learn information that retains both location and stronger semantic information [24]. Its network topology is shown in Figure 7. It can be observed that the FPNs are predicted by up-sampling the higher-level feature maps and then superimposing them horizontally with the semantic information at lower levels than them. This results in a feature map with strong semantic information and thus achieves the effect of feature fusion.  [29], whose backbone  [29], whose backbone feature extraction framework is shown in Figure 8. From the figure, we can see that PANet improves the structure of FPN by adopting a top-down and bottom-up bidirectional feature pyramid fusion structure. The bottom-up path enhancement strategy is added on the basis of the original FPN top-down horizontal stacking combined with downsampling, which aims to shorten the information path and utilize the shallow-level accurate localization, effectively improving the utilization of the underlying features of the network.

Efficient Channel Attention Module ECA
In the inference process of the Mask R-CNN algorithm, the recognition of medium and large targets is conducted by the high-level feature maps, which are also subject to targeting missed detection due to its low output resolution. In this paper, we introduced the ECA module to optimize the low-resolution semantic information map at a high level of output of the Mask R-CNN algorithm and then obtain the information map with lowresolution and high semantic features to better identify medium and large targets [30] so as to achieve the mean Average Precision (mAP) improvement.
The attention mechanism is a signal processing mechanism that mimics the human brain, and the strategy is well adapted and gainful for computer vision tasks [31]. The ECA structure is also used to obtain the importance of different feature channels by means of supervised learning. The nonlinear fully connected layer in SENet is improved to avoid the effect of the dimensionality reduction operation on channel attention and to ensure the efficiency and computational effectiveness of the network [32]. The module does not require dimensionality reduction to achieve cross-channel interaction; after the global averaging pooling operation on the output feature map, a one-dimensional convolution operation with convolution kernel size k is used to achieve the interaction of different chan-

Efficient Channel Attention Module ECA
In the inference process of the Mask R-CNN algorithm, the recognition of medium and large targets is conducted by the high-level feature maps, which are also subject to targeting missed detection due to its low output resolution. In this paper, we introduced the ECA Sensors 2023, 23, 3853 9 of 20 module to optimize the low-resolution semantic information map at a high level of output of the Mask R-CNN algorithm and then obtain the information map with low-resolution and high semantic features to better identify medium and large targets [30] so as to achieve the mean Average Precision (mAP) improvement.
The attention mechanism is a signal processing mechanism that mimics the human brain, and the strategy is well adapted and gainful for computer vision tasks [31]. The ECA structure is also used to obtain the importance of different feature channels by means of supervised learning. The nonlinear fully connected layer in SENet is improved to avoid the effect of the dimensionality reduction operation on channel attention and to ensure the efficiency and computational effectiveness of the network [32]. The module does not require dimensionality reduction to achieve cross-channel interaction; after the global averaging pooling operation on the output feature map, a one-dimensional convolution operation with convolution kernel size k is used to achieve the interaction of different channel information [33]. The ECA module topology is shown in Figure 9. The relevance of the feature channels can be expressed as follows: where CID denotes a one-dimensional convolution operation, y represents the input weight value, and σ represents the sigmoid function. The parameter krepresents the size of the convolution kernel in the one-dimensional convolution, and the value of k is mapped to the number of channels, C , which can be expressed as where k denotes the convolution kernel size, C is the number of channels, and odd is the nearest odd number taken. Parameters γ and b are set to 2 and 1, respectively, for changing the ratio between the number of channels C and the convolution kernel size k.

Optimization of Loss Functions
The bounding-box regression loss function smooth L1 loss is replaced with the CIoU loss function to speed up model convergence.
In  The relevance of the feature channels can be expressed as follows: where CID denotes a one-dimensional convolution operation, y represents the input weight value, and σ represents the sigmoid function. The parameter k represents the size of the convolution kernel in the one-dimensional convolution, and the value of k is mapped to the number of channels, C, which can be expressed as where k denotes the convolution kernel size, C is the number of channels, and odd is the nearest odd number taken. Parameters γ and b are set to 2 and 1, respectively, for changing the ratio between the number of channels C and the convolution kernel size k.

Optimization of Loss Functions
The bounding-box regression loss function smooth L1 loss is replaced with the CIoU loss function to speed up model convergence.
In the entire network structure, there are five predictors. The loss functions corresponding to these five predictors are the classification result L rc of the RPN, the bounding box regression prediction L rb of the RPN, the final classification result L cls , the final bounding box regression prediction L box , and the final mask image prediction L mask , respectively. The combination of the above loss functions gives L: where the bounding-box regression prediction L rb for RPN and the final bounding-box regression prediction L box are calculated using the smooth L1 loss function as follows: where G i is the position (coordinates and dimensions) of the ground truth box, p i is the position (coordinates and dimensions) of the prediction box, and x is the difference between the ground truth box and the prediction box. When smooth L1 loss is used to calculate the bounding box loss for target detection, the losses of the four points are derived independently and then summed to obtain the final bounding box loss [8]. The premise of this approach is that the four points are independent of each other, but there is actually some correlation. Furthermore, evaluating the metrics of the detection box is carried out using IoU, which is not equivalent to the loss. Multiple detection boxes may have the same loss, but the IoU may vary greatly. In order to solve this problem, CIoU loss is introduced as the loss function of the bounding box regression algorithm in this paper. CIoU loss takes the distance, overlap rate, and scale between the ground truth box and the prediction box as the judging criteria. Moreover, the penalty term is added to improve the target regression accuracy, and the loss value converges faster [34]. The CIoU is defined as follows: where b is the center of the prediction box, b gt is the center of the ground truth box, ρ is the Euclidean distance between the two centroids, c2 is the area of the smallest enclosed area that can contain both the prediction box and the ground truth box, v is the weighting factor, and α is the similarity ratio used to measure the length and width. Parameters α and v are defined as follows: The CIoU bounding box loss function is defined as follows: It can be seen that the advantage of the CIoU loss function used in this paper is that the loss factors such as overlap area, centroid distance and aspect ratio, and other bounding box regressions are taken into account, which makes the model converge faster and with higher accuracy and will be more reasonable and flexible in optimizing the network error.

Experimental Environment Configuration
This experimental training platform used a desktop computer with a Windows 10 operating system, and the main hardware configuration was as follows: AMD Ryzen7 5700X CPU, NVIDIA GeForce GTX 3080Ti. The Mask R-CNN algorithm was implemented using the deep learning framework Pytorch 1.13.0 and Python 3.8.

Dataset
The COCO dataset is a general dataset for studying instance segmentation [35], but there is fewer data for road environments, which is not suitable for relevant application scenarios such as autonomous driving. The datasets that provide segmentation data for road environments currently include KITTI, CityScapes, Apllo, etc. The CityScapes dataset contains a large number of complex street view data with good data quality, which is suitable for the application scenario of this study [36]. Therefore, the CityScapes dataset was chosen for model training in this study.
For this study, we selected part of the CityScapes dataset and divided the dataset into a training set, validation set, and test set according to the ratio of 7:1.5:1.5. Among them, the training set included 2334 images, the validation set included 500 images, and the test set included 500 images. The CityScapes dataset included 30 categories, considering the application scenario of this study was a traffic scenario; therefore, this study only detected 5 categories in the dataset, which were car, pedestrian, truck, bus, and rider. The CityScapes dataset is shown in Figure 10.

Dataset
The COCO dataset is a general dataset for studying instance segmentation [35], but there is fewer data for road environments, which is not suitable for relevant application scenarios such as autonomous driving. The datasets that provide segmentation data for road environments currently include KITTI, CityScapes, Apllo, etc. The CityScapes dataset contains a large number of complex street view data with good data quality, which is suitable for the application scenario of this study [36]. Therefore, the CityScapes dataset was chosen for model training in this study.
For this study, we selected part of the CityScapes dataset and divided the dataset into a training set, validation set, and test set according to the ratio of 7:1.5:1.5. Among them, the training set included 2334 images, the validation set included 500 images, and the test set included 500 images. The CityScapes dataset included 30 categories, considering the application scenario of this study was a traffic scenario; therefore, this study only detected 5 categories in the dataset, which were car, pedestrian, truck, bus, and rider. The City-Scapes dataset is shown in Figure 10.

Evaluation Indicators
As a standard measure of the model metrics, the datasets in this paper were converted to the COCO dataset format. AP and mAP were used as metrics to measure the performance of target detection and instance segmentation of the model, respectively. In addition to AP and mAP, our metrics also included 50

Model Training and Experimental Parameter Configuration
The parameters of the model were set as follows: the number of samples selected for one training (batch size) was 4, the total number of training rounds was 30, and the total number of iterations was 22,200. The initial learning rate was set to ,0.002 and the learning

Evaluation Indicators
As a standard measure of the model metrics, the datasets in this paper were converted to the COCO dataset format. AP and mAP were used as metrics to measure the performance of target detection and instance segmentation of the model, respectively. In addition to AP and mAP, our metrics also included AP 50 , AP car , AP person , AP truck , AP bus , and AP rider . AP 50 represents the IOU threshold of 0.5, and AP car , AP truck , AP bus , and AP rider represent the AP values for the five categories of car, pedestrian, truck, bus and rider, respectively. Such a standard setting can adequately evaluate the segmentation performance of the model.

Model Training and Experimental Parameter Configuration
The parameters of the model were set as follows: the number of samples selected for one training (batch size) was 4, the total number of training rounds was 30, and the total number of iterations was 22,200. The initial learning rate was set to 0.002 and the learning rate was divided by 10 at the 10th, 20th, and 25th epochs. To make the network training process more stable, the idea of migration learning was used to load the weight parameters from the Resnet network that had already been trained on the coco dataset to accelerate the training of the network.
When training the improved Mask R-CNN model, the change curve of the loss function was plotted by the information of the result of each training round, as shown in Figure 11.
As can be seen in the figure, the total loss values of the five groups of models converged at 22,200 training iterations. Compared with the other four groups of models, the loss function curve (yellow curve) of the improved Mask R-CNN model proposed in this paper significantly improved in convergence. rate was divided by 10 at the 10th, 20th, and 25th epochs. To make the network training process more stable, the idea of migration learning was used to load the weight parameters from the Resnet network that had already been trained on the coco dataset to accelerate the training of the network.
When training the improved Mask R-CNN model, the change curve of the loss function was plotted by the information of the result of each training round, as shown in Figure  11. As can be seen in the figure, the total loss values of the five groups of models converged at 22,200 training iterations. Compared with the other four groups of models, the loss function curve (yellow curve) of the improved Mask R-CNN model proposed in this paper significantly improved in convergence.

Ablation Experimental Results and Analysis
We conducted ablation experiments to explore the effects of the ResNeXt backbone network, the improved FPN module, the ECA attention module, and the CIoU loss function on model performance. The ablation experiments are shown in Tables 1 and 2. Five groups of models were trained separately, and each module was added sequentially on the basis of Mask R-CNN, and the five groups of models were tested sequentially on the same test set. Tables 1 and 2 show the AP and mAP of the five groups of models for each category when IoU was taken as 0.5.

Ablation Experimental Results and Analysis
We conducted ablation experiments to explore the effects of the ResNeXt backbone network, the improved FPN module, the ECA attention module, and the CIoU loss function on model performance. The ablation experiments are shown in Tables 1 and 2. Five groups of models were trained separately, and each module was added sequentially on the basis of Mask R-CNN, and the five groups of models were tested sequentially on the same test set.  Tables 1 and 2 show the AP and mAP of the five groups of models for each category when IoU was taken as 0.5.
In Tables 1 and 2, the improvement of the backbone network ResNeXt resulted in some improvement in the detection accuracy for all five types of targets. The improvement of detection accuracy for relatively small targets such as pedestrian and rider was more obvious, and the segmentation accuracy of these two types of targets was also significantly improved. The improved FPN structure showed that the detection accuracy for all five types of targets was significantly improved compared with the original model. It can be shown that in complex traffic scenes with more targets in front of the road, the improved FPN structure of the model can effectively fuse the low-level location information with the high-level semantic information, so that the denser multiple targets and small targets can be detected accurately. In terms of segmentation accuracy, the improved FPN structure resulted in a significant improvement of pedestrian, truck, and bus targets. The introduction of the ECA attention module slightly improved the detection accuracy of the model. It can be shown that ECA can effectively optimize the model performance and improve the detection accuracy of the model by optimizing the high-level low-resolution semantic information graph output from the backbone network, and it also has some improvement on the segmentation accuracy. Additionally, the use of the CIoU loss function for bounding box regression was able to optimize the model performance slightly. The detection accuracy for the truck targets was reduced, and the segmentation accuracy for the truck and rider targets was also slightly reduced. This situation was considered to be the result of the small number of truck and rider targets in the dataset, which in turn caused instability in the detection of such targets. Overall, the modules in our proposed improved Mask R-CNN algorithm effectively enhanced the recognition accuracy of the model. The detection accuracy was improved by 4.73%, and the segmentation accuracy was improved by 3.96% compared with the original Mask R-CNN algorithm, which verified the feasibility of the model.
For the Mask R-CNN model, we added improvement points individually to the network to perform experiments to evaluate the impact of each of its improvement points on the model in Table 3. Among them, the addition of the channel attention mechanism ECA had a minimal impact on the number of network parameters, but it helped a lot to improve the network. The addition of Ciou did not bring any change in the number of parameters, but it improved the accuracy by obtaining a more accurate bounding box. The group convolution of the ResNeXt network reduced the number of parameters and improved the performance of the model, while the FPN structure increased the computational complexity and the number of parameters to some extent, but it is crucial to improve the performance of the model. In order to more intuitively evaluate the target detection and segmentation effects of the improved Mask R-CNN algorithm in different complex scenes, we selected the CityScapes test dataset to test five groups of models, and the experimental results are shown in Figures 12-14. In order to more intuitively evaluate the target detection and segmentation effects of the improved Mask R-CNN algorithm in different complex scenes, we selected the City-Scapes test dataset to test five groups of models, and the experimental results are shown in Figures 12-14.   Comparing the experimental results of the CityScapes test dataset, we found that in the original Mask R-CNN test results, although the model could detect different kinds of targets well, there were some false detections and poor mask quality. In the test results of the improved Mask R-CNN, it was able to detect different kinds of dense targets better. The mask segmentation results with good quality were also presented for the targets with serious occlusion. In the left experimental group, the dense targets in the middle and the right side of the picture had partial false detection. In the right experimental group, the masks of the two dense small targets on the left side of the picture were not of high quality, and although the targets were correctly identified as cars, the masks showed a confusing situation. In order to more intuitively evaluate the target detection and segmentation effects of the improved Mask R-CNN algorithm in different complex scenes, we selected the City-Scapes test dataset to test five groups of models, and the experimental results are shown in Figures 12-14.   Comparing the experimental results of the CityScapes test dataset, we found that in the original Mask R-CNN test results, although the model could detect different kinds of targets well, there were some false detections and poor mask quality. In the test results of the improved Mask R-CNN, it was able to detect different kinds of dense targets better. The mask segmentation results with good quality were also presented for the targets with serious occlusion. In the left experimental group, the dense targets in the middle and the right side of the picture had partial false detection. In the right experimental group, the masks of the two dense small targets on the left side of the picture were not of high quality, and although the targets were correctly identified as cars, the masks showed a confusing situation. In order to more intuitively evaluate the target detection and segmentation effects of the improved Mask R-CNN algorithm in different complex scenes, we selected the City-Scapes test dataset to test five groups of models, and the experimental results are shown in Figures 12-14.   Comparing the experimental results of the CityScapes test dataset, we found that in the original Mask R-CNN test results, although the model could detect different kinds of targets well, there were some false detections and poor mask quality. In the test results of the improved Mask R-CNN, it was able to detect different kinds of dense targets better. The mask segmentation results with good quality were also presented for the targets with serious occlusion. In the left experimental group, the dense targets in the middle and the right side of the picture had partial false detection. In the right experimental group, the masks of the two dense small targets on the left side of the picture were not of high quality, and although the targets were correctly identified as cars, the masks showed a confusing situation. Comparing the experimental results of the CityScapes test dataset, we found that in the original Mask R-CNN test results, although the model could detect different kinds of targets well, there were some false detections and poor mask quality. In the test results of the improved Mask R-CNN, it was able to detect different kinds of dense targets better. The mask segmentation results with good quality were also presented for the targets with serious occlusion. In the left experimental group, the dense targets in the middle and the right side of the picture had partial false detection. In the right experimental group, the masks of the two dense small targets on the left side of the picture were not of high quality, and although the targets were correctly identified as cars, the masks showed a confusing situation.

Migration Experiment Results and Analysis
In order to verify the generalization ability of the improved Mask R-CNN model, this subsection conducted migration experiments with the BDD autonomous driving dataset, which is a real-time image captured under driving conditions of real traffic scenes. The experiment was conducted with randomly selected sample images from different scenes in the dataset. The detection and segmentation results of the migration experiment are shown in Figures 15-23.
In order to verify the generalization ability of the improved Mask R-CNN model, this subsection conducted migration experiments with the BDD autonomous driving dataset, which is a real-time image captured under driving conditions of real traffic scenes. The experiment was conducted with randomly selected sample images from different scenes in the dataset. The detection and segmentation results of the migration experiment are shown in Figures 15-23.   In this group of experiments, the traffic scene pictures in the BDD dataset were selected as the test dataset. From the left experimental group, we can see that there were three false detections in the original Mask R-CNN test results. On the leftmost side of the image, the model detected two pedestrians as three pedestrians and the fence in front of the tree as a pedestrian. The right side of the picture also had the problem of poor-quality masking of the pedestrian. From the experimental group on the right, we can see that there were four false detections in the original Mask R-CNN test results: two pedestrian targets on the right side of the image, one car target in the middle of the image far away, and the In order to verify the generalization ability of the improved Mask R-CNN model, this subsection conducted migration experiments with the BDD autonomous driving dataset, which is a real-time image captured under driving conditions of real traffic scenes. The experiment was conducted with randomly selected sample images from different scenes in the dataset. The detection and segmentation results of the migration experiment are shown in Figures 15-23.   In this group of experiments, the traffic scene pictures in the BDD dataset were selected as the test dataset. From the left experimental group, we can see that there were three false detections in the original Mask R-CNN test results. On the leftmost side of the image, the model detected two pedestrians as three pedestrians and the fence in front of the tree as a pedestrian. The right side of the picture also had the problem of poor-quality masking of the pedestrian. From the experimental group on the right, we can see that there were four false detections in the original Mask R-CNN test results: two pedestrian targets on the right side of the image, one car target in the middle of the image far away, and the subsection conducted migration experiments with the BDD autonomous driving dataset, which is a real-time image captured under driving conditions of real traffic scenes. The experiment was conducted with randomly selected sample images from different scenes in the dataset. The detection and segmentation results of the migration experiment are shown in Figures 15-23.   In this group of experiments, the traffic scene pictures in the BDD dataset were selected as the test dataset. From the left experimental group, we can see that there were three false detections in the original Mask R-CNN test results. On the leftmost side of the image, the model detected two pedestrians as three pedestrians and the fence in front of the tree as a pedestrian. The right side of the picture also had the problem of poor-quality masking of the pedestrian. From the experimental group on the right, we can see that there were four false detections in the original Mask R-CNN test results: two pedestrian targets on the right side of the image, one car target in the middle of the image far away, and the truck target in the middle of the image. Moreover, comparing the two sets of experiments in this scene, it can be seen that the overall mask quality was higher in the improved Mask R-CNN test results, and the details of each target could be accurately segmented.     The detection and segmentation of targets in front of vehicles in dark scenes is difficult due to the influence of light. There may be a large number of false detections and missed detections in the recognition process, and the segmentation accuracy of the model is required to be very high. Therefore, in this set of experiments, we selected the images of darker night scenes in the BDD dataset for testing. Compared with the experimental results, we can see that there were more false detections and missed detections in the original Mask R-CNN test results, and the mask quality was not high as well. The improved Mask R-CNN could better detect multiple targets in front of the road in the dark scene, and the mask quality of each target was better.    The detection and segmentation of targets in front of vehicles in dark scenes is difficult due to the influence of light. There may be a large number of false detections and missed detections in the recognition process, and the segmentation accuracy of the model is required to be very high. Therefore, in this set of experiments, we selected the images of darker night scenes in the BDD dataset for testing. Compared with the experimental results, we can see that there were more false detections and missed detections in the original Mask R-CNN test results, and the mask quality was not high as well. The improved Mask R-CNN could better detect multiple targets in front of the road in the dark scene, and the mask quality of each target was better.       While vehicle driving is affected by light, drivers also need to deal with various complex weather conditions. Therefore, in this set of experiments, we selected the images of rain and snow scenes from the BDD dataset for testing. Compared with the experiments on the left side, it can be seen that there were two false detections in the Mask R-CNN test results during rainy and snowy weather with a lot of fog in front. It was considered that the fog affected the feature extraction of the model for the front targets, which led to the low detection and segmentation accuracy. In the improved Mask R-CNN test results, the model could better detect all kinds of targets ahead. In the right experimental group, although the scene was clearer, the mask quality of each target was not high due to the influence of snow on the ground, and the details of each target could not be segmented better. The improved Mask R-CNN could accurately identify each target ahead and segment each target more completely.
The migration experimental results showed that the improved network model could still adapt and complete the detection and segmentation of multiple targets such as dense pedestrians and vehicles in various complex traffic scenarios for the new untrained In this group of experiments, the traffic scene pictures in the BDD dataset were selected as the test dataset. From the left experimental group, we can see that there were three false detections in the original Mask R-CNN test results. On the leftmost side of the image, the model detected two pedestrians as three pedestrians and the fence in front of the tree as a pedestrian. The right side of the picture also had the problem of poor-quality masking of the pedestrian. From the experimental group on the right, we can see that there were four false detections in the original Mask R-CNN test results: two pedestrian targets on the right side of the image, one car target in the middle of the image far away, and the truck target in the middle of the image. Moreover, comparing the two sets of experiments in this scene, it can be seen that the overall mask quality was higher in the improved Mask R-CNN test results, and the details of each target could be accurately segmented.
The detection and segmentation of targets in front of vehicles in dark scenes is difficult due to the influence of light. There may be a large number of false detections and missed detections in the recognition process, and the segmentation accuracy of the model is required to be very high. Therefore, in this set of experiments, we selected the images of darker night scenes in the BDD dataset for testing. Compared with the experimental results, we can see that there were more false detections and missed detections in the original Mask R-CNN test results, and the mask quality was not high as well. The improved Mask R-CNN could better detect multiple targets in front of the road in the dark scene, and the mask quality of each target was better.
While vehicle driving is affected by light, drivers also need to deal with various complex weather conditions. Therefore, in this set of experiments, we selected the images of rain and snow scenes from the BDD dataset for testing. Compared with the experiments on the left side, it can be seen that there were two false detections in the Mask R-CNN test results during rainy and snowy weather with a lot of fog in front. It was considered that the fog affected the feature extraction of the model for the front targets, which led to the low detection and segmentation accuracy. In the improved Mask R-CNN test results, the model could better detect all kinds of targets ahead. In the right experimental group, although the scene was clearer, the mask quality of each target was not high due to the influence of snow on the ground, and the details of each target could not be segmented better. The improved Mask R-CNN could accurately identify each target ahead and segment each target more completely.
The migration experimental results showed that the improved network model could still adapt and complete the detection and segmentation of multiple targets such as dense pedestrians and vehicles in various complex traffic scenarios for the new untrained sample dataset. It was proved that the proposed method had good migration performance, and the algorithm improvement was effective and feasible.

Conclusions
To address the problem of low detection and segmentation accuracy in the detection and segmentation of multiple targets in front of a vehicle in the environment perception of autonomous driving, this paper improved on the basis of Mask R-CNN. We improved the feature extraction capability of the backbone network by changing the backbone network ResNet to ResNeXt without increasing the complexity of the model. We also improved the FPN structure by adding a bottom-up path enhancement strategy. It can effectively shorten the information path and utilize the low-level accurate localization to improve the utilization of the low-level features of the network and facilitate the network to identify various targets more accurately. We also added the ECA attention module to the backbone feature extraction network to optimize the high-level, low resolution semantic information map output from the backbone network, which in turn led to a low resolution, high semantic feature information map for better identification of various types of dense medium and large targets. Finally, we replaced the loss function smooth L1 as CIoU for the two bounding box regressions in the model to improve the model convergence speed and optimize the network error. The experimental data showed that the loss function of the improved Mask R-CNN algorithm proposed in this paper had a smaller convergence value, and the mAP of target detection and instance segmentation were improved by 4.73% and 3.96%, respectively, compared with the original Mask R-CNN algorithm. The experimental results showed that the algorithm was more accurate and robust for the detection and segmentation of multiple targets ahead in traffic scenes. In the migration experimental test, the model could effectively identify various dense targets with good mask quality for different kinds of road traffic conditions and complex weather. In summary, the improved Mask R-CNN algorithm can be better applied to environment perception of autonomous driving with higher robustness and good generalization ability, which also has practical application prospects.
Although the improved Mask R-CNN proposed in this paper can better solve the problem of detecting and segmenting multiple targets in front of vehicles in complex scenes, there are still many areas to be improved in the future. For example, we will expand the number of targets in the dataset for road scenes and balance the number of targets in each category to avoid the problem of low target detection accuracy due to the small number of truck category samples in this study. The target detection and segmentation model proposed in this study will face a variety of traffic environments under diverse conditions in practical applications. The types of scenes provided by the dataset are still lacking compared to the actual situation. At the same time, it is still difficult to detect multiple targets in front of the vehicle under severe weather conditions, and the overall detection accuracy needs to be improved. Therefore, image quality assessment techniques will be introduced in future work to evaluate the quality of the dataset images. The low-quality images will be image-enhanced to improve the performance of the model. By continuously improving the detection and segmentation capabilities of the model, expanding the dataset, and supplementing with image quality assessment techniques, the future multi-objective recognition techniques for traffic scenes will be better applied to the environment perception of autonomous driving.

Conflicts of Interest:
The authors declare no conflict of interest.