DPNet: Dual-Pyramid Semantic Segmentation Network Based on Improved Deeplabv3 Plus

: Semantic segmentation ﬁnds wide-ranging applications and stands as a crucial task in the realm of computer vision. It holds signiﬁcant implications for scene comprehension and decision-making in unmanned systems, including domains such as autonomous driving, unmanned aerial vehicles, robotics, and healthcare. Consequently, there is a growing demand for high precision in semantic segmentation, particularly for these contents. This paper introduces DPNet, a novel image semantic segmentation method based on the Deeplabv3 plus architecture. (1) DPNet utilizes ResNet-50 as the backbone network to extract feature maps at various scales. (2) Our proposed method employs the BiFPN (Bi-directional Feature Pyramid Network) structure to fuse multi-scale information, in conjunction with the ASPP (Atrous Spatial Pyramid Pooling) module, to handle information at different scales, forming a dual pyramid structure that fully leverages the effective features obtained from the backbone network. (3) The Shufﬂe Attention module is employed in our approach to suppress the propagation of irrelevant information and enhance the representation of relevant features. Experimental evaluations on the Cityscapes dataset and the PASCAL VOC 2012 dataset demonstrate that our method outperforms current approaches, showcasing superior semantic segmentation accuracy.


Introduction
Semantic segmentation tasks have found extensive applications in various fields [1], such as autonomous driving [2][3][4][5][6][7] and medical image processing [8][9][10], and it is also an important technique for UAV remote sensing image analysis [11][12][13]. A robust segmentation model should not only accurately capture the semantic information of objects but also enhance the extraction of neighboring object boundaries. Deeplabv3 plus [14] has emerged as a classical architecture that leverages dilated convolutions to extract multi-scale features, thereby avoiding redundant upsampling operations and achieving a balance between model accuracy and computational efficiency. The ASPP module plays a vital role in aggregating contextual information from diverse regions, facilitating the exploration of global image characteristics. It is worth noting that the adoption of conventional single-scale convolutional kernels may impose limitations on the effective scope of feature extraction.
Based on the Deeplabv3 plus baseline model, a novel semantic segmentation method with a dual feature pyramid and attention mechanism is proposed. This method fully utilizes the feature information obtained from the backbone network, enhancing the model's generalization ability and enabling better handling of these problems.

Semantic Segmentation
FCN (Fully Convolutional Network) [15] revolutionized the field of semantic segmentation by introducing the use of convolutional neural networks. It replaces the last fully connected layer of the classification network with convolutional layers and enables end-to-end pixel-wise prediction of class labels in images. However, its effectiveness is hindered by the limited receptive field, which restricts the efficient utilization of multiscale contextual information. Recently, some algorithms, such as UPerNet [16], have been developed based on the FPN (Feature Pyramid Network) structure, utilizing top-down lateral connections to fuse multi-scale information. The HRNet [17] designed structure can maintain high resolution even with deeper network layers, highlighting the importance of high-resolution feature maps for semantic segmentation. The DANet [18] introduces a dual attention module that combines spatial attention and channel attention to improve feature representation, effectively integrating local and global features.

Multi-Scale Information Fusion
Previous works have discussed how to integrate multi-scale information and demonstrated the benefits of incorporating multi-scale information for semantic segmentation. Similar to the parallel aggregation in PSPNet [19], the Deeplab series [14,20,21] introduced the ASPP module to capture contextual information. It utilizes atrous convolution with different rates to construct multi-scale semantic information, allowing for a larger receptive field to capture multi-scale information.
In the shallow layers of the network, feature maps contain detailed information that effectively distinguishes small objects. However, due to the limited receptive field, they face difficulties in differentiating larger objects. On the other hand, in the deep layers of the network, feature maps undergo multiple downsampling operations, posing challenges for distinguishing small objects. Nevertheless, deep-level feature maps often contain rich semantic information that enables the discrimination of larger objects.
Zhu et al. [22] used a non-local method to fuse features from different scales. They proposed a Feature Pyramid Transformer for multi-scale feature fusion, transforming all feature maps to the same size or scale for fusion, abandoning the traditional top-down pathway. CHASPP (Cascaded Hierarchical Atrous Spatial Pyramid Pooling) module introduced a new hierarchical structure consisting of multiple convolutional layers [23], increasing density and effectively addressing the weak representation issue caused by sparse sampling in the ASPP module.
Jiang et al. [24] proposed MSCB (Multi-Scale Context Block) to aggregate features from different scales. Chen et al. [25] proposed a method to preserve multi-scale features by learning spatial localization through multiple parallel paths. Dai et al. [26] designed a parallel, dual-branch network to extract information at different scales. Tan et al. [27] proposed a weighted bidirectional pyramid structure with discriminative fusion of different input features. Ou et al. [28] proposed a pyramid decoder structure to obtain multi-scale feature maps generated by the ASPP module at different stages. Lin et al. [29] introduced a multi-path semantic segmentation structure that integrates semantic information from three pathways.
These approaches and techniques demonstrate various strategies for integrating multiscale information and leveraging it for improved semantic segmentation.

Attention Module
The attention mechanism imitates the cognitive process of the human brain, which has limited processing capacity for the information presented. Therefore, it requires a focus on specific regions to acquire more crucial information while filtering out irrelevant data. In certain mobile devices, such as UAVs [11], small robots [30], and augmented reality devices [31], it is not feasible to incorporate large computational devices. In medical image analysis, the datasets often have a large volume and exhibit high resolution and complexity [32]. The inclusion of attention mechanisms can enhance the efficiency of extracting critical lesion areas in these datasets. In neural networks, particularly in scenarios with limited computational resources, extracting relatively more significant and valuable information assumes great importance [33]. This ability aids the model in achieving enhanced performance in the respective tasks. Notably, this mechanism has found successful applications across diverse computer vision tasks. Existing attention modules, such as SENet [34] (Squeeze and Excitation Net), ECANet [35] (Efficient Channel Attention Net), and CBAM [36] (Convolutional Block Attention Module), have been developed. The SENet module is a representative channel attention architecture that employs GAP (Global Average Pooling) and fully connected layers to recalibrate feature responses across channels, thereby reshaping the interdependencies among channels. However, the SE module solely focuses on capturing variations in pixel importance across different channels, disregarding distinctions in pixel importance within the same channel. ECANet, built upon SENet, introduces a one-dimensional convolutional filter to generate channel weights, replacing the fully connected layer and reducing module complexity. Additionally, differing from the aforementioned two structures, Woo et al. proposed CBAM, a convolutional block attention network that combines channel and spatial attention. CBAM incorporates a structure of maximum pooling to further enhance the model's performance.
To integrate spatial and channel attention, challenges such as increased computational burden and convergence difficulties are encountered. To address this issue, a recent efficient module called Shuffle Attention [37] has been proposed. The Shuffle Attention module divides the feature map into multiple groups of sub-features along the channel dimension and leverages the Shuffle Unit to integrate complementary channel and spatial attention for each sub-feature. This approach offers a lightweight and efficient solution.
Our contribution: (1) We propose an improved semantic segmentation model based on Deeplabv3 plus, called DPNet. DPNet is a dual-pyramid semantic segmentation network. (2) We introduce a method that leverages a dual-feature pyramid to integrate feature maps at various resolutions, effectively amalgamating contextual information and enhancing accuracy. The model's capacity to capture multi-scale information and fuse diverse-scale information is strengthened while fully exploiting the effective features extracted by the backbone network. Compared with previous methods, our method improves on the classic Deeplabv3 plus algorithm by processing the feature map through two feature pyramids. Different from the method proposed by [28], our method uses two feature pyramids as the encoder, which forms a dual-branch structure. While DANet [18] also employs a dual-branch structure, our method differs in that we do not separately compute spatial attention and channel attention. Instead, we utilize a lightweight attention module.

Methods
Deeplabv3 plus is an extension of the Deeplabv3 architecture, featuring a simplified and efficient decoding module that refines feature information and enhances semantic segmentation performance. In the encoder component, an improved Xception model serves as the backbone network, leveraging depth-separable convolutions across different channels to extract image features. Subsequently, the ASPP module employs three parallel 3 × 3 convolutions with void rates of 6, 12, and 18, along with global average pooling, to capture high-level semantic information. In the decoder component, the lowlevel features extracted from the input layer of the backbone network are downscaled using 1 × 1 convolutions and fused with the high-level features obtained from the encoder. Multiple 3 × 3 convolutions are then applied to restore spatial information within the feature maps, followed by bilinear upsampling for precise boundary adjustment of the target objects. Ultimately, the segmentation results are derived from this process.
In this paper, the algorithm proposed selects ResNet as the backbone network due to its strong feature extraction capability. Extending the framework of Deeplabv3plus, a dual-path, dual-feature pyramid structure is introduced to facilitate multi-scale feature fusion. This approach effectively leverages the feature maps at different scales obtained from the backbone network.
In comparison to the Xception network model, ResNet introduces a residual structure with identity mapping, enabling smooth gradient propagation from shallow to deep layers. With this residual structure, it becomes feasible to train very deep neural networks, enhancing the capability of feature extraction and capturing the fine details and characteristics of input data. The ResNet network model with residual mechanisms can better retain the features of different scales for the later part of the network model.
The overall structure of the parallel dual pyramid is illustrated in Figure 1. In the algorithm proposed in this paper, ResNet is employed to obtain four layers of feature maps. The highest-level feature map is fed into the ASPP module, while the remaining three shallow feature maps undergo BiFPN to effectively enhance the network model's capability for multi-scale feature fusion. extracted from the input layer of the backbone network are downscaled using 1 × 1 convolutions and fused with the high-level features obtained from the encoder. Multiple 3 × 3 convolutions are then applied to restore spatial information within the feature maps, followed by bilinear upsampling for precise boundary adjustment of the target objects. Ultimately, the segmentation results are derived from this process.
In this paper, the algorithm proposed selects ResNet as the backbone network due to its strong feature extraction capability. Extending the framework of Deeplabv3plus, a dual-path, dual-feature pyramid structure is introduced to facilitate multi-scale feature fusion. This approach effectively leverages the feature maps at different scales obtained from the backbone network.
In comparison to the Xception network model, ResNet introduces a residual structure with identity mapping, enabling smooth gradient propagation from shallow to deep layers. With this residual structure, it becomes feasible to train very deep neural networks, enhancing the capability of feature extraction and capturing the fine details and characteristics of input data. The ResNet network model with residual mechanisms can better retain the features of different scales for the later part of the network model.
The overall structure of the parallel dual pyramid is illustrated in Figure 1. In the algorithm proposed in this paper, ResNet is employed to obtain four layers of feature maps. The highest-level feature map is fed into the ASPP module, while the remaining three shallow feature maps undergo BiFPN to effectively enhance the network model's capability for multi-scale feature fusion. The BiFPN module and ASPP module are integrated as parallel dual-branch structures, serving as the encoder of the overall architecture. The decoder part follows a similar structure to Deeplabv3 plus, where the high-level feature map from the ASPP module is upsampled to match the size of the feature map outputted by the BiFPN module, and they are stacked together. Two depth-separable convolutions are then applied to obtain the final effective feature map, which serves as a condensed representation of the entire image. Finally, the feature map is upsampled to the size of the input image.

Multi-Scale Information Feature Fusion in the Dual-Pyramid Structure
In the original algorithm of Deeplabv3 plus, the decoder part simply used a single low-level feature map to combine with the high-level features from the ASPP module for multi-scale feature fusion. This led to the underutilization of the multi-scale information obtained from the backbone network. To address this issue, we introduce the BiFPN module as a component for processing shallow features. It incorporates a weight selection mechanism that helps preserve more relevant and valuable features. The BiFPN module and ASPP module are integrated as parallel dual-branch structures, serving as the encoder of the overall architecture. The decoder part follows a similar structure to Deeplabv3 plus, where the high-level feature map from the ASPP module is upsampled to match the size of the feature map outputted by the BiFPN module, and they are stacked together. Two depth-separable convolutions are then applied to obtain the final effective feature map, which serves as a condensed representation of the entire image. Finally, the feature map is upsampled to the size of the input image.

Multi-Scale Information Feature Fusion in the Dual-Pyramid Structure
In the original algorithm of Deeplabv3 plus, the decoder part simply used a single low-level feature map to combine with the high-level features from the ASPP module for multi-scale feature fusion. This led to the underutilization of the multi-scale information obtained from the backbone network. To address this issue, we introduce the BiFPN module as a component for processing shallow features. It incorporates a weight selection mechanism that helps preserve more relevant and valuable features.
BiFPN is a structure proposed by the Google team in 2020 for fusing multi-scale information, originally introduced in the EfficientDet model [27]. The inspiration for BiFPN comes from the PANet [38] structure ( Figure 2). BiFPN is a structure proposed by the Google team in 2020 for fusing multi-scale information, originally introduced in the EfficientDet model [27]. The inspiration for BiFPN comes from the PANet [38] structure ( Figure 2).
BiFPN is a weighted bi-directional feature pyramid network consisting of a top-down pathway to propagate semantic information from higher layers and a bottom-up pathway to convey positional information from lower layers. It differs from the PANet structure in several aspects. BiFPN is a weighted bi-directional feature pyramid network consisting of a top-down pathway to propagate semantic information from higher layers and a bottomup pathway to convey positional information from lower layers. It differs from the PANet structure in several aspects. Firstly, nodes with only one input edge are removed since they contribute minimally to the overall fusion of multi-scale features. This removal simplifies the bi-directional network structure. Secondly, if an original input node and an output node are in the same layer, an additional edge is inserted between them. This allows the combination of more features without significantly increasing the data volume. Unlike traditional feature fusion approaches that often rely on simple feature map concatenation or shortcut operations, BiFPN takes into account the varying resolutions and contributions of feature maps from different layers. Simple stacking is not the optimal solution. In the BiFPN structure [27], learnable weights are introduced to determine the importance of different input features, and only the features need to be multiplied by learnable weights, which is similar to the softmax method. The weights are scaled to the range of [0,1], as shown in Equation (1): In Equation (1), where is a learnable weight, the ReLU activation function is applied after each to ensure that its value is greater than or equal to 0. The small value of = 0.0001 is used to prevent numerical instability.
In the proposed structure of this paper, three groups of features, ~ , are obtained by the backbone feature extraction network at different scales. The channel numbers are adjusted using 1 × 1 convolutional operations.
is then transformed into ′ via upsampling, matching the scale of . The result is stacked and convolved with feature . The intermediate calculation involves computing the weight selection via Equation (1), which is used to learn the importance of different input features.
As shown in Figure 3, taking and as examples, the expressions for and are given by Equations (2) and (3). The intermediate weight mechanism determines whether to focus more on ′ or ′. After the weight mechanism selection, the other features undergo similar stacking and weight selection operations to form a BiFPN module. Considering that the accuracy can be improved by repeatedly stacking BiFPN modules, but it may increase the data volume, in the proposed structure of this paper, this module is only stacked twice and achieves good improvement. BiFPN is a weighted bi-directional feature pyramid network consisting of a top-down pathway to propagate semantic information from higher layers and a bottom-up pathway to convey positional information from lower layers. It differs from the PANet structure in several aspects. BiFPN is a weighted bi-directional feature pyramid network consisting of a top-down pathway to propagate semantic information from higher layers and a bottom-up pathway to convey positional information from lower layers. It differs from the PANet structure in several aspects. Firstly, nodes with only one input edge are removed since they contribute minimally to the overall fusion of multi-scale features. This removal simplifies the bi-directional network structure. Secondly, if an original input node and an output node are in the same layer, an additional edge is inserted between them. This allows the combination of more features without significantly increasing the data volume. Unlike traditional feature fusion approaches that often rely on simple feature map concatenation or shortcut operations, BiFPN takes into account the varying resolutions and contributions of feature maps from different layers. Simple stacking is not the optimal solution. In the BiFPN structure [27], learnable weights are introduced to determine the importance of different input features, and only the features need to be multiplied by learnable weights, which is similar to the softmax method. The weights are scaled to the range of [0, 1], as shown in Equation (1): In Equation (1), where w i is a learnable weight, the ReLU activation function is applied after each w i to ensure that its value is greater than or equal to 0. The small value of = 0.0001 is used to prevent numerical instability.
In the proposed structure of this paper, three groups of features, P 1 ∼ P 3 , are obtained by the backbone feature extraction network at different scales. The channel numbers are adjusted using 1 × 1 convolutional operations. P 3 is then transformed into P 3 via upsampling, matching the scale of P 2 . The result is stacked and convolved with feature P 2 . The intermediate calculation involves computing the weight selection via Equation (1), which is used to learn the importance of different input features.
As shown in Figure 3, taking P 2 and P 3 as examples, the expressions for P td 2 and P out 2 are given by Equations (2) and (3). The intermediate weight mechanism determines whether to focus more on P 2 or P 3 . After the weight mechanism selection, the other features undergo similar stacking and weight selection operations to form a BiFPN module. Considering that the accuracy can be improved by repeatedly stacking BiFPN modules, but it may increase the data volume, in the proposed structure of this paper, this module is only stacked twice and achieves good improvement.
lectronics 2023, 12, x FOR PEER REVIEW By fully utilizing the feature maps of different scales obtained fr network, the shallow low-level feature maps in Deeplabv3 plus are rep sion results of the BiFPN structure. This enhances the network's represe for features of different resolutions, resulting in improved pixel classifi accurate details.

Shuffle Attention Module
For different computer vision tasks, attention modules are used to pendencies between features. There are two widely used types: channel a tial attention. The former aggregates information along the channel dim ture map, while the latter aggregates information within each channe sions. These attention mechanisms can enhance the model's representati improve its accuracy.
The most commonly used ones are the SE attention block and the years, the SE attention block has been widely used in semantic segment ever, the SE block only considers the influence of channel relationships o ing spatial positional information. While the CBAM block takes into acc and channel dimensions, it introduces additional parameters, thereby in work's parameter count. Moreover, it requires the calculation of ad scores, which adds to the model's complexity and computational cost. A not be suitable for scenarios with limited computing resources. Additio mance of the CBAM module highly depends on the input of convolution By fully utilizing the feature maps of different scales obtained from the backbone network, the shallow low-level feature maps in Deeplabv3 plus are replaced with the fusion results of the BiFPN structure. This enhances the network's representation capability for features of different resolutions, resulting in improved pixel classification and more accurate details.

Shuffle Attention Module
For different computer vision tasks, attention modules are used to build the interdependencies between features. There are two widely used types: channel attention and spatial attention. The former aggregates information along the channel dimension of the feature map, while the latter aggregates information within each channel's spatial dimensions. These attention mechanisms can enhance the model's representation capability and improve its accuracy.
The most commonly used ones are the SE attention block and the CBAM. In recent years, the SE attention block has been widely used in semantic segmentation tasks. However, the SE block only considers the influence of channel relationships on features, ignoring spatial positional information. While the CBAM block takes into account both spatial and channel dimensions, it introduces additional parameters, thereby increasing the network's parameter count. Moreover, it requires the calculation of additional attention scores, which adds to the model's complexity and computational cost. As a result, it may not be suitable for scenarios with limited computing resources. Additionally, the performance of the CBAM module highly depends on the input of convolutional layers, making it less effective for special or irregular image inputs, leading to no improvement or even a decrease in segmentation performance.
In our approach, we integrated the lightweight attention mechanism called Shuffle Attention, which was originally proposed in SA-Net [37]. This attention mechanism efficiently combines both spatial and channel attention mechanisms to enhance the representation of features in our method.
The structure of the Shuffle Attention module is illustrated in Figure 4. The overall structure is divided into four sub-modules: OR PEER REVIEW 7 of 16 number of channels in the feature map, while W and H represent the width and height of the feature map, respectively. It can be expressed as Equation (4). Next, each group of sub-feature maps is further divided into two branches, , ∈ ( )× × . These branches correspond to spatial attention and channel attention, respectively. The channel attention branch utilizes the relationships between feature channels to generate channel attention maps, while the other branch utilizes the spatial relationships between features to generate spatial attention maps.
(2) Channel Attention: For the spatial attention branch, the input is first subjected to average pooling to obtain channel-wise statistics. Subsequently, a linear transformation function is employed to enhance the feature representation. The resulting features are then activated using a sigmoid function. Finally, element-wise multiplication is performed between the activated features and the original input, enabling the incorporation of global information. This process generates a class representation with channel attention weights, denoted as ′ , thereby enhancing the features. The detailed calculation process is defined by Equation (5).
denotes the linear function, ( ) denotes the sigmoid activation function, denotes the average pooled feature, and the two parameters , ∈ ( )× × are linear transformation parameters used to scale and translate .
(3) Spatial Attention: Spatial attention can be seen as complementary information to channel attention, focusing on the spatial aspects of the features. For the spatial attention branch , the features are first normalized using the GN (Group Norm) function to obtain spatial statistics. Then, a linear transformation function is applied to enhance the feature representation. The resulting features are activated using a sigmoid function, and element-wise multiplied with the original features, yielding the spatial attention-weighted features ′ . This process highlights the importance of specific regions by emphasizing their contribution. The specific calculation procedure is defined by Equation (6).
where (•) denotes the group norm normalization function, and the two parameters , ∈ ( )× × are linear transformation parameters.
(4) Aggregation: To combine the channel attention ′ and spatial attention ′ (1) Feature grouping, which divides the input feature map X ∈ R C×W×H into G groups of sub-feature maps based on the channel dimension. Here, C represents the number of channels in the feature map, while W and H represent the width and height of the feature map, respectively. It can be expressed as Equation (4).
Next, each group of sub-feature maps X i is further divided into two branches, X i1 , X i2 ∈ R ( C 2G )×W×H . These branches correspond to spatial attention and channel attention, respectively. The channel attention branch utilizes the relationships between feature channels to generate channel attention maps, while the other branch utilizes the spatial relationships between features to generate spatial attention maps.
(2) Channel Attention: For the spatial attention branch, the input X i1 is first subjected to average pooling to obtain channel-wise statistics. Subsequently, a linear transformation function is employed to enhance the feature representation. The resulting features are then activated using a sigmoid function. Finally, element-wise multiplication is performed between the activated features and the original input, enabling the incorporation of global information. This process generates a class representation with channel attention weights, denoted as X i1 , thereby enhancing the features. The detailed calculation process is defined by Equation (5).
F c denotes the linear function, σ(·) denotes the sigmoid activation function, S 1 denotes the average pooled feature, and the two parameters W 1 , b 1 ∈ R ( C 2G )×1×1 are linear transformation parameters used to scale and translate S 1 .
(3) Spatial Attention: Spatial attention can be seen as complementary information to channel attention, focusing on the spatial aspects of the features. For the spatial attention branch X i2 , the features are first normalized using the GN (Group Norm) function to obtain spatial statistics. Then, a linear transformation function is applied to enhance the feature representation. The resulting features are activated using a sigmoid function, and element-wise multiplied with the original features, yielding the spatial attention-weighted features X i2 . This process highlights the importance of specific regions by emphasizing their contribution. The specific calculation procedure is defined by Equation (6).
where GN(·) denotes the group norm normalization function, and the two parameters W 2 , b 2 ∈ R ( C 2G )×1×1 are linear transformation parameters. (4) Aggregation: To combine the channel attention X i1 and spatial attention X i2 features, they are weighted and concatenated to form the aggregated features X i . This is achieved by connecting the two branches and ensuring that the resulting feature has the same number of channels as the input. It can be expressed as Equation (7).
Then, all the sub-features are aggregated. The channel grouping operation is performed to rearrange the channels, enabling information interaction between different channels. This enhances the model's expressive power.
All generated sub-features will be fused after the feature maps are computed via channel attention and spatial attention, respectively. Finally, a "Channel Shuffle" operation is used here.
The Shuffle Attention module is an efficient attention mechanism based on channel shuffling. Unlike global attention mechanisms, it offers improved performance without significantly increasing computational costs. This makes it particularly advantageous when dealing with a large number of channels, as it achieves better results with fewer resources. Its implementation is straightforward, requiring only convolution operations on the reshuffled feature maps. In this paper, the Shuffle Attention module is positioned after the ASPP module, leveraging the multi-scale features fused by the ASPP module while further enhancing the representation capabilities via the Shuffle Attention module. Moreover, this approach prevents premature compression of the features strengthened by the Shuffle Attention module.

Experiments and Results
In this section, we first introduce the configuration of the experimental environment, then present the dataset used for evaluation, and finally present a series of experimental results obtained on each of the two datasets.

Experimental Environment
The experimental platform in this paper is Ubuntu 18.04; the GPU used is Nvidia RTX3090 with 24G of memory; the deep learning framework is PyTorch 1.8; we use ResNet as the backbone network, using the pre-trained weights on ImageNet, and use SGD (Stochastic Gradient Descent) to train the network with momentum 0.9, the weight attenuation 1 × 10 −4 . This article employs the typical "ploy" strategy; the learning rate (lr) is calculated by Equation (8).
We trained on the Cityscapes dataset [39] for 100 epochs with mini-batchsize set to 8 and on the Pascal VOC 2012 dataset [40] for 100 epochs with mini-batchsize set to 8.
We compared our proposed method with several state-of-the-art networks, which have been open-sourced, on the Cityscapes dataset and the Pascal VOC 2012 dataset. For fair comparison, we set the backbone network of all these methods to ResNet-50 to minimize the influence of the backbone network on the experiments. We also conducted ablation experiments on the Cityscapes dataset to further evaluate our approach.

Cityscapes
Cityscapes is a popular dataset for semantic segmentation tasks. It consists of highresolution images and corresponding semantic annotations from 50 cities in Germany. The dataset includes 19 classes and covers various urban scenes such as roads, pedestrians, vehicles, and buildings. It is commonly used for semantic segmentation in the context of autonomous driving. The dataset contains 5000 finely annotated images and 19,998 coarsely annotated images. Among the finely annotated images, there are three subsets, which include 2975 images for training, 500 images for validation, and 1525 images for testing. Among the Cityscapes dataset, we evaluated our method on the validation dataset only.

Pascal VOC 2012
The Pascal VOC 2012 dataset consists of 21 common classes, such as cars, people, cats, dogs, and more. It contains 1464 training images, 1449 validation images, and 1456 test images. Additionally, an augmented set of 10,582 images is available for training purposes. The Pascal VOC 2012 dataset is also a popular general-purpose semantic segmentation dataset that can be used to evaluate the performance of algorithms. For the Pascal VOC 2012 dataset, we evaluated our method solely on the validation set.

Evaluation Metrics
To evaluate DPNet, this study analyzes the experimental results from both subjective and objective perspectives. The subjective evaluation compares certain targets and objects, such as the segmentation performance of smaller objects and object boundaries, based on the visual results of semantic segmentation. For objective evaluation, this study employs two metrics: mIoU (Mean Intersection over Union) and mPA (Average Pixel accuracy). The mIoU represents the average ratio of intersection to union, and in semantic segmentation tasks, the intersection over union for a single class is the ratio of the intersection between the ground truth label and the predicted label to their union.
The mIoU is the average of the intersection over union values for each class in the dataset. Equation (9) illustrates the calculation method for mIoU. Specifically, TP represents true positive, indicating the model's prediction is positive and the ground truth is also positive. FP represents false positive, indicating the model's prediction is positive while the ground truth is negative. FN represents false negative, indicating the model's prediction is negative while the ground truth is positive. TN represents true negative, indicating both the model's prediction and the ground truth are negative.
The mPA calculates the percentage of pixels correctly classified for each class separately, and then these values are summed and averaged as shown in Equation (10).

Ablation Study on Cityscapes Dataset
We conducted ablation experiments on the Cityscapes dataset, as shown in Table 1. (1) We presented the results of the baseline; and (2) the results after adding BiFPN; (3) the results after adding Shuffle Attention; (4) the results of our proposed dual-path dualfeature pyramid semantic segmentation method with both modules. These experiments were performed using ResNet-50 as the backbone network. All values are reported as percentages (%). ' √ ' indicates that this module is added.
After adding BiFPN, the mIoU increased by 1.24% compared to the baseline. Similarly, when Shuffle Attention was added, the mIoU increased by 1.2%. When both modules were added simultaneously, the result reached 78.69%. These results demonstrate that the multi-scale fusion method and attention module have a positive impact on scene segmentation. Notably, the mIoU achieved by adding the BiFPN module reached 77.83%, indicating that this module effectively selects better features and outputs them as useful features. Additionally, the attention mechanism we incorporated successfully combines channel attention and spatial attention, suppressing the transmission of irrelevant features, resulting in an mIoU of 77.79% for our approach. Furthermore, with the combination of these two modules, the segmentation capability of the model was further improved. This represents a 2.1% improvement over the baseline model, and the mPA is also improved by a small margin, providing strong evidence of the effectiveness of our improvements.
We also conducted experiments to examine the impact of noise on the model. We preprocessed the validation set images by introducing slight Gaussian noise (with a mean of 0 and a standard deviation of 0.3). The addition of Gaussian noise resulted in blurred edges and details in the images. Figure 5 showcases the images after adding Gaussian noise. that the multi-scale fusion method and attention module have a positive impact on scene segmentation. Notably, the mIoU achieved by adding the BiFPN module reached 77.83%, indicating that this module effectively selects better features and outputs them as useful features. Additionally, the attention mechanism we incorporated successfully combines channel attention and spatial attention, suppressing the transmission of irrelevant features, resulting in an mIoU of 77.79% for our approach. Furthermore, with the combination of these two modules, the segmentation capability of the model was further improved. This represents a 2.1% improvement over the baseline model, and the mPA is also improved by a small margin, providing strong evidence of the effectiveness of our improvements.
We also conducted experiments to examine the impact of noise on the model. We preprocessed the validation set images by introducing slight Gaussian noise (with a mean of 0 and a standard deviation of 0.3). The addition of Gaussian noise resulted in blurred edges and details in the images. Figure 5 showcases the images after adding Gaussian noise.
(a) (b) Figure 5. We added slight Gaussian noise to the validation set images (a) RGB Images, (b) RGB images processed with Gaussian noise.
In Table 1, the data in the last two columns indicate that compared to the images without preprocessing, the baseline method's mIoU decreased by 5.66% after adding Gaussian noise, while our proposed method only experienced a 3.91% decrease. This suggests that our method exhibits some performance degradation when dealing with slight noise, but it performs better compared to the baseline algorithm.
In Table 2, we present the changes in memory usage for single-image inference, demonstrating the variations in model size after incorporating different modules. We also calculated the FLOPs (Floating Point Operations) to reflect the changes in computational complexity. As shown in the data in Table 2, adding the BiFPN module increased memory usage by 195MB, while adding the Shuffle Attention module increased memory usage by 48MB. When both modules were added simultaneously, the memory usage increased by 242MB. Regarding FLOPs, our method experienced a 4.4% increase after adding the BiFPN module, while there was only a minor change after adding the Shuffle Attention module. The addition of a small amount of computational resources to our method has led to an overall performance improvement.  In Table 1, the data in the last two columns indicate that compared to the images without preprocessing, the baseline method's mIoU decreased by 5.66% after adding Gaussian noise, while our proposed method only experienced a 3.91% decrease. This suggests that our method exhibits some performance degradation when dealing with slight noise, but it performs better compared to the baseline algorithm.
In Table 2, we present the changes in memory usage for single-image inference, demonstrating the variations in model size after incorporating different modules. We also calculated the FLOPs (Floating Point Operations) to reflect the changes in computational complexity. As shown in the data in Table 2, adding the BiFPN module increased memory usage by 195MB, while adding the Shuffle Attention module increased memory usage by 48MB. When both modules were added simultaneously, the memory usage increased by 242MB. Regarding FLOPs, our method experienced a 4.4% increase after adding the BiFPN module, while there was only a minor change after adding the Shuffle Attention module. The addition of a small amount of computational resources to our method has led to an overall performance improvement. From the visualization results, we can see that our method has better integrity for some object segmentation, and in the results of Deeplabv3 plus, for some objects, the output results are not complete objects, as shown in the first two rows of Figure 6. For some target edges, our proposed method can obtain more accurate results and optimize the effect of segmented edges, as shown in the third and fourth rows of Figure 6. For smaller or distant objects, our proposed method obviously obtains better results. However, in terms of segmentation contours, especially for these smaller or distant objects, there is still room for improvement, as shown in the last three rows of Figure 6. From the visualization results, we can see that our method has better integrity for some object segmentation, and in the results of Deeplabv3 plus, for some objects, the output results are not complete objects, as shown in the first two rows of Figure 6. For some target edges, our proposed method can obtain more accurate results and optimize the effect of segmented edges, as shown in the third and fourth rows of Figure 6. For smaller or distant objects, our proposed method obviously obtains better results. However, in terms of segmentation contours, especially for these smaller or distant objects, there is still room for improvement, as shown in the last three rows of Figure 6.

Results on Cityscapses Dataset
To validate the effectiveness and rationality of our proposed method, we compared our proposed structure with widely used methods such as FCN, Deeplabv3, UPerNet, PSPNet, and DaNet. On the Cityscapes dataset, we set the training input size to 768 × 768 and conducted experiments using the same data augmentation methods. Table 3 presents the comparative experimental results on the Cityscapes dataset. As shown in Table 3, we quantitatively evaluated several methods on the validation set under the same environment and configuration. On the Cityscapes dataset, our method outperforms several other popular methods. In previous work, DANet utilized a selfattention mechanism to capture long-term dependencies, while PSPNet adopted a similar parallel aggregation structure and achieved some results. Different from the previous methods, our method is based on the parallel structure of the dual pyramid, making full use of the effective feature maps obtained in the backbone network. After adding the Shuffle Attention module, the performance has been further improved, and better performance has been achieved. It can also be observed that several classes in the comparative experiments exhibit poor segmentation performance, such as walls, fences, poles, and riders. Our model still requires further enhancement for thin objects or objects with intricate details.

Results on Pascal VOC 2012 Dataset
To further validate the effectiveness and rationality of our proposed method, we compared its structure with several widely used methods. On the Pascal VOC 2012 dataset, we set the training input size to 512 × 512 and conducted experiments using the same data augmentation methods. Table 4 presents the comparative experimental results on the Pascal VOC 2012 dataset. All values are reported as percentages (%). BG stands for background.
As seen from Table 4, among these popular methods (FCN, HRNet, UPerNet, PSPNet, and Deeplabv3 plus), the final mIoU of our method is 79.51%; mPA is 94.77% on the validation set, which is the best performance. However, the final mIoU of the classic Deeplabv3 plus method is 78.26%, and mPA is 94.55%.
We also show the results of the visualization and make an intuitive comparison, as shown in Figure 7.
As shown in Figure 7, this paper compares our proposed method with the Deeplabv3 plus method in the visualization part, and the performance of our method has been improved after adding the BiFPN module and the Shuffle Attention module. For example, in the first row, our method segmented the details of the rear tire of the bicycle more accurately; in the second row, our method segmented the human legs more clearly; in the third row, our method segmented the motorcycle bracket fully. In the last three rows, our method segments the contours of objects more accurately. However, for objects such as bicycles, chairs, and plants, further improvements are needed in the segmentation results. While our method has shown some improvements compared to other methods, we still face challenges in accurately segmenting objects with intricate details. As shown in Figure 7, this paper compares our proposed method with the Deeplabv3 plus method in the visualization part, and the performance of our method has been improved after adding the BiFPN module and the Shuffle Attention module. For example, in the first row, our method segmented the details of the rear tire of the bicycle more accurately; in the second row, our method segmented the human legs more clearly; in the third row, our method segmented the motorcycle bracket fully. In the last three rows, our method segments the contours of objects more accurately. However, for objects such as bicycles, chairs, and plants, further improvements are needed in the segmentation results. While our method has shown some improvements compared to other methods, we still face challenges in accurately segmenting objects with intricate details.

Discussion
This paper proposes an improved method based on Deeplabv3 plus, changing the encoder part to a double-pyramid structure. The parallel double-pyramid structure can realize the full utilization of the effective feature layer obtained from the backbone part on the basis of extracting global information. The information in the feature map is enriched. On this basis, we also added an attention mechanism. The addition of the Shuffle Attention module not only considers the connection between channels but also the relationship between spatial positions. In the case of limited hardware device resources, the transmission of useless features is suppressed, and the expression of effective features is guaranteed. Our method achieves the purpose of improving the segmentation effect. In addition, our method also shows some improvements compared to the baseline approach when slight noise is added. On the other hand, for smaller objects, objects with finer details, or objects located at a greater distance in the images, such as bicycle tires, fences, poles, and people in the distance, our method exhibits some advancements compared to other methods. However, there is still room for improvement in our method. Experiments show that the DPNet model we proposed can segment images with high precision and performs well on two public datasets.