E-HRNet: Enhanced Semantic Segmentation Using Squeeze and Excitation

: In the ﬁeld of computer vision, convolutional neural network (CNN)-based models have demonstrated high accuracy and good generalization performance. However, in semantic segmentation, CNN-based models have a problem—the spatial and global context information is lost owing to a decrease in resolution during feature extraction. High-resolution networks (HRNets) can resolve this problem by keeping high-resolution processing layers parallel. However, information loss still occurs. Therefore, in this study, we propose an HRNet combined with an attention module to address the issue of information loss. The attention module is strategically placed immediately after each convolution to alleviate information loss by emphasizing the information retained at each stage. To achieve this, we employed a squeeze-and-excitation (SE) block as the attention module, which can seamlessly integrate into any model and enhance the performance without imposing signiﬁcant parameter increases. It emphasizes the spatial and global context information by compressing and recalibrating features through global average pooling (GAP). A performance comparison between the existing HRNet model and the proposed model using various datasets show that the mean class-wise intersection over union (mIoU) and mean pixel accuracy (MeanACC) improved with the proposed model, however, there was a small increase in the number of parameters. With cityscapes dataset, MeanACC decreased by 0.1% with the proposed model compared to the baseline model, but mIoU increased by 0.5%. With the LIP dataset, the MeanACC and mIoU increased by 0.3% and 0.4%, respectively. The mIoU also decreased by 0.1% with the PASCAL Context dataset, whereas the MeanACC increased by 0.7%. Overall, the proposed model showed improved performance compared to the existing model.


Introduction
Studies on convolutional neural networks (CNNs) in the field of computer vision have demonstrated the high accuracy and good generalization performance of CNNs for various tasks and open datasets. CNNs exhibit excellent generalization performances in several computer vision problems, including image classification, semantic segmentation, object detection, and human pose estimation. However, capturing complex relationships between channels or pixel positions in space is challenging because of insufficient feature extraction for global context and spatial information.
To solve this problem, one study combined a CNN and an attention module with a residual connection architecture by emphasizing global context information, leading to RefineNet [18], which combines feature maps of various resolutions using a refine block, showed improved performance because a high resolution provides rich spatial information. As another method using multi-resolution, research on transmitting and exchanging low-resolution information with a residual connection structure [19] has been published; in addition, several other studies have been published, such as combining multi-scale pyramid representations [20,21].
Existing CNN-based models have a pyramid structure in which the size of the convolution feature maps decreases as the depth increases [1,13]. However, HRNet maintains feature maps with a smaller size than the high-resolution branched branches while maintaining high-resolution feature maps in parallel. A new feature map is generated by merging the feature maps in branches with different resolutions. This method obtains richer information by exchanging information from different resolutions. Feature maps containing information at multiple resolutions allow high-quality upsampling, resulting in more accurate segmentation. The structure of HRNet is shown in Figure 1.
Electronics 2023, 12, x FOR PEER REVIEW 3 of 19 and configuring all layers with convolution layers. Subsequently, SegNet [14] and UNet that [15] use an encoder-decoder structure were proposed. In addition, noting that the spatial information of different resolutions is important for performance improvement, Deeplabv3 [16] and PSPNet [17], using Atrous Convolution and ASPP, were proposed. RefineNet [18], which combines feature maps of various resolutions using a refine block, showed improved performance because a high resolution provides rich spatial information. As another method using multi-resolution, research on transmi ing and exchanging low-resolution information with a residual connection structure [19] has been published; in addition, several other studies have been published, such as combining multiscale pyramid representations [20,21]. Existing CNN-based models have a pyramid structure in which the size of the convolution feature maps decreases as the depth increases [1,13]. However, HRNet maintains feature maps with a smaller size than the high-resolution branched branches while maintaining high-resolution feature maps in parallel. A new feature map is generated by merging the feature maps in branches with different resolutions. This method obtains richer information by exchanging information from different resolutions. Feature maps containing information at multiple resolutions allow high-quality upsampling, resulting in more accurate segmentation. The structure of HRNet is shown in Figure 1. The HRNet consists of four stages. The first stage is a bo leneck structure with 64 channels like ResNet-50 [22]. The second, third, and fourth stages consist of transition and exchange units. The transition unit fuses feature maps of different branches to generate a new feature map. The exchange unit exchanges information on the feature maps of different branches. The overall structure of the HRNet is one in which the unit exchange is repeated four times after the unit transition and convolution. In HRNetV2-W18, W30, and W48, W is the number of channels with the highest-resolution convolution. The size of the convolution was 3 × 3, and the size of the first input image feature map was different for each dataset, as explained in detail in Section 3. When generating a new feature map using feature maps of different resolutions, downsampling or upsampling was performed to match the resolutions. For downsampling, a stride 2 convolution was performed when the resolution size was reduced by 1/2. Stride 2 convolution was performed again when the resolution was reduced by 1/4. When upsampling by 2× or 4×, the maximum value was used, and upsampling was performed in one step without intermediate steps. The number of channels in a parallel branch doubled when the resolution was reduced by half. If the original resolution image size was 32 channels, 1/2 had 64 channels, 1/4 had 128 channels, and 1/8 had 256 channels.
Despite the aforementioned efforts, HRNet continues to experience information loss during the feature extraction process, owing to the inherent characteristics of convolutionbased models. These factors contribute to a decrease in resolution, which is a significant concern in semantic segmentation because they adversely affect the segmentation accuracy. To address this issue, this study proposes inserting an a ention module immediately after each convolution to mitigate information loss and alleviate the resolution reduction problem. The HRNet consists of four stages. The first stage is a bottleneck structure with 64 channels like ResNet-50 [22]. The second, third, and fourth stages consist of transition and exchange units. The transition unit fuses feature maps of different branches to generate a new feature map. The exchange unit exchanges information on the feature maps of different branches. The overall structure of the HRNet is one in which the unit exchange is repeated four times after the unit transition and convolution. In HRNetV2-W18, W30, and W48, W is the number of channels with the highest-resolution convolution. The size of the convolution was 3 × 3, and the size of the first input image feature map was different for each dataset, as explained in detail in Section 3. When generating a new feature map using feature maps of different resolutions, downsampling or upsampling was performed to match the resolutions. For downsampling, a stride 2 convolution was performed when the resolution size was reduced by 1/2. Stride 2 convolution was performed again when the resolution was reduced by 1/4. When upsampling by 2× or 4×, the maximum value was used, and upsampling was performed in one step without intermediate steps. The number of channels in a parallel branch doubled when the resolution was reduced by half. If the original resolution image size was 32 channels, 1/2 had 64 channels, 1/4 had 128 channels, and 1/8 had 256 channels.
Despite the aforementioned efforts, HRNet continues to experience information loss during the feature extraction process, owing to the inherent characteristics of convolutionbased models. These factors contribute to a decrease in resolution, which is a significant concern in semantic segmentation because they adversely affect the segmentation accuracy. To address this issue, this study proposes inserting an attention module immediately after each convolution to mitigate information loss and alleviate the resolution reduction problem.

Attention Module
The basic idea of attention in natural language processing is that the encoder refers to the entire input sentence once again at each timestamp at which the decoder predicts the output word. Rather than referencing the entire input sentence in equal proportions, we refer to the part of the input word related to the word to be predicted at that time. The basic concept of the Attention technique is a dictionary data type consisting of key values applied to many fields of computer engineering; this is shown in (1): In Equation (1), Attention calculates the similarity between a given query and a key. The output similarity is then multiplied by each value mapped to a key. The sum of all values that reflected similarity was then obtained. Self-attention is an expanded form of attention [23]. The query, key, and value of existing attention are different values, whereas those of self-attention are the same. Self-attention recalibrates the channel by passing an input query and key through a 1 × 1 convolution. Subsequently, keys are transposed and multiplied to obtain the cosine similarity. The attention map is then outputted using softmax. Finally, a self-attention feature map is generated by multiplying the values that have undergone 1 × 1 convolution. Self-attention has expanded to various fields such as reinforcement learning, image captioning, and natural language processing [24,25]. It is also used to emphasize the relationship between context information and pixels [26]. Attention mechanisms have been used in many computer vision tasks to address the limitations of standard convolutions [27][28][29][30]. In some computer vision tasks, multi-head self-attention with a sufficient number of heads produced notable results in a study by Cordonnier et al. [31]. In addition, a standalone self-attention model in which all layers are composed of self-attention achieved excellent performance [32].
The attention module is mainly used in tasks where context information is important, such as visual question answering (VQA), image captioning, and scene character recognition [33,34]. However, when the concept of attention was expanded to self-attention, it began to be used in CNN. SENet uses an attention mechanism that captures the interactions between channels such that each channel can be assigned a different weight. This model can improve performance through different weightings per channel. A channel with a large weight is interpreted as an important feature, whereas a channel with a small weight is interpreted as containing less important information. Different weights were assigned to different channels of the feature map and multiplied. The module was used to assign different weights to each channel. The SE Block consists of two stages: squeeze and excitation. In the squeeze stage, global average pooling (GAP) is performed to make each channel of the image one-dimensional. In the recalibration stage, the squeezed vector passes through two fully connected layers: a rectified linear unit (ReLU) and a sigmoid. Finally, the flattened vector is multiplied by the image that has passed through a 1 × 1 convolution and the weight, which is squeezed information, to emphasize the important information. Figure 2 illustrates an SE Block [6]. Figure 3 shows the detailed architecture of the SE Block inserted into ResNet.
The hyperparameter in the SE Block is the reduction ratio, which reduces and increases the number of nodes in the fully connected layer and ReLU parts. As the reduction ratio decreased, the number of parameters increased. As the number of reduction ratio increased, the number of parameters decreased. That is, it is a hyperparameter related to changes in capacity and computational cost.

A ention Module
The basic idea of a ention in natural language processing is that the encoder re to the entire input sentence once again at each timestamp at which the decoder pred the output word. Rather than referencing the entire input sentence in equal proportio we refer to the part of the input word related to the word to be predicted at that time. basic concept of the A ention technique is a dictionary data type consisting of key va applied to many fields of computer engineering; this is shown in (1): In Equation (1), A ention calculates the similarity between a given query and a k The output similarity is then multiplied by each value mapped to a key. The sum o values that reflected similarity was then obtained. Self-a ention is an expanded form a ention [23]. The query, key, and value of existing a ention are different values, wher those of self-a ention are the same. Self-a ention recalibrates the channel by passing input query and key through a 1 × 1 convolution. Subsequently, keys are transposed multiplied to obtain the cosine similarity. The a ention map is then outpu ed using s max. Finally, a self-a ention feature map is generated by multiplying the values that h undergone 1 × 1 convolution. Self-a ention has expanded to various fields such as r forcement learning, image captioning, and natural language processing [24,25]. It is used to emphasize the relationship between context information and pixels [26]. A en mechanisms have been used in many computer vision tasks to address the limitation standard convolutions [27][28][29][30]. In some computer vision tasks, multi-head self-a en with a sufficient number of heads produced notable results in a study by Cordonnie al. [31]. In addition, a standalone self-a ention model in which all layers are compose self-a ention achieved excellent performance [32].
The a ention module is mainly used in tasks where context information is import such as visual question answering (VQA), image captioning, and scene character reco tion [33,34]. However, when the concept of a ention was expanded to self-a ention began to be used in CNN. SENet uses an a ention mechanism that captures the inte tions between channels such that each channel can be assigned a different weight. T model can improve performance through different weightings per channel. A chan with a large weight is interpreted as an important feature, whereas a channel with a sm weight is interpreted as containing less important information. Different weights w assigned to different channels of the feature map and multiplied. The module was u to assign different weights to each channel. The SE Block consists of two stages: sque and excitation. In the squeeze stage, global average pooling (GAP) is performed to m each channel of the image one-dimensional. In the recalibration stage, the squeezed ve passes through two fully connected layers: a rectified linear unit (ReLU) and a sigm Finally, the fla ened vector is multiplied by the image that has passed through a 1 convolution and the weight, which is squeezed information, to emphasize the impor information. Figure 2 illustrates an SE Block [6]. Figure 3 shows the detailed architect of the SE Block inserted into ResNet.   The hyperparameter in the SE Block is the reduction ratio, which reduces and increases the number of nodes in the fully connected layer and ReLU parts. As the reduction ratio decreased, the number of parameters increased. As the number of reduction ratio increased, the number of parameters decreased. That is, it is a hyperparameter related to changes in capacity and computational cost.

Proposed Method
Section 3 explains the structure of the proposed model and how it is combined with the a ention module. The proposed method focuses on improving the upsampling performance in semantic segmentation by adding an a ention module to the HRNet. When the a ention module was added, an increase in the number of parameters was required. Thus, an SE Block with a low computational load was used. Figure 4 presents an overview of E-HRNet, where an SE Block, which is an a ention module, is inserted at the end of each convolution block. In addition, all convolution blocks, except for the bo leneck block, have the same structure.

Proposed Method
Section 3 explains the structure of the proposed model and how it is combined with the attention module. The proposed method focuses on improving the upsampling performance in semantic segmentation by adding an attention module to the HRNet. When the attention module was added, an increase in the number of parameters was required. Thus, an SE Block with a low computational load was used. Figure 4 presents an overview of E-HRNet, where an SE Block, which is an attention module, is inserted at the end of each convolution block. In addition, all convolution blocks, except for the bottleneck block, have the same structure.

Details of HRNet Architecture
The detailed architecture of an existing HRNet is shown in Figure 1. The baseline model used was HRNetV2-W48. In HRNetV2-W48, the highest-resolution branch had 48 convolutional channels with resolutions of 1024 × 512 for Cityscapes, 473 × 473 for LIP, and 480 × 480 for PASCAL Context. Feature maps with 1/2, 1/4, and 1/8 resolutions were used only to exchange information at different resolutions. Therefore, attention modules can be easily added to all resolution branches per convolution block unit. There are four stages. Unit transitions and exchanges were repeated to form the 2nd, 3rd and 4th stages. The unit transition and exchange consist of a multi-resolution group convolution and multi-resolution convolution, as shown in Figure 5a,b. Figure 5a shows a simple extension of the convolution with multiple resolutions. Multi-resolution group convolution divides

Details of HRNet Architecture
The detailed architecture of an existing HRNet is shown in Figure 1. The baseline model used was HRNetV2-W48. In HRNetV2-W48, the highest-resolution branch had 48 convolutional channels with resolutions of 1024 × 512 for Cityscapes, 473 × 473 for LIP, and 480 × 480 for PASCAL Context. Feature maps with 1/2, 1/4, and 1/8 resolutions were used only to exchange information at different resolutions. Therefore, attention modules can be easily added to all resolution branches per convolution block unit. There are four stages. Unit transitions and exchanges were repeated to form the 2nd, 3rd and 4th stages. The unit transition and exchange consist of a multi-resolution group convolution and multi-resolution convolution, as shown in Figure 5a,b. Figure 5a shows a simple extension of the convolution with multiple resolutions. Multi-resolution group convolution divides an input channel into subsets of several channels and performs each convolution separately for different spatial resolutions. The difference from normal convolution is that in multi-resolution convolution, each subset of the channels has a different resolution. In addition, to reduce the resolution through downsampling, 2-stride 3 × 3 convolution was used to connect the input and output channels. Bilinear upsampling was performed while upsampling the downsampled feature map.

E-HRNet Architecture
A total of 71 ReLU layers were added by adding SE blocks to the existing HRNet model, consisting of 307 convolution layers, 306 batch normalizations, 269 ReLU layers, 4 bo leneck layers, 104 basic blocks, and 8 high-resolution modules. A total of 108 GAP layers were added for compression with one feature. Additionally, 206 fully connected layers, 108 sigmoid layers, and 108 SE Blocks were added. The existing number of parameters increased by 0.4 M from 65.8 M based on the Cityscapes dataset to 66.2 M, an increase of less than 1%. Giga floating point operations per second (GFLOPs) slightly increased to 0.0004. Figure 6 illustrates the E-HRNet. The existing HRNet efficiently extracts features by fusing the features between parallel branches. However, information loss still occurred during downsampling. In the proposed model architecture, global context information within the object domain can be recalibrated by adding an a ention module at the end of every convolution block to reduce information loss.  Figure 5b illustrates the multi-resolution convolution that exchanges and fuses features extracted from parallel branches with information from different resolutions. Multiresolution convolution is similar to the multibranch full-connection method of a general convolution, as shown in Figure 5c. A normal convolution can be divided into several small convolutions. The input channels are divided into several subsets. The output channels are also divided into several subsets. The input and output subsets were connected in a fully connected manner. Each connection has a normal convolution. Each subset of the output channels was the sum of the convolution outputs for each subset of the input channels.
The difference from normal convolution is that in multi-resolution convolution, each subset of the channels has a different resolution. In addition, to reduce the resolution through downsampling, 2-stride 3 × 3 convolution was used to connect the input and output channels. Bilinear upsampling was performed while upsampling the downsampled feature map.

E-HRNet Architecture
A total of 71 ReLU layers were added by adding SE blocks to the existing HRNet model, consisting of 307 convolution layers, 306 batch normalizations, 269 ReLU layers, 4 bottleneck layers, 104 basic blocks, and 8 high-resolution modules. A total of 108 GAP layers were added for compression with one feature. Additionally, 206 fully connected layers, 108 sigmoid layers, and 108 SE Blocks were added. The existing number of parameters increased by 0.4 M from 65.8 M based on the Cityscapes dataset to 66.2 M, an increase of less than 1%. Giga floating point operations per second (GFLOPs) slightly increased to 0.0004. Figure 6 illustrates the E-HRNet. The existing HRNet efficiently extracts features by fusing the features between parallel branches. However, information loss still occurred during downsampling. In the proposed model architecture, global context information within the object domain can be recalibrated by adding an attention module at the end of every convolution block to reduce information loss.
The SE Block is used in this context because of its ease of integration into any model and its ability to address the issue of information loss by recalibrating features with a minimal parameter increase. Specifically, the SE-Block-based attention module passes the input features through the GAP and squeezes each channel into one feature, that is, a scalar value. Subsequently, as shown in Figure 2, the importance of the feature squeezed through the fully connected layer and sigmoid is calculated as a probability value between 0 and 1 for each channel. The calculated importance is normalized as a weight and multiplied by an image that has undergone a 1 × 1 convolution to readjust the feature value. The SE Block is used in this context because of its ease of integration into any model and its ability to address the issue of information loss by recalibrating features with a minimal parameter increase. Specifically, the SE-Block-based a ention module passes the input features through the GAP and squeezes each channel into one feature, that is, a scalar value. Subsequently, as shown in Figure 2, the importance of the feature squeezed through the fully connected layer and sigmoid is calculated as a probability value between 0 and 1 for each channel. The calculated importance is normalized as a weight and multiplied by an image that has undergone a 1 × 1 convolution to readjust the feature value.
In this study, the SE block was selected to recalibrate the feature value of the global context information for each channel within a range where the number of parameters did not increase excessively. In addition, by adding an SE block to all the convolution processes, information can be extracted uniformly at both high and low resolutions.

Instantiation
To check the effect of the a ention module on the segmentation accuracy, this study was implemented in a manner similar to that of HRNetV2. The network starts with a branch of a 2-stride, 3 × 3 convolution that reduces the feature-map resolution to 1/4. Stage 1 consists of 4 convolutional blocks, each of which comprise a 64-channel bo leneck. Subsequently, a 3 × 3 convolution is continued one-by-one to reduce the width of the feature map to C. C means 48 of HRNetV2-W48. Stages 2, 3 and 4 include 1, 4 and 3 multi-resolution blocks, respectively. The widths of the four resolution convolutions were double those of C, 2C, 4C and 8C. Each branch of the multi-resolution group convolution contains 4 convolution blocks. Each resolution contains two 3 × 3 convolutions. In Figure 7, the middle box enlarges the input size 4 times through bilinear upsampling of the feature map extracted from the four resolution branches. In this study, the SE block was selected to recalibrate the feature value of the global context information for each channel within a range where the number of parameters did not increase excessively. In addition, by adding an SE block to all the convolution processes, information can be extracted uniformly at both high and low resolutions.

Instantiation
To check the effect of the attention module on the segmentation accuracy, this study was implemented in a manner similar to that of HRNetV2. The network starts with a branch of a 2-stride, 3 × 3 convolution that reduces the feature-map resolution to 1/4. Stage 1 consists of 4 convolutional blocks, each of which comprise a 64-channel bottleneck. Subsequently, a 3 × 3 convolution is continued one-by-one to reduce the width of the feature map to C. C means 48 of HRNetV2-W48. Stages 2, 3 and 4 include 1, 4 and 3 multiresolution blocks, respectively. The widths of the four resolution convolutions were double those of C, 2C, 4C and 8C. Each branch of the multi-resolution group convolution contains 4 convolution blocks. Each resolution contains two 3 × 3 convolutions. In Figure 7, the middle box enlarges the input size 4 times through bilinear upsampling of the feature map extracted from the four resolution branches. The outputs of all resolutions were then mixed using a 1 × 1 convolution to generate a 15C dimensional representation. Finally, a segmentation map with the original resolution is generated. Based on this architecture, a SE Block is added to every convolution block unit. Algorithm 1 shows the pseudocode of the E-HRNet. The code is wri en in Python. The deep learning library used was PyTorch. The SE Block was inserted at the end of the Basic Block that undergoes two convolutions. The SE Block learns the nonlinearity between channels through the fully connected layer and ReLU after being squeezed into a scalar value through adaptive average pooling. Finally, important information is emphasized through the sigmoid, and other information is zeroed out.  The outputs of all resolutions were then mixed using a 1 × 1 convolution to generate a 15C dimensional representation. Finally, a segmentation map with the original resolution is generated. Based on this architecture, a SE Block is added to every convolution block unit. Algorithm 1 shows the pseudocode of the E-HRNet. The code is written in Python. The deep learning library used was PyTorch. The SE Block was inserted at the end of the Basic Block that undergoes two convolutions. The SE Block learns the nonlinearity between channels through the fully connected layer and ReLU after being squeezed into a scalar value through adaptive average pooling. Finally, important information is emphasized through the sigmoid, and other information is zeroed out.
Pseudocode of E-HRNet (variables N, C, H, W denote sample number in a mini-batch, feature channels, image height, and image width, respectively) as Algorithm 1.

Experiments
Semantic segmentation is the task of assigning a label to each pixel. In this study, to verify the effect of the attention module on segmentation accuracy in semantic segmentation, the parameters, datasets, and training rules were set the same as those of the existing HRNetV2, except for the attention module. Cityscapes [9], a representative scene-parsing dataset, and LIP [10], a human-parsing dataset, were used. In addition, PASCAL Context [11], a general image dataset, is used. PASCAL Context extends the 2010 PASCAL-VOC. The HRNet-based models were pre-trained using ImageNet. Tables 1 and 2 list the hardware specifications and software versions used for development and testing.

Cityscapes
The Cityscapes dataset consists of 5000 high-resolution and finely annotated scene images. These finely annotated images were divided into 2975 training, 500 validation, and Electronics 2023, 12, 3619 9 of 16 1525 testing images. There were 30 classes in total. In this study, 19 classes, excluding the empty and sparse categories, were used for the training and evaluation of efficient learning.
The batch size was set to six. The same training protocol as HRNetV2 [17,35] was used, except that a single GPU was used instead of multiple GPUs. An image with a resolution of 1024 × 2048 pixels was randomly cropped to 512 × 1024 pixels. Data were augmented using random scaling and random horizontal flip in the range of 0.5-2. The optimizer used stochastic gradient descent (SGD). The initial learning rate was 0.01, and the momentum was set to 0.9. The dampening was set to 0, and the weight decay was 0.0005. The nesterov was set to false, and the maximize was set to false. The foreach was set to none, and the differentiable was set to false. The learning rate schedule used a polylearning rate policy with a power of 0.9. The reduction ratio used for the SE Block was 16. The performance of the model was evaluated using a single-scale non-flipped dataset. Table 3 presents a comparison of the number of parameters, GFLOPs, mIoU, and MeanACC of the HRNet and the proposed models with those of the Cityscapes validation set. The number of parameters increased by 0.4 M, and GFLOPs increased by 0.001 compared to HRNetV2-W48, which became the baseline model. MeanACC, the average accuracy of the pixel, decreased by 0.1%. However, the mIoU increased by 0.5% owing to the improved performance in segmenting the regions of objects corresponding to pixel classes.  Table 4 shows the mIoU comparison of existing models and the proposed model with the Cityscapes validation set. It achieved 4.2% higher performance than UNet++ [23], a relatively lightweight model. It also showed 1.2% and 0.1% higher performance than DeepLabv3 [16] and DeepLabv3+ [20] of similar weight, respectively.  Table 5 compares the mIoU, instance intersection over union (iIoU) classes, IoU categories, and iIoU categories between HRNet and the proposed model on the Cityscapes test set. While IoU evaluates how well a model segments an entire class, iIoU is a measure that determines how efficiently a model distinguishes individual objects within the same class by evaluating segmentation accuracy at the instance level. Utilizing both metrics simultaneously provides a comprehensive understanding of how well the model segments individual objects. The difference between class and category lies in the scope of consideration. For example, a class encompasses all of the individual classes, while a category groups similar classes into broader categories. 'Bus', 'Car', and 'Truck' are all grouped under the 'Vehicles' category. Overall, the proposed model demonstrated strong performance across all metrics, with a particularly noticeable improvement in iIoU. This suggests that the performance of the proposed model is more adept at segmenting individual objects than entire classes.  Table 6 shows the results of class-wise IoU on the Cityscapes test set. The proposed model demonstrated similar segmentation performance for large objects like 'sky' and 'buildings' compared to existing models, but excelled in segmenting relatively small and complex objects such as 'traffic lights', 'traffic signs', and 'fences'. The results of Tables 5 and 6 show that by emphasizing channel information, the characteristics of objects that belong to the same class or are small in size and easily confused can be mitigated.  Figure 8 shows the semantic segmentation prediction maps of the model trained on the Cityscapes dataset. HRNetV2-W18 exhibited relatively more misclassifications due to unclear boundaries between objects. On the other hand, HRNetV2-W48 demonstrated clearer boundaries between objects and fewer misclassifications compared to HRNetV2-W18. Our proposed model shares similarities with HRNetV2-W48; however, it displayed superior capabilities in accurately segmenting small and intricate objects that are easily overlooked. From these results, we can infer that the number of channels in convolution plays a significant role in segmentation performance. Additionally, we observed that information emphasis through attention modules has a meaningful impact on accurately segmenting intricate objects.
clearer boundaries between objects and fewer misclassifications compared to HRNetV2-W18. Our proposed model shares similarities with HRNetV2-W48; however, it displayed superior capabilities in accurately segmenting small and intricate objects that are easily overlooked. From these results, we can infer that the number of channels in convolution plays a significant role in segmentation performance. Additionally, we observed that information emphasis through a ention modules has a meaningful impact on accurately segmenting intricate objects.

LIP
The LIP dataset consists of 50,462 carefully annotated images of human body parts. The dataset was divided into 30,462 images for training and 10,000 images for validation. It consisted of 19 classes related to human parts and one background class.
The image was resized to 473 × 473 according to the training and test se ings in [36]. The performance was evaluated as the average of the segmentation maps of the original and flipped images. The se ings for the data augmentation and learning rate schedule and reduction ratio of SE Block were the same as those for Cityscapes. The training settings are the same as those in [26]. The optimizer used SGD. The initial learning rate was set to 0.01, and the momentum was set to 0.9. The dampening was set to 0, and the weight decay was set to 0.0005. The nesterov was set to false, and the maximize was set to false.

LIP
The LIP dataset consists of 50,462 carefully annotated images of human body parts. The dataset was divided into 30,462 images for training and 10,000 images for validation. It consisted of 19 classes related to human parts and one background class.
The image was resized to 473 × 473 according to the training and test settings in [36]. The performance was evaluated as the average of the segmentation maps of the original and flipped images. The settings for the data augmentation and learning rate schedule and reduction ratio of SE Block were the same as those for Cityscapes. The training settings are the same as those in [26]. The optimizer used SGD. The initial learning rate was set to 0.01, and the momentum was set to 0.9. The dampening was set to 0, and the weight decay was set to 0.0005. The nesterov was set to false, and the maximize was set to false. The foreach was set to none, and the differentiable was set to false. The batch size was 8. The performance of the model was evaluated using a single-scale non-flipped dataset. Table 7 shows a comparison of parameters, GFLOPs, mIoU, and MeanACC indicators of the existing HRNet model and the proposed model with the LIP validation set. The number of parameters increased by 0.4 M, and GFLOPs increased by 0.0004 compared to HRNetV2-W48, the baseline model. Both the object region and pixel class classification accuracy showed improvement, with MeanACC increased by 0.3% and mIoU increased by 0.4%.  Table 8 shows the mIoU comparison of several models and the proposed model with the LIP validation set. The proposed model as a whole without additional data achieved the best performance.

PASCAL Context
The PASCAL Context dataset consists of 4998 scene images for training and 5105 test images. This class consisted of 59 object classes and one background class.
The settings for the data augmentation and learning rate schedule and reduction ratio of SE Block were the same as those in Cityscapes. The optimizer used SGD. According to the training strategy in [41,42], the image size was resized to 480 × 480, and the initial learning rate was set to 0.004. The momentum was set to 0.9, and the dampening was set to 0. The weight decay was set to 0.001, the nesterov was set to false. The maximize was set to false, and the foreach was set to none. The differentiable was false. The batch size was 13. The test strategy was based on a previously described procedure [41,42]. The test image was resized to 480 × 480 pixels and input into the model. The output 480 × 480 segmentation map was resized to the original image size. The performance of the model was evaluated using a single-scale non-flipped dataset. Table 9 shows a comparison of parameters, GFLOPs, mIoU, and MeanACC indicators of the HRNet model and the proposed model with the PASCAL Context test set. The number of parameters increased by 0.5 M, and GFLOPs increased by 0.0004 compared to HRNetV2-W48, the baseline model. The classification accuracy of the pixel class seemed to be improved, as the mIoU fell by 0.1%, whereas MeanACC increased by 0.7%.  Table 10 shows the mIoU comparison of several models with the proposed model on the PASCAL Context test set. As in Table 9, 60 classes were evaluated, with the proposed model achieving the best performance.

HRNet-Based Model Performance Comparison Results
In the mIoU comparison, the PASCAL Context dataset, which was designed for segmenting small objects, saw a decrease of 0.1%. On the other hand, the Cityscapes validation set, intended for scene parsing, improved by 0.5%. In the experiments on the Cityscapes test set, the mIoU, iIoU class, IoU category, and iIoU category improved by 0.3%, 0.5%, 0.1%, and 0.8%, respectively. Additionally, the LIP dataset, designed for body parts parsing, experienced a 0.4% increase. The MeanACC comparison showed that the proposed model exhibited a decrease of 0.1% in Cityscapes, a scene understanding dataset, while showing an increase of 0.3% in LIP and 0.7% in the PASCAL Context dataset, compared with the existing HRNetV2-W48. Therefore, it can be seen that emphasizing global context information can influence the performance of segmenting boundaries between objects in scene understanding tasks; it also affects pixel classification accuracy more when segmenting small objects than relatively larger ones. Additional experiments to provide further evidence are included in Appendix A.

Conclusions
In this study, we propose an HRNet model that combines an attention module. The proposed method uses the SE Block as an attention module to reduce the loss of global context information. An attention module is introduced in each convolution block to mitigate the information loss, focusing on the information loss that occurs at every convolution. This approach emphasizes and preserves crucial information throughout a network, thereby effectively addressing the issue of information loss. The performance experiment compared the performances of the existing HRNet model and the proposed model using different learning strategies for each dataset. The number of parameters increased by 0.4 M in Cityscapes and LIP and by 0.5 M in PASCAL Context. The GFLOPs values increased by 0.001 in Cityscapes, and 0.0004 in LIP and Pascal Context. When using the Cityscapes dataset, the pixel class classification accuracy decreased slightly. However, the object-range segmentation performance improved. With the LIP dataset, all the performance metrics showed improvement. With PASCAL Context, the object region segmentation performance decreased slightly, whereas the pixel class classification performance improved. Compared to several other models, the best performance was achieved by the proposed model. Consequently, the attention module improved the performance without excessively increasing the complexity of the model. Furthermore, we observed that emphasizing global contextual information has a significant effect on performance.
In the future, it is expected that higher performance can be obtained by precisely adjusting the optimizer, learning policy, and hyperparameter values suitable for the proposed model. In future research, it will be necessary to develop an optimal attention module for the proposed model. Therefore, it is necessary to develop a new method to combine the extracted features for upsampling.

Conflicts of Interest:
The authors declare no conflict of interest.

ADE20K
For several reasons, we conducted additional experiments with the ADE20K dataset [46]. First, ADE20K encompasses various scene categories and annotated objects, enabling the evaluation of models even in complex scenarios. Second, in experiments using the Cityscapes dataset, there was little difference in segmentation performance for relatively large objects such as 'buildings' and 'sky' compared to other models, and in experiments using the PASCAL Context dataset, the mIoU actually dropped further, necessitating more research. Lastly, the granularity of the ADE20K annotations is particularly suitable to underscore the strengths we claim in our proposed model.
The ADE20K dataset was used in ImageNet scene parsing challenge 2016. There are 150 classes and diverse scenes with 1038 image-level labels. The dataset was divided into 20,210 training, 4002 validation, and 3352 testing images. Since the test set does not provide labels, the model's performance is evaluated through validation. The batch size was set to nine. The same training protocol as HRNetV2 + OCR [47] was used, except that a single GPU was used instead of multiple GPUs. The image size was resized to 520 × 520. The settings for the data augmentation and learning rate schedule and reduction ratio of SE Block were the same as those in Cityscapes. The optimizer used SGD. The initial learning rate was 0.02, and the momentum was set to 0.9. The dampening was set to 0, and the weight decay was 0.0001. The nesterov was set to false, and the maximize was false. The foreach was set to none, and the differentiable was set to false. Table A1 presents a comparison of the number of parameters, GFLOPs, mIoU, and MeanACC of the HRNet and the proposed models with those of the ADE20K validation set. The number of parameters increased by 0.4 M, and GFLOPs increased by 0.004 compared to HRNetV2-W48, which became the baseline model. MeanACC, the average accuracy of the pixel, decreased by 0.1%. However, the mIoU increased by 0.5% owing to the improved performance in segmenting the regions of objects corresponding to pixel classes. However, the mIoU was maintained, and the MeanACC improved by 0.4%. As a result, the fine object segmentation performance in the Cityscapes dataset was enhanced, and although the boundary segmentation performance between objects in PASCAL Context was slightly decreased, the accuracy of pixel classes improved. In other words, this indicates that the channel information emphasis feature of the proposed model is effective in mitigating characteristics that are easily confused with small objects. Table A1. Results of HRNetV2-based semantic segmentation model results with ADE20K validation set (single scale and no flipping). GFLOPs is calculated on the RTX 3090 and input size 520 × 520. The proposed method backbone is HRNetV2-W48.