Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation

Huang, Guanglun; Zheng, Zhaohao; Li, Jun; Zhang, Minghe; Liu, Jianming; Zhang, Li

doi:10.3390/app15126474

Open AccessArticle

Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation

by

Guanglun Huang

^1,2,

Zhaohao Zheng

¹,

Jun Li

¹

,

Minghe Zhang

¹,

Jianming Liu

^1,* and

Li Zhang

^1,*

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

Nanning New Technology Entrepreneur Center, Nanning 530007, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6474; https://doi.org/10.3390/app15126474

Submission received: 16 April 2025 / Revised: 28 May 2025 / Accepted: 5 June 2025 / Published: 9 June 2025

Download

Browse Figures

Versions Notes

Abstract

Image-level weakly supervised semantic segmentation is a challenging problem in computer vision and has gained a lot of attention in recent years. Most existing models utilize class activation mapping (CAM) to generate initial pseudo-labels for each image pixel. However, CAM usually focuses only on the most discriminating regions of target objects and treats each channel feature map independently, which may overlook some important regions due to the lack of accurate pixel-level labels, leading to the underactivation of the target objects. In this paper, we propose a dual attention equivariant network (DAEN) model to address this problem by considering both channel and spatial information of different feature maps. Specifically, we first design a channel–spatial attention module (CSM) for DAEN to extract accurately features of target objects by considering the correlation among feature maps in different channels, and then integrate the CSM with equivariant regularization and pixel-correlation modules to achieve more accurate and effective pixel-level semantic segmentation. Extensive experimental results show that the DAEN model achieved 2.1% and 1.3% higher mIoU scores than the existing weakly supervised semantic segmentation models on the PASCAL VOC 2012 and LUAD-HistoSeg datasets, respectively, validating the effectiveness and efficiency of the DAEN model.

Keywords:

weakly supervised semantic segmentation; CAM; channel attention; spatial attention; pixel correlation

1. Introduction

Semantic segmentation is a fundamental and critical task in the field of computer vision [1], and its purpose is to make accurate pixel-level classification prediction of images. With the rapid development of deep learning in recent years, the semantic segmentation model has made great progress and has been widely applied in fields of intelligent driving, intelligent security, satellite remote sensing image, medical image analysis, and so on [2,3,4,5]. However, fully supervised semantic segmentation requires manual labeling for each pixel, which is not only expensive but also time-consuming.

Recently, many studies have adopted weakly supervised semantic segmentation to address the problem of manual pixel-by-pixel annotation by using weak supervision such as boundary boxes [6,7,8], doodling [9,10], and image-level classification labels [11,12,13,14,15]. Among these methods, weakly supervised semantic segmentation using image-level classification labels has garnered significant attention from academia and industry due to the following reasons: (1) by providing high-level semantic information and semantic categories of images, it is easy to interpret and understand the semantic segmentation results; (2) it can be applied into large-scale semantic segmentation scenarios by only using a small amount of annotation information; (3) it highly correlates with image-level labels and can result in more intuitive outcomes. Therefore, we focus on weakly supervised semantic segmentation based on image-level classification labels in this paper.

Many weakly supervised semantic segmentation based on image-level classification labels use CAM to generate initial pseudo-labels [16,17,18,19,20]. However, CAM utilizes global average pooling to convert feature maps into class activation values and will cause the loss of some pixel-level spatial information. As a result, the generated CAM fails to accurately capture the boundaries and details of the target objects. Some work [21,22] has been carried out on the improvement of CAM by taking account of the cross-image relationship, the equivariance across various transformed images, and pixel correlation. However, these works mainly focus on discriminative regions provided by the classification network, and do not fully consider correlation between features in the channel and spatial dimensions; this may mean that they overlook some important features and lead to the underactivation of target objects.

Motivated by the above, in this paper, we investigate the improvement in the accuracy of CAM by considering the correlation between different feature maps in the channel and spatial dimensions provided by classification networks. Specifically, we propose a DAEN model to improve CAM accuracy by using a channel–spatial attention module that accurately extracts categorical features from both spatial and channel dimensions, and a pixel-correlation module that calculates the normalized feature similarity between pixels in different feature maps. To further improve CAM, DAEN also carries out equivariance regularization to the transformed images to generate additional supervision. The contribution of DAEN can be summarized as follows:

We propose a dual attention equivariant network model that combines the channel–spatial attention module, equivariance regularization, and pixel-correlation module to effectively improve the accuracy of CAM by considering both channel and spatial information of different feature maps.
We design a channel–spatial attention module for DAEN to extract accurately features of target objects by considering the correlation among feature maps in different channels.
Extensive experiments on PASCAL VOC 2012 and LUAD-HistoSeg datasets demonstrate that our proposed model outperforms existing state-of-the-art models for image-level weakly supervised semantic segmentation.

2. Related Work

In this section, we review some weakly supervised semantic segmentation models based on image-level classification labels and some attention mechanisms for feature extraction.

2.1. Weakly Supervised Semantic Segmentation Based on Image-Level Classification Labels

Many of the existing studies in this area first locate target objects using CAM and then generate pseudo-labels. Kolesnikov et al. proposed three principles that involve initializing seeds based on weak localization cues to expand objects and constrain the segmentation to align object boundaries to refine CAM [23]. Lee et al. proposed the capture of object information by randomly selecting hidden units of object feature graph [24]. Fan et al. proposed a cross-image association module to obtain additional semantic information from images with the same category [21]. Araslanov et al. proposed to use local consistency, semantic fidelity and completeness as training guidance to obtain accurate semantic masks in only a single stage [25]. Shimoda et al. proposed a self-supervised differential detection method to refine pixel level semantic affinity [26]. Wang et al. proposed to capture features from transformed images through pixel-correlation modules and equivariant cross regularization [22]. Zhang et al. proposed a pseudo-label updating mechanism and a customized super-pixel-based random walk mechanism to rectify mislabeled regions within objects and refine the boundaries of original pseudo-labels to align with the physical structure of each image more accurately [27]. Jiang et al. proposed the utilization of the relationship between feature maps and their corresponding gradients to generate more granular object positioning information to determine the target object [28]. Zhang et al. proposed a reliable region mining method to find target objects or background regions with high response on the class activation map [29]. Li et al. proposed an iterative DCRF method based on graph convolution for semantic propagation by using the position relationship between adjacent pixels [30]. Zeng et al. proposed to employ a saliency map generator and a globally consistent super-pixel module to enhance localization accuracy and segmentation edge precision [31]. Wang et al. proposed a method that effectively addresses issues of incomplete background and incomplete object representation by capturing multiscale contextual information formed from adjacent spatial feature grids and encoding fine-grained low-level features into high-level representations [32]. Ahn et al. proposed a straightforward yet highly effective method by leveraging the AffinityNet model along with semantic affinities in images to compensate for the absence of object shape information without the need for external data or additional supervision [33].

The models mentioned above have made significant improvements in weakly supervised semantic segmentation; however, they do not fully consider the correlation between features in the channel and spatial dimensions and may overlook some important features, leading to the underactivation of target objects.

2.2. Attention Mechanism

Some studies have also been carried out on the feature extraction of regions of interest by using attention mechanisms. Wang et al. proposed the capture of long-distance dependencies and positional interactions in input data by introducing non-local operations to enable global relationship modeling of the entire input space [34]. Jaderberg et al. proposed the extraction of important features by using corresponding spatial transformation of spatial domain information in the image [35]. Hu et al. proposed the adaptive adjustment of channel characteristic responses by using the interaction between channels [36]. Qin et al. proposed the recalibration of feature maps by leveraging frequency information and channel attention to achieve selective enhancement or suppression of specific frequency components [37]. Woo et al. proposed the extraction of important features from channel and space dimensions by connecting channel attention and space attention modules [38]. However, few work consider the integration of attention mechanism and weakly supervised semantic segmentation. These limitations inspired our work in this paper.

3. Model

3.1. Motivation

Current models using CAM on weakly supervised semantic segmentation rely mainly on the regional discriminant capability provided by convolutional classification networks and treat each channel feature map independently, which may overlook some important regions due to the lack of accurate pixel-level labeling. Generally, different feature maps have rich correlations in channel and spatial dimension. Specifically, different channel feature maps can complement and influence each other, with certain channels playing pivotal roles in recognizing and localizing specific targets, while others capture detailed information from different regions. Moreover, information exchange between different channels can also occur through spatial dimensions. For instance, in target localization tasks, different channel feature maps can correspond to different regions of the target in the spatial domain, and their spatial correlations may be crucial in achieving precise localization accuracy.

Based on the above insights, we design a channel–spatial attention module (CSM) to fully leverage the correlation among feature maps in different channels provided by the classification network. The CSM module can adaptively adjust the weights and spatial distributions of channel feature maps to enable the network to focus more on features of target objects. Then, by integrating the CSM module with equivariant regularization and pixel-level correlation modules, we propose the DAEN model, as shown in Figure 1, to achieve more accurate and effective pixel-level semantic segmentation. In the following, we shall introduce the implementation of DAEN in detail, including the CSM module, equivariant regularization, the pixel-correlation module, and a loss function.

3.2. CSM Module

The CSM module, as shown in Figure 2, can extract those important but easily neglected features in both channel and spatial dimensions. Specifically, the channel attention module first learns the inter-channel relationships within feature maps to capture essential feature mappings in the channel space; then, the CSM integrates the original feature maps with the outputs of the channel attention module via element-wise multiplication to further enhance the quality of the extracted feature maps. Then, the spatial attention module can be used to learn weights for each spatial position within feature maps to capture salient spatial features. Finally, the CSM conducts element-wise multiplication between the output of the spatial attention module and preceding feature maps, followed by the ReLU activation function, to obtain ultimate feature representation. These operations can be represented as follows:

C S a (f) = σ ((C a (f) \otimes f) S a (C a (f) \otimes f)),

(1)

where f represents the output of the classification network,

C a

represents the channel attention module, and

S a

represents the spatial attention module;

σ

represents the Relu function and ⊗ represents element-by-element multiplication.

Since the channel attention and spatial attention modules are important for the implementation of the CSM module, we will look at their workflow in detail next.

3.2.1. Channel Attention Module

The channel attention module can be used to learn the interdependency and importance of channels to aid the backbone network in extracting the requisite regions for the semantic segmentation tasks in the channel dimension; accordingly, this enhances the model’s capability of capturing some important but easily overlooked features. As shown in Figure 3, the channel attention module first passes the features to both the max pooling and average pooling layers to obtain the global information of the features. This can capture the statistical characteristics of different channels in the feature map and provide insights into the importance and contribution of each channel within the entire feature map. Then, a fully connected network is used to learn the weights of channels to enable the weighted fusion and adjustment of the features from different channels. The two outputs of the fully connected network module are then added in an element-wise manner to facilitate the full utilization of the global contextual information and local detailed information to extract the features of target objects accurately. Finally, the fused result is fed into a sigmoid function. The above operations can be expressed as follows:

C a (f) = φ (M L P (A v g (f)) + M L P (M a x (f))),

(2)

where f is the output of classification network,

A v g

is average pooling,

M a x

represents maximum pooling,

M L P

is the fully connected layer, and

φ

represents sigmoid function.

3.2.2. Spatial Attention Module

The spatial attention module can be used to learn the spatial correlation between different pixels in channel feature maps to capture crucial spatial information. As shown in Figure 4, the input features are first integrated in the spatial dimension by passing them to both the max pooling and average pooling layers to extract the global statistical information of the features. This aids in capturing the contribution of different spatial positions and provides a more comprehensive spatial context for subsequent processing steps. Then, the features obtained from the max pooling and average pooling layers are fused through a convolutional operation to allow for the full utilization of both the global statistical information and original spatial details, further enhancing the features of the target objects. After completing the preceding operations, the output is passed as an input into a sigmoid function. The above operations can be expressed as:

S a (f) = φ (C_{7 \times 7} (A v g (f) + M a x (f))),

(3)

where f is the output of the classification network,

A v g

is average pooling,

M a x

is maximum pooling,

φ

represents sigmoid function, and

C_{7 \times 7}

represents a convolution operation with a filter size of 7 × 7.

3.3. Equivariant Regularization

In fully supervised semantic segmentation, the same affine transformation is often applied into both the pixel-level labels and input image to enhance the generalization ability of the model. This operation introduces an equivariant constraint on the network; that is, the output of the network after affine transformation of the input image should be consistent with that obtained after the same transformation of the output of the input image itself. However, due to the lack of pixel-level labels in weakly supervised semantic segmentation, this equivariant constraint is missing. To impose the equivariant constraint, as achieved in SEAM [22], we introduce a twin structure with shared weights in the DAEN model. One branch applies a spatial affine transformation to the network output, while another branch performs the same affine transformation on the image before the network feeds forward. This design can obtain the output CAMs of both branches at the same time, and these CAMs can be used to constrain the learning of the model. This process can be expressed as:

E = {∥ N (A (i)) - A (N (i)) ∥}_{1},

(4)

where

N (*)

represents the network and

A (*)

represents the spatial transformation.

3.4. Pixel Correlation Module

To further improve CAM accuracy by contextual information, we embed a pixel-correlation module proposed by SEAM [22] at the end of network (see Figure 1) to achieve the fusion of shallow feature of each pixel. Specifically, the pixel-correlation module, as shown in Figure 5, can quantify the correlation between pixels of feature maps of the convolutional classification network by calculating their cosine similarity to provide more comprehensive contextual information. Then, the non-negative similarity values after ReLU activation function can be added to the original CAMs to refine the CAMs, so as to enhance the model’s accuracy in pixel-level classification prediction of images. This process can be represented as:

c_{i} = \frac{1}{O (x_{i})} \sum_{\forall j} R e l u (F (x_{i}, x_{j})) {\hat{c}}_{j},

(5)

F (x_{i}, x_{j}) = \frac{λ {(x_{i})}^{T} λ (x_{j})}{∥ λ (x_{i}) ∥ \cdot ∥ λ (x_{j}) ∥},

(6)

where x represents the input feature,

λ (x)

represents the result of input x after a linear transformation,

\hat{c}

represents the original CAM, and c represents the modified CAM, both with spatial indices i and j. The output result is normalized by

O (x_{i}) = \sum_{\forall j} F (x_{i}, x_{j})

.

3.5. Loss Function

The loss function consists of three parts, including classification loss, equivariant regularization, and cross-equivariant regularization. Since the DAEN model uses image-level classification labels t as weak supervision label information and outputs classification prediction vector p of the image, classification losses

L_{c l a}

can be expressed as follows:

l_{c l a} (p, t) = - \frac{1}{k - 1} \sum_{k = 1}^{k - 1} [t_{k} l o g (\frac{1}{1 + e^{- p_{k}}}) + (1 - t_{k}) l o g (\frac{e^{- p_{k}}}{1 + e^{- p_{k}}})],

(7)

L_{c l a} = \frac{1}{2} (l_{c l a} (p^{0}, t) + l_{c l a} (p^{1}, t)),

(8)

where

k - 1

represents the number of categories except the background,

l_{c l a}

represents the multi-label classification loss of a single branch, and

p^{0}

and

p^{1}

represent the affine branch global average pooling layers that aggregate them into prediction vectors, respectively.

To make the CAM that is generated after the affine transformation of input image maintain the same transformation as the CAM of the original image, to ensure equivariance, the equivariant loss

L_{e q}

can be expressed as follows:

L_{e q} = {∥ A ({\hat{c}}^{0}) - {\hat{c}}^{1} ∥}_{1},

(9)

where A represents affine transformation function and

{\hat{c}}^{0}

and

{\hat{c}}^{1}

represent the original CAM output of each branch of the affine transformation network, respectively.

The minimization of cross-equivariant regularization loss will make the output of PCM in the DAEN model is supervised by the CAM output in another branch, which not only can avoid the PCM output falling into the local optimal solution, but can also avoid CAM degradation in the PCM process. Therefore, the cross-equivariant regularization loss,

L_{c e}

, can be expressed as follows:

L_{c e} = ∥ A (c^{0}) - {\hat{c}}^{1} ∥_{1} + {∥ A ({\hat{c}}^{0}) - c^{1} ∥}_{1},

(10)

where

c^{0}

and

c^{1}

represent the output of the PCM module, respectively. As a result, the loss function L of the network is expressed as:

L = L_{c l a} + L_{e q} + L_{c e} .

(11)

4. Experiment

4.1. Experiment Setup

We evaluate our proposed DAEN model on the PASCAL V0C 2012 dataset. The dataset contains 21 annotation classes, including 20 foreground object classes and 1 background class. In our experiments, we use 10,582 images for training, 1449 images for validation, and 1456 images for testing. During training, only image-level classification labels are available. In addition, We utilize the metric of mean intersection over union (mIoU) for all classes to evaluate the system’s performance. The backbone network is set to be Resnet38 as in SEAM [22] with the epoch set to 8 and the batch size set to 8. The model is run on three TITAN XP GPUs and a stochastic gradient descent algorithm is adopted to train the network model. The learning rate is adaptively updated by using the “Poly” strategy of polynomial attenuation, shown as follows:

l r = i r l \times {(1 - \frac{i t}{m a x_i t})}^{t i m e},

(12)

where

i r l

is the initial value of learning rate, which is set as 0.01,

t i m e

is used for attenuation, which is set as 0.9,

i t

represents the number of iterations, and

m a x_i t

is the maximum number of iterations.

4.2. Evaluation of Experimental Results

We selected the SEAM model as the baseline to validate the effectiveness of our proposed model and conducted experiments first on the PASCAL VOC 2012 dataset. For our CSM module, we conducted a series of ablation experiments. We replace the CSM module in the DAEN model with a channel attention module (called CM) and with a spatial attention module (called SM), respectively. Table 1 shows the mIoU scores of DAEN, CM, SM, and SEAM in five rounds of experiments, respectively. It can be seen that the mIoU score of the generated CAMs of SM, CM, and DAEN outperform SEAM by 0.41%, 0.56%, and 0.72%, respectively, indicating that both the channel attention module and spatial attention module can effectively activate some important regions of target objects that are easily neglected by the SEAM model. Furthermore, the fact that the proposed DAEN model has the highest mIoU validates the effectiveness of the combination of the channel attention module and spatial attention module.

We also investigated the role of each loss function on the DAEN model, as presented in Table 2. The mIoU for the classification loss is 50.44%, indicating its pivotal role in enhancing the model’s recognition performance on target objects. The mIoU for the equivariant regularization loss in the model is 13.04%. Despite its relatively lower mIoU compared to the classification loss, the equivariant regularization loss serves the purpose of maintaining the model’s spatial consistency for target objects to enhances the model’s learning capabilities for target shape and structure. The mIoU for the cross-equivariant regularization loss is 50.58%, demonstrating its positive impact on the model by performing accurate target classification and preserving consistent deformations in target shapes. Combining the above three loss functions allows for the comprehensive utilization of their respective advantages, thereby enhancing the model performance.

Table 3 shows the average mIoU of feeding, respectively, the CAMs generated by the SEAM and DAEN models into AffinityNet [33] for further refinement. It can be seen from Table 3 that the average mIoU of DAEN is 64.36%, which is 1.7% higher than that of SEAM model. Figure 6c,d also show the pseudo-labels generated by the combination of SEAM and DAEN models with AffinityNet, respectively. It is seen that our DAEN has a significant improvement over SEAM. This is because, in the initial CAM generation stage, the proposed DAEN model can accurately activate some important but easily neglected regions that SEAM can not, thus providing the AffinityNet with more enriched semantic structures. We then use these pseudo-labels as full supervision labels to train the segmentation model DeepLab [39] that utilizes Resnet38 as backbone to obtain final segmentation results. Table 4 shows, respectively, the average segmentation results of each category for DAEN and SEAM models. It can be seen that, among the 21 categories, 15 categories of the proposed DAEN achieve a higher mIoU score than the SEAM model, which validates the effectiveness of the proposed DAEN model.

Table 5 shows the performance comparison of the proposed DAEN model with other weakly supervised semantic segmentation models. For fair comparison, all models employ the same Resnet38 network as the backbone. It can be seen that the proposed DAEN model achieves the highest mIoU score over all models. The reason is that the proposed DAEN model can better capture features of important but easily ignored regions in target objects and lead to better class activation maps for the segmentation task. This also can be explained by Figure 7 that shows the segmentation results of the proposed DAEN and SEAM models. It can be seen from Figure 7 that the proposed DAEN model can produce better semantic segmentation labels than SEAM model.

Furthermore, we conducted extensive experiments on the LUAD-HistoSeg dataset. LUAD-HistoSeg is a tissue section image segmentation dataset of lung adenocarcinoma, which is usually used for medical image analysis and computer-aided diagnosis research. The categories contained in the LUAD-HistoSeg dataset include tumor epithelial cells (TE), necrosis (NEC), lymphocytes (LYM), and tumor-associated stroma (TAS). The experimental results are shown in Table 6. It is seen that, compared with the HistoSegNet and SEAM models, our model has a higher mIoU score, which further verifies the effectiveness of our model.

5. Conclusions

In this paper, we proposed a dual attention equivariant network model to effectively improve the accuracy of CAM by considering both channel and spatial information of different feature maps. Specifically, we first designed a channel–spatial attention module for DAEN to extract accurately features of target objects by considering the correlation among feature maps in different channels, and then integrated CSM with equivariant regularization and pixel-correlation modules to achieve more accurate and effective pixel-level semantic segmentation. Extensive experiments on PASCAL VOC 2012 and LUAD-HistoSeg datasets demonstrate that our proposed model outperforms existing state-of-the-art models for image-level weakly supervised semantic segmentation.

Author Contributions

Conceptualization, G.H.; methodology, Z.Z.; data curation, M.Z.; formal analysis, L.Z.; funding acquisition, G.H., J.L. (Jianming Liu), and J.L. (Jun Li); investigation, Z.Z.; resources, G.H.; software, Z.Z.; project administration, L.Z.; visualization, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Guangxi Natural Science Foundation Program under Grant No. 2025GXNSFBA069090, the Guangxi Key Laboratory of Trusted Software under Grant No. KX202324, the Key Research and Development Program of Guangxi under Grant No. GuiKeAD22035118, and the Key Laboratory of Equipment Data Security and Guarantee Technology, Ministry of Education under Grant No. GDZB2024060600.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/HAO668/WSSS-DAEN.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015: 18th International Conference), Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhu, L.; Ji, D.; Zhu, S.; Gan, W.; Wu, W.; Yan, J. Learning statistical texture for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 12537–12546. [Google Scholar]
Dai, J.; He, K.; Sun, J. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1635–1643. [Google Scholar]
Zhu, Y.; Zhou, Y.; Xu, H.; Ye, Q.; Doermann, D.; Jiao, J. Learning instance activation maps for weakly supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3116–3125. [Google Scholar]
Khoreva, A.; Benenson, R.; Hosang, J.; Hein, M.; Schiele, B. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 876–885. [Google Scholar]
Vernaza, P.; Chandraker, M. Learning random-walk label propagation for weakly-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7158–7166. [Google Scholar]
Lin, D.; Dai, J.; Jia, J.; He, K.; Sun, J. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3159–3167. [Google Scholar]
Wei, Y.; Feng, J.; Liang, X.; Cheng, M.-M.; Zhao, Y.; Yan, S. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1568–1576. [Google Scholar]
Hou, Q.; Jiang, P.; Wei, Y.; Cheng, M.-M. Self-erasing network for integral object attention. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Huang, Z.; Wang, X.; Wang, J.; Liu, W.; Wang, J. Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7014–7023. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Lin, G.; Cai, J.; Shen, T.; Shen, C.; Kot, A.C. Decoupled spatial neural attention for weakly supervised semantic segmentation. IEEE Trans. Multimed. 2019, 21, 2930–2941. [Google Scholar] [CrossRef]
Chaudhry, A.; Dokania, P.K.; Torr, P.H.S. Discovering class-specific pixels for weakly-supervised semantic segmentation. arXiv 2017, arXiv:1707.05821. [Google Scholar]
Sun, K.; Shi, H.; Zhang, Z.; Huang, Y. ECS-net: Improving weakly supervised semantic segmentation by using connections between class activation maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7283–7292. [Google Scholar]
Wang, X.; You, S.; Li, X.; Ma, H. Weakly-supervised semantic segmentation by iteratively mining common object features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1354–1362. [Google Scholar]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised scale equivariant network for weakly supervised semantic segmentation. arXiv 2019, arXiv:1909.03714. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Fan, J.; Zhang, Z.; Tan, T.; Song, C.; Xiao, J. CIAN: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10762–10769. [Google Scholar]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12275–12284. [Google Scholar]
Kolesnikov, A.; Lampert, C.H. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, 223 Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 695–711. [Google Scholar]
Lee, J.; Kim, E.; Lee, S.; Lee, J.; Yoon, S. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5267–5276. [Google Scholar]
Araslanov, N.; Roth, S. Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4253–4262. [Google Scholar]
Shimoda, W.; Yanai, K. Self-supervised difference detection for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5208–5217. [Google Scholar]
Zhang, Z.; Peng, Q.; Fu, S.; Wang, W.; Cheung, Y.-M.; Zhao, Y.; Yu, S.; You, X. A Componentwise Approach to Weakly Supervised Semantic Segmentation Using Dual-Feedback Network. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 7541–7554. [Google Scholar] [CrossRef] [PubMed]
Jiang, P.-T.; Zhang, C.-B.; Hou, Q.; Cheng, M.-M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Xiao, J.; Wei, Y.; Huang, K.; Luo, S.; Zhao, Y. End-to-end weakly supervised semantic segmentation with reliable region mining. Pattern Recognit. 2022, 128, 108663. [Google Scholar] [CrossRef]
Li, Y.; Sun, J.; Li, Y. Weakly-Supervised Semantic Segmentation Network With Iterative dCRF. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25419–25426. [Google Scholar] [CrossRef]
Zeng, X.; Wang, T.; Dong, Z.; Zhang, X.; Gu, Y. Superpixel Consistency Saliency Map Generation for Weakly Supervised Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Wang, C.; Zhang, D.; Zhang, L.; Tang, J. Coupling Global Context and Local Contents for Weakly-Supervised Semantic Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13483–13495. [Google Scholar] [CrossRef] [PubMed]
Ahn, J.; Kwak, S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4981–4990. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Stammes, E.; Runia, T.F.H.; Hofmann, M.; Ghafoorian, M. Find it if you can: End-to-end adversarial erasing for weakly-supervised semantic segmentation. In Proceedings of the Thirteen International Conference on Digital Image Processing (ICDIP 2021), SPIE, Singapore, 20–23 May 2021; Volume 11878, pp. 610–619. [Google Scholar]
Zhang, B.; Xiao, J.; Wei, Y.; Sun, M.; Huang, K. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12765–12772. [Google Scholar]
Chan, L.; Hosseini, M.S.; Rowsell, C.; Plataniotis, K.N.; Damaskinos, S. HistoSegNet: Semantic segmentation of histological tissue type in whole slide images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10662–10671. [Google Scholar]

Figure 1. DAEN network structure: the DAEN incorporates the CSM module designed in Section 3.2, the equivariant regularization (EQ) designed in Section 3.3, the pixel-correlation module (PCM) designed in Section 3.4, and the loss functions designed in Section 3.5.

c^{0}

,

c^{1}

,

{\hat{c}}^{0}

, and

{\hat{c}}^{1}

are the final outputs of each branch, respectively.

Figure 1. DAEN network structure: the DAEN incorporates the CSM module designed in Section 3.2, the equivariant regularization (EQ) designed in Section 3.3, the pixel-correlation module (PCM) designed in Section 3.4, and the loss functions designed in Section 3.5.

c^{0}

,

c^{1}

,

{\hat{c}}^{0}

, and

{\hat{c}}^{1}

are the final outputs of each branch, respectively.

Figure 2. CSM module.

Figure 3. Channel attention module.

Figure 4. Spatial attention module.

Figure 5. PCM structure, where H, W, and C/

C_{1}

/

C_{2}

represent the height, width, and number of channels of the feature graph, respectively.

Figure 5. PCM structure, where H, W, and C/

C_{1}

/

C_{2}

represent the height, width, and number of channels of the feature graph, respectively.

Figure 6. Pseudo-labels comparison was generated on PASCAL VOC 2012. (a) Original images. (b) Ground truth labels. (c) Pseudo-labels generated by the SEAM model and AffinityNet. (d) Pseudo-labels generated by the DAEN model and AffinityNet.

Figure 7. Comparison of segmentation results on PASCAL VOC 2012 verification set. (a) Original images. (b) Ground truth labels. (c) Segmentation labels produced by the SEAM model. (d) Segmentation labels produced by the DAEN model.

Table 1. The mIoU(%) scores of DAEN, CM, SM, and SEAM in five rounds of experiments with different random seeds.

	1	2	3	4	5	Average Result
SEAM	54.362	54.726	54.563	54.607	54.598	54.57
SM	55.086	54.933	54.949	55.018	54.963	54.98
CM	55.129	55.236	55.100	55.160	55.040	55.13
DAEN	55.360	55.302	55.351	55.244	55.211	55.29

Table 2. The ablation experiments on each loss function of the DAEN. CLA: classification loss. EQ: equivariant loss. CE: cross-equivariant loss.

CLA	EQ	CE	mIoU
✓			50.44%
	✓		13.04%
		✓	50.58%
✓	✓		51.43%
	✓	✓	18.35%
✓		✓	54.38%
✓	✓	✓	55.36%

Table 3. Average mIoU(%) of feeding, respectively, the CAMs generated by the SEAM and DAEN models into AffinityNet for further refinement.

	1	2	3	4	5	Average Result
SEAM + AffinityNet	62.434	62.914	62.578	62.694	62.605	62.65
DAEN + AffinityNet	64.461	64.373	64.390	64.298	64.288	64.36

Table 4. IoU(%) of each category for DAEN and SEAM models on PASCAL VOC 2012.

Category	SEAM	DAEN	Category	SEAM	DAEN
backgroud	87.3	87.7	diningtable	51.6	54.6
aeroplant	67.6	63.8	dog	68.9	76.4
bicycle	40.2	39.1	horse	75.7	77.4
bird	81.8	83.3	motorbike	79.1	77.5
boat	42.5	36.3	person	52.1	60.8
bottle	56.2	65.6	pottedplant	44.0	48.1
bus	72.9	75.3	sheep	89.9	90.5
car	75.9	77.6	sofa	51.9	57.3
cat	60.5	71.4	train	67.0	64.0
chair	33.5	33.7	tv	55.2	55.2
cow	78.8	79.5	mIoU	63.5	65.5

Table 5. The performance comparison in terms of mIoU (%) of the proposed DAEN model with other weakly supervised semantic segmentation models on PASCAL VOC 2012. Note: * indicates the results obtained from running with the hyperparameters provided by SEAM on our server.

Methods	Backbone	val	Test
AffinityNet [33]	Resnet38	61.7	63.7
SSDD [26]	Resnet38	64.9	65.5
Araslanov [25]	Resnet38	62.7	64.3
PSA [40]	Resnet38	62.8	63.8
RRM [41]	Resnet38	62.6	62.9
SEAM [22]	Resnet38	64.5	65.7
WS-FCN [32]	Resnet38	65.0	64.2
SEAM * [22]	Resnet38	63.5	64.7
Our DAEN *	Resnet38	65.5	66.8

Table 6. The performance comparison in terms of mIoU (%) of the proposed DAEN model with other weakly supervised semantic segmentation models on LUAD-HistoSeg dataset.

Methods	TE	NEC	LYM	TAS	Average Result
HistoSegNet [42]	45.6	36.3	58.3	50.8	47.7
SEAM	47.6	49.5	48.3	56.9	50.6
Our DAEN	48.2	51.1	49.4	58.7	51.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, G.; Zheng, Z.; Li, J.; Zhang, M.; Liu, J.; Zhang, L. Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation. Appl. Sci. 2025, 15, 6474. https://doi.org/10.3390/app15126474

AMA Style

Huang G, Zheng Z, Li J, Zhang M, Liu J, Zhang L. Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation. Applied Sciences. 2025; 15(12):6474. https://doi.org/10.3390/app15126474

Chicago/Turabian Style

Huang, Guanglun, Zhaohao Zheng, Jun Li, Minghe Zhang, Jianming Liu, and Li Zhang. 2025. "Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation" Applied Sciences 15, no. 12: 6474. https://doi.org/10.3390/app15126474

APA Style

Huang, G., Zheng, Z., Li, J., Zhang, M., Liu, J., & Zhang, L. (2025). Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation. Applied Sciences, 15(12), 6474. https://doi.org/10.3390/app15126474

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual Attention Equivariant Network for Weakly Supervised Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Weakly Supervised Semantic Segmentation Based on Image-Level Classification Labels

2.2. Attention Mechanism

3. Model

3.1. Motivation

3.2. CSM Module

3.2.1. Channel Attention Module

3.2.2. Spatial Attention Module

3.3. Equivariant Regularization

3.4. Pixel Correlation Module

3.5. Loss Function

4. Experiment

4.1. Experiment Setup

4.2. Evaluation of Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI