First, we employ a local superpixel sampling weighting strategy to enhance local feature representation. This method improves the model’s sensitivity to object boundaries while effectively capturing local details by modeling pixels with soft membership relationships and generating weighted features that preserve boundary information in a pixelated manner. Second, we introduce the concept of class centers to model global features. This approach transforms the traditional self-attention mechanism from modeling pixel-to-pixel interactions into modeling relationships between pixels and class centers, thereby reducing computational complexity and information redundancy. To capture global context, we update the pre-segmentation results by representing the similarity between pixels and class centers and then concatenate this information with the original feature map to enrich the feature representation. Finally, an adaptive feature fusion module is designed to effectively integrate local and global features, enhancing both segmentation accuracy and the network’s generalization capability.
2.1. Superpixel Sampling Weighting Module
The superpixel sampling weighting module is designed to capture local features of the image, providing fine-grained information for subsequent segmentation tasks. By leveraging the inherent advantages of superpixel segmentation in preserving object boundaries, this module incorporates superpixel-based local soft clustering into the deep network. This integration enhances the network’s focus on boundary information while effectively extracting local contextual features.
The process begins with feature extraction from the input image, capturing essential low-level information such as edges and textures, which serve as input to the superpixel sampling stage. As illustrated in
Figure 2, the feature extraction network is composed of a series of convolutional layers, batch normalization layers (BN), and ReLU activation functions. To increase the receptive field, max-pooling operations are applied after the third and fifth down-sampling stages. Furthermore, in order to integrate multi-scale information, low-level and high-level feature maps are concatenated to form enriched pixel-level features that support a more accurate superpixel-based representation.
Furthermore, to better utilize the relationship between pixels and superpixels, the hard association is transformed into a soft association, replacing nearest-neighbor constraints with weight computation. This approach provides a more flexible way to capture the mapping
Q between pixels and superpixel features
. Specifically, for the
t-
iteration, the mapping relationship between pixel
p and superpixel
i is given by (
1). Additionally, during the iteration process, distance calculation is constrained to the nine superpixels surrounding each pixel. The corresponding
Q becomes a sparse matrix, improving the computational efficiency of the algorithm while enhancing the capture of local pixel correlations.
Finally,
Q is normalized row-wise and column-wise to obtain
and
. The normalized results are then used to weight the superpixel features
separately, followed by feature fusion in (
2). The row-normalized
standardizes the pixel-to-superpixel membership relationships, representing the relationships between each pixel and superpixel. On the other hand, the column-normalized matrix
standardizes the superpixel-to-pixel membership relationships, indicating the relationships between each superpixel and different pixels. This weighted fusion approach not only balances the relationships between pixels and superpixels but also addresses the issue of feature neglect caused by overly sparse matrices, enriching local feature information.
2.2. Class Center Attention Module
Self-attention mechanisms are effective in capturing long-range dependencies between pixels, enabling the model to develop a more comprehensive understanding of image content and thereby improving semantic segmentation accuracy. However, their high computational complexity introduces a significant trade-off between accuracy and inference speed when applied to convolutional neural networks.
To address this challenge, we propose a Class-Center Attention Module, illustrated in
Figure 3, specifically designed to reduce computational overhead while preserving the benefits of global context modeling. This module utilizes pre-segmentation results to generate class center features, which serve as category-level representations. By transforming pixel-to-pixel associations into relationships between pixels and class centers, the module shifts the modeling perspective from the spatial domain to the semantic category domain. This transformation significantly reduces redundancy in the attention matrix and improves computational efficiency, while still enabling effective global feature extraction.
Firstly, feature information F is extracted using an encoding network and pre-segmentation results S are obtained based on a fully convolutional network, serving as prior information for subsequent operations. To address the issue of detail loss during the downsampling process, we make improvements based on ResNet50, consisting of five stages. The settings for stages 1 to 3 are consistent with the ResNet50 network, while in stages 4 and 5, the stride is set to 1, preserving the resolution of the feature maps. Additionally, dilated convolutions are introduced to replace ordinary convolutions, increasing the receptive field without adding extra parameters. The dilation factors in these two stages are set to {1, 2} and {2, 4}, progressively expanding the receptive field to capture more feature information.
Secondly, to reduce computational costs, the feature map
F is dimensionally reduced using a 1 × 1 convolution, reducing the number of channels to
. Simultaneously,
F and
are reshaped into
and
, where
,
H denotes height, and
W denotes width. The pre-segmentation results
S serve as the similarity between pixels and classes. Matrix multiplication is then performed on the transposes of
F and
, followed by normalization, yielding the global class center features
. The calculation formula is as follows:
where
represents the probability that pixel
j belongs to class
i.
Finally, the pre-segmentation results are updated using class center features (depicted in
Figure 4), as shown in (
4).
As evident from the formulation, the updated segmentation results rely on two key components: the initial pre-segmentation outputs and the feature similarity between class center features and individual pixel features. When the pre-segmentation result indicates a high probability of a pixel belonging to a specific class, that pixel contributes more strongly to the corresponding class center, and the similarity between the pixel and the class center feature is higher. As a result, the updated segmentation is more likely to assign that pixel to the same class.
Conversely, when the class probabilities for a pixel are relatively uniform—indicating weak feature discrimination—the introduction of long-range contextual information via class center features helps enhance class-specific distinctiveness. This, in turn, improves the reliability and accuracy of pixel-level classification.
To reduce the dependency on coarse pre-segmentation results, we introduce class center attention feature maps, computed by weighting the class center features into the updated segmentation representation. These attention feature maps are then reshaped to a size of and passed through a convolution to align with the dimensions of the original feature map F. Finally, a straightforward element-wise addition is performed to fuse the original feature map with the refined class center attention map, resulting in semantic features enriched with global contextual information.
It is worth emphasizing that the design of this update mechanism cleverly avoids rigid reliance on the quality of initial pre-segmentation results, thereby alleviating the potential “circular dependency” issue. In this design, the initial pre-segmentation result
S only serves as a dynamic “class prior” rather than the sole decision-making basis. As shown in (
4), the updated segmentation result
is obtained through adaptive weighted fusion of the initial pre-segmentation
and a term based on feature similarity
. This means that even in the early stages of training, when pre-segmentation results are relatively coarse, the similarity calculation between pixel features
F and their corresponding class-center features
can still provide a strong correction signal. If the feature of a certain pixel is closer to the feature of another class center, this term will dominate the update process, thereby effectively correcting the initial misclassification. More importantly, this mechanism forms a feedback loop for positive optimization during the training process. As training progresses, the feature representation capability of the encoder continues to improve, and the quality of pre-segmentation results is enhanced accordingly. Higher-quality pre-segmentation generates more accurate class centers, and these more accurate class centers can provide more reliable global contextual information through the attention mechanism, further optimizing feature representation. This iterative self-improvement process enables the module to converge stably and gradually improve the accuracy and robustness of segmentation.
2.5. Evaluation Metrics
In the field of image segmentation, several metrics are employed to evaluate model performance. This paper adopts a set of widely used metrics, including Pixel Accuracy (PA), Mean Intersection over Union (mIoU), Boundary Recall (BR), and Achievable Segmentation Accuracy (ASA). These metrics comprehensively assess the segmentation effectiveness of the algorithm.
Pixel Accuracy (PA) represents the ratio of correctly segmented pixels to the total number of pixels. Performance is directly proportional to the PA value; a higher PA value indicates better overall performance. The formula for PA is given by
where
n is the total number of semantic categories excluding the background,
i represents the ground truth class,
j represents the predicted class, and
denotes the number of pixels of class
i that are predicted as class
j.
Mean Intersection over Union (mIoU) measures the degree of overlap between the segmentation result and the ground truth. It is defined as the average of the ratio of intersection to union between the predicted segmentation and the ground truth for each class. A higher mIoU value signifies that the algorithm’s segmentation result aligns more closely with the ground truth image. The formula for mIoU is given following:
In the aforementioned formula,
represents the total number of semantic categories including the background. The term
represents the true positives (pixels of class
i correctly predicted as class
i). The denominator
calculates the union of the predicted and ground truth pixels for class
i. This can also be expressed for each class using True Positives (TP), False Positives (FP), and False Negatives (FN), with mIoU being the mean of this value across all classes:
Unlike PA in (
7), which treats the contribution of each pixel equally, mIoU provides a distinct evaluation of performance for each class, offering a more comprehensive and objective reflection of the algorithm’s capabilities.
Boundary Recall (BR) and Achievable Segmentation Accuracy (ASA) are primarily used to evaluate the performance of superpixel segmentation algorithms. BR is used to measure the effectiveness of boundary segmentation, representing the proportion of ground truth boundaries that are correctly detected (or “recalled”) by the predicted segmentation boundaries within a given tolerance.
where
G represents the set of ground truth superpixel boundaries and
S is the set of predicted superpixel boundaries.
denotes the number of ground truth boundary pixels in
G that are correctly identified by a predicted boundary pixel in
S within a specified neighborhood.
denotes the number of ground truth boundary pixels in
G that are not detected by any predicted boundary pixel in
S within that neighborhood.
Achievable Segmentation Accuracy (ASA) evaluates the accuracy of superpixel segmentation from a pixel-level perspective by measuring the similarity between the predicted superpixels and the ground truth superpixels. It calculates the upper-bound accuracy achievable by labeling each predicted superpixel with the ground truth label that has the maximum overlap.
where
S is the set of predicted superpixel blocks (
being the
j-th superpixel),
G is the set of ground truth superpixels (
being the
i-th ground truth superpixel), and
N is the total number of pixels in the image. The term
represents the count of pixels in the intersection of the predicted superpixel
and the ground truth superpixel
.