Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution

Liao, Qin; Chen, Jianjun; Wang, Fei; Rashid, Md Harun Or; Xu, Taihua; Fan, Yan

doi:10.3390/electronics14173389

Open AccessArticle

Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution

by

Qin Liao

,

Jianjun Chen

^*

,

Fei Wang

^*,

Md Harun Or Rashid

,

Taihua Xu

and

Yan Fan

School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(17), 3389; https://doi.org/10.3390/electronics14173389

Submission received: 21 July 2025 / Revised: 20 August 2025 / Accepted: 23 August 2025 / Published: 26 August 2025

(This article belongs to the Special Issue Knowledge Representation and Reasoning in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Instance segmentation stands as a foundational technology in real-world applications such as autonomous driving, where the inherent trade-off between accuracy and computational efficiency remains a key barrier to practical deployment. To tackle this challenge, we propose a dual-path enhanced framework based on YOLO11l. In this framework, two improved models, YOLO-SA and YOLO-SD, are developed to enable high-performance lightweight instance segmentation. The core innovation lies in balancing precision and efficiency through targeted architectural advancements. For YOLO-SA, we embed the parameter-free SimAM attention mechanism into the C3k2 module, yielding a novel C3k2SA structure. This design leverages neural inhibition principles to dynamically enhance focus on critical regions (e.g., object contours and semantic key points) without adding to model complexity. For YOLO-SD, we replace standard backbone convolutions with lightweight SPD-Conv layers (featuring spatial awareness) and adopt DySample in place of nearest-neighbor interpolation in the upsampling path. This dual modification minimizes information loss during feature propagation while accelerating feature extraction, directly optimizing computational efficiency. Experimental validation on the Cityscapes dataset demonstrates the effectiveness of our approach: YOLO-SA increases mAP from 0.401 to 0.410 with negligible overhead; YOLO-SD achieves a slight mAP improvement over the baseline while reducing parameters by approximately 5.7% and computational cost by 1.06%. These results confirm that our dual-path enhancements effectively reconcile accuracy and efficiency, offering a practical, lightweight solution tailored for resource-constrained real-world scenarios.

Keywords:

instance segmentation; one-stage methods; deep learning

1. Introduction

Instance segmentation is one of the important tasks in computer vision. It combines the characteristics of object detection and semantic segmentation. It not only identifies category information for diverse objects but also needs to generate corresponding pixel-level masks for each target [1]. When deployed in critical domains such as autonomous driving, medical imaging, and remote sensing, instance segmentation is typically subject to stringent requirements regarding both accuracy and processing speed [2]. These metrics are not merely key indicators of system performance but also critical determinants of operational stability and safety.

Traditional instance segmentation can be traced back to Carsten Rother et al. [3] proposed an interactive graph cut method for extracting foreground objects from images, which is suitable for the foreground extraction task of instance segmentation. Traditional instance segmentation mainly relies on classical image processing and manual design feature extraction technology for tasks. However, traditional instance segmentation methods face several limitations: insufficient feature representation capability, challenges in handling complex scenes, low accuracy, and a tendency to generate substantial redundant information. With the advancement of deep learning, significant breakthroughs have been achieved in instance segmentation tasks. Among these, convolutional neural networks (CNNs) have been widely adopted, spawning a series of methods with enhanced performance. The current deep learning instance segmentation methods are mainly divided into two categories in their detection structure: one-stage methods and two-stage methods.

The two-stage method is represented by Faster R-CNN [4] and Cascade R-CNN [5], and the candidate regions are classified and segmented. The one-stage method is represented by the YOLO series [6,7,8,9] and SSD [10], and the target prediction is performed directly on the image. The two-stage method first generates candidate regions and then classifies and segments each region, which has higher precision, while the one-stage method skips the proposal step of the candidate box and directly performs dense prediction on the feature map, which has higher speed. Mask R-CNN, the most classical two-stage method model, was proposed by He et al. [11]. By adding a segmentation mask branch to Faster R-CNN, it outputs the pixel-level mask of each candidate region, and realizes multi-task learning of classification, border, and segmentation mask. Cascade Mask R-CNN [12] is an instance segmentation model extended from Cascade R-CNN. It achieves instance segmentation by adding a mask in the cascade structure. Compared with Mask R-CNN, these methods achieve higher accuracy but suffer from relatively slower inference speed. Notable examples include PointRend proposed by Kirillov et al. [13] and RefineMask developed by Zhang et al. [14]. PointRend prioritizes high-uncertainty regions to enhance focus on object edges, while RefineMask explicitly monitors boundary areas for edge refinement. Both approaches optimize the mask edges generated by Mask R-CNN through cascaded upsampling (from coarse to fine resolution), effectively addressing the issue of subpar edge prediction in traditional Mask R-CNN outputs.

One-stage real-time segmentation began to rise in 2018. Feature pyramid enhancement methods focus on improving multi-scale representation, with PANet [15] pioneering bidirectional feature pyramid networks that enhance multi-scale object segmentation capabilities and subsequently being adopted as the neck network in YOLOv4/v5 for improved small object segmentation. Prototype-based approaches revolutionized real-time segmentation efficiency, exemplified by YOLACT [16], the first real-time one-stage instance segmentation framework that generates masks through shared prototypes and linear combinations, with subsequent frameworks like EmbedMask [17] and CondInst [18] building upon this foundation.

Dynamic kernel prediction methods address mask quality issues, as demonstrated by SOLOv2 [19], which improved upon SOLOv1 [20] by introducing dynamic convolution kernel prediction to solve YOLACT’s mask coarseness problem, while BlendMask [21] further enhanced instance discrimination in complex scenes through attention-guided feature fusion. Contour-based segmentation approaches offer an alternative to traditional mask-level characterization, with PolarMask [22] extending the FCOS detection framework to unify object detection and instance segmentation architectures, and AdaptIS [23] enabling instance mask generation from single-point annotations, closely mimicking manual annotation processes.

Unified architecture methods demonstrate versatility across segmentation tasks, with K-Net [24] handling semantic, instance, and panoptic segmentation simultaneously through learnable dynamic kernels without relying on traditional region proposals. Transformer-based approaches have significantly advanced the field, with Swin Transformer [25] introducing hierarchical window self-attention mechanisms widely adopted in segmentation backbones, Mask DINO [26] enhancing DETR-like models through dynamic matching and mask decoding modules, and Yang et al. [27] proposed YOLO-SegNet, seamlessly fusing YOLOv8 with SegFormer for refined spatio-temporal feature expression. Real-time optimization methods balance performance with efficiency, as exemplified by RTMDet-Ins [28], which maintains strong feature representation while achieving excellent real-time performance on COCO benchmarks, and Liu et al. proposed YOLO-CORE [29], which improves boundary description and segmentation precision through contour modeling within the YOLO framework. The continuous evolution of one-stage instance segmentation methods toward high efficiency, deployment-friendly architectures, end-to-end inference, and decoupled mask generation, combined with the expanding YOLO ecosystem and integration of dynamic masks, lightweight backbones, and Transformer modules, promises more practical solutions for high-performance instance segmentation systems in real-world applications.

In recent years, with the continuous development of autonomous driving technology, the application scenarios of visual perception systems have become more and more diverse. Real-time feedback under complex street scenes and a large number of dense targets has become a major difficulty, and instance segmentation can effectively separate each target object accurately. The instance segmentation method that distinguishes its boundaries is the key to semantic understanding, path planning, and obstacle avoidance. However, although advanced instance segmentation models such as Mask R-CNN have high precision, they are difficult to deploy on resource-constrained platforms such as on-board GPUs or edge devices. Reducing the complexity of the instance segmentation model while ensuring the segmentation performance is the current research hotspot in instance segmentation.

To this end, this paper proposes a dual-path enhanced lightweight scheme based on YOLO11. The contributions of this paper are as follows:

A dual (YOLO-SA and YOLO-SD) collaborative lightweight enhancement scheme is proposed. On the premise of not changing the main structure of YOLO11 and decoding heads, two enhancement modules are designed to improve the segmentation precision and control the model complexity.
Introducing the SimAM attention mechanism to the C3k2 module (forming C3k2SA). Under the condition of almost no parameters and increased computational cost, the perception ability of the model to small targets and local salient regions is effectively enhanced.
Combining DySample and SPDConv modules to optimize the upsampling and convolution paths, improve the accuracy of feature reconstruction without introducing large-scale structure, suppress redundant information interference, and improve the overall segmentation performance.
Multiple sets of ablation experiments and comparative verification are carried out. The empirical results show that the YOLO-SA improves the segmentation performance under the premise of zero computational cost. YOLO-SD maintains excellent segmentation precision under the premise of controllable parameter overhead.
The dual-path enhancement models achieve a good trade-off between lightweight and precision and provide a practical reference solution for single-stage instance segmentation tasks for edge deployment or real-time applications.

The remainder of this paper is organized as follows. Section 2 introduces the network structure of YOLO11 and the basic principles of SimAM, SPD-Conv, and Dysample modules. Section 3 describes the fundamental framework and specific procedures of the new proposed method. Comparative experimental results of datasets and analysis are reported in Section 4. Finally, Section 5 is a summary of the algorithm and points for further work.

2. Methodology

2.1. YOLO11l Model

This paper is based on YOLO11-seg as the infrastructure. The model represents the latest progress in this field. It is ahead of other models in the YOLO series with the most advanced accuracy, speed, and efficiency, and has good feature extraction ability and segmentation accuracy [30]. YOLO11 is mainly composed of Backbone, Neck, and Head, and its structure is shown in Figure 1. Compared with YOLOv8, with strong ability, its Cf2 module is changed to C3k2, and then C2PSA module is added after SPPF module to reduce redundant calculation and improve efficiency. The YOLO11 algorithm provides five models of different sizes, namely, YOLO11n, YOLO11s, YOLO11m, YOLO11l, and YOLO11x. The C3k2 module is frequently used in YOLO11. The C3k2 module is an important feature extraction component in the YOLO11 model and is an improved design based on the traditional C3 module. It provides more powerful feature extraction capabilities by combining variable convolution kernels (such as 3 × 3, 5 × 5, etc.) and channel separation strategies, especially for more complex scenes and deep feature extraction tasks.

2.2. SimAM

Yang et al. proposed SimAM (A Simple, Parameter-free Attention Module), which is a concise attention module that does not require additional parameters [31]. It is dedicated to providing a certain attention mechanism to the convolutional neural network (Conv Net). Its fundamental idea comes from the ‘spatial inhibition’ in neuroscience; that is, the 3D attention weight of the feature map is directly obtained by optimizing the energy function. This 3D attention weight combines the importance of the channel and the spatial dimension, and then learns the importance of each neuron more effectively. Compared with channel attention and spatial attention, the weight generated by SimAM is the real 3D weight. Therefore, the important part of the feature map can be effectively obtained. As shown in Figure 2, SimAM differs from earlier attention mechanisms that apply 1D (channel-wise) or 2D (spatial-wise) attention. Instead, SimAM estimates attention weights over the entire 3D feature tensor, enabling simultaneous spatial and channel sensitivity. In order to enhance the discriminative power of the model without adding additional parameters, this paper introduces the SimAM lightweight attention mechanism, based on the ‘neuro inhibition mechanism’ in neuroscience, to evaluate the saliency of each neuron for its neighborhood, and to determine whether it will be retained on the feature map or suppressed. Neuroscience believes that the activity of neurons can be measured by the difference between the activity of neurons and the surrounding neighborhood. SimAM defines an energy function without parameters, defined by (1).

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{o} - \hat{x_{i}})}^{2}

(1)

where the linear transformation

\hat{t} = w_{t} t + b_{t}

and

\hat{x_{i}} = w_{t} x_{i} + b_{t}

are the target neurons and other neurons in the single channel of the input feature

X \in R^{C \times H \times W}

.

i

is the index on the spatial dimension, and

M = H \times W

is the number of neurons on the channel. Here,

H

and

W

denote the height and width of the input feature map, respectively.

w_{t}

and

b_{t}

are the weights and biases of the linear transformation. All variables in Formula (1) are scalar values. The energy function reaches its minimum when the transformed target neuron output

\hat{t}

equals

y_{t}

, and all other transformed neuron outputs

\hat{x_{i}}

equal

y_{o}

, where

y_{t}

and

y_{o}

represent two distinct values.

In order to simplify the calculation, the binary labels (i.e., 1 and −1) are used for

y_{t}

and

y_{o}

, and a regularization term is added. The final energy function follows Formula (2).

e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {{(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + (1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2}

(2)

By solving the above energy function, the closed-form solutions for

w_{t}

and

b_{t}

are obtained as follows:

w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}, b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t}

(3)

where

μ_{t} = \frac{1}{M - 1} \sum_{i} x_{i}

and

σ_{t} = \sqrt{\frac{1}{M - 1} \sum_{i} {(x_{i} - μ_{t})}^{2}}

are the mean and variance of all neurons except

t

on the channel. Since the above solution is obtained on a single channel, assuming that all pixels in a single channel follow the same distribution, then the mean and variance of all neurons can be calculated once and reused on the channel to obtain the minimum energy calculation formula:

e_{t}^{*} = \frac{2 (t - μ_{t})}{{(t - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(4)

where

\hat{μ} = \frac{1}{M} \sum_{i} x_{i}

and

{\hat{σ}}^{2} = \sum_{i} {(x_{i} - \hat{μ})}^{2}

. The lower the energy

e_{t}^{*}

, the greater the difference between the neuron t and the surrounding neurons, and the more important it is in visual processing. Therefore, the importance of each neuron can be obtained by

\frac{1}{e_{t}^{*}}

. According to the fact that attention modulation in the mammalian brain is usually manifested as a gain effect on neuronal responses, SimAM uses scaling operators rather than additions for feature refinement. The refinement stage formula of the whole module is:

\hat{X} = sigmoid (\frac{1}{E}) ⊙ X

(5)

where

E

is the set of

e_{t}^{*}

in all channels and spatial dimensions. The sigmoid function is used to limit the value in

E

to be too large. It is a monotonic function that does not affect the relative importance of each neuron.

SimAM does not change the feature dimension. It can be well inserted into the lightweight model to improve the discrimination of boundary targets and small targets. In urban streetscape datasets such as Cityscapes, common problems include dense target distribution, severe occlusion, and variable scales, so capturing fine spatial and contextual relationships is crucial. As a parameter-free three-dimensional attention mechanism, SimAM can better focus key information at the voxel level. This makes it very suitable for extracting rich feature representations in complex urban environments. In an experiment on Cityscapes, integrating SimAM into the YOLO framework significantly improves segmentation performance, which verifies its effectiveness in such challenging street scenes. Although the original SimAM paper [31] did not explicitly evaluate Cityscapes, it consistently reported improvements in complex datasets such as COCO. Moreover, recent studies have also introduced SimAM into YOLO-based segmentation frameworks for urban or industrial environments and observed similar gains [32]. These results together support our claim that SimAM is effective in complex street-scene segmentation.

2.3. SPD-Conv

Sunkara et al. proposed SPD-Conv, which is a new construction module used to replace the strided convolution and pooling operations in traditional CNNs. This method realizes the down-sampling of the feature map by combining the space-to-depth transform and the common convolution layer (stride = 1), while avoiding information loss as much as possible. The structure of SPD-Conv is shown in Figure 3.

The core of SPD operation is to rearrange the input feature map

X

, and its spatial dimension is reduced proportionally. At the same time, the compressed spatial information is remapped to the channel dimension, so as to retain all pixel information. For an intermediate feature map

X

, the SPD layer will divide a series of subgraphs according to the set scaling ratio (scale). For example, in the case of scale = 2, the following subgraph can be obtained:

f_{0, 0} = X [0 : S : scale, 0 : S : scale], f_{1, 0} = X [1 : S : scale, 0 : S : scale]

(6)

By analogy, multiple subgraphs

f_{x, y}

are formed, as shown in Figure 3a–c, where each subgraph is composed of pixels satisfying that

i + x

and

j + y

can be divided by scale. The shape of each subgraph is

(S / 2, S / 2, C_{1})

, and

X

is downsampled twice. Then, these sub-feature maps are spliced along the channel dimension to obtain a feature map

X^{'}

, whose spatial dimension is seen as scale times, and the channel dimension is increased by

{s c a l e}^{2}

. Put simply, SPD converts the feature map

X (S, S, C_{1})

into an intermediate feature map

X^{'} (S / scale, S / scale, {scale}^{2} C_{1})

. An illustration with a scale of 2 is depicted in Figure 3d. After the SPD feature conversion layer, a non-step (i.e., stride = 1) convolution layer is added. The convolution layer has

C_{2}

filters, where

C_{2} < {scale}^{2} C_{1}

, and further converts

X^{'} (S / scale, S / scale, {scale}^{2} C_{1})

to

X^{″} (S / scale, S / scale, C_{2}), as

illustrated in Figure 3e. Non-step convolution is used to preserve all discriminant feature information as much as possible. In contrast, if the traditional convolution uses a step size of 2 or 3, it may lead to uneven sampling of odd and even rows, which in turn damages the expression ability of fine features. SPD-Conv retains all spatial information and is suitable for small target detection or low-resolution image processing tasks [33].

2.4. Dysample

Liu et al. proposed Dysample, which is a lightweight and efficient dynamic upsampler [34]. Firstly, the input features are interpolated into a continuous feature map by bilinear interpolation using the function provided by Pytorch, and then resampled by content-aware sampling points. Figure 4 illustrates the DySample architecture. As shown in Figure 4a, the sampling-based dynamic upsampling framework consists of a sampling point generator that computes a sample set, which is then used in grid sampling to produce the upsampled feature map. The initial design is to give the feature map

X

and the upsampling scale

s

, generate offset

O

in the linear layer, and then use PixelShuffle to convert

O

into the size of

2 \times sH \times sW

.

S

and

O

are added by the addition operation in Sample Network

G

to generate the sample set S. Finally, the upsampled feature map

X^{'}

is sampled according to the sample set. In the original version, the location of the sampling start is fixed, and the distribution is also uneven, similar to ‘nearest neighbor initialization’. To solve this problem, it is updated to ‘bilinear initialization’, that is, changing the initial position, the feature map of bilinear interpolation can be obtained at zero offset, which improves the performance of the model. Because of the existence of the normalized layer, the output eigenvalues are generally between

[- 1, 1]

, which will lead to the overlap of the offset range of the local sampling points, resulting in inaccurate boundary prediction and artifacts. In order to achieve the purpose of limiting the offset range of local sampling points, the offset is multiplied by a ‘static range factor’ of 0.25 to achieve the limiting effect. To increase offset flexibility, the point-level ‘dynamic range factor’ is generated by changing the linear projection input features, the value of the dynamic range factor is limited to

[0, 0.5]

, and the model performance is improved with 0.25 as the center, as shown in Figure 4b.

Unlike most other dynamic upsampling samplers, Dysample does not need to provide any higher-resolution guided feature maps than it needs, does not require any additional CUDA libraries from outside Pytorch, and has lower inference latency, memory overhead, FLOPs, and fewer parameters.

3. Architecture

Aiming at the problem of considering the precision of the instance segmentation model and the difficulty of deployment on autonomous driving, we choose YOLO11 as the basic model. On the basis of ensuring the basic structure of the model, two lightweight and efficient modules, SimAM and Dysample + SPDConv, are added to improve the performance of the model. Figure 5 illustrates the YOLO-SA and YOLO-SD architecture.

The C3k2 structure is widely used as a feature extraction module in YOLO11. In order to enhance the response ability of the model to the target edge region and the small target region, a lightweight attention mechanism, SimAM (Simple Attention Module) [31], is introduced on the basis of the original C3k2 module to construct the C3k2SA module, as illustrated in Figure 5a. SimAM is a parameter-free attention machine that enhances the response intensity of important spatial position pairs by weighting based on neuron activity, and improves the modeling ability of the model for the visual area. SimAM does not add additional weights and convolution operations, and can directly replace the original activation function at the corresponding position without affecting the original structure. Using the results of C2f output, SimAM is used to re-use the attention weighting of the feature information after the shunt fusion, so that the model can pay attention to useful information earlier, prevent a large amount of useless or redundant information from being brought to the model at the beginning, and make the features of different levels connect with each other better and more targeted. Interactive fusion improves model performance. Based on the introduction of C3k2SA, it almost does not increase the model parameters, but it can better exert the strong response selectivity advantage of the model to the key features, especially in the mask quality of the complex background and the boundary area. The mask quality of the region is significantly higher than that of other models.

The upsampling of the original structure of YOLO11 adopts the nearest neighbor interpolation method. Although this upsampling method has a small amount of calculation, it cannot accurately obtain the detailed information and semantic information of the features, which will affect the effect of the model in intensive prediction tasks. YOLO11 replaces the four standard convolutions (Conv) of each stage of the original backbone network with a lightweight spatial perception convolution SPDConv to improve the parameter utilization and spatial perception ability in the feature extraction stage. SPDConv can achieve effective convolution operations when guided by local spatial information, which can reduce many repeated convolution operations and complete low-cost modeling of complex targets on different scales of P2–P5. The C3k2 module, SPPF, and C2PSA in the original structure of YOLO11 are retained to ensure the ability to express deep semantic information. Moreover, the original Upsample operation in the neck structure is replaced with DySample upsampling, which reconstructs features using local dynamic weights. This method effectively alleviates spatial information loss introduced by interpolation and is more suitable for recovering small targets, as shown in Figure 5b. The two DySample act on the two upsampling paths of P5 P4 and P4 P3, respectively, and then complete the cross-layer feature fusion through Concat. The three-scale output mechanism of YOLO11 is still used, and finally, the instance segmentation masks on the three scales of P3, P4, and P5 are output. It ensures the same effective layer depth as the original trunk. In this way, the model size can be well compressed, and the perception of small targets in dense scenes can be enhanced.

4. Experimental Results

In this study, YOLO11l is selected as the baseline architecture because it achieves the best precision among the YOLO11 variants during initial training. This makes it a more suitable candidate to demonstrate the effectiveness of our proposed enhancement modules. Although our method is compatible with other versions such as YOLO11s and YOLOv1m, the large variant provides a better foundation for performance improvement analysis due to its superior baseline accuracy.

4.1. Dataset

The Cityscapes dataset [35] holds significant value for computer vision research. Primarily designed for semantic segmentation tasks, it also supports the development of application systems such as autonomous driving and ADAS. The dataset comprises 5000 images captured from a driver’s perspective in urban driving scenarios, divided into 2975 training images, 500 validation images, and 1525 test images. These three subsets are densely annotated at the pixel level (full annotations) and further subdivided into eight categories at the instance level, including specific objects like pedestrians and vehicles. Cityscapes data was collected from street views of approximately 50 cities in Germany and neighboring European countries under three distinct climatic conditions (spring, summer, and autumn). The collection scope primarily covers large-scale urban road-centered scenes, encompassing elements such as traffic flow on ordinary urban roads, street-side restaurants, and shops, as well as road infrastructure like trees and transit stops. Each image in the dataset has a resolution of 2048 × 1024, with annotations for up to 19 object classes (with absolute labeling for over half of each image area) and 8 classes with precise instance-level classifications. Cityscapes employs two evaluation criteria: “fine” and “coarse.” The fine subset includes 5000 meticulously annotated images, while the coarse subset combines these 5000 fine annotations with 20,000 coarsely annotated images.

The COCO dataset is a large-scale, comprehensive resource for object detection, segmentation, and captioning tasks [36]. Focused on scene understanding, it draws primarily from complex daily-life scenarios, with objects in images annotated via accurate segmentation. The dataset includes 91 object categories, 328,000 images, and 2.5 million labels. As one of the largest datasets for semantic segmentation to date, it provides annotations for 80 categories across over 330,000 images, with 200,000 of these images labeled. The total number of individual instances in the dataset exceeds 1.5 million.

In this work, although all experiments are conducted on the Cityscapes dataset, the backbone networks of all models are initialized with pretrained weights from COCO2017. This leverages the diverse annotations in COCO to improve generalization and stability during fine-tuning.

4.2. Training Details of Model

The experiment is carried out in the environment of Python 3.8, Torch 2.0.0, CUDA 11.8, and Torchvision 0.15.1. The configuration used was equipped with an Intel Corei9-12900K CPU (NVIDIA, Santa Clara, CA, USA) with 16 GB of RAM, and an NVIDIA GeForce RTX 3090ti GPU (NVIDIA, Santa Clara, USA) with 24 GB of video memory. The equipment was provided by the laboratory of Jiangsu University of Science and Technology, China. The specific parameters are shown in Table 1.

In all experiments, this paper uses the transfer learning strategy for model training, that is, using the pre-trained model weights on large-scale datasets as initialization, and fine-tuning on the target task. To improve training stability and efficiency, model parameters are initialized using pre-trained weights obtained from the COCO2017 dataset. This strategy not only helps the model to converge quickly in the early stage, but also enhances the generalization ability of the model, especially showing stronger robustness under non-ideal data such as small samples and pseudo-labels. All models are trained under a unified training strategy to ensure the fairness and comparability of the results. The training image is uniformly scaled to a resolution of 640 × 640, the optimizer is AdamW, the batch size is set to 16, and the upper limit of training rounds is 100. During the training process, the Early Stopping strategy is not enabled, but a complete iteration is performed to ensure that the model fully converges. At the same time, hybrid precision training (AMP) was enabled in all experiments to improve computational efficiency. Except for the backbone loading COCO pre-training weight, other training parameters were consistent in each experiment.

4.3. Evaluation Indicators of the Model

Four evaluation metrics—mean average precision (mAP), FLOPs, model parameters, and FPS—are adopted to compare and evaluate different segmentation models. Among them, mAP@0.5 represents the average precision when the intersection over union (IoU) threshold is 0.5, which is used to measure the basic segmentation ability of the model. However, mAP@ (0.5:0.95) takes the average value in the range of IoU threshold from 0.5 to 0.95 (step size is 0.05), which more strictly reflects the accurate fitting ability of the model to the target boundary, and is a more authoritative evaluation standard in the current instance segmentation task. The mean average precision (mAP)calculation relies on two basic concepts: Precision and Recall. The precision rate represents how many of the samples predicted as positive samples are real positive samples, which is defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

The recall rate represents how many of all true positive samples are correctly predicted by the model, defined as:

R c a l l = \frac{T P}{T P + F N}

(8)

On this basis, the precision and recall curves under different IoU thresholds are calculated, and the average precision (

A P

) can be obtained. The integral form is defined as follows:

A P = \int_{0}^{1} P (R) d R

(9)

The Mean Average Precision (

m A P

) is the average of all categories of APs, expressed as:

m A P = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

(10)

where

T P

(True Positive) refers to the number of correctly predicted positive instances;

F P

(False Positive) denotes the number of negative instances incorrectly predicted as positive;

F N

(False Negative) represents the number of positive instances incorrectly predicted as negative;

P (R)

stands for the precision value at a specific recall level

R

;

A P

is the area under the precision–recall curve for a single class;

C

indicates the total number of object categories; and

m A P

(Mean Average Precision) refers to the average of

A P

across all classes.

FLOPs are used to quantify the computational resources required by the model in the reasoning stage. The smaller the FLOPs, the higher the computational efficiency. The parameter quantity reflects the storage requirements and size of the model, which is particularly critical for edge device deployment.

4.4. Experiment Results and Analysis

4.4.1. Refine the Experiment

The experimental results are shown as presented in Table 2. The YOLO-SA model achieves 0.410 on the mAP(mask)@0.5 indicator, which is 2.2% higher than the 0.401 of the baseline model. The model also achieved 0.208 under the stricter mAP(mask)@0.5:0.9 evaluation criteria, which is 2.0% higher than the baseline model 0.204. The YOLO-SD model also shows competitive performance. The model reaches 0.406 on the mAP(mask)@0.5 index, which is 1.2% higher than the baseline and slightly lower than YOLO-SA. However, on the comprehensive index of mAP(mask)@0.5:0.9, YOLO-SD and YOLO-SA reached the same level of 0.208, indicating that the two models have stable detection and segmentation capabilities in different IoU threshold ranges. The floating-point operations (FLOPs) of the YOLO-SA model is 142.5 G, which is only 0.4% higher than the baseline model’s 142.0 G. This small increase in computational overhead is acceptable. The parameters of the model remain at 27.6 M, which is almost consistent with the baseline model, indicating that the introduction of the SimAM attention mechanism does not significantly increase the storage requirements of the model. In contrast, the YOLO-SD model performs better in efficiency optimization. The FLOPs of the model was reduced to 140.5 G, a decrease of 1.1% compared to the baseline, and the amount of parameters was reduced to 26.0 M, a decrease of 5.8%. This shows that the design of YOLO-SD not only maintains the detection accuracy but also effectively reduces the computational complexity and storage requirements of the model, providing better feasibility for actual deployment.

Combining the performance of the two dimensions of accuracy and efficiency, both improved schemes achieve performance improvement based on the baseline model YOLO11l-seg. The YOLO-SA model takes into account both accuracy and speed, and can obtain stable gain without large modification of the model structure. YOLO-SD is more suitable for lightweight deployment. Although its accuracy has not been greatly improved, it has achieved good operator-level savings. The combination of the two needs to be further optimized. There is still much room for improvement in adding feature alignment and designing its fusion mechanism more reasonably.

4.4.2. Comparison with Other Models

To further validate the effectiveness of the proposed method, this study compares the improved YOLO11 model with state-of-the-art instance segmentation algorithms, including YOLOv8l, YOLACT, RTMDet-ins-tiny, Mask R-CNN, Cascade Mask R-CNN, and the original YOLO11 baseline. The evaluation metrics include mAP@0.5, inference time, FPS, FLOPs, and model parameters, with detailed results presented in Table 3 and Figure 6.

The comparative results reveal that YOLO-SA achieves the highest segmentation accuracy among all tested models, with a mean average precision of 0.410—an improvement of approximately 2.24% over YOLO11l. YOLO-SD also performs favorably, attaining a mean average precision of 0.406, which is 1.25% higher than the baseline. In terms of inference speed, YOLO-SA reduces latency from 28.6 ms to 24.5 ms, representing a 14.3% performance gain, while YOLO-SD achieves a latency of 26.3 ms, corresponding to an 8% acceleration. Notably, YOLO-SD maintains the smallest parameter count and the lowest computational cost across all models. Compared to YOLOv8l, YOLO-SA improves accuracy by 1.74%, reduces parameters by over 40%, and lowers FLOPs consumption by nearly 36%. Although the segmentation accuracy of YOLO-SD is slightly lower than that of YOLO-SA, its segmentation performance is still better than that of YOLOv8l, and it has a higher accuracy. Importantly, compared with the 45.94 M parameters of YOLOv8l, YOLO-SD requires only 2.602 M parameters, and the computational complexity is lower; only 140.5 GFLOPs and 220.7 GFLOPs are required. These results highlight the advantages of YOLO-SD in efficiency and accuracy, making it a more balanced and practical real-time deployment model. For models designed for real-time scenarios, YOLACT and RTMDet-ins-tiny achieve mean average precisions of only 0.229 and 0.368, respectively—significantly lower than the proposed models. In contrast, while Mask R-CNN and Cascade Mask R-CNN attain higher accuracies of 0.454 and 0.490, this is accompanied by a substantial increase in model size and computational cost. These two models also exhibit significantly prolonged inference times (53.4 ms and 69.9 ms) and parameter scales exceeding 44 M and 77 M, respectively.

To provide a more intuitive performance comparison, line plots are visualized for key evaluation metrics including accuracy (mAP), inference latency, FPS, and parameter count. These visualizations clearly demonstrate that YOLO-SA and YOLO-SD consistently achieve strong segmentation accuracy while maintaining low latency, high throughput, and compact model size. Compared to mainstream real-time and high-accuracy models, our proposed methods offer a favorable trade-off between precision and efficiency, making them well-suited for practical deployment.

Figure 7 shows the instance segmentation results of multiple models on complex urban street scenes in the Cityscapes dataset. In order to more clearly show the performance differences of each model, yellow arrows are added to the figure to point out the common types of errors such as instance fragmentation and missed detection. In the prediction of YOLACT, RTMDet-ins-tiny, and Mask R-CNN, it can be observed that some objects (such as vehicles or pedestrians) are wrongly segmented into multiple discontinuous masks. This shows that these models are difficult to maintain the integrity of instances when dealing with occluded or dense targets. In the detection results of YOLOv8 and YOLO11, there is a problem of redundant detection boxes in the yellow box area. Sometimes the model will generate two repeated borders for the same object, resulting in detection redundancy, which will not only affect the interpretability of the model but also reduce the efficiency of subsequent processing.

In contrast, YOLO-SA and YOLO-SD show more stable segmentation performance. They can achieve only one coherent mask for each object, avoid duplicate boxes, and maintain high accuracy, especially in crowded and occluded scenes.

4.4.3. Ablation Experiment

To verify the effectiveness of individual components and their synergistic performance, we conducted ablation experiments by incrementally incorporating SimAM, DySample, and SPDConv into the baseline model. The resulting performance metrics (based on the baseline YOLO11l) are summarized in Table 4.

As presented in Table 4, integrating SimAM alone elevates the mask mAP@0.5 to 0.410, demonstrating that this module enhances the model’s ability to focus on critical features and precisely localize object boundaries. Adding DySample independently yields a mask mAP@0.5 of 0.405, indicating its utility in improving segmentation accuracy with minimal additional computational overhead. In contrast, introducing SPDConv alone leads to a noticeable drop in mask mAP@0.5 to 0.365, suggesting that SPDConv may compromise the model’s capacity to capture local edge information during feature extraction, resulting in accuracy degradation.

When combining DySample and SPDConv (YOLO11 + DS + SPD), the mask mAP@0.5 rebounds to 0.406, with parameters reduced to 26.0 M (from 27.5 M) and FLOPs slightly lowered to 140.5 G, confirming that their joint application achieves a degree of lightweight optimization without sacrificing accuracy. However, integrating all three modules (SimAM, DySample, and SPDConv) does not yield additive performance gains; instead, the mask mAP@0.5 drops to 0.405, indicating suboptimal synergy between the components. This suggests that further refinement of the fusion strategy is required to leverage the strengths of each module effectively.

5. Conclusions and Future Work

To address the balance between accuracy and computational efficiency in instance segmentation, this paper proposes a dual-path enhanced lightweight framework based on YOLO11. The improvements primarily focus on two aspects: enhancing the model’s perception of key regions via the C3k2SA module, and optimizing the upsampling process of the feature pyramid using DySample and SPDConv. Experiments on the Cityscapes dataset validate the effectiveness of our method: the proposed model achieves an accuracy improvement of approximately 2% without increasing computational overhead, while maintaining comparable segmentation performance with reduced parameters. Although the combined use of multiple modules requires further optimization, each individual improvement demonstrates its unique value. Notably, the suboptimal performance when multiple modules are integrated indicates potential interference between different components. For example, introducing a dedicated interaction block or adaptive weighting scheme could more effectively coordinate feature fusion. Additionally, future work will systematically investigate the impact of such integration on different scene types and object scales. These efforts aim to improve module synergy and further enhance the robustness and generalizability of lightweight instance segmentation models in real-world applications.

Author Contributions

Conceptualization, Q.L. and J.C.; methodology, Q.L. and J.C.; software, Q.L. and M.H.O.R.; validation, F.W., T.X. and Y.F.; formal analysis, Q.L.; investigation, Q.L. and T.X.; resources, J.C.; data curation, Q.L.; writing—original draft preparation, Q.L. and F.W.; writing—review and editing, J.C.; visualization, Q.L. and M.H.O.R.; supervision, J.C.; funding acquisition, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62202210) and Jiangsu Province College Students’ Innovation Training Program (No. X202510289173).

Data Availability Statement

The data that support the findings of this study are openly available in the Cityscapes dataset at https://www.cityscapes-dataset.com (accessed on 24 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Müller, D.; Kramer, F. Miscnn: A framework for medical image segmentation with convolutional neural networks and deep learning. BMC Med. Imaging 2021, 21, 12. [Google Scholar] [CrossRef] [PubMed]
Kaymak, Ç.; Uçar, A.A. A brief survey and an application of semantic image segmentation for autonomous driving. In Smart Innovation, Systems and Technologies; Balas, V., Roy, S., Sharma, D., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 161–200. [Google Scholar]
Rother, C.; Kolmogorov, V.; Blake, A. GrabCut: Interactive foreground extraction using iterated graph cuts. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 1st ed.; ACM: New York, NY, USA, 2023; pp. 593–598. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://zenodo.Org/records/7347926 (accessed on 15 July 2025).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 24–27 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image segmentation as rendering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar]
Zhang, G.; Lu, X.; Tan, J.; Li, J.; Zhang, Z.; Li, Q.; Hu, X. RefineMask: Towards high-quality instance segmentation with fine-grained features. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–25 June 2021; pp. 6857–6865. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time instance segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 1266–1273. [Google Scholar]
Ying, H.; Huang, Z.; Liu, S.; Shao, T.; Zhou, K. EmbedMask: Embedding coupling for instance segmentation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}, Montreal, QC, Canada, 19–26 August 2021; pp. 1266–1273. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional convolutions for instance segmentation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; pp. 282–298. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and fast instance segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 17721–17732. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting objects by locations. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 649–665. [Google Scholar]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-down meets bottom-up for instance segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8570–8578. [Google Scholar]
Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. PolarMask: Single shot instance segmentation with polar representation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12190–12199. [Google Scholar]
Sofiiuk, K.; Barinova, O.; Konushin, A. AdaptIS: Adaptive instance selection network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 7354–7362. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-Net: Towards unified image segmentation. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 10326–10338. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M.; Shum, H.Y. Mask DINO: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 3041–3050. [Google Scholar]
Yang, T.; Zhou, S.; Xu, A.; Ye, J.; Yin, J. YOLO-SegNet: A method for individual street tree segmentation based on the improved yolov8 and the segformer network. Agriculture 2024, 14, 1620. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Liu, H.; Xiong, W.; Zhang, Y. YOLO-CORE: Contour regression for efficient instance segmentation. Mach. Intell. Res. 2023, 20, 716–728. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. YOLO by Ultralytics. 2020. Available online: https://github.Com/ultralytics/ultralytics (accessed on 15 July 2025).
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Ma, C.; Li, K.; Pan, J.; Zheng, J.; Zhang, Q.; Qi, C. The Segmentation of Tunnel Faces in Underground Mines Based on the Optimized YOLOv5. Minerals 2025, 15, 255. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A new cnn building block for low-resolution images and small objects. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Turin, Italy, 19–23 September 2022; Amini, M.-R., Canu, S., Fischer, A., Guns, T., Novak, P.K., Tsoumakas, G., Eds.; pp. 443–459. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6004–6014. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; pp. 740–755. [Google Scholar]

Figure 1. Network architecture of YOLO11-seg.

Figure 2. Illustration of different attention weight representations. Figures (a,b) represent conventional 1D (channel-wise) and 2D (spatial-wise) attention mechanisms, where weights are generated separately and then broadcast across the feature map. In contrast, (c) shows the SimAM module used in this work, which directly assigns unique attention weights to each spatial-channel point, enabling fine-grained feature enhancement with minimal computational overhead.

Figure 3. SPD-Conv module structure (scale = 2). (a) Shows the original shape of the input feature map X. (b) Illustrates how the feature map is divided into smaller sub-maps, each with half the spatial resolution of the original, when using a scale of 2. (c) Displays the sub-maps concatenated along the channel dimension. (d) Shows the transformed feature map X′ after the SPD operation, where the spatial dimensions are reduced, and the channel count is increased. Finally, (e) demonstrates how applying a non-strided convolution further refines the feature map by reducing the number of channels while preserving important details.

Figure 4. Illustration of the sampling-based dynamic upsampling mechanism and DySample module structure. (a) Shows the process of sampling-based dynamic upsampling, where the input feature map X is re-sampled using a generated sampling set and grid sample function. (b) Depicts the sampling point generator in DySample, with the top part using a static scope factor to generate offsets via a linear layer, and the bottom part using a dynamic scope factor for more flexible offset generation.

Figure 5. YOLO-SA and YOLO-SD network architecture. (a) shows the structure of YOLO-SA, (b) displays the structure of YOLO-SD.

Figure 6. Comparison of instance segmentation models across key metrics.

Figure 7. The experimental results of the comparison of different instance segmentation models. (a) Comparison of the models’ predictions on the first test image; (b) Comparison of the models’ predictions on the second test image.

Table 1. Details of the training platform.

Equipment	Name
Hardware environment	CPU: Intel Core i9-12900K 2.4–5.2 GHz 16 GB of RAM GPU: NVIDIA GeForce RTX 3090ti 24 GB
Software environment	Windows 11; PyCharm 2024.3.5
Network framework	Torch 2.0.0; CUDA 11.8; Torchvision 0.15.1
Programming language	Python 3.8.20

Table 2. Improvement results of the dual-path network.

Models	mAP(mask)@0.5	mAP(mask)@0.5;0.9	FLOPs/G	Parameters/M
Baseline (YOLO11l)	0.401	0.204	142.0	27.6
YOLO-SA	0.410	0.208	142.5	27.6
YOLO-SD	0.406	0.208	140.5	26.0

Table 3. Results of different models on the Cityscapes dataset.

Models	mAP(mask)@0.5	Inference (ms)	FPS	FLOPs/G	Parameter/M
YOLO11l	0.401	28.6	34.9	142.0	27.599
YOLOv8l	0.403	24.1	41.4	220.7	45.937
YOLACT	0.229	30.2	33.1	62.0	34.795
RTMDet-ins-tiny	0.368	47.3	21.1	11.8	5.617
Mask R-CNN	0.454	53.4	18.7	235.2	44.008
CASCADE Mask R-CNN	0.490	69.9	14.3	1757.7	77.047
YOLO-SA	0.410	24.5	40.8	142.5	27.624
YOLO-SD	0.406	26.3	38.2	140.5	26.019

Table 4. Impact of different design modules on instance segmentation.

Models	mAP(mask)@0.5	mAP(box)@0.5	FLOPs/G	Parameter/M
YOLO11l	0.401	0.511	142.0	27.5
YOLO11l + SimAM	0.410	0.516	142.5	27.6
YOLO11l + Dysample	0.405	0.515	142.6	27.6
YOLO11l + SPDConv	0.365	0.474	131,1	24.6
YOLO11l + DS + SPD	0.406	0.512	140.5	26.0
YOLO11l + SA + DS + SPD	0.405	0.506	140.5	26.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, Q.; Chen, J.; Wang, F.; Rashid, M.H.O.; Xu, T.; Fan, Y. Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution. Electronics 2025, 14, 3389. https://doi.org/10.3390/electronics14173389

AMA Style

Liao Q, Chen J, Wang F, Rashid MHO, Xu T, Fan Y. Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution. Electronics. 2025; 14(17):3389. https://doi.org/10.3390/electronics14173389

Chicago/Turabian Style

Liao, Qin, Jianjun Chen, Fei Wang, Md Harun Or Rashid, Taihua Xu, and Yan Fan. 2025. "Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution" Electronics 14, no. 17: 3389. https://doi.org/10.3390/electronics14173389

APA Style

Liao, Q., Chen, J., Wang, F., Rashid, M. H. O., Xu, T., & Fan, Y. (2025). Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution. Electronics, 14(17), 3389. https://doi.org/10.3390/electronics14173389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution

Abstract

1. Introduction

2. Methodology

2.1. YOLO11l Model

2.2. SimAM

2.3. SPD-Conv

2.4. Dysample

3. Architecture

4. Experimental Results

4.1. Dataset

4.2. Training Details of Model

4.3. Evaluation Indicators of the Model

4.4. Experiment Results and Analysis

4.4.1. Refine the Experiment

4.4.2. Comparison with Other Models

4.4.3. Ablation Experiment

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI