YOLOv8-seg has models of YOLOv8n-seg, YOLOv8s-seg, YOLOv8m-seg and other scales, but the parameters of these models are getting larger and larger, the inference accuracy is getting higher and longer and the inference time is getting longer and longer, so it is not conducive to deployment at the edge. Therefore, in this study, the model structure was optimized based on the YOLOv8n-seg model as follows:
  2.3.1. MSDA-CBAM
Woo et al. proposed the CBAM in 2018, which focuses on the network’s learning of important features in the feature channel dimension and spatial dimension [
24]. In this study, the weight of important features was increased and passed into the network to carry out accurate learning classification in a deeper level. Especially, CBAM is a lightweight, general-purpose module that can be seamlessly integrated into any CNN architecture, with negligible additional overhead, and can be trained end-to-end alongside the basic CNN. CBAM was constructed by the fusion of channel attention mechanism and spatial attention mechanism and its structure is shown in 
Figure 3.
The channel attention mechanism performs global average pooling and global maximum pooling operations on the input feature layer, and then the pooling results of the two are processed and added using the shared full connection layer, and finally the channel attention diagram is obtained using the Sigmoid activation function. Through the above operations, the weights of each channel can be generated to highlight important feature channels.
The spatial attention mechanism takes the maximum and average value of the channels of each feature point and stack the results of the two, then adjusts the number of channels by using the convolution of the number of channels as 1, and finally obtains the spatial attention diagram by using the Sigmoid activation function. Through the above operations, the weights of each spatial position can be generated to highlight important feature space regions.
For the road segmentation task of tea garden in hilly areas, the segmentation area is often in the middle position of the image, so the spatial attention mechanism can effectively improve the segmentation accuracy of the model.
MSDA, derived from the study of Jiao et al. on Transformer, is an attention mechanism that can simulate the interaction of local and sparse image blocks in a small range. They found that Vision Transformer (ViT) at a shallow level, the attention matrix has locality and sparseness. This shows that in shallow semantic modeling, the parts far from the query block are largely irrelevant, so there is a lot of redundancy in the global attention module. On the basis of this research, DilateFormer was further constructed, which stacks MSDA blocks in the bottom stage and uses global multi-head self-attention blocks in the high stage. This design allows the model to take advantage of locality and sparsity when dealing with low-level information, while simulating distant dependencies when dealing with high-level information [
25]. The basic structure of the MSDA module is shown in 
Figure 4.
The MSDA module also adopts a multi-head design, dividing the channel of the feature graph into different heads, and performing Strong–Weak Distribution Alignment (SWDA) at different heads using different cavity rates. In this way, semantic information can be aggregated at various scales within the sensory field of attention, and the redundancy of self-attention mechanisms can be effectively reduced without complex operations and additional computational costs. The specific operations are as follows.
For each head, there will be an independent expansion rate. When it is 1, 2 and 3, the receptive fields are 3 × 3, 5 × 5 and 7 × 7, respectively.
Obtain the slice from the feature map, and execute SWDA to obtain the output. The QKV parameter is a method used to calculate the correlation between different positions in the input sequence. They stand for query, key and value, all of which are derived from different linear transformations of the embedded vectors of the input sequence. Query and key are used to calculate the similarity score of each position with the others, and then the weight matrix is normalized by SoftMax. Value is used to represent the information of the input sequence, which is multiplied by the weight matrix to obtain the weighted output sequence.
The outputs of all the heads are joined together and then feature aggregated through a linear layer.
Since the small model with depth scaling was used for this paper, the depth of the model was relatively shallow, so MSDA could be applied to achieve efficient feature extraction.
Based on the CBAM, this study replaces the channel attention module in CBAM with the MSDA module and generates the MSDA-CBAM through this operation. The MSDA-CBAM combines the advantages of CBAM and MSDA to achieve multi-scale channel attention capabilities, as well as spatial attention capabilities, and is also a combination of CNN and Transformer. Finally, the improved attention mechanism module is added to the SPPF pooling layer of backbone to improve the feature extraction capability of the backbone network. The structure of the improved attention module is shown in 
Figure 5.
For MSDA-CBAM, if the input eigenmatrix is 
, then the output matrix 
 satisfies the following:
As shown in Formula (1) above,  is the weight transformation matrix of spatial attention and  is the weight transformation matrix of MSDA mechanism. ,  and  is a linear transformation matrix after model training.  represents multiply elements.  is the activation function,  is 7 × 7 convolution,  is average pooling operation,  is the maximum pooling operation,  is linear transformation and  is concatenation operation. SWDA is the sliding window expansion attention operator in Transformer. n means that the feature map is sliced n times according to the QKV matrix of the self-attention mechanism.  is the expansion rate of the corresponding head in the i-th slice.
  2.3.2. DR-Neck Feature Extraction Network
The DR-Neck network is proposed by referring to several excellent feature fusion network structures and combining them with practical application requirements. The network pays full attention to the deep network features from the structure, adopts multiple up-sampling to enrich the edge information of segmentation, uses CBAM to correct the original feature distortion caused by the up-sampling operator and adopts a Semantics and Detail Infusion (SDI) multi-level fusion module to achieve efficient cross-scale feature fusion.
SDI, meaning the fusion of semantics and details, is a multi-level fusion module derived from UNet v2 network [
26]. SDI strengthens feature extraction through attention mechanism, preprocesses cross-scale information through up-sampling, down-sampling and other operations, and fuses features through Hadamard product operations to enhance semantic and detailed information in images.
The attention mechanism is added to YOLOv8-seg, and the feature fusion network is modified to DR-Neck, thus forming the DR-YOLO instance segmentation algorithm. The network structure is shown in 
Figure 6. Lines of different colors indicate that the feature information does not intersect and does not fuse.
The basic idea of DR-Neck is that the feature information of C2f_1 and C2f_2, C2f_2 and C2f_3 is firstly fused through the SDI module for cross-scale feature fusion, and then the feature information of C2f_4 is twice up-sampled, and then cross-scale feature fusion is carried out through the SDI module. Both CBS_1 and CBS_2 convolution modules are 3 × 3 convolution, and their functions are to keep the number of feature channels unchanged and down-sample the feature information. The function of up-sampling is to improve the resolution of the feature map, so as to improve the precision of the segmentation edge, but the feature loss problem is caused by the up-sampling interpolation operation. Therefore, this study adopts CBAM to optimize the feature information after up-sampling. As an attention mechanism based on CNN, CBAM has the advantages of being lightweight and easily portable. DR-Neck improves the segmentation of large targets by repeatedly referencing the feature information generated by C2f_4.
The original YOLOv8-seg feature fusion network does not directly use the feature information extracted by the C2f_1 module, but the feature map of the module has a higher resolution and a clearer edge, which can be used to improve the accuracy of the segmentation edge of the instance. Therefore, the C2f_1 module is extra invoked by DR Neck.
After model structure optimization, the DR-YOLO model is obtained, and its network structure is shown in 
Table 2.