YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection

: The growth of Achnatherum splendens ( A. splendens ) inhibits the growth of dominant grass-land herbaceous species, resulting in a loss of grassland biomass and a worsening of the grassland ecological environment. Therefore, it is crucial to identify the dynamic development of A. splendens adequately. This study intended to offer a transformer-based A. splendens detection model named YOLO-Sp through ground-based visible spectrum proximal sensing images. YOLO-Sp achieved 98.4% and 95.4% AP values in object detection and image segmentation for A. splendens , respectively, outperforming previous SOTA algorithms. The research indicated that Transformer had great potential for monitoring A. splendens . Under identical training settings, the AP value of YOLO-Sp was greater by more than 5% than that of YOLOv5. The model’s average accuracy was 98.6% in trials conducted at genuine test sites. The experiment revealed that factors such as the amount of light, the degree of grass growth, and the camera resolution would affect the detection accuracy. This study could contribute to the monitoring and assessing grass plant biomass in grasslands.


Introduction
Typical grasslands are indispensable in global ecological, economic, and social values areas [1]. In a healthy grassland ecosystem, all types of livestock can flourish, producing high-quality meat and milk. As the largest green vegetation and biological resource, the natural grassland covers about 400 million hm 2 of land in China [2]. However, the ecology of grassland in China is facing challenges. For instance, knowledge about grassland resource protection is lacking, the vicious cycle of destruction and degradation of grassland is undergoing, and livestock production efficiency needs to be improved [3,4]. In typical Inner Mongolia grasslands, A. splendens grows in large quantities [5]. The stress resistance to drought, cold, alkaline, and salt of A. splendens is exceptional [6]. Due to its robust root system, A. splendens has an advantage in absorbing water. Nevertheless, A. splendens is unsuitable for animals as a primary food source because of its low nutritional content and high leaf fiber content [7]. Jiang et al. discovered that A. splendens could affect soil hydrological parameters and the dynamics of soil water and salt [8]. The water absorption mechanism of the A. splendens roots system facilitates the passage of salt ions and leads to salt buildup. Yang et al. [9] also found that A. splendens can alter soil microbiological properties and harm plant production. This results in excessive growth of A. splendens at the expense of other forage grasses. In other words, the proliferation of A. splendens could potentially influence the development of dominant forage grasses in grasslands, eventually leading to a decrease in grassland biomass and deterioration of grasslands. However, if the growth of A. splendens is effectively understood and controlled, its positive effect will become more prominent. A. splendens is suggestive vegetation as emergency feed and a water sources for cattle searching in pastoral grasslands [10], and it has become an indicator factor for the watershed climate change, human activity, and the biological environment of grasslands [6]. Therefore, effectively quantifying the biomass of A. splendens and monitoring its dynamics change is very meaningful to studying grassland degradation.
Using UAV remote sensing to acquire spectral images and laser points cloud data have been proven effective and efficient in detecting and estimating vegetation height and coverage [11][12][13]. Some researchers are also conducting a study on assessing grass plant biomass-based remote sensing data. Guo Y et al. investigated the capacity of hyperspectral measures to quantify plant invader coverage and the impact of senescent plant coverage [14]. In addition to using satellite remote sensing imagery data directly, vegetation indices such as NDVI and EVI produced from MODIS data are commonly used in vegetation coverage. Zha Y et al. used satellite remote sensing data to systematically monitor grassland cover changes near Qinghai Lake in Western China and derived the NDVI to quantify grassland cover variations between 1987 and 2000 [15]. Converse RL et al. evaluated the grassland covering of Sevilleta National Wildlife Refuge in 2009, 2014, and 2019 using satellite remote sensing data and confirmed that multi-endmember spectral mixture analysis could be utilized effectively to monitor semiarid grassland and shrub systems in New Mexico [16]. Although UAV and satellite remote sensing could measure vegetation height and coverage quickly and efficiently, shortcomings still exist. For instance, aerial remote sensing is restricted by weather conditions. Overlarge cloud coverage could kill the survey mission. For satellite data, the cloud could block and prevent the sensor received reflectance from the ground, while the cloud could introduce noise and color variance into UAV data [15]. Especially for UAV remote sensing, precipitation and wind can also affect the mission. Moreover, the sensors carried by UAVs, such as laser multi-line radar, multispectral, and hyperspectral sensors, are prohibitively expensive for grassland management and research and need more utility [17,18]. Besides, it is difficult to statistically detect changes in overall biomass and partial grassland cover under the same conditions, such as atmospheric and seasonal conditions, because two sets of satellite images must be employed [15]. Although regular RGB cameras can be utilized for some situations, researchers have yet to use them to produce good detection results for detecting and estimating the height of grass growth on grasslands, including A. splendens. Typical grasslands often have a temperate, semiarid continental climate with frequent periods of severe wind [19], which make the UAV survey very challenging. In addition, assessing grassland in a particular area generally involves delineating individual sample points for estimating the height and cover of grass. Thus, ground proximal sensing-based detection is preferable for this investigation.
With image pre-processing and data fusion, RGB imagery data can be used to derive biomass information [20][21][22][23]. Therefore, delineating the outer contours of A. splendens can also prepare for the subsequent biomass estimation. Gebhardt S et al. detected the broadleaved dock (Rumex obtusifolius L., R.o.), a weed in European grasslands, by converting RGB images into grayscale images and segmenting them [24]. In the experiments, the average R.o. detection rate ranged from 71% to 95% for 108 images, including more than 3600 objects. Petrich L et al. proposed a method based on UAV visible light images to locate and detect the poisonous Colchicum autumnale on the grassland [25]. This method relied on a convolutional neural network to find flowers and achieved an accuracy rate of 88.6% in test experiments. Wang L et al. used four semantic segmentation algorithms to detect the images of woody plants that invaded grassland ecosystems collected by UAVs [26]. The research shows that the ResNet algorithm has the highest comprehensive accuracy, and the segmentation performance for eastern redcedar (ERC) decreases with the decrease in resolution. In Gallmann J et al.'s study, their drone-based images effectively detected different flowering plants in a meadow. The experimental precision and recall are close to 90% through the Faster R-CNN algorithm [27]. The above research indicated that detecting A. splendens is practical using visible light images based on deep learning.
Deep learning has become a prevalent approach in image processing tasks, and convolutional neural networks (CNNs), with good robustness and robustness, are widely used in deep learning algorithms [28][29][30][31][32]. Nonetheless, when the Transformer idea was presented to the realm of computer vision, there was much expectation about its development. Even its applicability is equivalent to that of the CNN model. In 2020, the Facebook AI-proposed DETR (Object Detection with Transformer) identification system established a new paradigm for object detection [33]. This approach performs end-to-end detection without requiring non-maximum suppression (NMS) post-processing. Several specialists and academics have explored the possibilities of Transformer in vision and offered novel detection techniques. Afterwards, the transformer-based network Swin Transformer obtained the top performance (state of the art, SOTA) on various vision tests [34]. Several specialists have started incorporating the transformer model into the deep learning algorithm to accomplish detection. Lin et al. [35] conducted rapid and accurate monitoring of the emergence rate of peanut seedlings by integrating the transformer and CSNet models. Wang Dandan and He Dongjian [36] calculated apple instances against a complicated backdrop by proposing an attention method. Olenskyj AG et al. [37] concluded the evaluation of object detection, CNN regression, and Transformer models, demonstrating that Transformer can provide accurate grape production estimates. These professionals and academics have used Transformer for visual inspection tests in the industrial and agricultural sectors. However, there are few related studies on the application of Transformer in detecting grassland plants, especially A. splendens. Consequently, this research will investigate if the Transformer model is superior to CNN for detecting images of A. splendens.
The flexibility of the tracked robot's movement and the ability to acquire target information from multiple angles align with this paper's research. This study equipped the crawler robot with a camera to achieve ground detection. To sum up, this study aims to realize the ground monitoring of A. splendens and propose an effective detection method based on deep learning. This study proposed a new deep learning detection method for A. splendens based on improving the backbone, neck, and head parts of the traditional YOLOv5. In addition, this study verified whether the Transformer model was superior to the CNN model in grassland vegetation detection through the detection and segmentation of A. splendens.

Materials
The trial location was situated in Baiyinxile Ranch, Xilinhot City, Xilin Gol League, Inner Mongolia, China, at 43 • 37 East longitude and 116 • 42 North latitude ( Figure 1). This region had a semiarid grassland climate. Roughly 80% of the yearly precipitation in this region occurred between June and September, amounting to about 350 mm of precipitation annually. At this time, the season of high temperatures created circumstances of high temperature and humidity favorable for plant development. The development of A. splendens in the experimental region was shown in Figure 2. In the testing stage, the phenological phase of development of A. splendens was the flowering and fruiting stage. At this stage, A. splendens had a high vertical growth ability, and its height was usually above 150 cm. Its leaves were linear or tubular, hard in texture, long, and thin. The leaves were dark green with a glossy surface. Its stalks grew prostrate or erect, usually light green, with fine hairs on the surface.

Data Collection
In August 2022, data were gathered in the experimental area. As a result of the land's challenging topography, the crawler chassis was chosen as the mobile robot

Data Collection
In August 2022, data were gathered in the experimental area. As a result of the grassland's challenging topography, the crawler chassis was chosen as the mobile robot carrier (Figure 3a). STM32 (STMicroelectronics, Geneva, Switzerland) was responsible for the mobile robot's basic control; it transmitted signals to the motor to drive the crawler wheels. The motherboard used NVIDIA Jetson TX1 (NVIDIA Corp., Santa Clara, CA, USA) to ac-

Data Collection
In August 2022, data were gathered in the experimental area. As a result of the grassland's challenging topography, the crawler chassis was chosen as the mobile robot carrier (Figure 3a). STM32 (STMicroelectronics, Geneva, Switzerland) was responsible for the mobile robot's basic control; it transmitted signals to the motor to drive the crawler wheels. The motherboard used NVIDIA Jetson TX1 (NVIDIA Corp., Santa Clara, CA, USA) to acquire inertial measurement unit (IMU), odometer information, and different sensor data sent by STM32. In this research, NVIDIA Jetson TX1 primarily acquired picture data sent by the camera, processed and evaluated the image, and then calculated the grass height of A. splendens in this region. The camera used in this work was the MYNT AI D100-50 (Slightech, Wuxi, China; Figure 3b), a binocular stereo-depth camera capable of obtaining color depth information. The camera's product parameters are listed in Table 1. The camera can obtain clear images, which is conducive to subsequent detection tasks, and then input depth information to the main control terminal of the mobile robot. The mobile robot captured one thousand photographs in the testing area. by the camera, processed and evaluated the image, and then calculated the grass height of A. splendens in this region. The camera used in this work was the MYNT AI D100-50 (Slightech, Wuxi, China; Figure 3b), a binocular stereo-depth camera capable of obtaining color depth information. The camera's product parameters are listed in Table 1. The camera can obtain clear images, which is conducive to subsequent detection tasks, and then input depth information to the main control terminal of the mobile robot. The mobile robot captured one thousand photographs in the testing area.   Nine test sites of different coverage in the experimental area were selected for the experiment to evaluate the proposed detection models. In the test experiment, the camera angle was changed ten times at the same test site to obtain the data, and each set of data was repeated ten times for the experiment. At each test site, we collected images from different orientations. A total of 4500 images were contained in all test sites, and the ratio of the training set and verification set in the data set was 7:3. Finally, after image enhancement, the training and verification set images were increased to 5000.  Nine test sites of different coverage in the experimental area were selected for the experiment to evaluate the proposed detection models. In the test experiment, the camera angle was changed ten times at the same test site to obtain the data, and each set of data was repeated ten times for the experiment. At each test site, we collected images from different orientations. A total of 4500 images were contained in all test sites, and the ratio of the training set and verification set in the data set was 7:3. Finally, after image enhancement, the training and verification set images were increased to 5000.

The Research Method for the Detection of A. splendens 2.3.1. Model Preparation
As shown in Figure 4, the Transformer encoder block consisted of Embedded Patches, LayerNorm, Dropout, Multi-Head Attention, and MLP. Each Transformer encoder block contained two sublayers. The first sublayer was a Multi-Head Attention layer, and the second sublayer (MLP) was a fully connected layer. A residual dropout connection was used between each sublayer. The Transformer encoder block added the ability to capture different local information. It could also utilize the self-attention mechanism for mining feature representation potential.
Agriculture 2023, 13, x FOR PEER REVIEW

Model Preparation
As shown in Figure 4, the Transformer encoder block consisted of Emb Patches, LayerNorm, Dropout, Multi-Head Attention, and MLP. Each Transform coder block contained two sublayers. The first sublayer was a Multi-Head Attentio and the second sublayer (MLP) was a fully connected layer. A residual dropout con was used between each sublayer. The Transformer encoder block added the ability ture different local information. It could also utilize the self-attention mechanism f ing feature representation potential. After optimizing and improving the Transformer encoder block, the Swin achieved the SOTA effect in the image segmentation and detection field. The overa of the module is shown in Figure 5. The module first performed LayerNorm on the map. It determined whether the feature map needed to be shifted through the sh parameter and divided into windows. The module calculated the attention and u mask to distinguish whether it was window attention or shift window attention, limited the content that could be seen at each position in the attention. After mergin window, the previous shift operation must be restored by reverse shift (restore the ous shift operation). Finally, the module process was completed through the Layer fully connected layer and the residual dropout connection. After optimizing and improving the Transformer encoder block, the Swin block achieved the SOTA effect in the image segmentation and detection field. The overall flow of the module is shown in Figure 5. The module first performed LayerNorm on the feature map. It determined whether the feature map needed to be shifted through the shift_size parameter and divided into windows. The module calculated the attention and used the mask to distinguish whether it was window attention or shift window attention, which limited the content that could be seen at each position in the attention. After merging each window, the previous shift operation must be restored by reverse shift (restore the previous shift operation). Finally, the module process was completed through the LayerNorm+ fully connected layer and the residual dropout connection. CBAM was a straightforward yet efficient attention module ( Figure 6). It was weight, plug-and-play module that could be trained end-to-end and was compatib CNN architectures. Given a feature map, CBAM sequentially inferred an attentio along two independent dimensions of channel and space and then multiplied the tion map with the input feature map to conduct adaptive feature refinement.

The Proposal of the YOLO-Sp Model in This Study
This study proposed a new deep-learning model called the YOLO-Sp model. troducing the Swin block, CBAM module, CIOU Loss, VFL Loss, and decoupled methods, this study completed the optimization and adjustment of the classic YO model. As seen in Figure 7, this study has made network adjustments in the bac neck, and head parts of YOLOv5. Some convolution blocks and CSP bottleneck bl YOLOv5 have been replaced with Transformer encoder blocks, which could emp self-attention mechanism to exploit the potential of feature representation and en the capacity to gather diverse local information. The Swin block in the backbone therefore accommodate the size-adaptive output of SPP. In the neck section, the ne also implemented the Transformer concept. The CBAM module successively inferr attention map in two independent dimensions (channel and space), which multiplie the input feature map for adaptive feature optimization. The connection between th block and CBAM module could efficiently convey robust semantic features throu attention mechanism during up sampling and down sampling. The coupled-he scribes the head of YOLOv5. Referring to the head concept of YOLOX, in this stu coupled-head was replaced with the decoupled-head to increase convergence while the number of channels of the regression head was altered as necessary. CBAM was a straightforward yet efficient attention module ( Figure 6). It was a lightweight, plug-and-play module that could be trained end-to-end and was compatible with CNN architectures. Given a feature map, CBAM sequentially inferred an attention map along two independent dimensions of channel and space and then multiplied the attention map with the input feature map to conduct adaptive feature refinement. CBAM was a straightforward yet efficient attention module ( Figure 6). It was a lightweight, plug-and-play module that could be trained end-to-end and was compatible with CNN architectures. Given a feature map, CBAM sequentially inferred an attention map along two independent dimensions of channel and space and then multiplied the attention map with the input feature map to conduct adaptive feature refinement.

The Proposal of the YOLO-Sp Model in This Study
This study proposed a new deep-learning model called the YOLO-Sp model. By introducing the Swin block, CBAM module, CIOU Loss, VFL Loss, and decoupled head methods, this study completed the optimization and adjustment of the classic YOLOv5 model. As seen in Figure 7, this study has made network adjustments in the backbone, neck, and head parts of YOLOv5. Some convolution blocks and CSP bottleneck blocks in YOLOv5 have been replaced with Transformer encoder blocks, which could employ the self-attention mechanism to exploit the potential of feature representation and enhance the capacity to gather diverse local information. The Swin block in the backbone could therefore accommodate the size-adaptive output of SPP. In the neck section, the network also implemented the Transformer concept. The CBAM module successively inferred the attention map in two independent dimensions (channel and space), which multiplied with the input feature map for adaptive feature optimization. The connection between the Swin block and CBAM module could efficiently convey robust semantic features through the attention mechanism during up sampling and down sampling. The coupled-head describes the head of YOLOv5. Referring to the head concept of YOLOX, in this study, the coupled-head was replaced with the decoupled-head to increase convergence speed, while the number of channels of the regression head was altered as necessary.

The Proposal of the YOLO-Sp Model in This Study
This study proposed a new deep-learning model called the YOLO-Sp model. By introducing the Swin block, CBAM module, CIOU Loss, VFL Loss, and decoupled head methods, this study completed the optimization and adjustment of the classic YOLOv5 model. As seen in Figure 7, this study has made network adjustments in the backbone, neck, and head parts of YOLOv5. Some convolution blocks and CSP bottleneck blocks in YOLOv5 have been replaced with Transformer encoder blocks, which could employ the self-attention mechanism to exploit the potential of feature representation and enhance the capacity to gather diverse local information. The Swin block in the backbone could therefore accommodate the size-adaptive output of SPP. In the neck section, the network also implemented the Transformer concept. The CBAM module successively inferred the attention map in two independent dimensions (channel and space), which multiplied with the input feature map for adaptive feature optimization. The connection between the Swin block and CBAM module could efficiently convey robust semantic features through the attention mechanism during up sampling and down sampling. The coupled-head describes the head of YOLOv5. Referring to the head concept of YOLOX, in this study, the coupled-head was replaced with the decoupled-head to increase convergence speed, while the number of channels of the regression head was altered as necessary. CIoU Loss was used as the regression loss of the model designed. CIoU took the aspect ratio of the bounding box into the loss function, further improving the regression accuracy. The penalty item of CIoU was an impact factor called av added to the penalty item of DIoU. This factor considered that the aspect ratio of the predicted box fitted the aspect ratio of the ground truth box. The penalty term is described in Equation (1). a in Equation (1) was a parameter used for trade-off, and v was a parameter used to measure the consistency of the aspect ratio. Their expressions were described in Equations (2) and (3), respectively. CIoU was calculated by Equation (5).
where b and b gt represent the center points of the predicted frame and the rear frame, respectively; α is the parameter used to do trade-off; v is a parameter used to measure the consistency of aspect ratio; p stands for calculating the Euclidean distance between two center points; c represents the diagonal distance of the smallest closure area that can contain both the predicted frame and the real frame. VFL Loss was used as a classification loss calculated by Equation (6). The main improvement of VFL was to propose an asymmetric weighting operation. For positive samples, q was the IoU of bbox and gt, and for negative samples, q = 0. FL was not used when it was a positive sample, but ordinary BCE, an adaptive IoU weight, was added to CIoU Loss was used as the regression loss of the model designed. CIoU took the aspect ratio of the bounding box into the loss function, further improving the regression accuracy. The penalty item of CIoU was an impact factor called av added to the penalty item of DIoU. This factor considered that the aspect ratio of the predicted box fitted the aspect ratio of the ground truth box. The penalty term is described in Equation (1). a in Equation (1) was a parameter used for trade-off, and v was a parameter used to measure the consistency of the aspect ratio. Their expressions were described in Equations (2) and (3), respectively. CIoU was calculated by Equation (5).
where b and b gt represent the center points of the predicted frame and the rear frame, respectively; α is the parameter used to do trade-off; v is a parameter used to measure the consistency of aspect ratio; p stands for calculating the Euclidean distance between two center points; c represents the diagonal distance of the smallest closure area that can contain both the predicted frame and the real frame. VFL Loss was used as a classification loss calculated by Equation (6). The main improvement of VFL was to propose an asymmetric weighting operation. For positive samples, q was the IoU of bbox and gt, and for negative samples, q = 0. FL was not used when it was a positive sample, but ordinary BCE, an adaptive IoU weight, was added to highlight the primary example. Moreover, it was the standard FL when it was a negative sample. It could be found that VFL is simpler than QFL, and its main features were asymmetric weighting of positive and negative samples and prominent positive samples as the primary samples. Thus, this study introduced VFL Loss into the model calculation classification loss.
where p is a label; q is the predicted probability; a is the weight parameter; γ is adjustable printing.

Experimental Evaluation Index
The test set was used to evaluate the performance of the detection model once training was complete. The precision, recall, and F1-score were used as evaluation indicators and are described in Equations (7)-(9). AP and mAP could reflect the average accuracy and effect of the model, which were calculated by Equations (10) and (11).
where n is the number of IoU thresholds, and T positive , F positive , and F negative are true positives (correct detection), false positives (false detection), and false negatives (miss), respectively; A c represents the area of the smallest box that contains both the predicted box and the actual box.

Results
The model's epoch was set to 200 to get the ideal model weight, and the optimal weight was updated every ten iterations. After completing the tuning test, this study modified the parameters at all levels (LearningRate = 0.01, BatchSize = 8, Momentum = 0.94, mosaic = 1.0, warmup momentum = 0.8, weight decay = 0.0005, and warmup epochs = 3.0). Figure 8 depicted the training outcomes of the model's first 50 cycles following parameter optimization and tweaking. The first line in Figure 8 was the loss circumstance in the training phase, and the second was the loss circumstance in the val phase. The third and fourth rows illustrated the evolution of the indicators during the procedure. After 40 rounds, the loss throughout the entire process would approach zero. When the round approaches 40, the loss would decrease abruptly. After 20 rounds of the val procedure, the loss would converge on zero. Before the fifth round, there would be significant swings, which would stabilize progressively after the tenth round. Other indicators shared the same predicament as val loss before five rounds. Using val Box and Box, it could be determined that the bounding box derived from the model's weight after 20 rounds of training were already rather precise. Precision and recall levels fluctuated greatly before 100 epochs but converged to 1.0 after 100 epochs. After approximately 10 epochs, mAP-50 converged to 1.0. The model might be generalized and stable. As the threshold increased, the mAP value of the model varied strongly before epoch 40 but converged to 1.0 after epoch 40. In conclusion, the experimental training results demonstrated that the model's average accuracy was outstanding.   To completely assess the performance of the YOLO-Sp model presented in this study, this study compared and evaluated the mainstream image segmentation model indicated in the Introduction and the image segmentation model using a CNN as its backbone ( Table 2). The parameters of each model used the YOLO-Sp model trained in the previous step, and the best weights were selected for comparative analysis in the next iteration. Since the properties of grassland vegetation in RGB images were difficult to identify, the extracted characteristics of A. splendens were often confused with the surrounding environment. BlendMask and other algorithms' AP values did not surpass 90%, but they fared better than other CNN algorithms, with an AP value of 86.3%. As demonstrated in Table 2, the AP value of the conventional model after transplanting the Transformer was greatly enhanced. Although Swin increased the model size and inference time, the AP value after transplanting to the backbone was more than that of most conventional CNN models. The AP value of YOLO-Sp-X proposed in this research was 95.4%, which had the best segmentation performance for A. splendens images. The model with the best overall performance was YOLO-Sp-M, whose AP value reached 95.2% when it was tiny. The Swin block was also ideal for transplanting in the conventional image segmentation paradigm, as seen from the side. Additionally, the AP value of Cascade R-CNN through Swin surpassed 90%. Cascade R-model CNN's size was enormous, while the model size acquired by Swin-Small for backbone training was over 1 GB. This study compared Mask R-CNN with Cascade R-CNN, and the conclusion was that the performance of conventional networks might be enhanced by transplanting Swin block. In this study, the addition of the CBAM module as a complement to the attention had done an outstanding job of ensuring CNN network compatibility. To report the performance of the different models of YOLO-Sp in terms of mean average precision (mAP) at a confidence threshold of 0.5 (Table 3), this study evaluated each model on a test dataset and recorded the precision and recall for each class. The mAP was calculated as the average of the precision-recall curve across all categories. The results show that YOLO-Sp models with more extensive backbones (i.e., YOLO-Sp-M, YOLO-Sp-L, and YOLO-Sp-X) achieved higher mAP scores than those with smaller backbones (i.e., YOLO-Sp-N and YOLO-Sp-S). Specifically, YOLO-Sp-X achieved the highest mAP score of 95.4%, followed by YOLO-Sp-M at 95.2%, YOLO-Sp-L at 94.6%, YOLO-Sp-S at 86.7%, and YOLO-Sp-N at 81.4%. Overall, these results suggest that using more extensive backbones can significantly improve the performance of the YOLO-Sp object detection algorithm, especially for detecting small or low-contrast objects. However, it is essential to balance the trade-off between accuracy and computational cost when selecting a model for a specific application. Lastly, the decoupled-convergence head speed had been enhanced, enhancing its overall performance. Figure 9 was the image segmentation effect of YOLO-Sp-M. This study compared YOLO-Sp to the existing popular YOLO series algorithm models regarding object detection (Table 4). All model parameters correspond to the YOLO-Sp model trained above and, following iteration, determined the appropriate weight for comparative analysis. The bounding box could detect the location of A. splendens with reasonable accuracy, which was simpler than picture segmentation. Other YOLO models were outperformed by the YOLO-Sp model. YOLO-Sp-X had an AP value of 98.4%, making it the model with the highest AP value. YOLO-Sp-M demonstrated the greatest overall performance. The AP value produced a superior performance by preserving the equilibrium between model size and pre-process time. Based on YOLOv5, YOLO-Sp was enhanced, and the average AP value of each model was boosted by more than 5%. As demonstrated in Figure 10, the detection impact of YOLO-Sp-X was illustrated by the bounding box surrounding the entire A. splendens.  This study compared YOLO-Sp to the existing popular YOLO series algorithm models regarding object detection (Table 4). All model parameters correspond to the YOLO-Sp model trained above and, following iteration, determined the appropriate weight for comparative analysis. The bounding box could detect the location of A. splendens with reasonable accuracy, which was simpler than picture segmentation. Other YOLO models were outperformed by the YOLO-Sp model. YOLO-Sp-X had an AP value of 98.4%, making it the model with the highest AP value. YOLO-Sp-M demonstrated the greatest overall performance. The AP value produced a superior performance by preserving the equilibrium between model size and pre-process time. Based on YOLOv5, YOLO-Sp was enhanced, and the average AP value of each model was boosted by more than 5%. As demonstrated in Figure 10, the detection impact of YOLO-Sp-X was illustrated by the bounding box surrounding the entire A. splendens.  In this study, the model's real detection accuracy was ultimately evaluated. The best weights generated by YOLO-Sp-X training with good comprehensive performance in the above experimental process were selected for testing. Additionally, 1800 images were chosen and separated into nine groups for detection. As illustrated in Table 5, results were computed by noting the positive and negative samples for each category. The precision values achieved by the algorithm range from 97.5% to 99.5%, with an average precision of 98.6%. This indicates that the algorithm could accurately detect objects in the images with high confidence. However, it is worth noting that the algorithm's precision could have been more consistent across all test groups, with some test groups achieving higher precision values than others. To better understand the algorithm's performance, it is necessary to evaluate its recall and F1-score. Recall measures the ability of the algorithm to detect all positive instances, while the F1-score measures the algorithm's precision and recall. The recall values achieved by the algorithm range from 97.5% to 99.5%, with an average recall of 98.7%. The F1-scores performed by the algorithm range from 98.2% to 99.5%, with an average F1-score of 98.9%. Overall, the results suggest that the detection In this study, the model's real detection accuracy was ultimately evaluated. The best weights generated by YOLO-Sp-X training with good comprehensive performance in the above experimental process were selected for testing. Additionally, 1800 images were chosen and separated into nine groups for detection. As illustrated in Table 5, results were computed by noting the positive and negative samples for each category. The precision values achieved by the algorithm range from 97.5% to 99.5%, with an average precision of 98.6%. This indicates that the algorithm could accurately detect objects in the images with high confidence. However, it is worth noting that the algorithm's precision could have been more consistent across all test groups, with some test groups achieving higher precision values than others. To better understand the algorithm's performance, it is necessary to evaluate its recall and F1-score. Recall measures the ability of the algorithm to detect all positive instances, while the F1-score measures the algorithm's precision and recall. The recall values achieved by the algorithm range from 97.5% to 99.5%, with an average recall of 98.7%. The F1-scores performed by the algorithm range from 98.2% to 99.5%, with an average F1-score of 98.9%. Overall, the results suggest that the detection algorithm could accurately detect objects in the images with high precision and recall. However, it is essential to note that various factors, such as lighting, image quality, and object orientation, may influence the algorithm's performance. In nine experimental groups, the accuracy of data gathered under insufficient sunshine or camera backlight was lower than under normal conditions (sufficient light). Therefore, further testing and analysis may be necessary to fully evaluate the algorithm's performance in real-world scenarios.

Discussion
It can be seen from Figure 8 that both loss and metrics fluctuated greatly in the previous rounds. The analysis may be because the modules introduced by the model described in this study must be changed and adjusted to the original network. While these modules may be designed to enhance specific aspects of the model's functionality, they may also introduce new parameters or alter existing ones, potentially disrupting the delicate balance between the model's components. As a result, the initial stages of training may involve a period of instability as the model adjusts to these changes and attempts to optimize its performance. In the early stages of training, the model may make significant changes to these parameters to find an optimal configuration, resulting in large fluctuations in loss and metrics. As training progresses and the model becomes more familiar with the data, it may converge to a more stable configuration with smaller changes between iterations. We believe that compatibility with the CNN network has been greatly enhanced by installing the CBAM module. Simultaneously, the decoupled-head boosts the generalization introduced by the Swin block and increases the convergence speed of the model.
From the multiple experiments conducted in this paper, compared to existing SOTA deep learning models [28][29][30][31][32][33][34][39][40][41], the YOLO-Sp model suggested in this work produced superior results in A. splendens for object identification and image segmentation. This demonstrates that Transformer possesses more potential benefits in identifying A. splendens. This advantage may also occur in other plants native to temperate meadow steppe or mild temperate grassland. The diverse field circumstances of grasslands necessitate substantial training data for grassland feature extraction. Transformer adapts effectively to large data sets, and it is evident from this study that the model's performance improves as data grows. Transformer gains a deeper understanding of the interaction between the learnt characteristics, making it fitter for grassland environments. The current study investigated the distribution of focus from the model. It has been discovered that each attention head may learn to do tasks differently while sensing A. splendens with varying basal coverage. The Swin block module is highly portable and compatible with the conventional Conv module (Swin block and Conv extract the global data in the upper and lower layers in the YOLO-Sp feature). CNNs are well-suited for image-based tasks that involve identifying spatial patterns, while Transformers are better suited for tasks that involve sequential data and long-term dependencies. Combining these two models allows the resulting hybrid model to capture both spatial and sequential information, allowing it to perform well on a wide range of tasks. The combination of Transformer and CNN can produce superior network structure generalization. The YOLO-Sp proposal incorporated the location, translation invariance, and hierarchy of CNN. When evaluating and predicting lengthier texts, it additionally contains Transformer's ability to capture the influence of semantic linkages with longer intervals. Future advancements in detecting temperate grassland vegetation represented by A. splendens may hinge on developing a model that can more effectively integrate Transformer and CNN. The YOLO-Sp proposal is a specific example of this approach to detecting temperate grassland vegetation, suggesting potential for future advancements in this area by further integrating these models.
In this study, after labelling the data, it was discovered that herbaceous plants represented by A. splendens may have made this task challenging. Performing rectangle labelling for target identification on millions of data is arduous, but the rectangular frame is more precise and accessible. Notwithstanding, conducting polygon labelling for segmentation jobs on slender plants is difficult and imprecise, resulting in low labor efficiency. This study also investigates the usage of other techniques ( Figure 11). Consequently, research into algorithms that can achieve effective detection results through simple labelling may become popular. Deep learning needs a big amount of data and an improved labeling technique. example of this approach to detecting temperate grassland vegetation, suggesting potential for future advancements in this area by further integrating these models.
In this study, after labelling the data, it was discovered that herbaceous plants represented by A. splendens may have made this task challenging. Performing rectangle labelling for target identification on millions of data is arduous, but the rectangular frame is more precise and accessible. Notwithstanding, conducting polygon labelling for segmentation jobs on slender plants is difficult and imprecise, resulting in low labor efficiency. This study also investigates the usage of other techniques ( Figure 11). Consequently, research into algorithms that can achieve effective detection results through simple labelling may become popular. Deep learning needs a big amount of data and an improved labeling technique. Using intelligent robots for grassland resource surveys could significantly improve the efficiency and accuracy of these surveys, leading to better management of grassland resources and, ultimately, a more sustainable ecosystem. However, factors such as camera resolution, shooting distance, and weather can all impact the image capture quality, further complicating plant detection tasks. Table 5 shows that the ultimate accuracy is obtained in insufficient sunlight and low backlight conditions. This finding may be useful for grassland plant detection, where the detection of slender plants such as A. splendens is challenging due to the intermingling of plant pixels with pixels of the surrounding environment. While the results show some potential for real-time intelligent estimation of grass plant biomass, further research is needed to improve the accuracy and detection speed of the model. By testing the performance of different object detection models in grassland environments, the study sheds light on the strengths and limitations of various models, highlighting areas for future research and improvement. These findings may help inform the development of more robust and adaptable object detection models for use in diverse environmental settings. The way to improve the detection speed is to use lightweight networks, such as ShuffleNet [42] and MobileNets [43], on the backbone network. Some scholars apply these networks to large-scale detection algorithms to improve detection speed [44,45]. However, how to better combine the lightweight network with the Transformer still needs us to explore through a lot of experiments A. splendens is distributed in Northwest and Northeastern provinces of China, Inner Mongolia, Shanxi, and Hebei. It grows on slightly alkaline grasslands and sandy slopes at an altitude of 500-900 m. A. splendens has strong vitality, is resistant to drought, salt, and alkali, and can still grow vigorously on barren land where other plants cannot survive [5]. The research results of this paper can be used to locate and detect growth. Mechanized Using intelligent robots for grassland resource surveys could significantly improve the efficiency and accuracy of these surveys, leading to better management of grassland resources and, ultimately, a more sustainable ecosystem. However, factors such as camera resolution, shooting distance, and weather can all impact the image capture quality, further complicating plant detection tasks. Table 5 shows that the ultimate accuracy is obtained in insufficient sunlight and low backlight conditions. This finding may be useful for grassland plant detection, where the detection of slender plants such as A. splendens is challenging due to the intermingling of plant pixels with pixels of the surrounding environment. While the results show some potential for real-time intelligent estimation of grass plant biomass, further research is needed to improve the accuracy and detection speed of the model. By testing the performance of different object detection models in grassland environments, the study sheds light on the strengths and limitations of various models, highlighting areas for future research and improvement. These findings may help inform the development of more robust and adaptable object detection models for use in diverse environmental settings. The way to improve the detection speed is to use lightweight networks, such as ShuffleNet [42] and MobileNets [43], on the backbone network. Some scholars apply these networks to large-scale detection algorithms to improve detection speed [44,45]. However, how to better combine the lightweight network with the Transformer still needs us to explore through a lot of experiments A. splendens is distributed in Northwest and Northeastern provinces of China, Inner Mongolia, Shanxi, and Hebei. It grows on slightly alkaline grasslands and sandy slopes at an altitude of 500-900 m. A. splendens has strong vitality, is resistant to drought, salt, and alkali, and can still grow vigorously on barren land where other plants cannot survive [5]. The research results of this paper can be used to locate and detect growth. Mechanized root cutting and harvesting are the tasks after detection. It is not easy to precisely control this in A. splendens. We also find the importance of interdisciplinary research, combining computer science, ecology, and agriculture expertise. For example, this study focuses on the flowering and fruiting stage of A. splendens, which may be more beneficial for detection because of its external phenotype. For other grassland vegetation, there are multiple phenological phases of development. It is necessary to use different algorithms for detection at various stages of different plants. Researchers can develop solutions to complex environmental challenges by working collaboratively, such as monitoring and managing grassland resources. This type of research provides practical benefits and contributes to a more holistic understanding of the interactions between technology and the environment. The results demonstrate the potential for real-time intelligent estimation of grass plant biomass, offering hope for more efficient and accurate grassland resource surveys. Moreover, the study highlights the importance of interdisciplinary research, which can lead to more effective solutions to environmental challenges. Overall, this research represents a modest step forward in developing intelligent robots for grassland resource surveys, and it may contribute to the broader field of computer vision and machine learning.

Conclusions
To intelligently monitor the growth of A. splendens, the monitoring method proposed in this paper can be implemented by a ground mobile robot with a binocular depth camera. Ground-based detection is more weatherproof than drones, and binocular depth cameras are less expensive than multithreaded lidar and hyperspectral sensors. This paper proposed a new Transformer-based model called YOLO-Sp. Based on YOLOv5, it improved and optimized the backbone, neck, and head parts. Swin block can increase the computational complexity of the original model. Transformer does have a better ability to acquire global information. The Introduction of the CBAM module makes the model better integrated with the CNN network, fully absorbing the advantages of CNNs locality, translation invariance, and hierarchy. The improved convergence speed of the decoupled-head also makes the overall performance of YOLO-Sp the best. For detecting A. splendens, YOLO-Sp has achieved better results in comparing the AP value with the SOTA depth model in the same period under the same training conditions and in terms of actual measurement accuracy. However, the model also has some things that could be improved, such as the inability to adapt well to light intensity, and the reasoning speed should be improved. A. splendens is relatively slender in the image, and the pixels are easily mixed with the surrounding environment, which has much to do with its growth condition. Our follow-up work is improving the accuracy and extending the method to other plant detections in grassland ecosystems. This work may contribute to the development of intelligent detection of grassland plant biomass.  Data Availability Statement: Data will be made publicly available when the article is accepted for publication.