1. Introduction
China is the world’s third largest sugar producer; sugarcane is China’s main sugar crop. Cane sugar production accounts for more than 90% of China’s total sugar production and occupies an important position in the agricultural economy [
1]. Guangxi is the most important sugarcane planting area in China, and the area of sugarcane planted in Guangxi in 2020 reached 64.7% of the total area of sugarcane planted in China [
2]. However, the popularization of mechanized harvesting of sugarcane in Guangxi is seriously lagging behind. Although the comprehensive mechanization level of Guangxi has reached 60% [
3], the rate of sugarcane machine harvesting still stays at a very low level, and the mechanization of sugarcane harvesting has become the biggest short board that restricts the promotion of the whole mechanization of sugarcane in Guangxi. Sugarcane tips have a low sugar content and absorb the sugar solution during the sugar extraction process, which is discharged along with the bagasse and waste, thus lowering the sugar yield. Leafy cane tips entering the flow channel increase the pressure of the flow channel, aggravate the blockage of the flow channel, and, at the same time, affect the effect of leaf stripping and increase the impurity rate [
4]. In traditional sugarcane harvesting machinery, in the harvesting process, the cutting position of the cane tip is determined by the operator according to naked eye judgment (the height of the top cutter is not changed during the cutting process). There is a great deal of subjectivity and arbitrariness. The cutting position is on the low side, resulting in a waste of sugarcane; the cutting position is on the high side, resulting in high inclusions of raw sugarcane [
5]. Therefore, it is of great significance to realize the high-precision recognition of sugarcane tails under the complex environment in the field and to strengthen the research on intelligent weed removal of sugarcane harvesters. The main contributions of this paper are as follows:
To address the challenge of identifying the cutting point of sugarcane tail tips—caused by variations in sugarcane morphology, extensive leaf wrapping around the cutting area, and potential occlusion—we propose a method for recognizing the entire tail tip region instead of a single cutting point.
We construct a custom sugarcane tail tip dataset collected from real-field complex environments and further enhance it using data augmentation techniques to simulate natural factors such as lighting variations, thereby improving the model’s generalization ability and robustness.
We propose Slim-YOLO, an improved YOLO11n-based algorithm for sugarcane tail tip recognition in complex environments. This model achieves accurate detection while meeting real-time processing requirements for embedded deployment, striking an optimal balance between detection accuracy and inference speed.
2. Related Works
In recent years, as traditional agriculture has evolved toward intelligence, an increasing number of researchers have integrated information technology with conventional machinery to achieve precision agriculture. Shen et al. [
6] employed traditional image segmentation and semantic segmentation methods for sugarcane tip recognition and proposed an improved PSO + OTSU algorithm. By integrating an asymmetric acceleration factor with a nonlinearly decreasing inertia weight W into the PSO algorithm and utilizing the H component in the HSV color space, the method applies median filtering to remove small isolated noise areas, effectively segmenting the sugarcane stalk and tip regions. The improved algorithm achieved recognition rates of 88.33% and 91.67% for sugarcane images, with an average accuracy of 90.0%, and an average processing time of 0.3687 s per image. Li et al. [
7] proposed a bifurcation point recognition method for sugarcane tips based on an improved YOLOv5s model. This model introduces a BiFPN feature fusion structure and a CA attention mechanism into YOLOv5s while incorporating GSConv convolution and a Slim-Neck paradigm. Additionally, a Ghost module replaces the standard convolution in the backbone network to reduce computational complexity and parameter size. The improved model achieved an average precision of 92.3%, a recall rate of 89.3%, and a detection time of 19.3 ms. Wen et al. [
8] proposed a method for estimating the cutting position height of sugarcane tips based on multimodal registration and depth image fusion. The measured cutting height closely matched the manually measured actual cutting height, with a root mean square error (RMSE) ranging from 1.22 cm to 1.78 cm and an R2 value between 0.79 and 0.86. Beyond the aforementioned studies on adaptive adjustment of sugarcane tail tip cutting positions, significant research has been conducted both domestically and internationally on the application of deep learning in intelligent sugarcane processing. Militante et al. [
9] employed a deep learning-based convolutional neural network (CNN) to identify sugarcane leaf diseases in the field. Zhou et al. [
10] developed a sugarcane cutting system based on machine vision to achieve bud protection and automatic single-bud segment cutting. The system integrates mechanical, electrical, and visual processing components, with machine vision serving as the core technology for identifying sugarcane stalk segments. The experiments showed a recognition accuracy of 93% and an average processing time of 0.539 s. The throughput capacity of the cutting unit reached 2400 buds per hour, and the cutting point precision met agricultural requirements, with the bud damage rate reduced to zero, providing a practical foundation for intelligent sugarcane seed cutting machinery. Huang et al. [
11] developed a machine vision-based system for the automatic cutting of sugarcane single-bud segments. The system uses image processing techniques such as mean filtering, thresholding, morphological operations, and maximum area selection to identify the position of sugarcane internodes in the H component of the HSV color space. The method also utilizes a rectangular template to move along the G-B color difference component map of the image, calculating the average grayscale value for each step, and identifying the position with the highest average grayscale value as the internode location. Experimental results showed that, in 36 combination tests, the optimal combination, with a template width and step size of 6, achieved a recognition rate of 90.77% and an average processing time of 0.481539 s.
The above research has greatly promoted the development of all-round intelligence of sugarcane, which is of great practical significance. However, there are still some deficiencies in sugarcane tail tip recognition, i.e., the accuracy and quantity of labeled data are still insufficient in complex environments, which affects the generalization ability of the model, and the deep learning model has poor adaptability to complex environmental factors, such as climate and light changes, which leads to a decrease in the accuracy in practical applications. To this end, this paper proposes Slim-YOLO, an improved algorithm for recognizing sugarcane tails in complex environments based on YOLO11n, aiming to achieve accurate recognition of sugarcane tails in complex environments while meeting the real-time requirements of embedded deployments so as to achieve optimal balance between recognition rate and inference speed. This paper uses the lightweight RepViT network as the backbone of the model, and the ELANSlimNeck neck structure is designed while the Unified-IoU (UIoU) loss function is introduced. Taking into account the problem that sugarcane has different morphology, the cutting position is wrapped with a lot of leaves, and the cutting position of the tail tip is easily obscured by the leaves, leading to the difficulty of recognizing the position of the cutting point, this paper proposes a method for the recognition of the whole area of the tail tip of the sugarcane. At the same time, we collected image data of sugarcane in a complex environment in the field and performed data augmentation on the images to construct an overall regional dataset of sugarcane tails. The experimental results show that on the overall regional dataset of sugarcane tails, the average accuracy of the improved model, mAP50 and mAP50:95, reached 92.2% and 48.2%, respectively, which were 8.2% and 6.1% higher compared to the original YOLO11n model. In comparison, the number of parameters in the model was reduced by 48.4%.
3. Materials and Methods
3.1. Data Acquisition
The experimental sugarcane variety targeted in this study was Zhongcane No. 1, which is the most widely planted sugarcane in Guangxi, and the data collection site was located in the sugarcane planting site of Guangxi University’s New City of Agricultural Science, Dupang Village, Quli Town, Fusui County, Chongzuo City, Guangxi Zhuang Autonomous Region. The device used for this image acquisition was the iPhone 13, and the resolution of the acquired images was 1920 × 1080. Due to the complexity of the field environment, the detection accuracy is susceptible to factors such as light intensity, and the dataset is collected from different light conditions, shooting angles, and shooting distances. Taking into account the harsh conditions that may be encountered in the natural environment of the field and the phenomenon of mutual occlusion among sugarcane, the dataset covers typical scenarios such as dim light, backlight, strong light at noon, weak light in the evening, mutual occlusion, etc., and a total of 2300 representative image samples were collected. The diverse dataset enhances the model’s ability to adapt to various complex environments, which improves the robustness of the model, and some of the datasets are shown in
Figure 1.
3.2. Data Augmentation and Preprocessing
In order to simulate the sugarcane harvesting scenarios in complex environments and further improve the generalization performance of the sugarcane tail tip detection model, we perform data augmentation on the sugarcane tail tip dataset. In this paper, data augmentation is performed on the original dataset by rotating, cropping, mirroring, adjusting brightness, adjusting saturation, and adding noise in random combinations with each other. Among these techniques, the rotation operation can simulate the multi-angle tilting of sugarcane plants due to the difference in growth patterns or wind force; the cropping process can simulate the mutual occlusion between sugarcane tail tips, allowing the model to better adapt to scenarios where only part of the target is visible; the mirror flip enhances the model’s adaptability to the left–right symmetry of sugarcane’s growth characteristics; the luminance adjustment simulates the effect of different light intensities (e.g., strong light in the noon and low light at dusk) on the target’s phenological appearance; and the change in saturation corresponds to the color distortion scene caused by rainy weather or soil reflection; the addition of noise can be equivalent to the equipment imaging noise or field noise interference. Color distortion scenarios are caused by rainy weather or soil reflections, and added noise can be equated to equipment imaging noise or field dust interference. Samples of the original and data-enhanced images are shown in
Figure 2.
On the basis of the original 2300 collected images of sugarcane tails, each image was generated into 2 different images by the above data augmentation method, and 4600 enhanced images were obtained, and the original image and the enhanced image totaled 6900 images. The images are randomly divided in the ratio of 8:1:1 to produce the training set, validation set, and test set with the numbers 5520, 690, and 690, respectively. We manually annotated the augmented dataset using the open-source tool LabelImg, using rectangular bounding boxes to mark the entire tail tip region of each sugarcane. The bounding boxes provide the top-left and bottom-right coordinates of the target object for precise localization. The annotation information for each sugarcane is stored in a corresponding TXT file, which is uniformly saved in the Labels folder. An example of labeling is shown in
Figure 3.
3.3. YOLO11 Original Model
YOLO11 [
12] is the latest generation model released on 30 September 2024, by the Ultralytics team. It builds on the success of previous YOLO releases and introduces new features and improvements to further enhance performance and flexibility. The YOLO11 is designed to be fast, accurate, and easy to use, making it an excellent choice for a variety of tasks, such as object detection and tracking, instance segmentation, image classification, and pose estimation. The network structure of YOLO11 is more similar to that of YOLOv8. The network structure is mainly composed of three parts: backbone, neck and detection head. The backbone part is responsible for feature extraction. It employs a series of convolutional and anti-convolutional layers, while residual joins and bottleneck structures are used to reduce the size of the network and improve performance. The neck part is mainly responsible for multi-scale feature fusion, and the YOLO11 neck uses a feature pyramid network structure to enhance feature representation by fusing feature maps from different stages of the backbone. Similar to YOLOv8, the cross-scale transfer of information is realized by fusing feature maps at different scales through bottom-up paths and top-down paths. This structure allows the model to detect targets at different scales more efficiently. For the detection head, two depth-separable convolutions DWConv are added to the original YOLOv8 classification detection head, which is more lightweight and helps to reduce the amount of computation while improving the accuracy and efficiency of the model, enabling it to achieve faster inference and higher computational efficiency. The YOLO11 network structure is shown in
Figure 4.
3.4. Model Improvement
3.4.1. Replacing the YOLO11 Backbone
Target detection for sugarcane in complex environments in the field faces a number of challenges, including complex backgrounds, variable illumination, target occlusion, significant deformation, multi-scale target distribution, and limited real-time and equipment resources. Backgrounds in field environments often contain weeds, soil, and other objects that are similar in color and texture to sugarcane and can easily lead to false or missed detections. In addition, instability of lighting conditions, such as glare, shadows, or low light, can distort or blur parts of the target, further increasing the difficulty of detection. Occlusion and deformation problems of the target, such as overlapping or inverted sugarcane leaves, also make it difficult for traditional detection models to accurately recognize them. The multi-scale distribution characteristics of sugarcane targets, on the other hand, require the model to be able to have good detection capability for targets of different sizes and distances, while in dense target scenarios, the model needs to have stronger differentiation capability to avoid misjudgment. These issues are overlaid with the requirements of real-time and limited equipment resources in practical agricultural applications, making it necessary for the models to be not only highly accurate but also lightweight and suitable for deployment to edge devices such as drones or self-driving farm machines.
For these problems, replacing the backbone network of YOLO11 with RepViT [
13] is an ideal solution. The design of RepViT combines the global feature modeling capability of a vision transformer with the local feature extraction capability of a convolutional neural network to accurately separate targets from complex backgrounds while preserving the detailed features of sugarcane leaves. Its adaptability to illumination variations and robustness to target deformation and occlusion significantly improve the model’s detection performance. The lightweight, heavily parameterized structure means the RepViT inference phase has a low computational overhead and is suitable for real-time operation in resource-constrained devices. In addition, RepViT’s multi-scale feature modeling capability enhances the detection of small targets and improves the differentiation of dense targets. Overall, RepViT offers an ideal balance of performance and efficiency, providing strong support for sugarcane target detection in complex field environments.
RepViT is a lightweight CNN (Convolutional Neural Network) designed for computer vision tasks. Inspired by RepVGG, it aims to maintain or enhance model performance while retaining its lightweight nature. The RepViT structure is shown in
Figure 5 and consists of four phases. The resolution of the processed image at each stage is denoted as
×
,
×
,
×
, and
×
, channel size
, batch size B, and image size H × W. The Stem module preprocesses the input image. Stages 1 to 4 include multiple RepViTBlock and optional RepViTSEBlock modules as well as deeply separable convolution (3 × 3 DW), 1 × 1 convolution, SE (Squeeze and Excite) modules, and FFN (Feedforward Network). Each stage reduces the spatial dimension by downsampling. In addition, the Pooling module performs global average pooling to further reduce the spatial dimensionality of the feature map. The FC module consists of a fully connected layer for final category prediction. Inspired by RepVGG, RepViT utilizes structural reparameterization techniques to enhance model learning during training. It employs a multi-branch structure to improve model representation during training, which is subsequently reparameterized to an equivalent single-branch structure during inference. The reduction in computational complexity improves efficiency during inference and is particularly beneficial for mobile devices by eliminating the computational and memory costs associated with skipping connections.
To more effectively represent the contribution of the backbone RepViT structure to the network, this paper uses Grad-CAM to visualize and compare the photos. Grad-CAM can visualize the identification of targets by means of heat maps. The thermal effect of the YOLO11-RepViT network after replacing the backbone with the original YOLO11 network for target identification is shown in
Figure 6 after testing the network through Grad-CAM. It can be observed that the structure after replacing the backbone RepViT has a darker color for the overall region of the cane tailing compared to the original YOLO11 structure, indicating a higher thermal value, while the red area is more concentrated in the cane tailing region, suggesting that the improved model is better able to focus on the characteristics of the overall region of the cane tailing. Moreover, the heat map generated by the improved model can distinguish the contour of the overall area of the sugarcane tail tip more clearly than the original model heat map, and the identification position is more accurate and reliable.
3.4.2. Proposing an Efficient Feature Fusion Neck Network ELANSlimNeck to Solve the Mutual Occlusion Problem
For the later deployment of the model, we consider that field applications usually need to have low computational power, and the presence of multi-scale targets further exacerbates the difficulty of detection, considering that the targets in the overall area of the sugarcane tailing may be partially occluded by the target itself and that the dimensions of the overall area of the sugarcane tailing may change due to variations in the shooting angles and distances. This places higher demands on the feature extraction capabilities of the model. To this end, we propose the ELANSlimNeck neck network, the structure of which is shown in
Figure 7, using the feature extraction-fusion module RepNCSPELAN4 from YOLOv9 [
14] to replace the VoVGSCSP module in the original Slim-Neck [
15]. The part labeled in red in
Figure 7 is the replacement part. Compared to the VoVGSCSP module, RepNCSPELAN4 has significant advantages in terms of computational efficiency, multi-scale feature processing capability, lightweight design, adaptability, and performance-efficiency balance. The RepNCSPELAN4 module enhances the expressive power of the model in the training phase through structural reparameterization techniques and significantly reduces the computational complexity in the inference phase by simplifying the structure to meet the demand for high efficiency in field equipment. The Spatial Pyramid Enhancement (SPE) mechanism introduced in the module effectively captures multi-scale target features and further enhances the model’s ability to perceive the overall area of sugarcane tails at different scales. The efficient feature fusion mechanism ensures the accurate recognition of targets in the overall area of sugarcane tail tips under complex backgrounds, while the design of normalization and convolution (NC) stabilizes the performance of the model under complex lighting conditions. These features make RepNCSPELAN4 more advantageous than VoV-GSCSP in resource-constrained scenarios (e.g., real-time target detection in complex environments in the field). The structure of the RepNCSPELAN4 module is shown in
Figure 8.
At the same time, with the GSConv and channel compression technology, the aim is to reduce redundant information interference, improve the detection performance of different scales of targets, ensure the detection accuracy while significantly improving the detection speed and achieve a balance between performance and efficiency. The GSConv structure diagram is shown in
Figure 9.
3.4.3. Improvement of the Loss Function
The CIOU (Complete IoU) [
16] loss function used in the original YOLO11, although it takes into account the centroid distance, overlap area, and aspect ratio, still has some limitations in high-quality prediction frame optimization and is especially prone to unstable convergence and frame regression bias when detecting dense targets. The introduction of Unified-IoU (UIoU) [
17] loss function can effectively solve these problems. UIoU adopts a dynamic weight adjustment mechanism in order to balance the contradiction between model attention and convergence speed; a dynamic hyperparameter “ratio” is designed to adjust the scaling ratio of the bounding box. Specifically, convergence is accelerated by expanding the bounding box at the beginning of training, which is equivalent to reducing the IoU loss of high-quality prediction frames and allowing the model to focus on low-quality anchor frames (ratio > 1). During the training process, the “ratio” gradually decreases, and the model’s attention is gradually shifted to high-quality prediction frames. In the later stages of training, the final detection accuracy is improved by shrinking the bounding box, which corresponds to increasing the IoU loss of the high-quality prediction frames, thus focusing the model’s attention on the high-quality anchor frames (ratio < 1). The value of “ratio” is related to the current training rounds, and the initial value of the hyperparameter “ratio” is set to 2 in order to reach a certain accuracy as soon as possible at the beginning of training, and the end value of “ratio” is set to 0.5 in order to focus more on high-quality prediction frames at the later stage of training to improve the final quality of object detection. The initial value of the hyperparameter “ratio” is set to 2 in order to reach a certain accuracy as soon as possible at the beginning of training, and the end value of “ratio” is set to 0.5 in order to pay more attention to the high-quality prediction frames at the later stage of training, so as to improve the final quality of object detection. For the formulated three hyper-parameter “ratio” reduction modes, their relationship with the training round epoch is as follows.
For the field cane detection task, this mechanism can help the model to fit the target boundary more accurately and perform target localization more robustly, even in the face of complex backgrounds or partially occluded canes. In addition, UIoU introduces a dual-attention mechanism, which enables the model to adaptively adjust its attention to different target scales during the optimization process. It especially performs better when dealing with small-scale targets. Compared with CIOU, Unified-IoU can also reduce the jitter phenomenon of frame regression, improve the stability of the prediction frame, avoid detection errors caused by inaccurate bounding box regression, which can better adapt to the complex environment in the field, improve the accuracy and robustness of the target detection in the overall area of the sugarcane tail tip, and optimize the detection performance, especially under high IoU thresholds, so that its performance in the real application scenario is more reliable. The diagram of the improved Slim-YOLO model is shown in
Figure 10. The part labeled in the red box in the figure is the improved part.
3.5. Experimental Platforms
The hardware environment on the PC side of the experimental platform used for dataset training and testing in this paper is as follows: CPU processor 16 vCPU Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10 GHz, GPU NVIDIA GeForce RTX 4090, RAM 24 GB, operating system ubuntu22.04. CUDA version 12.1, Python version 3.10, Pytorch version 2.1.0.
3.6. Evaluation Indicators
Target detection algorithms can be evaluated by a set of generalized evaluation metrics to assess how good the model is. The evaluation metrics used in this paper include precision (P), recall (R), mean average precision mean (mAP50, mAP50:95), number of parameters (Params), and number of billion floating point operations per second (GFLOPs). The specific meaning of each indicator and how it is calculated are set out below.
(1) Precision (P):
where TP refers to the number of target objects successfully recognized by the algorithm, and FP refers to the number of cases in which the algorithm incorrectly recognizes non-target objects as target objects. Precision indicates the proportion of correct predictions among all prediction frames detected as targets and reflects the reliability of the model’s prediction results, with a high accuracy indicating fewer false positives for the model.
(2) Recall (R):
where FN refers to the number of instances where the algorithm incorrectly recognizes a target object as a non-target object. Recall indicates the proportion of all real targets that are correctly detected and is used to measure the model’s ability to detect real targets, with a high recall indicating that the model misses fewer detections.
(3) Mean Average Precision Mean (mAP):
where AP denotes the average precision (P) of the model at different levels of recall (R). mAP denotes the average AP of the model at different IOU thresholds. Since different IOU thresholds reflect different detection accuracy requirements, mAP can comprehensively evaluate the performance of the model at different accuracies.
4. Results
4.1. Comparison Experiment
4.1.1. Comparative Experiments on the Structure of the Neck Network
In the target detection model, the neck network (neck) is an important module connecting the backbone network (backbone) and the detection head (head), and its main role is to perform further processing, fusion and enhancement of the features extracted from the backbone network in order to improve the detection accuracy and robustness of the model. With the proposed lightweight target detection model, the design of the neck network has become one of the important factors affecting the performance of the model. The structure and optimization of the necking network is especially critical in terms of small target detection in complex contexts, multi-scale target fusion, and the balance of computational efficiency. In order to evaluate the performance of ELANSlimNeck fully, we designed comparative experiments to systematically compare it with several mainstream neck network architectures. The results are shown in
Table 1.
The experimental results show that our proposed ELANSlimNeck achieves optimal performance in both key detection accuracy metrics, mAP50 and mAP50-95, reaching 0.921 and 0.471, respectively, which are 9.6% and 11.9% higher compared to PAFPN [
18] (0.84 and 0.421) and 2.3% and 6.3% compared to HSFPN [
19] (0.9 and 0.443) and also better than BiFPN [
20] (0.911 and 0.461) and Slim-Neck (0.91 and 0.459). It is shown that ELANSlimNeck can detect targets more accurately under different IoU thresholds, and has stronger robustness and generalization ability in complex environments. In terms of computation, the GFLOP of ELANSlimNeck is only 6.1, which is 13% lower than BiFPN (7.0) and also lower than PAFPN (6.3) and Slim-Neck (6.3) and is close to HSFPN (5.6), which indicates that it effectively reduces computation and improves the inference efficiency while maintaining a high detection accuracy, and is suitable for deployment in resource-constrained devices. In terms of the number of parameters, ELANSlimNeck has only 2,119,891 parameters, which is significantly lower than Slim-Neck (2,734,843) and BiFPN (2,670,343), a reduction of 22.5% and 20.6%, as well as a reduction of 17.9% compared to PAFPN (2,582,347), and is slightly higher than HSFPN (1,845,435), but ELANSlimNeck achieved greater performance gains. From the perspective of FPS (Frames Per Second), the proposed ELANSlimNeck demonstrates a significant advantage in inference speed. Compared to BiFPN (105.1 FPS) and HSFPN (110.3 FPS), ELANSlimNeck improves inference speed by 26.3% and 20.3%, respectively. Additionally, it achieves a 40.2% increase over Slim-Neck (94.6 FPS). This result indicates that our improved approach significantly enhances computational efficiency while maintaining high accuracy, making it more practical for real-time applications. In particular, ELANSlimNeck exhibits greater deployment advantages in resource-constrained embedded devices. Therefore, by optimizing the feature fusion and channel attention mechanisms, ELANSlimNeck achieves better detection performance and lighter structure while maintaining computational efficiency and demonstrates superior performance in a number of metrics compared to PAFPN, BiFPN, HSFPN, and Slim-Neck, proving its superiority in the task of target detection in complex environments. The results of the visualization experiment are shown in
Figure 11.
4.1.2. Comparative Experiments Between Different Mainstream Models
In this section, in order to verify the effectiveness of our proposed Slim-YOLO model in the task of sugarcane tail tip target detection in complex environments, we conducted comparative experiments between it and the current mainstream target detection models. The comparison models mainly include RTDETR-l, YOLOv8n, YOLOv9t, YOLOv10n, YOLO11n, YOLO11s, YOLO11m, and YOLO11l. The comparison results are shown in
Table 2. The results of the visualization experiment are shown in
Figure 12.
The experimental results show that our proposed Slim-YOLO achieves optimal performance in both key detection accuracy metrics, mAP50 and mAP50-95, reaching 0.922 and 0.482, respectively, which is slightly improved compared to RTDETR-l (0.916 and 0.435), but the number of parameters of Slim-YOLO is only 1.33M, which is 95.8% less compared to the number of parameters of RTDETR-l (31.99M), which is reduced by 95.8%, indicating that our model is more lightweight and has lower computational and storage costs while achieving similar or even higher detection accuracy. Compared with the lightweight models YOLOv8n, YOLOv9t, YOLOv10n, and YOLO11n, Slim-YOLO performs better in all the metrics, especially in mAP50-95, which is 6.1% better than YOLO11n, and 7% better than YOLOv9t, and in terms of computational complexity, Slim-YOLO’s GFLOPs is only 3.5, which is 44.4% less than YOLO11n (6.3) and 56.8% less than YOLOv8n (8.1), further proving its efficiency. Compared to the larger-scale YOLO11s, YOLO11m, and YOLO11l, Slim-YOLO still achieves an advantage in accuracy, i.e., 1.6% over YOLO11m and 3.1% over YOLO11l in mAP50, while the number of parameters is only 14.1% of that of YOLO11s (9.41M), YOLO11m (20.03M), 6.6% of YOLO11m (20.03M), and 5.3% of YOLO11l (25.28M), which significantly reduces the computational requirements. From the perspective of Frames Per Second (FPS), the proposed Slim-YOLO demonstrates outstanding inference speed advantages. Compared to YOLO11n (176 FPS), YOLO11s (171.1 FPS), and YOLO11m (137.9 FPS), Slim-YOLO achieves an increase of 7.4%, 10.5%, and 37.1%, respectively, with an inference speed of 189.1 FPS. The optimized model has a compact size of only 3.0 MB and is capable of processing 189 images per second, making it well-suited for deployment in complex sugarcane field environments. Overall, Slim-YOLO achieves significant lightweight and computational efficiency optimization while maintaining high detection accuracy and demonstrates extremely high comprehensive performance and application value compared to other mainstream models. The results of the visualization comparison of the different models are shown in
Figure 13.
4.2. Ablation Experiment
4.2.1. Verifying the Validity of the Modules
In this section, in order to further validate the effectiveness of each improved module of Slim-YOLO, we design a series of ablation experiments to analyze the effects of different components on the model performance. Also, to validate the effectiveness of the modules in the ELANSlimNeck neck structure, we added the core modules RepNCSPELAN4 and GSConv in the ELANSlimNeck neck structure to the ablation experiments. The following groups of experiments are conducted on the basis of the improved loss function UIoU, and the results are shown in
Table 3.
From the results of the ablation experiments, it can be seen that each of the improvement modules contributes significantly to the performance improvement of Slim-YOLO. First, the introduction of RepViT alone (Model A) significantly improves mAP50 and mAP50-95 from 0.84 and 0.421 to 0.906 and 0.46, and also reduces the GFLOPS to 4.1, indicating that RepViT as a lightweight backbone network not only enhances the feature extraction capability but also improves the computational efficiency. Second, RepNCSPELAN4 (Model B) was able to significantly improve mAP50-95 to 0.47, indicating that the module optimizes feature fusion, resulting in more accurate target detection. GSConv (Model C) slightly improves the recall (R) while improving mAP50-95, which indicates that this module optimizes the efficiency of feature extraction. Further, combining both modules (A+B, A+C, and B+C) leads to performance improvements, while the full Slim-YOLO (A+B+C) optimizes on almost all metrics, where the mAP50-95 improved to 0.482, the GFLOPS reduced to 3.5, and the number of parameters reduced to 1,331,835, which verifies the synergistic effect of the modules in improving the detection accuracy while keeping the computational overhead low.
4.2.2. Verifying the Validity of the Loss Function
In order to verify the effectiveness of the loss function alone, we conducted experiments on the Slim-YOLO model using the loss function CIoU of the original YOLO11 model and the improved loss function UIoU. The experimental results are shown in
Table 4 below.
The experimental results show that the model Slim-YOLO is improved in P (precision), R (recall), mAP50 and mAP50-95 after improving the loss function to UIoU, which can be better adapted to the complex environment in the field and improve the accuracy of the target detection in the overall area of sugarcane tailing.
5. Discussion
In this study, a YOLO11n improved sugarcane tail tip recognition algorithm entitled Slim-YOLO is proposed and optimized in terms of the backbone network, neck structure, and loss function for the challenge of target detection in complex environments in the field, respectively. The experimental results show that Slim-YOLO achieves a good balance of detection accuracy, inference speed, and computational efficiency and especially outperforms traditional methods when dealing with the challenges of complex backgrounds, multi-scale targets, and illumination variations. Compared to the original YOLO11n, we adopt RepViT as the backbone network, which, by virtue of its structure-heavy parameterized design, significantly reduces the computational overhead in the inference phase while improving the expressive and feature extraction capabilities of the model, enabling it to maintain a high detection performance under lightweight conditions. In addition, our proposed ELANSlimNeck neck structure further enhances the feature extraction and fusion capabilities by replacing the VoV-GSCSP in the original Slim-Neck structure with RepNCSPELAN4. RepNCSPELAN4′s improvements in feature pyramid fusion and channel attention mechanisms make the model more stable in multi-scale target detection and improve the detection accuracy in small targets and complex backgrounds. Meanwhile, in terms of the loss function, we introduce Unified-IoU (UIoU), which enables the model to focus on optimizing low-quality prediction frames in the early stage of training to speed up convergence while shifting to high-quality prediction frames in the late stage of training to improve the final detection accuracy. Compared to the original CIOU loss, the regression accuracy of UIoU is superior at high IoU thresholds, which further enhances the bounding box stability and detection robustness of Slim-YOLO.
Despite improvements in several areas, Slim-YOLO still has some limitations. For example, although RepViT as a backbone network reduces the computational effort, there is still room for further optimization of the feature extraction capability under extreme lighting conditions. In addition, although RepNCSPELAN4 performs well in small target detection, false detections may still occur in heavily occluded scenes. Future research can further optimize the model in the following directions: first, the Transformer mechanism or a more efficient attention mechanism can be introduced to further enhance the feature expression ability of the model in complex scenarios; second, an adaptive data augmentation strategy can be combined to make the model have a better generalization ability in the face of variable field environments; lastly, quantization techniques can be explored in response to the deployment requirements or model pruning methods to make Slim-YOLO more suitable for resource-constrained edge devices and realize more efficient real-time detection.
6. Conclusions
This study addresses the challenge of sugarcane tail tip detection in complex field environments by proposing a lightweight object detection algorithm, Slim-YOLO, based on an improved YOLO11n. The experimental results show that the improved model achieves a mean average precision (mAP50) of 92.2% and a mAP50:95 of 48.2% on the sugarcane tail tip dataset, representing an increase of 8.2% and 6.1%, respectively, compared to the original YOLO11n. At the same time, the number of parameters is reduced by 48.4%. While maintaining high detection accuracy, this model significantly reduces computational complexity and parameter count, demonstrating superior real-time performance and potential for mobile deployment. It provides both theoretical and technical references for the automation control of sugarcane harvester cutting devices.
Although Slim-YOLO performs well in complex sugarcane tail tip detection tasks, there is still room for further optimization, such as improving detection stability under extreme lighting conditions or in higher-density scenarios. In the future, we plan to integrate multimodal information (e.g., LiDAR point cloud data) to further enhance detection robustness, while also exploring more efficient model compression and acceleration techniques to reduce computational costs and advance the development of intelligent detection technologies in smart agriculture.
Author Contributions
C.W.: Conceptualization, Data curation, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing—original draft, Writing—review and editing. Y.C.: Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing—original draft, Writing—review and editing. S.L.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Visualization. L.L.: Conceptualization, Data curation, Investigation, Methodology, Validation. Q.L.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology. K.L.: Conceptualization, Data curation, Investigation, Validation. Y.H.: Conceptualization, Data curation, Funding acquisition, Investigation. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Guangxi Science and Technology Major Project (Guike AA22117006), the Open Project of Key Laboratory of Artificial Intelligence and Information Processing in Guangxi Colleges and Universities (2024GXZDSY017), the Key Project of Teaching Reform of the Guangxi Zhuang Autonomous Region in 2022 (2022JGZ121), the Project of Reform of Degree and Postgraduate Education in Guangxi in 2023 (JGY2023115), and 2020 Guangxi Degree and Postgraduate Education Reform Project (JGY2020066).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The raw/processed data required to reproduce the above findings cannot be shared at this time, as the data also form part of an ongoing study.
Acknowledgments
The authors would like to express their sincere thanks to Nanning Taiyin Technology Co., Ltd. for their great support during the experimental period. We also sincerely appreciate the valuable contributions of Yongle Hu, Xiaozhu Long, and Hongliang Nong to this work. Their assistance in data collection, technical support, and discussions has been greatly beneficial.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Ou, Y.; Yang, D. Several Issues in the Mechanization of Sugarcane Production in Main Producing Areas. Guangxi Agric. Mech. 2010, 4, 8–10. [Google Scholar]
- Department of Development Planning. Compilation of the New Round of Advantageous Agricultural Products Regional Layout Planning; China Agriculture Press: Beijing, China, 2009; pp. 1–4. ISBN 978-7-109-13451-5. [Google Scholar]
- Lu, P. Investigation on the Development Status of Sugarcane Mechanized Harvest in Guangxi and Influencing Factors. Master’s Thesis, Guangxi University, Nanning, China, 2022. [Google Scholar]
- Ma, F.; Lin, Y.; Dong, C.; Shen, K.; Cai, L.; Gao, J. Design and Test of the Hydraulic System of the Top Cutting Device for Small Sugarcane Harvester. J. Agric. Mech. Res. 2017, 39, 55–61. [Google Scholar] [CrossRef]
- Xia, A. Research on Cane Tip Recognition Method Based on Machine Vision. Master’s Thesis, Guilin University of Technology, Guilin, China, 2023. [Google Scholar]
- Shen, Z.; Xia, A.; Dong, Z.; Chen, W.; Cao, W. Study on Image Segmentation Method of Sugarcane Tip in Complex Environment. J. Chin. Agric. Mech. 2023, 44, 113–118. [Google Scholar] [CrossRef]
- Li, S.; Bian, J.; Li, K.; Ren, H. Identification and Height Localization of Sugarcane Tip Bifurcation Points in Complex Environments Based on Improved YOLO V5s. Trans. Chin. Soc. Agric. Mach. 2023, 54, 247–258. [Google Scholar]
- Wen, C.; Hou, B.; Li, J.; Wu, W.; Yan, Y.; Cui, W.; Huang, Y.; Long, X.; Nong, H.; Lu, Y. Height Estimation of Sugarcane Tip Cutting Position Based on Multimodal Alignment and Depth Image Fusion. Biosyst. Eng. 2024, 243, 93–105. [Google Scholar] [CrossRef]
- Militante, S.V.; Gerardo, B.D.; Medina, R.P. Sugarcane Disease Recognition Using Deep Learning. In Proceedings of the 2019 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), IEEE, Yunlin, China, 3–6 October 2019; pp. 575–578. [Google Scholar]
- Zhou, D.; Fan, Y.; Deng, G.; He, F.; Wang, M. A New Design of Sugarcane Seed Cutting Systems Based on Machine Vision. Comput. Electron. Agric. 2020, 175, 105611. [Google Scholar] [CrossRef]
- Huang, Y.; Huang, T.; Huang, M.; Yin, K.; Wang, X. Sugarcane Internode Recognition Based on Local Mean Values. Chin. J. Agric. Mach. Chem. 2017, 38, 76–80. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. 2024. Available online: https://doi.org/10.48550/arXiv.2410.17725 (accessed on 3 March 2025).
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Rep ViT: Revisiting Mobile CNN From ViT Perspective. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 16 June 2024; pp. 15909–15920. [Google Scholar]
- Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024, Seattle, WA, USA, 16–22 June 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland; Milan, Italy, 2025; Volume 15089, pp. 1–21. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight-Design for Real-Time Detector Architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
- Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
- Luo, X.; Cai, Z.; Shao, B.; Wang, Y. Unified-IoU: For High-Quality Object Detection. 2024. Available online: https://doi.org/10.48550/arXiv.2408.06636 (accessed on 3 March 2025).
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
- Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
- Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).