Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning

Zheng, Kang; Yang, Shuo; Wang, Zhichong; Fu, Hao; Wang, Xiu; Zou, Wei; Zhai, Changyuan; Chen, Liping

doi:10.3390/agronomy16020210

Open AccessArticle

Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning

by

Kang Zheng

^1,2,3,

Shuo Yang

²,

Zhichong Wang

³

,

Hao Fu

^3,4,

Xiu Wang

²,

Wei Zou

³,

Changyuan Zhai

^3,*

and

Liping Chen

^1,3,*

¹

College of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

²

Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

³

Intelligent Equipment Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

⁴

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2026, 16(2), 210; https://doi.org/10.3390/agronomy16020210

Submission received: 6 December 2025 / Revised: 28 December 2025 / Accepted: 13 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue Intelligent Detection and Classification of External Traits in Crop Plants, Fruits, and Vegetables)

Download

Browse Figures

Review Reports Versions Notes

Abstract

To enhance the automation and intelligence of trenching fertilization operations, this research proposes a real-time trunk detection model (Trunk-Seek) designed for spindle-shaped orchards. The model employs a customized data augmentation strategy and integrates the YOLO deep learning framework to effectively address visual challenges such as lighting variation, occlusion, and motion blur. Multiple object tracking algorithms were evaluated, and ByteTrack was selected for its superior performance in dynamic trunk tracking. In addition, a Positioning and Triggering Algorithm (PTA) was developed to enable precise localization and triggering for target-oriented fertilization. The system was deployed on an edge device, a test bench was established, and both laboratory and field experiments were conducted to validate its performance. Experimental results demonstrated that the detection model achieved an mAP50 of 98.9% and maintained a stable 32.53 FPS on the edge device, fulfilling real-time detection requirements. Test bench analysis revealed that variations in trunk diameter and operation speed affected triggering accuracy, with an average dynamic localization error of ±1.78 cm. An empirical model (T) was developed to describe the time-delay behavior associated with positioning errors. Field verification in orchards confirmed that Trunk-Seek achieved a triggering accuracy of 91.08%, representing a 24.08% improvement over conventional training methods. Combining high accuracy with robust real-time performance, Trunk-Seek and the proposed PTA provide essential technical support for the development of a visual target-oriented fertilization system in modern orchards.

Keywords:

fundamental model; pipeline building methodology; positioning and triggering algorithm; dynamic detection; precision fertilization in orchards

1. Introduction

Intelligent mechanized operations in orchard management can reduce labor costs and improve the efficiency of orchard production [1]. Detecting tree targets is essential for various tasks, including intelligent fertilization [2], automatic pruning [3], plant phenotyping [4], smart harvesting [5], and unmanned navigation [6]. Generally, the fertilization techniques in orchards include trenching, broadcasting, and holing fertilization. Trenching fertilization has emerged as the predominant horticultural practice in modern orchards owing to its high efficiency and compatibility with mechanized operations [7]. Typically, trenching fertilization is conducted between tree rows or individual trees in a single row. Therefore, employing target detection technology to locate tree trunks for automated trenching fertilization represents a critical step in transitioning from mechanized to intelligent fertilization systems in orchards [8,9]. Common detection technologies include ultrasonic sensors, LiDAR, and computer vision [10]. Although ultrasonic sensors and LiDAR can detect the presence of tree trunks via reflected signals, they struggle to acquire semantic information about the trees and face challenges in distinguishing the trunk from other targets [11]. Computer vision technology, with its capacity to capture vast amounts of information accurately and intelligently, holds a pivotal role in horticultural intelligent equipment [12]. Developing a detection model using computer vision technology for tracking and positioning during trenching fertilization is essential for improving fertilizer efficiency and minimizing chemical fertilizer residue in orchards [13].

Traditional detection algorithms struggle to effectively extract representative semantic information and exhibit low recognition accuracy in orchard scenarios characterized by uneven lighting, natural motion, and occlusion from leaves and branches [14]. Additionally, the process of generating candidate boxes produces numerous redundant regions, which slows down overall detection, and these algorithms no longer satisfy the demands of orchard applications [15]. Deep learning-based object detection algorithms, ranging from the two-stage R-CNN series to prominent single-stage algorithms like YOLO and SSD, have been extensively utilized in agricultural detection owing to their robust feature representation capabilities, strong generality, and rapid, high-precision detection performance [16,17,18]. Cao proposed a YOLOv8-Trunk method for detecting orchard tree trunks and extracting navigation paths [19]. This method enhances the detection accuracy and speed of the YOLOv8 network by incorporating the minimum point distance intersection-over-union loss function and an efficient multi-scale attention mechanism. Reference lines along both sides of the tree trunks are fitted using the least squares method to facilitate navigation path planning for orchard robots. The experimental results demonstrated that the method attained a tree trunk detection accuracy of 92.7%, and the navigation path reliability was high. Brown proposed a particle filtering-based framework for orchard robot localization, which segments tree trunks using deep learning, estimates their width, and integrates this width information into the particle filtering algorithm, thereby substantially enhancing the robot’s localization accuracy and convergence speed between rows [20]. Gao developed a method for apple detection and counting using YOLOv4-tiny and CSR-DCF algorithms. This method mitigates the problem of duplicate counting in multi-object tracking by focusing on tracking the tree trunk rather than the fruit, thereby achieving high-precision fruit counting [14].

The methods proposed by the aforementioned researchers have achieved detection, tracking, and localization to varying degrees, yet the effects of lighting and occlusion from branches and leaves in orchards continue to present challenges in developing detection models [21]. The inference process of deep learning demands significant computational resources. Balancing real-time performance with precision, while ensuring deployment on embedded devices appropriate for outdoor operations, represents a critical research issue [22]. More importantly, fertilization operations are invariably conducted in dynamic environments, where machine movement and vibration inevitably induce motion blur in detection models, resulting in decreased accuracy [23]. Therefore, examining the operational speed appropriate for vision systems and analyzing the dynamic localization errors of detection models has also emerged as a key research objective in this research. Based on the foregoing, this research will develop a trunk detection model (Trunk-Seek) suitable for standardized spindle-shaped orchards (Figure 1). This model can be utilized for the processing of tree trunk detection, tracking, and dynamic localization, and it supports real-time detection on embedded devices. During model training, customized data augmentation is employed to enhance the generalization capability of the model in various scenarios. Furthermore, tracking and localization algorithms are integrated to accomplish dynamic localization of trunks during motion, to analyze localization errors, and to define the application scope and compensation conditions for the detection model in orchard target-oriented fertilization contexts. The primary contributions of this work are as follows. First, we propose Trunk-Seek, a real-time trunk detector tailored for spindle-shaped orchards (Apple orchard), and demonstrate its deployment on edge devices with stable real-time performance. Second, a dedicated data augmentation pipeline is designed to improve the model’s robustness to lighting variations, occlusion, and motion blur. Third, multiple multi-object tracking algorithms are evaluated, and ByteTrack is integrated to enhance trunk association continuity in dynamic scenes. Fourth, a Positioning and Triggering Algorithm (PTA) is developed to convert continuous detections into reliable triggering events for targeted trench fertilization. Finally, an experimental platform is constructed to quantify dynamic localization errors, and an empirical temporal delay model is calibrated to enable real-time compensation.

2. Materials and Methods

The key components of this work include: robustness enhancement strategies against motion blur, occlusion, and lighting variations; a real-time detection model deployable on edge devices; a stable trunk ID association method based on multi-target tracking; and a dedicated positioning and triggering algorithm integrating dynamic localization error quantification and an empirical delay model, which will be validated through both controlled test benches and orchard experiments.

Figure 2 illustrates the proposed framework for model training. First, data processing is applied to the collected trunks, including converting the annotation structure to YOLO format, adjusting image pixels to 1920 × 1080, and proportionally dividing the Dataset into training, validation, and test sets. Secondly, the YOLOv8L model is utilized for initial training on the prepared data, while the algorithm modifies the original images to expand and augment the Dataset. Then, the trained YOLOv8L model is employed to semi-automatically label the Original and Combined Datasets, followed by manual refinement of the semi-automated labels. During the training phase, a comparative experiment among YOLO series models is designed, and a comprehensive performance analysis is performed based on appropriate evaluation metrics to identify the optimal detection model for dynamic detection. On this basis, multiple object tracking (MOT) and the PTA are incorporated. Finally, the validation on both the test bench and in orchard conditions for the deep learning-based detection model is established.

2.1. Construction of Trunk-Seek

2.1.1. Data Collection and Customized Dataset

As shown in Figure 3, the research was conducted at the Forestry Research Institute of the Beijing Academy of Forestry and Agricultural Sciences (116.227117° N, 39.974562° E) during tree fertilization operations. A total of 200 static images and 6 dynamic videos of trees were collected, with the static images stored in JPEG format. Data collection was performed using an Insta360 Go 3S camera (Insta360 Technologies Inc., Shenzhen, China), which was mounted on a remote-controlled rover in the orchard via a clamping device. The images were captured at a resolution of 1920 × 1080 pixels. Tree heights were approximately 2.0~3.0 m, with plant spacing of 1.5 m and row spacing of 3.6 m. During data collection, the rover operated at a constant speed along the fruit tree row paths. The video collection was conducted across three time periods—morning, noon, and afternoon—to ensure variability in the tree backgrounds within the videos. The total number of trunks in the videos was counted by the data collection team using a mechanical counter, and this count was validated as the ground truth.

Subsequently, the constructed dataset was randomly divided into training, validation, and test sets at a ratio of 8:1:1. For each image, a YOLO annotation file is generated, containing image attributes (e.g., name, width, height) and object attributes (e.g., class name, bounding box), with trunk objects annotated using rectangular boxes and unified class labels in the YOLO format. To achieve accurate detection and mitigate overfitting in the supervised learning algorithm during model construction, as illustrated in Figure 4, data augmentation techniques are applied, including the use of Motion Blur simulation to mimic motion blur in images, Copy Cut augmentation to introduce random occlusions of tree trunks, and Brightness Adjustment to vary image brightness, as well as Convolution operations to enhance tree trunk contours. Consequently, the original 3677 images are expanded to a total of 18,385 images [24].

2.1.2. Construction of Target-Oriented Model

The research employs a transfer learning approach and selects the widely used YOLOv8n model as the primary framework for trunk detection. Compared to other detection models, YOLOv8 employs a single-stage detector that combines classification and localization functions for tree trunks within a single neural network, necessitating only one network inference to determine the bounding box location and object class. YOLOv8 builds upon the improvement strategies of the YOLO series [25]. First, YOLOv8 deactivates Mosaic augmentation during the final ten epochs of training. This modification is based on the observation that employing Mosaic augmentation throughout the entire training process could diminish performance. Secondly, the C3 module is substituted with the C2f module. Figure 5 illustrates the structures of the C3 and C2f modules. The C2f module is designed based on the C3 module and incorporates an efficient layer aggregation network, enabling YOLOv8 to capture richer gradient flow information while preserving lightweight performance. Additionally, an anchor-free mechanism is introduced, which directly predicts object centers instead of relying on offsets from predefined anchor boxes. This enhancement effectively reduces the number of candidate boxes, thereby accelerating the non-maximum suppression post-processing step [25]. Lastly, these modifications are applied to the model’s backbone network, key building blocks, and fusion layers, rendering the model more compact and efficient.

And, we addressed the need for real-time detection and positioning by introducing a tracker based on an object detection algorithm to resolve the association problem of trunks in video streams. Currently, multiple object tracking (MOT) methods involve two primary functions: object detection and Re-Identification (Re-ID). The detector is employed to identify object positions in each frame, while Re-ID is used to establish associations between the object and its positions in previous frames. As illustrated in Figure 6a, this research adopts a modular design and incorporates tracking modules, which primarily include the Hungarian matching algorithm, Kalman filtering, and tracking management for state updates [26]. The Hungarian matching algorithm addresses the cascading matching issues between consecutive frames, while the Kalman filter provides predictions for the trunk’s expected position and velocity. In this framework, the ByteTrack algorithm utilizes the intersection-over-union (IoU) between the detection box and the Kalman prediction box as the cost function, ensuring seamless continuity of trunks across frames [27]. As depicted in Figure 6b, ByteTrack incorporates a mechanism that retains low-confidence detection boxes. This is primarily accomplished by classifying detection boxes into high-confidence and low-confidence categories, seamlessly integrating high-confidence boxes into trajectories, while re-matching low-confidence boxes with unmatched tracking entities to re-establish trajectories. This approach effectively minimizes the frequency of ID switches in complex scenarios, thereby mitigating challenges such as missed boxes caused by occlusion and motion blur.

Based on the aforementioned methods, we proposed a dynamic PTA. The algorithm involves setting a single line in the video captured by the camera to detect tree trunks and trigger positioning. The general principle for setting this line is to place a centrally aligned virtual line segment within the detection frame. By monitoring the relative position changes with passing tree trunks, the IDs of the trunks are tracked. During detection, the center points of bounding boxes in two consecutive frames are connected. if this connecting line intersects with the detection line, localization is triggered; otherwise, the status remains unchanged. As illustrated in Figure 7, assume there is a line segment L in the previous and subsequent frames of the video stream. If the two endpoints, C_i and C_i+1, of the virtual line segment C_iC_i+1, along with one endpoint (either A or B, but not both) of the line segment connecting frame_i and frame_i+1, form vectors, the cross product between vector C_iA (or C_i+1B) and vector C_iC_i+1 is calculated. If the result has opposite signs, it indicates that C_i and C_i+1 are located on opposite sides of line AB. If the result has the same sign, it indicates that both points C_i and C_i+1 are on the same side of line AB, implying no intersection. Thus, the position of the tree trunk relative to the detection line can be determined using the cross product method for positioning and triggering (P&T) the signal.

The mathematical expressions for this method are shown in Equations (1) and (2).

\{\begin{cases} L : (x_{A}, y_{A}) \to (x_{B}, y_{B}) \\ d_{1} = \vec{C_{i} A} \times \vec{C_{i} C_{i + 1}} \\ d_{2} = \vec{C_{i} B} \times \vec{C_{i} C_{i + 1}} \end{cases}

(1)

I_{c o u n t} = \{\begin{cases} 1, d_{1} \cdot d_{2} < 0 \\ 0, e l s e \end{cases}

(2)

where d₁ and d₂ represent the result of cross product of two vectors, respectively. x and y are the pixel coordinates of each point in the image.

2.2. Model Deployment, Training, and Evaluation Metrics

This research deployed the relevant training models for the experiment on the desktop workstation and subsequently deployed the trained models on the edge device. The detailed hardware is presented in Table 1.

All models were initialized from pretrained weights to enable transfer learning. We trained for up to 500 epochs using AdamW with an initial learning rate of 0.001. A learning rate schedule with (warmup epochs = 3.0) was applied. We used early stopping based on the validation mAP₅₀. Other hyperparameters are listed in Table 2.

In the detection section, the research comprehensively evaluates the performance of the trained model through an assessment based on precision (P), recall (R), mean average precision (mAP₅₀), weight size, and detection speed (FPS). P, R, and mAP₅₀ collectively indicate the overall performance of the detector (Equations (3)–(5)), with higher values signifying improved performance. Weight size, as a metric of detector efficiency, indicates that smaller values correspond to higher efficiency [28].

m {AP}_{50} = \int_{0}^{1} P (R) d R

(3)

P = \frac{T_{P}}{F_{P} + T_{P}}

(4)

R = \frac{T_{P}}{F_{P} + T_{N}}

(5)

In the equations, T_P is True Positives (correctly predicted trunks); F_P is False Positives (other objects misidentified as trunks); F_N is False Negatives (trunks misidentified as other objects).

In the tracking section, the research evaluates performance using higher-order tracking accuracy (HOTA), multi-object tracking accuracy (MOTA), identification F1 score (IDF1), and ID switches (IDSW) [21]. These metrics are essential for assessing the performance of tracking algorithms, with higher values of HOTA, MOTA, and IDF1 indicating superior algorithm performance, whereas a lower value of IDSW denotes improved tracking accuracy.

Additionally, we introduce two metrics (Equations (6) and (7)), triggering accuracy (P_trig) and dynamic localization error (Err), to explore the performance of PTA. Theoretically, the model detection error arises from the discrepancy in the horizontal distance between the bounding box width and the trunk contour width. However, because the triggering base of the PTA is situated at the center of the trunk, the is defined as the difference between the distance from the center of the bounding box to its edge and the distance from the center of the target to its edge.

P_{t r i g} = T_{t r i g} / T_{t o t a l}

(6)

In Equation (6), T_trig, T_total, respectively, represent actual triggered count; total count.

E r r = \frac{1}{2} \cdot |L_{1} - L_{2}|

(7)

In Equation (7), L₁ is trunk length (Ground Truth) detected by the laser sensors, L₂ is P&T length detected by vision, and Err represents the difference in the length from each signal’s center point to the rising edge.

2.3. Test Bench Overview

To verify the dynamic positioning and triggering accuracy of the Trunk-Seek, a detection and positioning test bench was constructed. The structure of the test bench is shown in Figure 8, consisting of a camera (RealSense D455, Intel Co., Santa Clara, CA, USA), an edge device (Jetson Orin NX, NVIDIA Co., Santa Clara, CA, USA), an encoder, a conveyor belt, tissue tubes (as the target), laser sensors (A0620, Suzhou Guhan Mechanical Technology Co., Suzhou, China, with response time < 10 ms, beam spacing of 20 mm, and normally open), and an oscilloscope (TDS2000, Tektronix Inc., Beaverton, OR, USA). The RealSense D455 and Jetson Orin NX were used to detect the target, while laser sensors mounted on both sides of the conveyor belt also detected the target. The oscilloscope was employed to acquire detection signals from the RealSense D455 and laser sensors, while the encoder provided real-time feedback on the conveyor belt speed. The conveyor belt could operate at a set speed, with an adjustable range of 0 to 1 m/s. The laser sensors installed on both sides of the conveyor belt were positioned parallel to the RealSense D455. When a target blocked the beam, the laser sensors output a 12 V high-level signal; otherwise, a low-level signal was generated and recorded by the oscilloscope. Meanwhile, the RealSense D455, positioned on one side of the conveyor belt, collected video information of the target, while the Trunk-Seek algorithm, deployed on the Jetson Orin NX, performed real-time recognition and positioning of the moving target. The PTA described in Figure 7 used the center point of the target bounding box (C_i(x_i, y_i)) as the triggering reference. For experimental purposes, the algorithm was modified accordingly: when the virtual positioning line touched one side of the bounding box (C_i(x_i2, y_i)), a high-level signal was output; conversely, when the other side of the trunk bounding box (C_i(x_i1, y_i)) moved away from the positioning line, a low-level signal was output. These signals were stored in the oscilloscope. Furthermore, the oscilloscope recorded the laser sensor signal and the PTA-generated digital signal on the same time base, eliminating clock drift between devices. A fixed system latency was calibrated by repeatedly passing a target at a known speed and aligning the rising edges of the two signals. The residual offset caused by installation geometry was converted to an equivalent time shift and compensated when computing Err.

2.4. Comprehensive Validation

In the validation, the research divides the trunk detection model validation into five components: (1) conducting experiments using the YOLOv5n, YOLOv6n, YOLOv8n, and YOLOv11n models for detection; (2) based on the detection model selected from the previous experiment, employing SORT, Strong-SORT, BoT-SORT, and ByteTrack for tree trunk tracking analysis; (3) using the object’s ID in combination with the PTA to validate the P&T of tree trunks; (4) deploying the Trunk-Seek on the test bench to evaluate the P_trig and Err of the PTA; and (5) testing the tracking results of the Trunk-Seek on video streams and conducting practical validation of the transferred model in an orchard environment.

2.4.1. Object Detection Model Confirmation

To compare the detection performance of the YOLO nano series on the Dataset, the researchers trained four network models—YOLOv5n, YOLOv6n, YOLOv8n, and YOLOv11n—on the Dataset. Model evaluation metrics, including mAP₅₀ in trunk recognition, weight size, and average frame processing time, were employed for a comprehensive assessment to select the most suitable transfer learning model.

2.4.2. Tracking Algorithm Analysis

To enhance the accuracy and stability of object tracking and to determine positional changes of tree trunks across consecutive frames, this research implemented the tracking algorithms SORT [27], StrongSORT [29], BoT-SORT [30], and ByteTrack [26]. Two sets of experiments were designed based on Unoccluded and Occlusion orchard scenarios, and dynamic data for MOT were generated using the Dark-Label v2.4 annotation software [31]. The TrackEval framework [32] was employed to assess metrics including higher-order HOTA, MOTA, IDF1, IDSW, and overall P_trig.

2.4.3. The PTA Evaluation

Based on the test bench, we designed an experiment to assess the impact of target diameter and motion speed on the model’s Err. By adjusting the conveyor belt speed, three speed levels were established: 0.3 m/s, 0.5 m/s, and 0.7 m/s. Additionally, tissue tubes with diameters of 4 cm, 7 cm, and 10 cm were utilized as tree trunk targets. Targets were placed at intervals of 1.5 m along the conveyor belt, moving in conjunction with the belt. Once the conveyor belt stabilized, output signals from the laser sensor and detection model were recorded using an oscilloscope. The direction of the conveyor belt’s motion was defined as the positive direction.

2.4.4. Orchard Validation

To verify the P_trig of Trunk-Seek under orchard conditions, the research also designed offline detection and orchard validation experiments. The offline detection experiment utilized video streams of tree trunks under various lighting conditions, collected by the Forestry Research Institute of the Beijing Academy of Forestry and Agricultural Sciences, as the experimental material. The videos were input into the trunk detection model, and the offline detection accuracy of orchard tree trunks was statistically analyzed. Subsequently, to explore the model’s generalization ability in non-prioritized scenes, this research deployed the experimental system on a tractor and conducted online validation in another orchard.

3. Results and Discussion

3.1. Analysis of Detection Model Results

Table 3 presents the results of various models on the Original and Combined Datasets. The evaluation metrics include precision (P), recall (R), mean average precision (mAP₅₀), weight size, and frames per second (FPS). The mAP₅₀ represents the area under the precision-recall curve, ranging from 0 to 1, with higher values indicating superior model performance. Thus, mAP₅₀ serves as a more comprehensive metric for assessing model accuracy [28]. Overall, within the same Dataset, the differences in mAP₅₀ scores are not significant, suggesting comparable performance across models. When comparing the average mAP₅₀ scores of the Original and Combined groups (0.930 and 0.974, respectively), models in the Combined group exhibited a 4.4% improvement over the Original group, clearly demonstrating enhanced accuracy through data augmentation. Since this research utilized YOLO nano series models for training, these models maintain high accuracy while featuring smaller sizes. Models trained on the Combined Dataset exhibited smaller sizes compared to those trained on the Original Dataset, and these results satisfy the deployment requirements for an edge device [14]. Additionally, the FPS for each model was recorded, revealing that in the Original Dataset group, all models except YOLOv11n achieved FPS values greater than 30. Models trained on the combined Dataset showed a slight decrease in FPS. Specifically, YOLOv6n and YOLOv8n recorded FPS values of 37.77 ms and 38.11 ms, respectively. After deploying the models on an edge device, FPS decreased due to computational resource limitations, ranging from a maximum of 32.53 ms to a minimum of 25.74 ms. Since the fixed FPS of the RealSense D455 used in the experiment is 30, the FPS of all models on the edge device is compatible with the camera, supporting real-time applications [22]. We selected YOLOv8n as the detector considering a trade-off among accuracy, runtime on edge device, and P&T stability. Compared with other nano models, YOLOv8n achieved the best mAP₅₀ on the Combined dataset while maintaining real-time inference on Jetson Orin NX compatible with the camera frame rate. In addition, YOLOv8n adopts an anchor-free design and an efficient feature aggregation module, which reduces post-processing overhead and facilitates stable deployment on embedded platforms.

Figure 9 illustrates the detection results of various models on the Combined Dataset. These images encompass 3 to 4 trees within the field of view, with some trees affected by leaf occlusion and varying lighting conditions. In comparison with Original images, all YOLO nano series models displayed bounding boxes, but YOLOv8n and YOLOv11n outperformed YOLOv5n and YOLOv6n in confidence scores. In the Motion Blur evaluation, the YOLO nano series mitigated the adverse effects of motion blur, with YOLOv8n and YOLOv11n demonstrating superior confidence scores compared to YOLOv5n and YOLOv6n. In the Copy Cut operation, two gray rectangular masks were randomly added to the image, with one obscuring a tree. In Figure 9(c1), YOLOv5n exhibited a missed detection, and the overall confidence scores of bounding boxes in the occluded images decreased. Figure 9(d1–d4) depicts an image captured under backlighting conditions, and after the Brightness Adjustment operation, the brightness was reduced by 30%. The YOLO nano series algorithms successfully detected the tree trunks in these conditions. Figure 9(e1–e4) was subjected to a Convolution operation to enhance the distinctness of object contours, but missed detections were observed in the results for YOLOv5n and YOLOv11n.

3.2. Tracking Algorithm Evaluation

Figure 10 shows the object tracking visualization results for the SORT, StrongSORT, BoT-SORT, and ByteTrack algorithms under Unoccluded and Occlusion orchard scenarios. These algorithms display information including counting, labels, confidence scores, historical trajectories of object centers, and assigned IDs. In the Unoccluded orchard video, 72 trunks appeared, whereas in the Occlusion orchard video, 40 trunks appeared [21]. Based on this, the research labeled 2080 and 1711 consecutive frames, along with 72 and 40 ground truth IDs, using the Dark-Label V2.4 annotation software. In the Unoccluded orchard experiment, the IDSW scores for SORT, StrongSORT, BoT-SORT, and ByteTrack were 163, 48, 136, and 136, respectively, indicating that StrongSORT performed better in trunk association in this scenario. Meanwhile, in the occluded orchard scenario, the IDSW scores for SORT, StrongSORT, BoT-SORT, and ByteTrack were 80, 95, 99, and 63, respectively, demonstrating that ByteTrack offered more stable ID tracking in occluded environments (Table 4). Additionally, the historical trajectories of object centers for BoT-SORT and ByteTrack were relatively stable, and their predictions of object positions in consecutive frames were more accurate, contributing to higher triggering scores (Figure 10, Table 4).

Table 4 presents the experimental results for the SORT, StrongSORT, BoT-SORT, and ByteTrack trackers. From the perspectives of HOTA, MOTA, IDF1, and IDSW, in the Unoccluded scenario, ByteTrack achieved scores of 95.17%, 94.329%, 96.532%, and 19, demonstrating excellent performance. In the Occlusion scenario, the metrics for HOTA, MOTA, and IDF1 across all tracking algorithms decreased significantly. In comparison, BoT-SORT and ByteTrack exhibited relatively better results for HOTA, MOTA, and IDF1. In the Unoccluded scenario, ByteTrack’s IDSW was 9, the best among the algorithms, indicating that it maintains smooth tracking during the consecutive frame process and alleviates ID switching caused by changes in detection label coordinates due to occlusion. Additionally, in terms of triggering statistics, the results for all algorithms were similar in the Unoccluded scenario. In the Occlusion scenario, BoT-SORT and ByteTrack significantly outperformed SORT and StrongSORT in triggering statistics, indicating that SORT and StrongSORT are less effective at addressing occlusion issues in orchard environments [33]. In both scenarios, BoT-SORT and ByteTrack performed well across performance metrics, but ByteTrack excelled in IDSW, which is crucial for mitigating occlusion issues. The ByteTrack demonstrated significant advantages in overall tracking metrics. Considering all factors, this research selected the ByteTrack tracking algorithm to address trunk object association and identification issues in real-world scenarios.

3.3. P&T Results

Figure 11 illustrates the P&T experiment. The L₁ represents the ground truth target signal detected by the laser sensors, and L₂ represents the trunk signal detected by the PTA. The ΔL denotes the installation error between the laser sensors and the camera (Figure 11d). Indeed, the P&T Err is defined as the difference in length from the center of each detected signal to the rising edge, which equals the duration (T) of the Err multiplied by the conveyor belt speed (v).

Table 5 presents the experimental statistics for P_trig and Err of targets. This research found that as target movement speed increased, Err decreased. When the speed was 0.3 m/s and 0.5 m/s, the impact on accuracy was minimal, but at 0.7 m/s, detection accuracy for targets of different diameters decreased significantly. As target diameter increased, the average Err in dynamic detection also increased. Simultaneously, with a fixed diameter, as target movement speed increased, the Err in detection gradually decreased. This may be attributed to the increase in target movement speed, which enhanced the tracking algorithm’s target association frequency, reduced the relative position difference of the bounding box, and thereby compensated for the Err in detection. When the object diameter was 10 cm and the speed was 0.3 m/s, the average Err was the largest, at 4.24 cm. When the target diameter was 4.0 cm and the speed was 0.7 m/s, the average Err was the smallest, at 0.58 cm. This research also analyzed the duration T (Equation (8)) of Err for detection under different target diameters and speeds. By calculating the Pearson correlation coefficient between d, v, and T, it was found that d and T exhibited a positive correlation (coefficient of 0.563), while v and T showed a negative correlation (coefficient of −0.758). This research then established a quadratic polynomial regression model for d and v, with an R² value of 0.996.

From the 3D surface regression model illustrated in Figure 12, it is evident that d is positively correlated with T; thus, larger diameters correspond to longer T values. The v is strongly negatively correlated with T; consequently, higher speeds result in shorter T values. It suggests that larger targets are associated with greater errors during movement, whereas faster speeds may enable the detection model to associate targets more rapidly, thereby reducing T.

T = (- 1 . 52) + 38.58 \cdot d - 247.71 \cdot v - 28.38 d \cdot v - 1.08 d^{2} + 257 . 92 \cdot v^{2}

(8)

3.4. The Results of Orchard Validation

As shown in Table 6, the research presents the statistics for offline validation of the PTA over a full day in the orchard scenario. The orchard scenes encompassed various conditions, including sunny, cloudy, shaded, and occluded environments, with light intensity ranging from 1.53 × 10⁴ to 9.95 × 10⁴ Lux. A total of 1979 trunks were present; the model trained on the Original Dataset detected 1326 trunks with a P_trig of 67.00%, while the model trained on the Combined Dataset detected 1794 trunks with a P_trig of 91.08%. The algorithm’s P_trig improved by 24.08% using the Combined Dataset, thereby significantly enhancing its robustness.

Figure 13 illustrates the trunk detection results. Under sunny conditions, with high light intensity and bright illumination, the bark of the trunks exhibits reflections, and the contour features are relatively distinct, enabling the detection model to identify them accurately. Furthermore, the frames from sunny days were categorized into sun-facing and backlit conditions. Under sun-facing conditions, the model trained on the Combined dataset detected 245 trunks with a P_trig of 91.08%; under backlit conditions, the model detected 379 trunks with a P_trig of 92.44%. Statistical data indicate that light variation of sun-facing conditions has a passive impact on validation results. Under cloudy conditions, the lighting is more uniform, with similar color tones, and there is a noticeable difference in texture between the tree trunks and the background. Under tree shade conditions, due to back-lighting and branch obstruction, the detection environment features uneven lighting; however, the contours remain clear, and the background is prominent, enabling Trunk-Seek to perform effectively. All three conditions demonstrate good detection results. In cases of missed detection, some trunks in the later growth stage were obstructed by branches and weeds, resulting in a more complex background. The trunk and concrete frame in the orchard exhibit very similar colors and contour features, particularly for some tree trunks in the early fruiting stage, which were thinner and appeared as small targets, thereby increasing detection difficulty.

To investigate the P_trig of Trunk-Seek under dynamic conditions, this research conducted a transfer validation in Jinzhou City, Hebei Province (Figure 14). The experimental system was mounted on a tracked tractor, with the camera capturing images and the Jetson Orin NX processing data in real-time. The algorithm assigned a fixed ID to each trunk. When a trunk entered the center of the field of view, Trunk-Seek output the P&T signal, the v, and current time, which were then synchronized and saved in a TXT file. The trunks in the experimental area were marked with fixed numbers to facilitate synchronization of experimental information. The trunk diameters were used as the ground truth for validation, with measurements obtained using a caliper. Each diameter measurement was taken three times for the same trunk, and the average value was recorded. Subsequently, the Err was calculated.

Figure 15 shows the results of Trunk-Seek Transfer Orchard Validation. The experiment involved measurements of 27 trunks at three predefined speed levels (0.26 m/s, 0.41 m/s, and 0.59 m/s). The true average d was 4.28 cm. At a speed of 0.26 m/s, the average detection Err was 1.52 cm; this value was 1.34 cm at 0.41 m/s and 0.69 cm at 0.59 m/s. It was observed that the detection model’s average Err decreased with increasing speed, a trend that is consistent with the conclusions derived from the test bench experiments. Furthermore, the T mentioned in Section 3.3 was employed to predict the Err in trunk detection. The corresponding average Errs predicted by the model were 1.69 cm, 1.46 cm, and 0.92 cm at speeds of 0.26 m/s, 0.41 m/s, and 0.59 m/s, respectively. These predicted results were then compared against the actual Err measurements from the orchard, yielding absolute deviations of 0.17 cm, 0.12 cm, and 0.23 cm. These deviations may be attributed to the image annotation process, wherein laboratory conditions are more conducive to precise trunk contour matching within the bounding box. Additionally, interference from the outdoor environment likely impacted the detection model’s performance. Moreover, these deviations are relatively minor when considered against the precision requirements for positioning trunks with diameters ranging from 4.0 to 11.0 cm. Therefore, the statistics on trunk diameter and speed provided by the T can be utilized to estimate detection and localization errors in real-time, thereby guiding the future development of target-oriented fertilization systems [34].

3.5. Discussion

Overall, this research focused on spindle-shaped tree trunks in orchard environments, achieving real-time detection and localization. The research investigated the impact of trunk diameter and operational speed on the model’s detection and positioning performance. Through modifications to the Dataset strategy and a comparative analysis of models, the mAP₅₀ was improved by 6.2%, leading to the selection of the YOLOv8n architecture for the implementation pipeline. The selected model was subsequently deployed on a Jetson Orin NX platform. The inference time stabilized at an average of 32.53 milliseconds per frame, satisfying the real-time operational requirements for detection. Building upon the principles of detection and positioning, various combinations of detection and tracking algorithms were implemented, culminating in the proposal of PTA. Analysis revealed a positive correlation between d and T, and a negative correlation between v and T. These relationships informed the establishment of a quadratic polynomial regression model to describe T. This model, formulated as T = Err/v, can be utilized for time-delay compensation in target-oriented fertilization systems within orchards.

Currently, real-time detection methods in applications predominantly rely on algorithmic optimization. Common optimization strategies include enhancing backbone feature extraction networks and incorporating attention mechanisms. These modifications aim to improve detection speed and mitigate the decline in accuracy caused by unstructured environments. In line with this approach, we integrated lightweight backbone networks, including GhostNetv3, StarNet, MobileNetv4, and FasterNet [22]. Subsequently, attention mechanisms—such as Convolutional Block Attention Module (CBAM), Efficient Multi-scale Attention (EMA), and Squeeze-and-Excitation (SE)—were incorporated. Among the configurations tested, the YOLOv8n-GhostNetv3-SE model demonstrated the best performance, with a model size of 3.8 MB. This represents a reduction compared to the 5.3 MB size of the baseline YOLOv8n model. Although both models achieved real-time processing on edge devices, the practical value of this lightweight improvement was deemed marginal. Furthermore, the YOLOv8n-GhostNetv3-SE model achieved a mean average precision mAP₅₀ of 93.33%, which was not a significant improvement over the baseline YOLOv8n model’s performance. Consequently, the training strategy was altered by employing more extensive data augmentation. This was designed to enhance the model’s robustness to occlusion, varying lighting conditions, motion blur, and its ability to extract trunk contours, ultimately yielding a superior detection model. Compared to the YOLOv8m-vine-classes model (mAP₅₀: 94.4%) reported by Saha and Noguchi [35], our model exhibited a 4.5% improvement in mAP₅₀ and maintained an advantage in weight size. During practical validation in an orchard environment, the model trained on the Combined Dataset achieved an accuracy of 91.08%. This represents a 24.08% improvement over previous training iterations, further validating the efficacy of the trunk detection model. Interestingly, the research also conducted an ablation test to quantify the contribution of each augmentation component to YOLOv8n. Starting from the Original Dataset, incrementally added Copy Cut, Motion Blur, Brightness Adjustment, and Convolution operation, and finally applied the Detection Model. The results show that building upon Original, Dataset (Original, Copy Cut), (Original, Copy Cut, Motion Blur), (Original, Copy Cut, Motion Blur, Brightness Adjustment), and Combined achieved 0.97%, 3.7%, 1.1%, and 0.7% improvements in mAP₅₀, respectively. The Motion blur significantly boosted model accuracy, particularly in dynamic environments. Conclusively, as data augmentation is progressively implemented, the model’s performance metrics show consistent improvement (P (%), R (%), mAP₅₀).

Correspondingly, a series of tracker evaluations were conducted, leading to the selection of the ByteTrack algorithm due to its superior performance for integration as the tracking module. Originally designed for multi-object pedestrian tracking, ByteTrack effectively mitigates trunk occlusion issues [26], a capability that aligns well with the technical requirements for trunk positioning and triggering. This research represents an innovative application of ByteTrack in orchard environments, thereby expanding the scope of its application. To evaluate dynamic detection P&T precision, an integrated algorithm combining object detection, tracking, and PTA was designed. The P&T precision remained relatively stable at v below 0.5 m/s. However, a further increase in speed resulted in a decline in triggering accuracy. Notably, for a trunk diameter of 10 cm and a speed of 0.7 m/s, the P_trig dropped to 84%, a result that is consistent with findings reported by Zhai et al. [28]. This phenomenon can be attributed to motion blur in the trunk contour features, induced by the camera’s exposure time at high speeds (e.g., 0.7 m/s). Consequently, the detection model extracted fewer effective features, impairing recognition ability and ultimately leading to a reduction in accuracy.

Conclusively, the research also has certain limitations. On one hand, lighting changes are the primary factor leading to system performance degradation. Lighting variations (such as specular reflections) reduce contrast between tree trunks and backgrounds, which may decrease detection confidence. On the other hand, while the detection-tracking-PTA triggering workflow proposed in this work demonstrates good visibility when trunks are within the camera field of view, system performance may decline for heavily occluded structures (e.g., open flower pots or trellis cultivation systems) or cases with significant geometric and textural differences in trunks. Due to complex backgrounds and partial visibility of trunks, future work will focus on optimizing the system through hardware stabilization and multi-sensor fusion, while employing more representative datasets combined with data augmentation and transfer learning techniques to enhance detector performance and maintain PTA recognition stability.

4. Conclusions

To achieve efficient trenching and target-oriented fertilization in spindle-shaped orchards, this research proposes Trunk-Seek—a real-time vision-based model for tree trunk detection, tracking, and dynamic positioning. Unlike conventional orchard perception methods that primarily focus on frame-level detection accuracy, Trunk-Seek is designed as a trigger-oriented perception framework that can directly support fertilization control actions during dynamic field operations.

First, the Trunk-Seek lies in a robustness-driven data augmentation strategy specifically designed for orchard-specific visual degradation factors. Through targeted augmentation, the training strategy improves the detection performance of YOLOv8n by 6.2% in mAP₅₀, significantly enhancing the model’s generalization capability under complex field conditions. Second, the detection model is deployed on an edge device and achieves real-time inference with a stable processing time of 32.53 ms per frame, meeting the requirements for on-board real-time operation and ensuring compatibility with agricultural cameras and control systems. Third, Trunk-Seek integrates object detection, tracking, and the PTA into a unified pipeline. By evaluating multiple trackers in both unoccluded and occluded orchard scenarios, ByteTrack is identified as the optimal solution for maintaining trunk identity continuity. This trigger-oriented integration enables the transformation of continuous visual perception into reliable, discrete fertilization trigger events. Finally, a data-driven dynamic error and delay model is constructed through bench tests under varying trunk diameters and operating speeds. A quadratic polynomial regression model for the time-delay parameter T is derived, providing a quantitative basis for online estimation and compensation of localization errors in practical fertilization systems. Field validation in orchards demonstrates that Trunk-Seek achieves a triggering accuracy of 91.08%, and successful transfer experiments confirm its applicability to heterogeneous orchard environments. In future work, Trunk-Seek will be further integrated with the fertilization control module to build a comprehensive intelligent target-oriented fertilization system, enabling closed-loop verification of vision perception and agricultural machinery control.

Author Contributions

Resources, funding acquisition, supervision, C.Z., X.W. and L.C.; writing, formal analysis, conceptualization, K.Z.; validation, investigation, formal analysis, S.Y., Z.W. and W.Z.; writing—review and editing, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

Support was provided by (1) The National Key Research and Development Program of China (2022YFD2001402); (2) The Yunnan Provincial Science and Technology Department (202302AE0900200202); (3) The China Agriculture Research System (CARS-30-4-01); (4) The Program for Cultivating Outstanding Scientists of Beijing Academy of Agriculture and Forestry Sciences (JKZX202212).

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, C.; Valente, J.; Kooistra, L.; Guo, L.; Wang, W. Orchard Management with Small Unmanned Aerial Vehicles: A Survey of Sensing and Analysis Approaches. Precis. Agric. 2021, 22, 2007–2052. [Google Scholar] [CrossRef]
Bai, Q.; Luo, H.; Fu, X.; Zhang, X.; Li, G. Design and Experiment of Lightweight Dual-Mode Automatic Variable-Rate Fertilization Device and Control System. Agriculture 2023, 13, 1138. [Google Scholar] [CrossRef]
You, A.; Parayil, N.; Krishna, J.G.; Bhattarai, U.; Sapkota, R.; Ahmed, D.; Whiting, M.; Karkee, M.; Grimm, C.M.; Davidson, J.R. Semiautonomous Precision Pruning of Upright Fruiting Offshoot Orchard Systems: An Integrated Approach. IEEE Robot. Autom. Mag. 2023, 30, 10–19. [Google Scholar] [CrossRef]
Zahid, A.; Mahmud, M.S.; He, L.; Heinemann, P.; Choi, D.; Schupp, J. Technological Advancements towards Developing a Robotic Pruner for Apple Trees: A Review. Comput. Electron. Agric. 2021, 189, 106383. [Google Scholar] [CrossRef]
Lei, X.; Liu, J.; Jiang, H.; Xu, B.; Jin, Y.; Gao, J. Design and Testing of a Four-Arm Multi-Joint Apple Harvesting Robot Based on Singularity Analysis. Agronomy 2025, 15, 1446. [Google Scholar] [CrossRef]
Zhang, L.; Li, M.; Zhu, X.; Chen, Y.; Huang, J.; Wang, Z.; Hu, T.; Wang, Z.; Fang, K. Navigation Path Recognition between Rows of Fruit Trees Based on Semantic Segmentation. Comput. Electron. Agric. 2024, 216, 108511. [Google Scholar] [CrossRef]
Zheng, K.; Yang, S.; Gao, Y.; Wang, X.; Wang, J.; Song, S.; Zhai, C.; Chen, L. Numerical Simulation and Optimization Design of a Novel Longitudinal-Flow Online Fertilizer Mixing Device. Comput. Electron. Agric. 2025, 237, 110546. [Google Scholar] [CrossRef]
Sun, J.; Chen, Z.; Song, R.; Fan, S.; Han, X.; Zhang, C.; Wang, J.; Zhang, H. An Intelligent Self-Propelled Double-Row Orchard Trenching and Fertilizing Machine: Modeling, Evaluation, and Application. Comput. Electron. Agric. 2025, 229, 109818. [Google Scholar] [CrossRef]
Liu, H.; Wang, L.; Shi, Y.; Wang, X.; Chang, F.; Wu, Y. A Deep Learning-Based Method for Detecting Granular Fertilizer Deposition Distribution Patterns in Centrifugal Variable-Rate Spreader Fertilization. Comput. Electron. Agric. 2023, 212, 108107. [Google Scholar] [CrossRef]
Ozdarici-Ok, A.; Ok, A.O. Using Remote Sensing to Identify Individual Tree Species in Orchards: A Review. Sci. Hortic. 2023, 321, 112333. [Google Scholar] [CrossRef]
Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization Strategies of Fruit Detection to Overcome the Challenge of Unstructured Background in Field Orchard Environment: A Review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A Survey of Deep Learning-Based Object Detection Methods in Crop Counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and Systems for Fruit Detection and Localization: A Review. Comput. Electron. Agric. 2015, 116, 8–19. [Google Scholar] [CrossRef]
Gao, F.; Fang, W.; Sun, X.; Wu, Z.; Zhao, G.; Li, G.; Li, R.; Fu, L.; Zhang, Q. A Novel Apple Fruit Detection and Counting Methodology Based on Deep Learning and Trunk Tracking in Modern Orchard. Comput. Electron. Agric. 2022, 197, 107000. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Intelligent Detection of Multi-Class Pitaya Fruits in Target Picking Row Based on WGB-YOLO Network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cao, Z.; Gong, C.; Meng, J.; Liu, L.; Rao, Y.; Hou, W. Orchard Vision Navigation Line Extraction Based on YOLOv8-Trunk Detection. IEEE Access 2024, 12, 104126–104137. [Google Scholar] [CrossRef]
Brown, J.; Paudel, A.; Biehler, D.; Thompson, A.; Karkee, M.; Grimm, C.; Davidson, J.R. Tree Detection and In-Row Localization for Autonomous Precision Orchard Management. Comput. Electron. Agric. 2024, 227, 109454. [Google Scholar] [CrossRef]
Tu, S.; Huang, Y.; Huang, Q.; Liu, H.; Cai, Y.; Lei, H. Estimation of Passion Fruit Yield Based on YOLOv8n + OC-SORT + CRCM Algorithm. Comput. Electron. Agric. 2025, 229, 109727. [Google Scholar] [CrossRef]
Karim, M.J.; Nahiduzzaman, M.; Ahsan, M.; Haider, J. Development of an Early Detection and Automatic Targeting System for Cotton Weeds Using an Improved Lightweight YOLOv8 Architecture on an Edge Device. Knowl.-Based Syst. 2024, 300, 112204. [Google Scholar] [CrossRef]
Sanchez, P.R.; Zhang, H. Precision Spraying Using Variable Time Delays and Vision-Based Velocity Estimation. Smart Agric. Technol. 2023, 5, 100253. [Google Scholar] [CrossRef]
Zhong, Z.; Zheng, L.; Zheng, Z.; Li, S.; Yang, Y. Camera Style Adaptation for Person Re-Identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the Computer Vision—ECCV, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Zhai, C.; Fu, H.; Zheng, K.; Zheng, S.; Wu, H.; Zhao, X. Establishment and experimental verification of deep learning model for on-line recognition of field cabbage. Trans. Chin. Soc. Agric. Mach. 2022, 53, 293–303. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Darkpgmr GitHub—Darkpgmr/DarkLabel: Video/Image Labeling and Annotation Tool. Available online: https://github.com/darkpgmr/DarkLabel (accessed on 5 August 2025).
JonathonLuiten GitHub—JonathonLuiten/TrackEval: HOTA (and Other) Evaluation Metrics for Multi-Object Tracking (MOT). Available online: https://github.com/JonathonLuiten/TrackEval (accessed on 5 August 2025).
Ma, B.; Hua, Z.; Wen, Y.; Deng, H.; Zhao, Y.; Pu, L.; Song, H. Using an Improved Lightweight YOLOv8 Model for Real-Time Detection of Multi-Stage Apple Fruit in Complex Orchard Environments. Artif. Intell. Agric. 2024, 11, 70–82. [Google Scholar] [CrossRef]
Pinto de Aguiar, A.S.; Neves dos Santos, F.B.; Feliz dos Santos, L.C.; de Jesus Filipe, V.M.; Miranda de Sousa, A.J. Vineyard Trunk Detection Using Deep Learning—An Experimental Device Benchmark. Comput. Electron. Agric. 2020, 175, 105535. [Google Scholar] [CrossRef]
Saha, S.; Noguchi, N. Smart Vineyard Row Navigation: A Machine Vision Approach Leveraging YOLOv8. Comput. Electron. Agric. 2025, 229, 109839. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of trenching fertilization principle by vision detection.

Figure 2. Framework for Data Collection, Pre-processing and Training.

Figure 3. The Trunk Data Collection Process.

Figure 4. Customized Data Preparation.

Figure 5. YOLOv8n network architecture.

Figure 6. Implementation Mechanism of Trackers and ByteTrack Algorithm. (a) Tracking diagram; (b) ByteTrack diagram.

Figure 7. Schematic diagram of principle for Positioning and Triggering Algorithm (PTA).

Figure 8. Test Bench.

Figure 9. Detection Results in Data Augmentation. (a1) YOLOv5n-Orignal; (a2) YOLOv6n-Orignal; (a3) YOLOv8n-Orignal; (a4) YOLOv11n-Orignal; (b1) YOLOv5n-Motion Blur; (b2) YOLOv6n-Motion Blur; (b3) YOLOv8n-Motion Blur; (b4) YOLOv11n-Motion Blur; (c1) YOLOv5n-Copy Cut; (c2) YOLOv6n-Copy Cut; (c3) YOLOv8n-Copy Cut; (c4) YOLOv11n-Copy Cut; (d1) YOLOv5n-Brigntness Adjustment; (d2) YOLOv6n-Brigntness Adjustment; (d3) YOLOv8n-Brigntness Adjustment; (d4) YOLOv11n-Brigntness Adjustment; (e1) YOLOv5n-Convolution; (e2) YOLOv6n-Convolution; (e3) YOLOv8n-Convolution; (e4) YOLOv11n-Convolution.

Figure 10. Tracking Visualizations of SORT, StrongSORT, BoT-SORT, and ByteTrack. (a1) Unocclude_SORT; (a2) Unocclude_Strong-SORT; (a3) Unocclude_BoT-SORT; (a4) Unocclude_ByteTrack; (b1) Occlusion_SORT; (b2) Occlusion_Strong-SORT; (b3) Occlusion_BoT-SORT; (b4) Occlusion_ByteTrack.

Figure 11. Positioning & Trigering (P&T) Experiment. (a) The Sensors; (b) The Cylinders; (c) Test Bench; (d) The Oscillogragh; (e) The Screen Shot of Oscillogragh.

Figure 12. The 3D Surface Response Diagram of T.

Figure 13. The Dynamic Detection Results under Different Scenarios. (a) Sunny scene; (b) Cloudy scene; (c) Shade scene; (d) Scene of missed detection.

Figure 14. The Trunk-Seek Transfer Orchard Validation. (a) The Measurement of ground truth; (b) The Validation of detection; (c) The Screen Shot of validation.

Figure 15. The Err in different operation speed. (a) 0.59m/s; (b) 0.41m/s; (c) 0.26m/s.

Table 1. Workstation Specifications.

Configuration	Parameters
CPU	Intel Xeon Gold 5218R
GPU	NVIDIA RTX 3090
Operating system	Windows10 Pro
GPU computing platform	CUDA 11.6
Library	Pytorch 1.13.1

Table 2. Configuration of Model Training Parameters.

Hyper Parameters	Values
Optimizer	AdamW
Learning rate	0.001
Momentum	0.937
Weight decay	0.0005
Batch size	16
Epoch	500

Table 3. Comparison of Detection Model Performance.

Model	Dataset	P (%)	R (%)	mAP₅₀	Weight Size (MB)	FPS (ms)	FPS for Edge Device (ms)
YOLOv5n	Original	90.9	89.5	0.931	5.30	30.43	-
YOLOv6n		89.9	88.8	0.932	8.70	38.01
YOLOv8n		88	88.1	0.927	6.23	38.03
YOLOv11n		89.1	88.9	0.933	5.52	29.71
YOLOv5n	Combined	95.2	92.0	0.971	5.18	32.78	27.21
YOLOv6n		97.6	92.8	0.961	8.32	37.77	31.93
YOLOv8n		97.8	95.8	0.989	6.01	38.11	32.53
YOLOv11n		98.2	95.0	0.974	5.23	30.41	25.74

Table 4. Tracker experimental Results.

Scene	Method	HOTA (%) ↑	MOTA (%) ↑	IDF1 (%) ↑	IDSW ↓	Triggering/Total
Unoccluded	SORT	71.883	-	88.768	163	66/72
	Strong-SORT	74.896	89.433	92.362	48	68/72
	BoT-SORT	93.977	94.033	96.281	136	70/72
	ByteTrack	95.17	94.329	96.532	136	70/72
Occlusion	SORT	40.525	-	45.624	80	35/40
	Strong-SORT	37.897	38.104	38.7	95	32/40
	BoT-SORT	51.241	39.266	68.407	99	37/40
	ByteTrack	51.274	41.896	68.279	63	39/40

Table 5. The Err Statistics of PTA.

d (cm)	v (m/s)	Triggering/Total	P_trig (%)	Err (cm)	T (ms)
4.0	0.3	50/50	100	1.42	47
	0.5	47/50	94	1.18	24
	0.7	44/50	88	0.58	8
7.0	0.3	48/50	96	3.24	108
	0.5	48/50	96	2.80	56
	0.7	44/50	88	1.97	28
10.0	0.3	49/50	98	4.23	141
	0.5	49/50	98	3.6	72
	0.7	42/50	84	2.37	34

Table 6. The Orchard Validation Results.

Group	Dataset	Light Intensity (1 × 10⁴ Lux)	Count/Total	P_trig (%)
1	Original	1.53–9.95	1326/1979	67.00
2	Combined	1.53–9.95	1794/1979	91.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, K.; Yang, S.; Wang, Z.; Fu, H.; Wang, X.; Zou, W.; Zhai, C.; Chen, L. Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning. Agronomy 2026, 16, 210. https://doi.org/10.3390/agronomy16020210

AMA Style

Zheng K, Yang S, Wang Z, Fu H, Wang X, Zou W, Zhai C, Chen L. Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning. Agronomy. 2026; 16(2):210. https://doi.org/10.3390/agronomy16020210

Chicago/Turabian Style

Zheng, Kang, Shuo Yang, Zhichong Wang, Hao Fu, Xiu Wang, Wei Zou, Changyuan Zhai, and Liping Chen. 2026. "Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning" Agronomy 16, no. 2: 210. https://doi.org/10.3390/agronomy16020210

APA Style

Zheng, K., Yang, S., Wang, Z., Fu, H., Wang, X., Zou, W., Zhai, C., & Chen, L. (2026). Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning. Agronomy, 16(2), 210. https://doi.org/10.3390/agronomy16020210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Detection and Validation of a Target-Oriented Model for Spindle-Shaped Tree Trunks Leveraging Deep Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction of Trunk-Seek

2.1.1. Data Collection and Customized Dataset

2.1.2. Construction of Target-Oriented Model

2.2. Model Deployment, Training, and Evaluation Metrics

2.3. Test Bench Overview

2.4. Comprehensive Validation

2.4.1. Object Detection Model Confirmation

2.4.2. Tracking Algorithm Analysis

2.4.3. The PTA Evaluation

2.4.4. Orchard Validation

3. Results and Discussion

3.1. Analysis of Detection Model Results

3.2. Tracking Algorithm Evaluation

3.3. P&T Results

3.4. The Results of Orchard Validation

3.5. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI