Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID

Shen, Renyuan; Wang, Yong; Liu, Huaiyang; Gu, Haiyang; Geng, Changxing; Shi, Yun

doi:10.3390/make8020039

Open AccessArticle

Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID

by

Renyuan Shen

^1,2,†,

Yong Wang

^1,2,†,

Huaiyang Liu

^1,*

,

Haiyang Gu

^1,2,

Changxing Geng

^2,* and

Yun Shi

¹

State Key Laboratory of Efficient Utilization of Arid and Semi-Arid Arable Land in Northern China, Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100086, China

²

School of Mechanical and Electrical Engineering, Soochow University, Suzhou 215000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mach. Learn. Knowl. Extr. 2026, 8(2), 39; https://doi.org/10.3390/make8020039

Submission received: 19 January 2026 / Revised: 6 February 2026 / Accepted: 6 February 2026 / Published: 8 February 2026

Download

Browse Figures

Versions Notes

Abstract

Dense foliage, severe illumination variations, and interference from multiple individuals with similar appearances in complex orchard environments pose significant challenges for vision-based following robots in maintaining persistent target perception and identity consistency, thereby compromising the stability and safety of fruit transportation operations. To address these challenges, we propose a novel framework, DeepDIMP-ReID, which integrates the Deep Implicit Model Prediction (DIMP) tracker with a person re-identification (ReID) module based on EfficientNet. This visual perception and autonomous following framework is designed for differential-drive orchard transportation robots, aiming to achieve robust target perception and reliable identity maintenance in unstructured orchard settings. The proposed framework adopts a hierarchical perception–verification–control architecture. Visual tracking and three-dimensional localization are jointly achieved using synchronized color and depth data acquired from a RealSense camera, where target regions are obtained via the discriminative model prediction (DIMP) method and refined through an elliptical-mask-based depth matching strategy. Front obstacle detection is performed using DBSCAN-based point cloud clustering techniques. To suppress erroneous following caused by occlusion, target switching, or target reappearance after occlusion, an enhanced HOReID person re-identification module with an EfficientNet backbone is integrated for identity verification at critical decision points. Based on the verified perception results, a state-driven motion control strategy is employed to ensure safe and continuous autonomous following. Extensive long-term experiments conducted in real orchard environments demonstrate that the proposed system achieves a correct tracking rate exceeding 94% under varying human walking speeds, with an average localization error of 0.071 m. In scenarios triggering re-identification, a target discrimination success rate of 93.3% is obtained. These results confirm the effectiveness and robustness of the proposed framework for autonomous fruit transportation in complex orchard environments.

Keywords:

orchard transportation robot; visual perception; autonomous following; person re-identification; motion control

1. Introduction

With the continuous expansion of agricultural production scales and the sustained rise in labor costs, the demand for automation and intelligent equipment in orchard operations has become increasingly urgent. Transportation tasks constitute a high-frequency and labor-intensive component of orchard production, which have long relied on manual labor, resulting in limited efficiency and considerable physical workload. In recent years, mobile robots, unmanned operating platforms, and intelligent agricultural machinery have attracted significant attention in the agricultural domain and are regarded as key technological solutions for improving operational efficiency and alleviating labor shortages [1,2,3,4]. However, compared with industrial or other structured environments, orchard settings typically have constrained spaces and dense vegetation. They also experience severe illumination variations and dynamic interactions with humans and obstacles. These characteristics pose substantial challenges to stable and reliable autonomous robotic operation in real-world orchard applications [5,6,7,8].

As an important category of intelligent agricultural equipment, orchard transportation robots are commonly required to perform cooperative carrying tasks alongside human workers, which places stringent requirements on the robustness of target perception, identity discrimination, and motion control. Existing studies have explored orchard transportation and following tasks, for example, by employing vision-based perception to recognize and follow harvesting workers, thereby preliminarily demonstrating the feasibility of human–robot collaborative transportation in orchard environments [9]. To address the instability of GNSS signals, some studies have introduced multi-sensor fusion schemes incorporating UWB, LiDAR, and odometry to improve localization accuracy [10] or have developed SLAM systems tailored to orchard environments to enhance environmental modeling capabilities [11,12]. Although these approaches have achieved certain improvements in localization and navigation accuracy, their relatively high hardware costs and system complexity significantly limit large-scale deployment on lightweight orchard transportation platforms.

In contrast, vision-based perception methods exhibit strong practical potential in orchard scenarios due to their low cost and deployment flexibility. In recent years, the application of deep learning to target detection, tracking, and environmental understanding has significantly enhanced the perceptual capabilities of visual systems [13,14,15]. Nevertheless, orchard environments commonly involve severe foliage occlusion, background similarity, and dramatic illumination changes, making traditional vision-based tracking methods prone to target loss and erroneous following in multi-person or dynamically disturbed scenarios [9,14]. In particular, when the target temporarily leaves the field of view or reappears after occlusion, tracking algorithms that rely solely on temporal association struggle to preserve identity consistency.

Person re-identification (ReID) techniques offer a promising solution to these challenges. By modeling discriminative identity features, ReID methods enable target identity re-validation after tracking interruptions, thereby enhancing the stability of long-term following tasks [16]. While existing ReID approaches have achieved satisfactory performance on standard pedestrian datasets, their generalization capability in real orchard environments remains limited due to significant differences in illumination conditions, occlusion patterns, and clothing characteristics compared with urban scenes [16,17]. Moreover, integrating ReID mechanisms into real-time following systems requires careful consideration of computational overhead and system responsiveness, as excessive latency may adversely affect real-time performance at critical moments [18,19,20,21].

Regarding motion control and following strategies, most existing studies adopt path-tracking-based or simple feedback control methods to achieve target following [13,22,23]. Although these approaches can provide basic following functionality in structured environments, they are susceptible to accumulated localization errors, target motion uncertainty, and sudden obstacle appearances in space-constrained and dynamically disturbed orchard scenarios [24,25,26]. Some studies have further explored multi-robot cooperation or predictive control strategies [27,28]; however, the associated system complexity often hinders their direct application to single-robot following transportation tasks.

To address the aforementioned challenges, this paper proposes and implements a vision-guided autonomous following system, termed DeepDIMP-ReID (Deep Implicit Model Prediction tracker combined with an EfficientNet-based person re-identification module), for trellised orchard transportation scenarios, as schematically illustrated in Figure 1. The proposed framework employs a RealSense camera as the primary sensor and integrates a robust visual tracking algorithm with depth information matching to achieve real-time three-dimensional localization of the following target. Building upon this perception pipeline, an enhanced person re-identification model is introduced to perform identity verification at critical moments, reducing interference from visually similar individuals and ensuring identity consistency throughout the following process. Meanwhile, obstacle perception and a state-driven motion control strategy are incorporated to enable safe and stable robot following in complex and dynamically changing orchard environments.

The main contributions of this work are summarized as follows. First, a vision-guided following system tailored for trellised orchards is developed, achieving tight integration of target localization, identity discrimination, obstacle perception, and motion control. Second, to address occlusion and multi-person interference in orchard environments, a person re-identification mechanism is incorporated into the following state to enhance robustness against target disappearance and reappearance. Finally, extensive long-term field experiments conducted in real orchard environments systematically validate the effectiveness and engineering feasibility of the proposed approach under complex operating conditions, providing a deployable technical solution for autonomous following in orchard transportation tasks.

2. Materials and Methods

2.1. Orchard Transport Robot System Architecture

The orchard transportation robot system is primarily composed of four components. These include: (1) a perception sensor, represented by a RealSense depth camera (Intel Corporation, Santa Clara, CA, USA); (2) controllers, consisting of a high-performance industrial computer (upper-level controller) running Ubuntu 22.04 with ROS 2 framework, and an on-board controller (lower-level controller); (3) motion execution devices, including motor drivers and servo motors integrated on a mobile chassis platform (Shihe Robotics Co., Ltd., Hefei, Anhui, China); and (4) fruit placement devices, composed of a support frame and a fruit basket (Figure 2).

To support subsequent experiments on target tracking and localization accuracy, as well as to acquire trajectories for analysis, a 16-line LiDAR (RoboSense Technology Co., Ltd., Shenzhen, China) and an RTK-GPS module (Shanghai Huace Navigation Technology Ltd., Shanghai, China) are additionally installed, as shown in Figure 3d,e. The physical prototype of the orchard transportation robot is presented in Figure 3.

2.2. DeepDIMP-ReID-Based Target Perception and Discrimination

2.2.1. DIMP-Based Stereo Tracking for Target Localization

In orchard environments, illumination conditions vary significantly, and multiple targets may appear within the sensor’s field of view. Moreover, the visual appearance of the tracking object can easily deform due to motion. These factors demand that single-object visual tracking algorithms exhibit strong robustness. When selecting the baseline single-object tracking algorithm, several commonly used deep-learning-based trackers are evaluated on the NFS [29], OTB-100 [30], and UAV123 [31] datasets, using the AUC [30] metric for performance comparison. The results are summarized in Table 1.

As shown in Table 1, the DIMP algorithm demonstrates strong performance across all three datasets. Therefore, DIMP is adopted as the visual tracking algorithm for the transportation robot, using the officially pre-trained model weights.

As illustrated in Figure 4, the DIMP algorithm performs visual tracking of the following target. After the initial frame is manually annotated (red bounding box), DIMP predicts the target’s bounding box (green) in subsequent frames, thereby achieving continuous visual localization of the following target.

After obtaining the bounding box of the target in each frame, depth matching within the corresponding region of the depth image is required to estimate the target’s depth. However, in outdoor environments, the RealSense camera is sensitive to strong illumination, which often results in missing depth information in local regions. Under such conditions, directly selecting the depth value at the center pixel of the target bounding box may yield zero or invalid values, leading to target localization failures, as illustrated in Figure 5 and Figure 6.

To address this issue, a robust depth matching method based on an elliptical mask is proposed. The elliptical mask is geometrically better aligned with the human body contour, effectively reducing the influence of background regions on depth estimation. Specifically, an elliptical mask corresponding to the target bounding box is first generated on the depth image (Figure 7). Depth pixels within the mask are then filtered to remove invalid zero-value points, and the median of the remaining valid depth pixels is used as the target’s depth estimate, thereby improving the stability and robustness of depth estimation.

In implementation, the center coordinates

(x, y)

of the target bounding box are obtained from the RGB image based on the bounding box predicted by the DIMP algorithm. The depth and RGB images are spatially aligned using the RealSense SDK to ensure pixel-level correspondence. An elliptical mask is constructed on the aligned depth image with

(x, y)

as the center, and the ellipse parameters are calculated according to Equations (1) and (2), where a and b denote the minor and major axes, respectively, and width and height represent the dimensions of the target bounding box.

a = \frac{w i d t h}{8},

(1)

b = \frac{h e i g h t}{8},

(2)

A new depth image, new_depth, is obtained by performing a bitwise AND operation between the original depth image and the generated mask (Equation (3)), where depth is the original depth image and mask is the elliptical mask. Depth values within the elliptical region are then extracted and collected into a depth set. Zero-value pixels are removed to handle missing depth information, and the median of the remaining valid depth values is taken as the depth measurement of the target’s center.

n e w_{d e p t h} = d e p t h ⊙ m a s k,

(3)

where

⊙

denotes element-wise multiplication between the depth image and the elliptical mask.

After estimating the target’s depth, its pixel coordinates are mapped to the robot coordinate system using the pinhole camera model and the intrinsic and extrinsic parameters of the RGB-D camera, achieving three-dimensional spatial localization. Figure 8 illustrates the tracking and localization results. By transforming between the pixel and robot coordinate systems, the robot can continuously acquire the relative position of the following target in real time.

2.2.2. DBSCAN-Based Obstacle Perception

In orchard environments, common obstacles encountered during the following process include four categories: fruit crates, humans, fruit trees, and buildings. This study focuses on perceiving these four types of obstacles. To prevent the following robot from colliding with obstacles, it is necessary not only to perceive the following target but also to detect obstacles ahead. An emergency stop is triggered if an obstacle is closer than a predefined safety distance threshold.

Traditional methods for detecting nearby obstacles mainly rely on ultrasonic sensors, LiDAR, or vision-based object detection. However, incorporating additional sensors or new detection algorithms increases both cost and computational requirements. To address this, a DBSCAN-based obstacle detection method is proposed. Depth images from the RealSense camera are first used to generate pseudo 2D laser scan data. Points with depth less than 1 m are filtered and clustered using DBSCAN. The number of points in each cluster is then compared with a threshold to determine the presence of obstacles.

The obstacle perception procedure is as follows:

Parameter initialization:

A 1280 × 720 depth image is sampled at intervals to obtain 640 points. The safety distance threshold is set to 1 m, and the minimum number of points per cluster is set to 20. The sampling range along the y-axis is restricted to 0–600 pixels to discard points near the ground, as illustrated in Figure 9.

2: Pseudo-2D laser point generation:

The depth image is cropped according to the y-axis sampling range. For each column, the point with the minimum depth is selected, and its pixel coordinates are converted to the robot coordinate system, forming the pseudo-2D laser point set.

3: DBSCAN clustering:

The pseudo-2D laser points are clustered using DBSCAN to obtain all clusters.

4: Obstacle determination:

The number of points in each cluster is counted. Clusters with point counts exceeding the threshold are identified as obstacles. If at least one obstacle cluster exists, the presence of an obstacle ahead is confirmed, and the safety strategy is activated.

2.2.3. Human ReID-Based Similar-Target Discrimination

As described in Section 2.2.1, the DIMP single object tracking algorithm can be divided into two stages: target localization and bounding-box refinement. During the localization stage, when the correct target temporarily disappears from the field of view due to fruit carrying or occlusion by obstacles, the algorithm may erroneously capture a visually similar object appearing in front and incorporate its samples into the model update pool. This can weaken the discriminative capability toward the correct target, leading to incorrect following or unintended actions by the visual following robot in complex orchard environments, as shown in Figure 10.

To address this issue, a human re-identification (ReID) module is integrated to assist DIMP tracking, enabling long-term and stable tracking of dynamic targets. The proposed method consists of three stages:

Human feature encoding: Images of all workers in the orchard are sampled from multiple viewpoints (four directions). Each image is encoded using the modified HOReID model, and the resulting feature vectors are stored in a database, with each vector labeled by the person ID and image index (Figure 11).

The HOReID [38] model is modified as follows:

(1): The backbone ResNet-50 is replaced with EfficientNet-B4 to improve feature extraction efficiency. The stride of the sixth block is changed from 2 to 1, and the output channels of the final 1 × 1 convolution are increased from 1792 to 2048, enhancing feature map resolution and feature granularity.
(2): A CBAM (including a channel attention module (CAM) and a spatial attention module (SAM)) is added after the seventh block to adaptively enhance key feature channels and spatial locations, improving ReID accuracy under occlusion.

The modified backbone network is illustrated in Figure 12, with the green regions indicating modified or added components.

2.: Target appearance sampling: When the target reappears, complete human keypoints (head and feet) are extracted using a human keypoint detection model. Appearance features are then sampled from the DIMP bounding box region, ensuring that incomplete sampling does not compromise matching accuracy (Figure 13).

3.: Target matching and re-identification: The sampled appearance features are encoded and compared with all features in the database using cosine similarity. The most similar feature is selected, and its ID is assigned to the target, thereby achieving target re-identification (Figure 14).

2.3. Pure-Pursuit-Based Following Control with State-Driven Strategy

2.3.1. Pure-Pursuit-Based Trajectory Following

In the robot following system shown in Figure 3, the RealSense camera mounted on the experimental platform first captures the RGB and depth images in front of the robot. The target tracking algorithm takes the RGB images as input and outputs the location of the following target within the image. By combining depth matching and coordinate transformation, the target’s three-dimensional position is obtained in the robot coordinate system. Sudden changes in the target’s position between consecutive frames trigger the human re-identification module to distinguish visually similar pedestrians, ensuring target consistency and robustness during long-term following.

Based on this, the relative position of the target in the robot coordinate system is combined with the robot chassis kinematic model to generate motion control commands. These commands are then sent to the lower-level controller. The lower-level controller interprets and executes the commands, driving the actuators to move the robot and achieve stable following. To simplify control design, the robot chassis is modeled as a differential drive system with two active wheels, ignoring the influence of the chassis weight and passive wheels. The following control method is built upon this simplified kinematic model.

Traditional following control methods typically rely on numerous hyperparameters or human experience, leading to complex control flows and limited kinematic derivation. In contrast, a Pure-Pursuit-based motion control method is adopted here, which requires minimal computation and only a single speed hyperparameter to achieve stable control. By setting a look-ahead distance, the method continuously selects the target point and computes the robot’s required angular velocity and turning radius based on geometric relationships, enabling smooth and continuous following motion.

2.3.2. Follow-State Recognition and Strategy Switching

During fruit carrying and cargo transport tasks, the orchard transportation robot may encounter complex situations such as temporary target disappearance, suspected target switching, or obstacles appearing ahead. To ensure safe navigation while maintaining long-term stable following of the correct target, a motion control strategy switching mechanism based on following-state recognition is designed.

Specifically, for each image frame, the system first checks for potential target switching using the proposed re-identification trigger mechanism. If the mechanism is activated, the robot immediately pauses. Once the human re-identification module completes the sampling and matching of the target’s appearance features, a decision is made: if the match is successful, following resumes; if the match fails, the DIMP filter update is suspended, and target matching continues until the correct following target is relocked.

Meanwhile, the system continuously monitors the distance to obstacles and the target. The robot stops if an obstacle is detected ahead or if the target distance is less than 1 m to ensure operational safety. This obstacle distance threshold is determined by considering the robot’s maximum operating speed, braking capability, and perception–control latency, such that the available stopping distance exceeds the estimated braking distance plus system response delay under worst-case conditions. This setting is also consistent with commonly adopted safety distances in close-proximity human–robot collaborative scenarios.

In this steering mode, when the target distance falls within 1–1.5 m, the left and right drive wheels rotate in place with identical angular velocities but opposite directions to realign the camera with the following target. This steering mode switching range is selected to balance responsiveness and stability during following. Distances below 1 m may induce abrupt steering commands and increase the risk of oscillations or collisions, whereas distances exceeding 1.5 m lead to overly conservative behavior and reduced following accuracy in space-constrained orchard rows. Empirical tuning during extensive field experiments confirmed that this range provides stable and smooth following performance in real orchard environments.

In the absence of the above abnormal states, the robot employs the Pure-Pursuit-based motion control method to drive the wheels, achieving smooth and stable following motion.

3. Results

3.1. Experimental Scenario

The experimental scenario is set in a small vineyard and surrounding areas in Kunshan, Suzhou, Jiangsu Province, China. Seven participants with varying clothing, including visually similar outfits, took part in the experiments. The orchard environment mainly consists of rows of fruit trees, main orchard pathways, and open grassy areas, as shown in Figure 15. The experimental platform uses the transport robot prototype constructed in Section 2.3.

3.2. Performance Evaluation and Field Validation of an Orchard Target Following System

3.2.1. Tracking and Localization Accuracy Experiment

To facilitate the evaluation of target localization accuracy, a LiDAR sensor is additionally mounted on the robot. The LiDAR-measured target position serves as the ground truth, while the position obtained by the proposed target localization method serves as the estimated value. Target localization experiments are conducted along the main orchard pathways, between rows of fruit trees, and in grassy areas, as illustrated in Figure 16a–c, where red points indicate the positions obtained by the proposed method and green points indicate the actual target positions. Euclidean distance is used as the deviation metric, as defined in Equation (4):

e = \sqrt{{(x_{p r e d} - x_{a c t u a l})}^{2} + {(y_{p r e d} - y_{a c t u a l})}^{2}},

(4)

where

(x_{p r e d}, y_{p r e d})

denote the positions obtained by the proposed method and

(x_{a c t u a l}, y_{a c t u a l})

denote the actual target positions.

To quantitatively evaluate the localization performance, statistical analysis was conducted based on the localization errors recorded across all tracking trials. Specifically, the mean error, maximum error, standard deviation, and 95th percentile of the localization error were computed. Across the three tracking trials, the mean localization error is 0.071 m, while the maximum deviation reaches 0.217 m. The standard deviation of the localization error is 0.051 m, indicating limited dispersion of the errors around the mean. Furthermore, 95% of the localization errors are below 0.172 m, demonstrating that the proposed system maintains stable localization accuracy under most operating conditions.

Due to the use of an elliptical mask for depth extraction, no failures caused by missing depth values are observed. The main sources of deviation include errors in the ground-truth measurement process, RealSense depth measurement errors, errors in computing human coordinates, and inaccuracies in depth extraction caused by imperfect bounding-box predictions of the DIMP algorithm. Although a certain deviation exists between the measured and estimated target positions, its impact on the overall system performance is negligible. Therefore, the proposed target tracking and localization method satisfies the requirements for tracking target localization accuracy.

3.2.2. Similar-Target Disambiguation via Person Re-Identification

To evaluate the effectiveness and applicability of the proposed similar-target discrimination method based on human re-identification (ReID) in orchard following tasks, experiments are conducted at three levels: model performance, target discrimination in static scenarios, and system performance in dynamic tracking scenarios. First, the improved HOReID model is quantitatively evaluated on standard pedestrian ReID datasets to verify its feature discrimination capability under occlusion. Second, in real orchard static scenarios, target discrimination experiments are performed under varying numbers of personnel and environmental conditions to assess the reliability of the ReID method in practical applications. Finally, in dynamic tracking scenarios that include multiple typical interference factors, the proposed target discrimination strategy is comprehensively tested, with particular emphasis on maintaining identity consistency during target switching, short-term disappearance, and occlusion. This layer-by-layer experimental design provides a comprehensive evaluation of the method’s practicality and robustness in complex orchard environments.

Experiment I: Experimental Setup and Performance Evaluation of the Improved HOReID Model

To validate the effectiveness of the improved HOReID model, training is conducted on the Occluded-Duke [39] and DukeMTMC [40] datasets. Data augmentation techniques including random horizontal flipping, random erasing, and random cropping are applied during training. To adapt to varying illumination conditions in orchards, random brightness adjustment is also introduced. The model is trained for 120 epochs. The performance comparison of the test sets of Occluded-Duke and DukeMTMC is shown in Table 2.

As shown in Table 2, the EfficientNet-HOReID model achieves a 3.5% improvement in Rank-1 and a 4.4% improvement in mAP on the Occluded-Duke dataset. On the DukeMTMC dataset, Rank-1 and mAP improve by 1.2% and 0.8%, respectively, demonstrating the effectiveness of the model modifications.

Experiment II: Reliability Analysis of Similar-Target Discrimination in Static Multi-Person Orchard Scenarios

To verify the proposed ReID-based target discrimination method in orchard environments, experiments are conducted in static scenarios with varying numbers of personnel and environmental conditions, comprising a total of nine test groups. Before each experiment, images of each participant are collected from four different angles, and the extracted features are stored in a database. For each group, 20 ReID trials are conducted, with the candidate target randomly selected from the participants present. A trial is considered successful if the candidate correctly matches the corresponding ID in the database. The results are summarized in Table 3.

Table 3 shows that the proposed ReID method achieves an average matching success rate of 92.8%. The success rate is relatively lower in the between-tree-row scenario. Main causes of mismatches include insufficient performance of the ReID network itself, limited feature extraction under low illumination, and interference from spotty lighting in tree-row areas. Despite occasional mismatches, the frequency is low, and orchard operations typically occur during daytime under non-extreme lighting conditions. Therefore, the proposed ReID method meets the requirements for personnel re-identification in orchard tasks. The variability in success rates across groups can be explained by the binomial nature of the 20 ReID trials per group, with slightly higher variability for larger group sizes and low illumination conditions.

Experiment III: Impact of Similar-Target Interference on Following Identity Consistency in Dynamic Tracking Scenarios

To assess the effectiveness of the ReID-based target discrimination method in dynamic tracking, offline video sequences are collected from multiple scenarios with various interference factors, as shown in Figure 17. Four typical interferences are introduced: multiple people present simultaneously (seven participants in total, with one target and six distractors; four-angle images of each participant are collected in advance to construct the ReID feature database), target pose variations, short-term disappearance and reappearance, and target switching due to occlusion. The test data are collected in the Bacheng vineyard and surrounding areas of Kunshan, Suzhou, Jiangsu Province, covering main orchard pathways, rows of fruit trees, indoor aggregation areas, and open grassy areas. The robot followed the target under human operation, with the above interferences intentionally introduced.

The proposed method is compared with the original DIMP, SORT [45], and DeepSORT [46] algorithms, as shown in Table 4. For quantitative evaluation, frame-by-frame results are classified into four states: correct tracking (CT), indicating correct following of the target; correct loss (CL), indicating the target leaves the field of view and the system correctly identifies it as lost; wrong tracking (WT), indicating tracking of an incorrect target; and wrong loss (WL), indicating the target remains visible but is not tracked.

Although the proposed method does not always achieve the highest CT score in certain scenarios, its CT + CL (system correct decision rate) is significantly higher than that of DIMP, SORT, and DeepSORT across five test scenarios. A Wilcoxon signed-rank test indicates that the proposed method shows an improvement in CT + CL over the baseline trackers. This reflects a system-level design choice: in complex orchard environments, the priority is to ensure reliable and safe following decisions rather than maximizing frame-level continuous tracking accuracy.

In orchard operations, frequent occlusions, high visual similarity, drastic pose changes, and rapid illumination variations are common. Under such conditions, WT poses a higher risk to the system than CL. Therefore, the proposed ReID-based following framework adopts a conservative discrimination strategy when target identity is uncertain, favoring correct loss detection rather than riskily following a potential incorrect target. This strategy effectively suppresses wrong following, explaining the relatively limited CT but significantly reduced WT.

Furthermore, the ReID mechanism is not continuously applied to every frame. Instead, it is triggered only during critical events, such as suspected target switching, target disappearance, or reappearance, for identity verification and correction. Its main effect is to enhance identity consistency and robustness during long-term following rather than to improve short-term frame-level tracking accuracy. This characteristic allows the system to effectively avoid template contamination and identity accumulation under multi-person interference and prolonged operation.

Overall, the experimental results indicate that, although the proposed method involves some trade-offs in CT, it achieves superior overall decision correctness and following stability in complex, multi-interference orchard dynamic scenarios, demonstrating higher practical value for real-world human–robot collaborative following tasks.

3.2.3. Follow Control and Decision-Making Evaluation

To validate the proposed robot following control method and its decision-making strategy, following control experiments are designed to cover a variety of typical operating conditions. The robot’s linear velocity is preset and is adjusted in real time according to the target’s depth information. Based on practical orchard operation requirements, the following process is divided into five decision states:

Emergency stop: triggered when an obstacle is detected within 1 m in front of the robot.
Stop motion: executed when the distance to the target is less than 1 m.
In-place rotation: when the target distance is 1–1.5 m, the robot performs rotation in place by driving the two wheels in opposite directions at equal speed to maintain alignment.
Re-identification-triggered decision: when a suspected target switch is detected, the human ReID module is invoked to verify identity and suppress incorrect following.
Normal following: when the target distance exceeds 1.5 m and no abnormal conditions occur, the robot follows the target using the Pure Pursuit algorithm.

The safety thresholds and distance-based decision boundaries follow the control strategy defined in Section 2.3.2, where their kinematic and empirical rationale is discussed.

To evaluate the robot’s ability to track the target trajectory under the normal following state, an RTK-GPS module is installed on top of the robot, and a base station is deployed to record the motion trajectories of both the target and the robot. As shown in Figure 18, trajectories are recorded under three different following scenarios.

Analysis of the recorded trajectories shows that the robot can closely follow the target during straight walking, turning, and free walking, demonstrating high accuracy and real-time performance in following motion. However, small deviations are observed at some slight turns. Possible causes of these deviations include: (1) the target’s lateral speed during turns is relatively high, which delays the robot’s motion control adjustment; (2) slippage of the differential-drive robot chassis during turning, which causes trajectory offset.

3.2.4. Long-Term Field Experiments in Orchard Environments

To evaluate the long-term following performance of the robot in real operational environments, a field following experiment was conducted in a vineyard and surrounding areas in Bacheng, Kunshan, Suzhou, Jiangsu Province, China, lasting 25 min and 6 s. During the experiment, the robot follows three different participants at walking speeds of approximately 1.5 m/s, 1.0 m/s, and 0.6 m/s, maintaining a human–robot distance of ≤6 m for effective perception by the RealSense camera. Six additional participants acted as distractors, and one was the designated target. Before the experiment, four-view images of all participants are collected and pedestrian re-identification features are extracted to construct the matching database. Figure 19a–e illustrate the field experiment scenarios. When the robot loses the target, it stops and waits until the correct target reappears before resuming motion (Figure 20).

The results of target tracking and localization are summarized in Table 5, with the LiDAR-measured target positions treated as ground truth to compute the average localization error. The average localization error is also reported along with its standard deviation to reflect stability over time.

During the experiment, obstacles appear in front of the robot eight times, and the corresponding obstacle perception results are presented in Table 6.

In addition, five instances require triggering the re-identification module, with the target recognition results shown in Table 7.

Throughout the following process, the system successfully handles sudden obstacle intrusion, temporary target disappearance and reappearance, and interference from visually similar individuals. Statistical results indicate that, when the target is within the field of view, the system achieves a correct tracking rate exceeding 94% under all speed conditions, with a maximum of 98.5%, and an average localization error below 0.08 m. Based on the long-term field experiment data, more than 95% of the localization errors are estimated to remain below 0.12 m, demonstrating stable localization performance. Obstacle perception achieves a 100% success rate across all speeds, with an average per-frame processing time of approximately 20 ms. For critical events requiring re-identification, the average target recognition success rate reaches 93.3%. These results demonstrate that the proposed following system provides stable long-term tracking performance and high safety in complex orchard environments, meeting the practical requirements of orchard transportation operations.

4. Discussion

This study investigates the stability of orchard transport robots during following tasks in trellised vineyard environments by developing a vision-guided autonomous following framework that integrates target localization, obstacle perception, similar-target discrimination, and motion control. Systematic experiments conducted in real orchard environments demonstrate that the framework achieves stable long-term tracking of designated targets under complex conditions, including vegetation occlusion, illumination changes, and multi-person interference, thereby validating its engineering feasibility for orchard transport operations.

The proposed DIMP-based visual tracking combined with depth information maintains low localization errors in complex orchard backgrounds. The obstacle perception method, which leverages pseudo-laser representation and DBSCAN clustering, effectively detects common close-range obstacles encountered in orchard operations. The pedestrian re-identification mechanism enhances target identity consistency under scenarios of temporary target disappearance, reappearance, or interference from visually similar individuals. Meanwhile, the lightweight following control strategy imposes low computational overhead, satisfying real-time system requirements.

Nevertheless, the system has several limitations. In cases of complete target occlusion or dense multi-person interactions, the re-identification trigger may introduce latency. Under low-light conditions, the feature extraction capability of the re-identification module decreases. Additionally, triggering the re-identification module temporarily increases computational load. The current obstacle perception module does not support autonomous obstacle avoidance, and the following control strategy is primarily open-loop, which may lead to cumulative errors over prolonged operation.

Another important limitation lies in the deployment dependency of the ReID module. The current system relies on a pre-constructed ReID database, which requires prior multi-view image acquisition of target workers. This assumption may restrict scalability in real-world orchard scenarios, especially when new workers are introduced, clothing changes occur, or protective equipment is worn. In addition, the present operating mode primarily focuses on tracking known individuals, and the handling of completely unknown persons remains limited. Although this design ensures high identity consistency in controlled deployments, it may reduce flexibility in dynamic labor environments.

Furthermore, the experimental validation was conducted at a single orchard site with seven participants under relatively favorable environmental conditions. The system performance under adverse weather (e.g., rain, dust), challenging illumination (e.g., low sunlight, backlight, twilight, or nighttime), and reflective or highly dynamic backgrounds has not been fully evaluated. These factors may affect both visual tracking and re-identification reliability, thereby limiting the generalization of the current experimental results to broader operational scenarios.

Future work will focus on improving the robustness of re-identification under low-light and appearance-variant conditions by incorporating clothing-invariant features, online or incremental identity updating strategies, and more efficient matching mechanisms. In addition, multi-site experiments under diverse environmental conditions, including varying illumination and weather, will be conducted to further evaluate system generalization. Multi-sensor fusion-based obstacle avoidance and closed-loop control strategies will also be integrated to enhance long-term stability and operational safety.

5. Conclusions

This study investigates the stability and accuracy of transport robots during following tasks in trellised vineyard environments by proposing and implementing the DeepDIMP-ReID framework, which has been validated through hardware platform deployment and real-world experiments in orchard scenarios. The system combines the DIMP algorithm with elliptical-mask depth matching to achieve real-time 3D localization of the following target. In complex orchard environments, including main orchard roads, inter-row areas, and open lawns, the average localization error is only 0.071 m, effectively ensuring spatial accuracy for following tasks.

The obstacle detection method, based on pseudo-laser representation and DBSCAN clustering, enables high-precision perception of close-range obstacles such as fruit boxes, pedestrians, trees, and buildings, providing safety assurance for robot motion in dynamic environments. For target discrimination, the modified EfficientNet-HOReID model is employed for pedestrian re-identification. On the Occluded-Duke and DukeMTMC datasets, Rank-1 accuracy is improved by 3.5% and 1.2% and mAP is increased by 4.4% and 0.8%, demonstrating the effectiveness of the model enhancements. In long-term orchard following experiments, the target re-identification success rate during critical moments reaches 93.3%, significantly improving identity consistency and robustness under multi-person interference conditions.

In terms of motion control, the system integrates the Pure Pursuit algorithm with a state-driven control strategy, achieving stable trajectory tracking. Under various target speeds, the correct tracking rate consistently exceeds 94%, effectively handling temporary target disappearance and reappearance under occlusion, thereby ensuring continuity and reliability of following tasks.

Overall, the experimental results confirm that the proposed framework achieves high-precision, robust, and stable autonomous following in complex orchard environments. The system supports long-term continuous operation of transport robots and provides a practical and deployable solution for autonomous following in challenging agricultural field scenarios.

Author Contributions

Conceptualization, R.S. and C.G.; methodology, R.S. and Y.W.; software, R.S.; validation, H.L., R.S. and Y.W.; formal analysis, R.S. and H.G.; investigation, R.S. and H.G.; resources, C.G., Y.S. and H.L.; data curation, R.S.; writing—original draft preparation, R.S. and Y.W.; writing—review and editing, Y.W. and H.L.; visualization, R.S.; supervision, C.G. and Y.S.; project administration, R.S. and H.G.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Program of the Chinese Academy of Agricultural Sciences (CAAS-CAE-202302 and CAAS-CAE-202301).

Data Availability Statement

Due to our laboratory’s policies or confidentiality agreements, we cannot provide raw data.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DeepDIMP-ReID	Deep Discriminative Model Prediction with Re-Identification
DIMP	Discriminative Model Prediction
DBSCAN	Density-Based Spatial Clustering of Applications with Noise
HOReID	Higher-Order Relation-Based Person Re-Identification
GNSS	Global Navigation Satellite System
UWB	Ultra-Wideband
LiDAR	Light Detection and Ranging
SLAM	Simultaneous Localization and Mapping
ReID	Re-Identification
RTK	Real-Time Kinematic
GPS	Global Positioning System
NFS	Need for Speed
OTB-100	Object Tracking Benchmark-100
UAV123	Unmanned Aerial Vehicle 123
AUC	Area Under the Success Plot Curve
CCOT	Continuous Convolution Operator Tracker
DaSiam-RPN	Distractor-Aware Siamese Region Proposal Network
SiamRPN++	Siamese Region Proposal Network++
UPDT	Unconstrained Pseudo-Dynamic Tracker
ATOM	Accurate Tracking by Overlap Maximization
CBAM	Convolutional Block Attention Module
SAM	Spatial Attention Module
Occluded-REID	Occluded Person Re-Identification Dataset
DukeMTMC	Duke Multi-Target, Multi-Camera Tracking Dataset
Rank-1	Rank-1 Accuracy
mAP	Mean Average Precision
DSR	Deep Similarity Re-Identification
SFR	Similarity Feature Re-Ranking
PCB	Part-Based Convolutional Baseline
Ad-Occluded	Adaptive Occlusion-Aware Re-Identification
SORT	Simple Online and Realtime Tracking
DeepSORT	Deep Simple Online and Realtime Tracking
PGFA	Part-Guided Feature Aggregation
CT	Correct Tracking
CL	Correct Loss
WT	Wrong Tracking
WL	Wrong Loss

References

Fontani, M.; Luglio, S.M.; Gagliardi, L.; Peruzzi, A.; Frasconi, C.; Raffaelli, M.; Fontanelli, M. A Systematic Review of 59 Field Robots for Agricultural Tasks: Applications, Trends, and Future Directions. Agronomy 2025, 15, 2185. [Google Scholar] [CrossRef]
Jiang, D.; Shen, Z.; Zheng, Q.; Zhang, T.; Xiang, W.; Jin, J. Farm-LightSeek: An Edge-Centric Multimodal Agricultural IoT Data Analytics Framework with Lightweight LLMs. IEEE Internet Things Mag. 2025, 8, 72–79. [Google Scholar] [CrossRef]
Tan, H.; Zhao, X.; Fu, H.; Yang, M.; Zhai, C. A novel fusion positioning navigation system for greenhouse strawberry spraying robot using LiDAR and ultrasonic tags. Agric. Commun. 2025, 3, 100087. [Google Scholar] [CrossRef]
Yang, J.; Zulkarnain, N.; Ibrahim, M.F.; Nordin, I.N.A.M.; Vaghefi, S.A. A comprehensive review of agricultural ground automatic navigation systems based on multi-sensor fusion. IEEE Access 2025, 13, 168159–168182. [Google Scholar] [CrossRef]
Fei, Z.; Vougioukas, S.G. A robotic orchard platform increases harvest throughput by controlling worker vertical positioning and platform speed. Comput. Electron. Agric. 2024, 218, 108735. [Google Scholar] [CrossRef]
He, Z.; Liu, Z.; Zhou, Z.; Karkee, M.; Zhang, Q. Improving picking efficiency under occlusion: Design, development, and field evaluation of an innovative robotic strawberry harvester. Comput. Electron. Agric. 2025, 237, 110684. [Google Scholar] [CrossRef]
Sun, H.; Zhang, S.; Quan, Q. Multi-feature fusion and memory-based mobile robot target tracking system. IET Cyber-Syst. Robot. 2024, 6, e12119. [Google Scholar] [CrossRef]
Syed, T.N.; Zhou, J.; Lakhiar, I.A.; Marinello, F.; Gemechu, T.T.; Rottok, L.T.; Jiang, Z. Enhancing Autonomous Orchard Navigation: A Real-Time Convolutional Neural Network-Based Obstacle Classification System for Distinguishing ‘Real’ and ‘Fake’ Obstacles in Agricultural Robotics. Agriculture 2025, 15, 827. [Google Scholar] [CrossRef]
Huang, Z.; Ou, C.; Guo, Z.; Ye, L.; Li, J. Human-Following Strategy for Orchard Mobile Robot Based on the KCF-YOLO Algorithm. Horticulturae 2024, 10, 348. [Google Scholar] [CrossRef]
Jia, L.; Wang, Y.; Ma, L.; He, Z.; Li, Z.; Cui, Y. Integrated positioning system of kiwifruit orchard mobile robot based on UWB/LiDAR/ODOM. Sensors 2023, 23, 7570. [Google Scholar] [CrossRef]
Tan, H.; Zhao, X.; Zhai, C.; Fu, H.; Chen, L.; Yang, M. Design and experiments with a SLAM system for low-density canopy environments in greenhouses based on an improved Cartographer framework. Front. Plant Sci. 2024, 15, 1276799. [Google Scholar] [CrossRef]
Zhang, W.; Gong, L.; Huang, S.; Wu, S.; Liu, C. Factor graph-based high-precision visual positioning for agricultural robots with fiducial markers. Comput. Electron. Agric. 2022, 201, 107295. [Google Scholar] [CrossRef]
De Silva, R.; Cielniak, G.; Gao, J. Vision based crop row navigation under varying field conditions in arable fields. Comput. Electron. Agric. 2024, 217, 108581. [Google Scholar] [CrossRef]
Khan, N.; Rahi, A.; Rajendran, V.P.; Al Hasan, M.; Anwar, S. Real-time crop row detection using computer vision- application in agricultural robots. Front. Artif. Intell. 2024, 7, 1435686. [Google Scholar] [CrossRef]
Li, B.; Li, D.; Wei, Z.; Wang, J. Rethinking the crop row detection pipeline: An end-to-end method for crop row detection based on row-column attention. Comput. Electron. Agric. 2024, 225, 109264. [Google Scholar] [CrossRef]
Syed, M.A.; Ou, Y.; Li, T.; Jiang, G. Lightweight multimodal domain generic person reidentification metric for person-following robots. Sensors 2023, 23, 813. [Google Scholar] [CrossRef]
Seol, J.; Park, Y.; Pak, J.; Jo, Y.; Lee, G.; Kim, Y.; Ju, C.; Hong, A.; Son, H.I. Human-Centered Robotic System for Agricultural Applications: Design, Development, and Field Evaluation. Agriculture 2024, 14, 1985. [Google Scholar] [CrossRef]
Murcia, H.F.; Tilaguy, S.; Ouazaa, S. Development of a low-cost system for 3D Orchard Mapping Integrating UGV and LiDAR. Plants 2021, 10, 2804. [Google Scholar] [CrossRef] [PubMed]
Nkwocha, C.L.; Adewumi, A.; Folorunsho, S.O.; Eze, C.; Jjagwe, P.; Kemeshi, J.; Wang, N. A Comprehensive Review of Sensing, Control, and Networking in Agricultural Robots: From Perception to Coordination. Robotics 2025, 14, 159. [Google Scholar] [CrossRef]
Shamshiri, R.R.; Navas, E.; Dworak, V.; Cheein, F.A.A.; Weltzien, C. A modular sensing system with CANBUS communication for assisted navigation of an agricultural mobile robot. Comput. Electron. Agric. 2024, 223, 109112. [Google Scholar] [CrossRef]
Wu, M.; Yeong, C.F.; Su, E.L.M.; Holderbaum, W.; Yang, C. A review on energy efficiency in autonomous mobile robots. Robot. Intell. Autom. 2023, 43, 648–668. [Google Scholar] [CrossRef]
Huan, Z.L.; Tomohiro, T.; Tofael, A. Leader-follower tracking system for agricultural vehicles: Fusion of laser and odometry positioning using extended kalman filter. IAES Int. J. Robot. Autom. 2015, 4, 1–18. [Google Scholar] [CrossRef]
Li, J.; Wang, S.; Zhang, W.; Li, H.; Zeng, Y.; Wang, T.; Fei, K.; Qiu, X.; Jiang, R.; Mai, C.; et al. Research on Path Tracking for an Orchard Mowing Robot Based on Cascaded Model Predictive Control and Anti-Slip Drive Control. Agronomy 2023, 13, 1395. [Google Scholar] [CrossRef]
Katona, K.; Neamah, H.A.; Korondi, P. Obstacle avoidance and path planning methods for autonomous navigation of mobile robot. Sensors 2024, 24, 3573. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Huang, K.; Sun, Y.; Lei, X.; Yuan, Q.; Zhang, J.; Lv, X. An autonomous navigation method for orchard mobile robots based on octree 3D point cloud optimization. Front. Plant Sci. 2025, 15, 1510683. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Shen, Y.; Zhang, Y.; Huo, C.; Shen, Z.; Su, W.; Liu, H. Research Progress on Path Planning and Tracking Control Methods for Orchard Mobile Robots in Complex Scenarios. Agriculture 2025, 15, 1917. [Google Scholar] [CrossRef]
Chen, M.; Huang, Y.; Wang, W.; Zhang, Y.; Xu, L.; Pan, Z. Model inductive bias enhanced deep reinforcement learning for robot navigation in crowded environments. Complex Intell. Syst. 2024, 10, 6965–6982. [Google Scholar] [CrossRef]
Hu, J.; Bhowmick, P.; Lanzon, A. Group coordinated control of networked mobile robots with applications to object transportation. IEEE Trans. Veh. Technol. 2021, 70, 8269–8274. [Google Scholar] [CrossRef]
Galoogahi, H.K.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1134–1143. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef]
Benchmark, U. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 472–488. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 103–119. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 621–629. [Google Scholar]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate tracking by overlap maximization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4655–4664. [Google Scholar]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6181–6190. [Google Scholar]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6448–6457. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Rita, C.; Carlo, T. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 17–35. [Google Scholar] [CrossRef]
He, L.; Liang, J.; Li, H.; Sun, Z. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7073–7082. [Google Scholar]
He, L.; Sun, Z.; Zhu, Y.; Wang, Y. Recognizing partial biometric patterns. arXiv 2018, arXiv:1810.07399. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 501–518. [Google Scholar]
Huang, H.; Li, D.; Zhang, Z.; Chen, X.; Huang, K. Adversarially occluded samples for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5098–5107. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]

Figure 1. DeepDIMP-ReID framework.

Figure 2. Schematic diagram of the hardware system.

Figure 3. Orchard transportation robot hardware components, (a) RealSense camera, (b) industrial computer, (c) electronic control unit, (d) RTK-GPS module, (e) 16-line LiDAR, (f) fruit basket and support frame.

Figure 4. Illustration of DIMP algorithm output.

Figure 5. Noise points in the depth image.

Figure 6. Holes in the depth image.

Figure 7. Illustration of elliptical mask generation.

Figure 8. Flow of tracking and localization.

Figure 9. Field-of-view range considered for obstacle perception.

Figure 10. Example of incorrect following.

Figure 11. Feature extraction.

Figure 12. Structure of the improved HOReID network.

Figure 13. Appearance sampling combined with human keypoint detection.

Figure 14. Illustration of the target re-identification process.

Figure 15. Field experiment scenarios for target following, (a) main orchard pathway (soil surface); (b) main orchard pathway (concrete surface); (c) between rows of fruit trees (low illumination); (d) between rows of fruit trees (high illumination); (e) open grassy area; (f) some experiment participants.

Figure 16. The robot used in the target tracking and positioning experiment and some experimental results data, (a) orchard main road; (b) tree row; (c) lawn.

Figure 17. Experimental scenarios covered by the video dataset, (a) between rows of fruit trees; (b) orchard main road; (c) trellis-row roadway; (d) lawn area; (e) fruit storage warehouse.

Figure 18. Partial tracking trajectory data, (a) straight following trajectory; (b) turn following track; (c) following trajectory of free walking 1; (d) following trajectory of free walking 2.

Figure 19. Real-world following experiment scenarios, (a) orchard main road; (b) between rows of fruit trees; (c) turning section of the orchard main roadway; (d) fruit storage warehouse; (e) entrance of the fruit storage warehouse.

Figure 20. Screenshots of the following process.

Table 1. AUC comparison of single-object trackers across datasets.

Datasets	CCOT [32]	DaSiam-RPN [33]	SiamRPN++ [34]	UPDT [35]	ATOM [36]	DIMP [37]
NFS	48.8	/	/	53.6	58.4	61.9
OTB-100	68.2	65.8	69.6	70.4	66.3	68.4
UAV123	51.3	57.7	/	54.5	64.2	65.3

Table 2. Results of the improved HOReID model on the datasets.

Methods	Occluded-REID		DukeMTMC
Methods	Rank-1	mAP	Rank-1	mAP
DSR [41]	40.8	30.4	/	/
SFR [42]	42.3	32	/	/
PCB [43]	42.6	33.7	81.8	66.1
Ad-Occluded [44]	44.5	32.2	/	/
PGFA [39]	51.4	37.3	82.6	65.5
HOReID	55.1	43.8	86.9	75.6
Our work	58.6	48.2	88.1	76.4

Table 3. Human re-identification success rate across environments and group sizes.

Experiment ID	Environment	Number of Persons in the Scene	Human ReID Accuracy
1	Orchard main road (medium illumination)	3	100%
2		5	100%
3		7	95%
4	Between tree rows (low illumination)	3	100%
5		5	80%
6		7	75%
7	Lawn area (high illumination)	3	100%
8		5	95%
9		7	90%

Table 4. Following performance evaluation results.

Scenario	Methods	Evaluation Metrics					Total Frames
Scenario	Methods	CT/%	CL/%	WT/%	WL/%	CT + CL/%	Total Frames
1	DIMP	45.89%	17.66%	35.22%	1.23%	63.55%	1948
	SORT	7.29%	38.66%	0.00%	54.05%	45.95%
	DeepSORT	7.80%	39.12%	0.00%	53.08%	46.92%
	Our work	41.89%	38.40%	12.32%	7.39%	80.29%
2	DIMP	51.07%	13.39%	34.76%	0.78%	64.46%	1539
	SORT	8.06%	29.37%	1.04%	61.53%	37.43%
	DeepSORT	7.92%	30.00%	1.49%	60.59%	37.92%
	Our work	40.29%	28.07%	8.12%	23.52%	68.36%
3	DIMP	45.00%	17.29%	36.88%	0.83%	62.29%	1920
	SORT	6.93%	37.08%	6.88%	49.11%	44.01%
	DeepSORT	6.82%	37.29%	6.98%	48.91%	44.11%
	Our work	44.79%	36.88%	3.23%	15.10%	81.67%
4	DIMP	30.14%	9.59%	59.81%	0.46%	39.73%	1752
	SORT	15.75%	25.51%	3.54%	55.20%	41.26%
	DeepSORT	16.10%	25.46%	4.10%	54.34%	41.56%
	Our work	45.21%	25.23%	23.63%	5.94%	70.44%
5	DIMP	49.09%	9.02%	40.98%	0.91%	58.11%	1862
	SORT	6.23%	33.62%	0.00%	60.15%	39.85%
	DeepSORT	6.77%	33.94%	0.00%	59.29%	40.71%
	Our work	45.81%	33.51%	5.53%	15.15%	79.32%
Mean Value	DIMP	44.24%	13.39%	41.53%	0.68%	57.63%	/
	SORT	8.85%	32.85%	2.29%	56.01%	41.70%
	DeepSORT	9.08%	33.16%	2.51%	55.24%	42.24%
	Our work	43.60%	32.42%	10.57%	13.42%	76.02%

Table 5. Statistics of target tracking and localization results.

Followed Person	Walking Speed	Frames with Target in Camera View	Correctly Tracked Frames	Correct Tracking Rate	Average Localization Error/m
Person 1	1.5 m/s	10,025	9420	94.0%	0.0792 ± 0.012 m
Person 2	1 m/s	10,302	9703	94.2%	0.0722 ± 0.009 m
Person 3	0.6 m/s	11,027	10,865	98.5%	0.0525 ± 0.007 m

Table 6. Statistics of similar-target discrimination based on human re-identification.

Followed Person	Walking Speed	Obstacle Encounters	Successful Detections	Detection Success Rate	Average Processing Time per Frame/ms
Person 1	1.5 m/s	8	8	100%	22 ms
Person 2	1 m/s	8	8	100%	21 ms
Person 3	0.6 m/s	8	8	100%	18 ms

Table 7. Statistical results of similar-target discrimination based on person re-identification.

Followed Person	Walking Speed	ReID Trigger Events	Successful ReID Triggers	Correct Target Identifications	Target Identification Rate
Person 1	1.5 m/s	5	5	4	80%
Person 2	1 m/s	5	5	5	100%
Person 3	0.6 m/s	5	5	5	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, R.; Wang, Y.; Liu, H.; Gu, H.; Geng, C.; Shi, Y. Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID. Mach. Learn. Knowl. Extr. 2026, 8, 39. https://doi.org/10.3390/make8020039

AMA Style

Shen R, Wang Y, Liu H, Gu H, Geng C, Shi Y. Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID. Machine Learning and Knowledge Extraction. 2026; 8(2):39. https://doi.org/10.3390/make8020039

Chicago/Turabian Style

Shen, Renyuan, Yong Wang, Huaiyang Liu, Haiyang Gu, Changxing Geng, and Yun Shi. 2026. "Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID" Machine Learning and Knowledge Extraction 8, no. 2: 39. https://doi.org/10.3390/make8020039

APA Style

Shen, R., Wang, Y., Liu, H., Gu, H., Geng, C., & Shi, Y. (2026). Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID. Machine Learning and Knowledge Extraction, 8(2), 39. https://doi.org/10.3390/make8020039

Article Menu

Visual Perception and Robust Autonomous Following for Orchard Transportation Robots Based on DeepDIMP-ReID

Abstract

1. Introduction

2. Materials and Methods

2.1. Orchard Transport Robot System Architecture

2.2. DeepDIMP-ReID-Based Target Perception and Discrimination

2.2.1. DIMP-Based Stereo Tracking for Target Localization

2.2.2. DBSCAN-Based Obstacle Perception

2.2.3. Human ReID-Based Similar-Target Discrimination

2.3. Pure-Pursuit-Based Following Control with State-Driven Strategy

2.3.1. Pure-Pursuit-Based Trajectory Following

2.3.2. Follow-State Recognition and Strategy Switching

3. Results

3.1. Experimental Scenario

3.2. Performance Evaluation and Field Validation of an Orchard Target Following System

3.2.1. Tracking and Localization Accuracy Experiment

3.2.2. Similar-Target Disambiguation via Person Re-Identification

3.2.3. Follow Control and Decision-Making Evaluation

3.2.4. Long-Term Field Experiments in Orchard Environments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI