Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery

Zhao, Jiamuyang; Fan, Shuxiang; Zhang, Baohua; Wang, Aichen; Zhang, Liyuan; Zhu, Qingzhen

doi:10.3390/agriculture15111223

Open AccessReview

Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery

by

Jiamuyang Zhao

¹

,

Shuxiang Fan

^2,*,

Baohua Zhang

³

,

Aichen Wang

^1,4

,

Liyuan Zhang

¹ and

Qingzhen Zhu

^1,4,*

¹

School of Agricultural Equipment Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Technology, Beijing Forestry University, Beijing 100083, China

³

College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210031, China

⁴

Key Laboratory for Theory and Technology of Intelligent Agricultural Machinery and Equipment of Jiangsu University, Zhenjiang 212013, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(11), 1223; https://doi.org/10.3390/agriculture15111223

Submission received: 3 May 2025 / Revised: 1 June 2025 / Accepted: 2 June 2025 / Published: 4 June 2025

(This article belongs to the Special Issue How Optical Sensors and Deep Learning Enhance the Production Management in Smart Agriculture)

Download

Browse Figures

Versions Notes

Abstract

With the acceleration of agricultural intelligent transformation, deep reinforcement learning (DRL), leveraging its adaptive perception and decision-making capabilities in complex environments, has emerged as a pivotal technology in advancing the intelligent upgrade of agricultural machinery and equipment. For example, in UAV path optimization, DRL can help UAVs plan more efficient flight paths to cover more areas in less time. To enhance the systematicity and credibility of this review, this paper systematically examines the application status, key issues, and development trends of DRL in agricultural scenarios, based on the research literature from mainstream Chinese and English databases spanning from 2018 to 2024. From the perspective of algorithm–hardware synergy, the article provides an in-depth analysis of DRL’s specific applications in agricultural ground platform navigation, path planning for intelligent agricultural end-effectors, and autonomous operations of low-altitude unmanned aerial vehicles. It highlights the technical advantages of DRL by integrating typical experimental outcomes, such as improved path-tracking accuracy and optimized spraying coverage. Meanwhile, this paper identifies three major challenges facing DRL in agricultural contexts: the difficulty of dynamic path planning in unstructured environments, constraints imposed by edge computing resources on algorithmic real-time performance, and risks to policy reliability and safety under human–machine collaboration conditions. Looking forward, the DRL-driven smart transformation of agricultural machinery will focus on three key aspects: (1) The first aspect is developing a hybrid decision-making architecture based on model predictive control (MPC). This aims to enhance the strategic stability and decision-making interpretability of agricultural machinery (like unmanned tractors, harvesters, and drones) in complex and dynamic field environments. This is essential for ensuring the safe and reliable autonomous operation of machinery. (2) The second aspect is designing lightweight models that support edge-cloud collaborative deployment. This can meet the requirements of low-latency responses and low-power operation in edge computing scenarios during field operations, providing computational power for the real-time intelligent decision-making of machinery. (3) The third aspect is integrating meta-learning with self-supervised mechanisms. This helps improve the algorithm’s fast generalization ability across different crop types, climates, and geographical regions, ensuring the smart agricultural machinery system has broad adaptability and robustness and accelerating its application in various agricultural settings. This paper proposes research directions from three key dimensions-“algorithm capability enhancement, deployment architecture optimization, and generalization ability improvement”-offering theoretical references and practical pathways for the continuous evolution of intelligent agricultural equipment.

Keywords:

deep reinforcement learning (DRL); agricultural machinery; agricultural intelligence; agricultural machinery navigation; path planning; UAV operations

1. Introduction

With advancements in cloud positioning and collaborative precision positioning (collaborative precision positioning refers to the technology permitting multiple positioning devices or systems to work together to achieve more accurate positioning information) technologies, agricultural automation is transitioning from single-dimensional navigation tasks to multidimensional intelligent operations, significantly reducing labor intensity and enhancing production efficiency [1,2,3]. Modern agricultural automation has evolved under the impetus of precision agriculture [4], which is closely tied to efficient and intelligent navigation systems for agricultural machinery. High-precision positioning and navigation are prerequisites for agricultural robots to perform autonomous field operations [5]. During navigation, robots must first determine their absolute or relative positions to facilitate subsequent path planning and trajectory tasks [6,7]. Unlike navigation scenarios for pedestrians, mobile robots, vehicles, or UAVs [8,9,10,11,12], agricultural environments feature unstructured terrain, including uneven soil, dynamic crop distributions (e.g., varying plant heights during growth cycles), and transient obstacles (e.g., fallen branches or irrigation tools); unmanned agricultural machinery imposes higher demands on navigation due to operational speed and equipment action requirements [13]. Consequently, there is a pressing need to develop instruction-based navigation methods for agricultural machinery to achieve efficient full-field coverage [14]. Current autonomous driving technologies for agricultural machinery, leveraging high-precision positioning and path planning, have markedly improved efficiency in tasks such as plowing and sowing. Furthermore, meeting the diverse demands of delicate operations, such as fruit picking and field weeding, relies on the flexible control of intelligent agricultural end-effectors. Recent breakthroughs in sensor technology, materials science, and computer science have driven innovations in decision-making, perception and localization, structural optimization, intelligent control, and operational management for these end-effectors, which are critical to enhancing harvesting efficiency [15,16,17,18]. Additionally, the application of low-altitude UAV technology in agriculture has emerged as a transformative practice, injecting new momentum into agricultural production through data-driven models [19]. In scenarios such as precision pesticide spraying and crop growth monitoring, UAVs have demonstrated remarkable advantages in autonomous decision-making [20]. However, the unique complexity of agricultural production environments poses multidimensional challenges to traditional automation technologies:

(1): Dynamic environmental factors (e.g., changing crop distributions, sudden obstacles) demand real-time perception and adaptive responses.
(2): Hybrid task requirements (e.g., coupling continuous control with discrete decision-making) necessitate the development of hybrid intelligent decision systems.
(3): Resource constraints (e.g., limited computational power and energy consumption) impose stricter demands on system efficiency.

In agricultural environments, traditional methods struggle to handle real-time environmental changes and are highly sensitive to sensor noise and uncertainties. To address these limitations, researchers are actively exploring the adaptability of intelligent algorithms in agricultural scenarios. While deep learning (DL), a representative perception intelligence technology, excels in feature extraction and complex data processing, its reliance on extensive training data and high computational costs in dynamic environments (e.g., orchards and crop fields) limit its practicality. In contrast, reinforcement learning (RL), which relies on environmental interaction and trial-and-error learning, demonstrates unique advantages in addressing dynamic and stochastic challenges [21,22,23]. Nevertheless, traditional RL algorithms face dimensional catastrophe in real-world control tasks due to constrained action and sample spaces in complex agricultural environments (e.g., narrow passages, dense crop obstructions, and dynamic obstacles). To bridge these gaps, researchers proposed deep reinforcement learning (DRL) [24], which integrates the perceptual capabilities of DL with the decision-making strengths of RL in unknown environments. In summary, compared to traditional rule-based automation, DRL overcomes the rigidity limitations of rule systems in real-time environmental changes by integrating dynamic perception with autonomous decision-making. It can adaptively address the complexity, uncertainty, and resource constraints of agricultural scenarios, achieving more flexible and efficient intelligent decision-making. DRL, through the deep coupling of perception and decision-making, offers innovative strategies to resolve the dilemma of “environmental dynamism, task complexity, and resource constraints” in agricultural scenarios.

This paper reviews the recent literature to provide a comprehensive understanding of DRL advancements in agricultural production environments. We first introduce DRL fundamentals and classifications, then analyze its applications, challenges, and future directions, aiming to guide further research and deployment of DRL in agriculture.

2. Deep Reinforcement Learning Technology Framework and Algorithm Architecture

2.1. Deep Reinforcement Learning Algorithm Framework

DRL, the integration of deep learning (DL) and reinforcement learning (RL), aims to address intelligent decision-making in complex environments. Its core lies in end-to-end learning, which eliminates the need for labeled data. Instead, agents learn action policies by interacting with the environment through raw input information, refining strategies via trial-and-error to develop highly adaptive intelligence [25].

DRL encompasses a wide range of algorithms [26], including classical frameworks such as Deep Q-Networks (DQNs) [27], Deep Deterministic Policy Gradient (DDPG) [28], and Twin Delayed Deep Deterministic Policy Gradient (TD3) [29]. DQNs use deep networks to handle high-dimensional states (limited to discrete actions); DDPG outputs continuous actions via Actor–Critic but suffers from overestimation; TD3 introduces dual Critics, delayed updates, and policy smoothing to reduce variance and error, resulting in greater stability. Depending on whether environmental models are explicitly constructed during training, DRL algorithms can be categorized into model-based and model-free approaches [30,31,32,33] (Figure 1) (Table 1). Model-based DRL algorithms seek optimal policies based on learned environmental dynamics, encompassing methods like fine-tuning algorithms [34] and augmented intelligence frameworks [35]. In contrast, model-free DRL algorithms acquire optimal policies through direct agent–environment interactions, broadly divided into value function-based and policy gradient-based methods [36]. Compared to model-based approaches, model-free DRL avoids constructing explicit environmental dynamics, learning directly from interactions, making it more suitable for high-dimensional, dynamic, or unknown agricultural scenarios. This section focuses on model-free DRL algorithms, specifically value function and policy gradient variants.

2.1.1. Value Function-Based DRL Algorithms

The DQN algorithm, a widely adopted RL variant, has been successfully applied to robot path planning and similar fields [40,41,42]. In DQNs, agents utilize neural networks to approximate the optimal action value function, mapping states to the expected long-term rewards of actions. The network is trained through a variant of Q-learning, where agents learn from state transitions and rewards [43,44,45]. When applying a DQN to robot path planning, agents are first trained on a set of sample environments. During training, agents explore the environment and learn to select optimal actions for each state. Performance is evaluated based on their ability to navigate to target locations while avoiding obstacles [46,47,48]. However, a traditional DQN’s single-network architecture is prone to Q-value overestimation, potentially destabilizing policy convergence. This limitation has driven researchers to propose improved frameworks like Double Q-learning.

To mitigate overfitting caused by a single Q-network, Hasselt et al. [49] introduced a Double Q-learning structure combined with deep learning, outperforming standard DQNs. Building on Double-DQNs, Zhang et al. [50] developed a path smoothing and tracking control method capable of tracking linear and polygonal paths. Compared to traditional pure pursuit control (PPC), their approach significantly reduced stabilization time and corner overshoot in high-speed scenarios. Yue et al. [51] employed a Double-DQN architecture to design a reinforcement learning-based obstacle avoidance controller for agricultural robots, achieving efficient and reliable autonomous navigation in complex farm environments. Zhigang Ren et al. [52] implemented a Double-DQN-based path-tracking control algorithm for orchard traction spray robots, demonstrating superior path-tracking accuracy and stability compared to conventional control algorithms. These studies confirm that value function-based DRL algorithms effectively reduce errors caused by inaccurate environmental modeling in agriculture, thereby enhancing operational efficiency.

2.1.2. Policy Gradient-Based DRL Algorithms

Policy gradient-based DRL algorithms primarily include DDPG, Trust Region Policy Optimization (TRPO) [53], and Asynchronous Advantage Actor–Critic (A3C) [54] (Figure 2). The core idea of these algorithms is to adjust policy parameters via gradient ascent to maximize the expected long-term cumulative reward. Owing to their flexibility and adaptability, policy gradient-based DRL has become a critical branch of DRL, particularly excelling in complex control tasks.

Designing effective reward mechanisms and strategies for multi-agent collaboration and competition remains challenging due to intricate relationships between agents. For instance, Hu et al. [55] enhanced the DDPG algorithm by designing reward functions based on COLREGs (Collision Avoidance Regulations) and introducing potential-based reward shaping to guide agents toward optimal policies. Takahashi et al. [56] proposed a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method to learn near-optimal solutions from prior knowledge. Ye et al. [57] integrated an ε-greedy exploration strategy into MADDPG to balance exploration and exploitation during training, accelerating convergence and improving performance. Dynamic environments pose another significant challenge, as traditional deep learning models trained in static settings struggle to adapt to real-time changes. To address this, Chen et al. [58] improved the reward–penalty mechanism of DDPG by incorporating adaptive artificial potential fields (APF-DDPG). This enhancement boosted learning speed, obstacle avoidance success rates, and continuous information utilization during training. For efficient computational resource utilization, Yoko Sasaki et al. [59] developed an A3C-based motion learning algorithm for autonomous mobile robots, reducing training costs and simplifying collision avoidance design. In summary, researchers have made notable progress in reward design for multi-agent systems, policy stability, and dynamic environment adaptability. These advancements not only enhance learning efficiency and decision-making precision in complex scenarios but also provide foundational insights for future explorations in multi-objective collaborative optimization and robust policy design under non-stationary conditions.

Compared to value function-based DRL algorithms, policy gradient-based approaches directly search for optimal policies in the policy space through end-to-end learning. Their simpler architecture makes them more suitable for handling continuous and high-dimensional action spaces.

3. Applications of DRL in Agricultural Production Environments

With the rapid advancement of artificial intelligence, DRL has been widely adopted in agricultural production environments due to its robust learning and decision-making capabilities [60]. For instance, DRL-powered agricultural transplanting robots have reduced mechanical seedling damage rates to 2.82% through multi-agent collaborative path optimization. Meanwhile, DRL-enhanced plant protection drones achieved a 41.68% improvement in pesticide deposition coverage rate while reducing operational path overlap to 5.56%, enabled by dynamic environment modeling and action space adaptation. This section explores its applications in three key areas: agricultural ground platform navigation, motion planning for intelligent agricultural end-effectors, and low-altitude UAV operations, highlighting the technical strengths and operational advantages of smart agricultural systems to support the development of an intelligent agricultural ecosystem.

3.1. Agricultural Ground Platform Navigation

3.1.1. Application Background

Agricultural ground platforms, including autonomous tractors and combine harvesters, face significant navigation challenges in unstructured farmland environments characterized by uneven crop distributions, soil undulations, and transient obstacles. These dynamic, unstructured environments demand robust navigation systems to handle variable conditions while improving operational efficiency and reducing energy consumption. Although traditional navigation technologies have made progress through sensor fusion and algorithm enhancements, their inherent limitations become evident in agricultural scenarios. For example, Freitas et al. [61] developed an orchard obstacle detection system that identifies obstacles over 15 cm when unobscured by grass and adjusts vehicle speed based on proximity yet fails to detect smaller obscured obstacles. Blok et al. [62] achieved lateral deviations below 5 cm at 0.5 m/s speeds using a particle filter with laser beam modeling but showed limited adaptability in high-speed or sudden obstacle scenarios. Wang et al. [63] improved the A*-SVR method to reduce lateral deviations by 29.57% to 6.90 cm between tree rows, though performance degrades significantly under model mismatches due to reliance on precise environmental modeling. These methods generally suffer from static environmental perception and rigid algorithmic adjustments, struggling to adapt to real-time agricultural dynamics. In contrast, DRL overcomes dependence on preset models through autonomous environmental interaction and policy optimization, offering a novel pathway to break through traditional navigation limitations in unstructured agricultural settings.

3.1.2. DRL-Based Solution and Key Advantages

In the field of precision agriculture navigation, traditional end-to-end decision-making frameworks based on multi-sensor fusion (e.g., vision/LiDAR/GNSS integration) can achieve basic path tracking but face challenges such as complex parameter tuning and insufficient environmental adaptability, particularly under dynamic scenarios like uneven crop distributions, undulating terrain, or sudden obstacles [64]. To address these complex control issues, DRL is emerging as a core technology for intelligent navigation in agricultural machinery, leveraging its autonomous decision-making advantages through environmental interactions. In agricultural ground vehicle control, innovations in DRL algorithms primarily focus on dynamic performance optimization and system lightweighting. For instance, Zhang et al. [50] pioneered the application of Double-DQNs to agricultural vehicle navigation. By integrating nonlinear normalization and reward function design, they optimized dynamic path-tracking performance, overcoming the limitations of traditional methods in sharp-turning and high-speed scenarios. Yan et al. [65] proposed a lightweight, portable DRL control algorithm. By incorporating path curvature states, the algorithm enhanced control accuracy in curved and high-curvature paths, demonstrating centimeter-level precision and robust performance in both simulations and field trials. This method has been successfully applied to unmanned agricultural machinery for straight-line operations and headland turning. To meet the multidimensional demands of smart agriculture, DRL technology is expanding into multi-agent collaboration and specialized environmental adaptation. Examples of this include the following: Fan et al. [66] developed an SMO-Rainbow strategy for UAV path planning in smart tourism agriculture. By combining hierarchical reinforcement learning (HRL) with techniques like Double-DQN and Dueling DQN, they reduced model complexity and improved performance, addressing inefficiencies in data collection, training complexity, and dynamic adaptability. Hu [67] introduced a DRL-based navigation method for orchard inspection robots using the Soft Actor–Critic (SAC) algorithm. Ordered stochastic curriculum learning was employed to resolve GPS inaccuracy, sparse rewards, and environmental adaptability challenges. Furthermore, DRL is demonstrating significant potential beyond physical navigation, extending into the core decision-making and resource optimization layers of smart agriculture systems. Devarajan et al. [68] proposed a two-stage DRL framework (DONSA$/MACO-DQN + RL-DQN) for intelligent agricultural management. The first stage employs an Ant Colony Optimization-enhanced Deep Q-Network (MACO-DQN) to optimally offload diverse monitoring tasks (e.g., fire/pest detection, irrigation scheduling, soil/climate monitoring) to edge, fog, or cloud computing nodes based on latency, energy consumption, and computing power. The second stage utilizes a reinforcement learning-enabled DQN (RL-DQN) model for the actual prediction and monitoring of these agricultural activities. This approach demonstrated superior performance (98.5% precision, 99.1% recall, 98.1% F-measure, and 98.5% accuracy) and faster convergence compared to traditional methods, highlighting DRL’s capability in optimizing complex system-level tasks and resource allocation within the smart agriculture paradigm. Table 2 compares agricultural machinery navigation methods. Next, we introduce several typical agricultural ground platform algorithms:

Double-DQN algorithm (Figure 3a): The Double-DQN is an enhanced algorithm developed from the foundation of DQNs, with the primary objective of addressing the issue of Q-value overestimation inherent in DQNs [49]. The central principle involves decoupling the evaluation of Q-values from the selection of actions, employing two distinct neural networks to execute these tasks independently. This separation enhances the precision of Q-value estimation, thereby enabling the agent to learn an optimal policy with greater stability throughout the decision-making process. As depicted in the leftmost input section of the accompanying figure, the input comprises speed and two error terms. These elements collectively characterize the agent’s current environmental state, serving as the foundational basis for subsequent decision-making. The central component of the figure, the deep network, constitutes the critical processing unit of the Double-DQN. This network accepts the input state and processes it through a multi-layer neural network architecture, performing feature extraction and transformation to map the relationship between states and Q-values. The network’s parameters are iteratively updated to progressively approximate the true Q-value function. In the output phase, the action yielding the highest Q-value is designated as the final decision action. This selection is executed via a “max” operation, identifying the optimal action that drives the agent to perform the corresponding operation within the environment, subsequently receiving a reward, and transitioning to the next state. The figure provides a clear illustration of this workflow: from the input state, through deep network processing to derive Q-values, to the ultimate selection of actions based on these Q-values. This visualization encapsulates the fundamental operational framework and decision-making paradigm of Double-DQNs.
Soft Actor–Critic (SAC) algorithm (Figure 3b): The SAC algorithm is a policy gradient method rooted in maximum entropy reinforcement learning [69]. Its core principle involves not only pursuing high cumulative rewards during policy optimization but also maximizing the entropy (uncertainty) of the policy. This encourages the agent to thoroughly explore the environment while maintaining effective control. During training, the agent executes actions determined by the current policy (output by the Actor network), interacts with the environment, and collects experiential data (such as states, actions, and rewards), which is stored in a replay buffer. Data sampled from this replay buffer is then used to update the networks.
SMO-Rainbow strategy (Figure 3c): The rainbow algorithm integrates multiple improved algorithms into the DQN algorithm [70]. The network architecture uses a fully connected neural network with two hidden layers and a dueling layer, as shown in Figure 2. The network input is the environment state S, with 1024 and 512 neurons in the hidden layers, respectively. The dueling layer includes both a value function network and an advantage function network, which together output the predicted Q-values for each option. The hidden layers and the dueling layer use ReLU (rectified linear unit) activation functions, and the loss function is Huber loss.

These studies collectively reveal the technological breakthrough trajectory of DRL in agricultural navigation: through adaptive state representation construction (such as the introduction of curvature parameters), hierarchical policy optimization (such as the SMO-Rainbow framework), and innovative training mechanisms (such as curriculum learning), they systematically address inherent deficiencies of traditional methods in dynamic environment perception, multi-objective optimization, and hardware adaptation. With the integration and evolution of lightweight deployment and edge computing technologies, DRL-driven high-precision navigation is reshaping the operational paradigms of precision agriculture, providing core technological support for low-consumption and efficient unmanned farm operations.

3.2. Motion Planning for Intelligent Agricultural End-Effectors

3.2.1. Application Background

In recent years, DRL has demonstrated significant advantages in motion planning for intelligent agricultural end-effectors. However, fruit harvesting and weeding require the precision control of manipulators in dynamic environments with occlusions and soft targets—challenges where rule-based control falls short. By leveraging autonomous agent–environment interaction mechanisms, DRL effectively addresses the limitations of traditional methods in dynamic environmental adaptability and complex constraint handling. From fruit picking to field weeding, and from multi-end-effector collaboration to dynamic path planning, DRL is rapidly advancing agricultural robots toward intelligent operation.

3.2.2. DRL-Based Solution and Key Advantages

As a core driver for the intelligent upgrade of agricultural robotic arms, DRL’s capabilities in autonomous environmental interaction and dynamic policy optimization are progressively overcoming the technical bottlenecks of traditional motion planning in complex farmland scenarios. In perception modeling, researchers construct high-dimensional state spaces through multi-source heterogeneous data fusion. For example, Zhang et al. [71] employed the SAC algorithm to develop a 6-degree-of-freedom (6-DOF) robotic arm assembly system. By integrating multi-source data such as end-effector pose, joint angles, and contact forces into the state space, they achieved a 93% success rate in flexible assembly tasks for 3C products. To address the morphological diversity of crops, Sheng et al. [72] innovated a joint framework combining point cloud feature extraction and grasp evaluation networks, enabling the stable grasping of irregular targets like tomatoes and peppers. This advances DRL’s adaptation from general-purpose robotic manipulation to specialized agricultural scenarios. As perception modeling matures, real-time responsiveness in dynamic environments has become a focal point. Xuan et al. [73] modeled dynamic target grasping using Markov decision processes, incorporating Kalman filtering for trajectory prediction and Proximal Policy Optimization (PPO) for 6-DOF robotic arm control. Their approach improved success rates by 21% compared to traditional methods. Simultaneously, Liu et al. [74] proposed a safety verification mechanism, building a closed-loop system (trajectory prediction-collision detection-dynamic replanning) on the TD3 framework. This enabled robotic arms to generate dynamic obstacle avoidance paths within 12.1 milliseconds, achieving a 3.7-fold speed improvement over Rapidly Exploring Random Trees (RRTs). For multi-robot collaborative tasks with complex constraints, DRL exhibits robust optimization capabilities. With advancements in single-robot intelligence, distributed decision-making frameworks are emerging to address multi-agent coordination challenges. Xie et al. [75] constructed a multi-agent Markov game model, coordinating a four-robot harvesting system via self-attention mechanisms. This reduced task completion time by 10.7% in scenarios involving 50 targets. Bu et al. [76] combined artificial potential fields with DDPG, designing a multi-level reward function incorporating obstacle avoidance penalties and contact force constraints. Their method reduced seedling damage rates to 2.82% in inter-row weeding tasks. These studies demonstrate that explicit constraint integration into reward functions allows DRL to balance operational efficiency and safety metrics effectively. To tackle the inherent uncertainties of agricultural environments, successful deployment relies on breakthroughs in simulation-to-reality transfer. Current research utilizes high-fidelity digital twin systems built on physics engines like MuJoCo and PyBullet. By introducing over 20 domain randomization parameters—such as wind disturbances and soil adhesion—the generalization capabilities of robotic arms trained in virtual environments are significantly enhanced. From multimodal perception to swarm collaboration, and from dynamic obstacle avoidance to virtual–real migration, DRL not only elevates robotic arm precision by orders of magnitude but also redefines the technological paradigm of agricultural automation. It shifts mechanical systems from rigid program execution to autonomous environmental cognition, injecting core momentum into fully autonomous operations for fruit picking, precision spraying, and beyond. A representative framework diagram of motion planning algorithms for intelligent agricultural robotic end-effectors is illustrated in Figure 4:

Twin Delayed Deep Deterministic Policy Gradient (TD3) (Figure 4a): TD3 is an improved deterministic policy gradient algorithm. Its core idea involves enhancing the stability of reinforcement learning by integrating dual critic networks, smooth target network updates, and delayed policy updates. During execution, the agent executes actions (“a”) according to the policy output by the “Actor Network π₀”, interacts with the environment, and observes the next state (“s′”). These experiences—including the current state, action, and next state—are then stored (typically in a replay buffer for practical applications).
A deep reinforcement learning network with self-attention (Figure 4b): This deep reinforcement learning network incorporates a self-attention mechanism. Its core idea is to leverage the self-attention mechanism to capture global dependencies among different features within the input observations, thereby extracting more effective features for value estimation and action selection in reinforcement learning. The agent acquires observational data from the environment and inputs it into a feature extractor. The data first undergo FNC and ReLU processing, performing preliminary feature extraction and nonlinear transformation on the raw observations. The self-attention mechanism then processes these initially extracted features, computing correlation weights between features. By comparing each feature against others, the self-attention mechanism determines the importance of each feature, ultimately generating a new feature representation that highlights the most critical features for the task.
Deep Deterministic Policy Gradient (DDPG) algorithm (Figure 4c): The DDPG algorithm is a reinforcement learning method that combines the Q-learning and policy gradient techniques [28]. It employs two neural networks—a policy network (Actor) and a value network (Critic)—to represent the policy and value function, respectively, utilizing experience replay and target networks to stabilize the training process. At its core, DDPG optimizes the value function via a deterministic policy to identify optimal policies in continuous action spaces. The agent executes actions generated by the online policy network πθ (st) during environmental interaction, observing the next state (s_t+1) and receiving an immediate reward (r_t). These experiences (s_t, a_t, r_t, and s_t+1) are stored in an experience replay buffer. Through continuous updates to the online networks and soft updates to the target networks, the agent learns an effective policy to maximize cumulative rewards.

DRL overcomes the limitations of traditional agricultural robotic arms by enabling autonomous environmental interaction and dynamic policy optimization, which is crucial for complex farmland scenarios. Its multidimensional perception decision loop integrates robot dynamics with environmental physics, achieving real-time responses such as millisecond obstacle avoidance. Table 3 summarizes the performance comparison between DRL schemes and traditional methods in robot task scenarios.

3.3. Agricultural Low-Altitude Drone Operations

3.3.1. Application Background

As aerial platforms for agricultural intelligence, unmanned aerial vehicles (UAVs) have witnessed rapid adoption in modern farming systems, driven by their low-cost deployment and operational flexibility. Currently, UAVs are undergoing a paradigm shift—transitioning from mechanized task execution to autonomous decision-making—through the integration of DRL. This evolution enables UAVs to dynamically optimize mission parameters (e.g., flight altitude, spraying intensity, and obstacle avoidance) in response to real-time field conditions, thereby enhancing both task adaptability and resource-use efficiency [77,78] (Figure 5). By leveraging the state representation capabilities of deep neural networks and the dynamic policy optimization mechanisms of reinforcement learning, DRL has successfully addressed core challenges faced by traditional UAVs in complex farmland scenarios, such as delayed environmental responses and difficulties in multi-objective coordination. Significant breakthroughs have been achieved in critical areas, including pesticide spraying, path planning, and dynamic perception.

3.3.2. DRL-Based Solution and Key Advantages

In pesticide spraying scenarios, DRL has enabled a transformative leap from static path execution to dynamic trajectory optimization. For instance, Hu et al. [80] designed a wireless sensor network (WSN)-assisted UAV trajectory correction system that integrates real-time wind speed and direction data from ground sensors. By employing a dual-algorithm framework combining a DQN and Particle Swarm Optimization (PSO), they generated optimal spraying paths. Experimental results demonstrated a 41.68% increase in pesticide settlement coverage and a reduction in overlapping coverage to 5.56%. Fu et al. [81] further enhanced traditional DQNs by introducing a Bidirectional Long Short-Term Memory (Bi-LSTM) network to construct a BL-DQN model. This approach achieved 19.9% fewer steps for efficient coverage in a 10 × 10 grid map, validating DRL’s robustness in pesticide spraying across complex terrains. Path planning technologies have also evolved significantly, showcasing DRL’s advantages in multi-physics coupled environments. Kang et al. [82] proposed a control frequency adaptive (CFA) scheduling method, where a reinforcement learning agent dynamically adjusts PID-based attitude and position control frequencies. Under sudden wind disturbances, this method, combined with quadrotor dynamics and Q-learning, reduced waypoint tracking time by 12.8%. Huang et al. [83] addressed obstacle avoidance in mountainous orchards by developing a 3D path planning system using deep Q-learning. A multi-tiered reward mechanism—incorporating terrain slope, crop density, and pest distribution weights—reduced path redundancy by 31.4% and saved battery consumption by 18.2%. Agricultural UAVs not only address the limitations of satellite-based remote sensing (e.g., temporal resolution constraints and cloud interference) but also demonstrate superior operational advantages, including real-time responsiveness, cost-effectiveness, and operator-friendly simplicity [84]. Furthermore, DRL integration is driving the evolution of agricultural UAVs toward embodied intelligence—a paradigm characterized by self-adaptive decision loops and enhanced environment–body–task integration through sensorimotor learning. Almalki et al. [79] developed a dual-cognition UAV system that integrates DRL for autonomous navigation and Faster R-CNN for vegetation detection, achieving 98% detection accuracy in Saudi Arabia’s arid valleys. Chen et al. [85] proposed the MTPI-MTSA hybrid algorithm, which optimizes network initialization parameters through transfer learning and automatically enhances datasets using multi-Thompson sampling reinforcement learning. This method reduced vegetation segmentation network training time by 93.7% and improved the Intersection over Union (IoU) metric to 90.9%. Additionally, Gözen and Özer [86] innovated action-sequence reward functions and introduced diagonal movement commands for UAV visual tracking, boosting tracking accuracy by 3.87% on the VisDrone dataset and effectively resolving challenges such as small-target occlusion and scale variations. Collectively, these advancements outline a clear trajectory for DRL-powered agricultural UAVs: transitioning from single-task optimization (e.g., spray coverage) to multi-objective synergy (e.g., balancing energy efficiency and precision) and evolving from discrete functional modules (perception, planning, control) to intelligent closed-loop systems. This progression establishes a comprehensive technical framework that integrates environmental perception, dynamic decision-making, and precise execution into a unified continuum. With ongoing breakthroughs in 3D path planning and multimodal perception fusion, DRL-driven UAVs are redefining operational paradigms in precision plant protection and ecological monitoring, offering scalable, intelligent solutions to support global agricultural transition toward low-carbon practices. A representative algorithmic framework for low-altitude agricultural UAV operations is depicted in Figure 6:

Neural network-based value function approximator of reinforcement learning (Figure 6a): Hu et al.’s neural network architecture, as shown in Figure 3, consists of two fully connected layers instead of the convolutional layers found in typical Deep Q-Networks. Additionally, the rectified linear unit (ReLU) is used as the activation function, with each layer having 32 neurons. The input is the state S = (U, X_w, w), and the output is the corresponding Q-value.
QFP and CFA models (Figure 6b): QFP is trained to predict the next position of a quadcopter controlled by PID in an environment free from external influences. After training, the estimated next state by QFP is used as input for CFA. Then, CFA is trained to adjust the position and attitude control frequencies against external influences by maximizing a given reward function. The approximate external influence is calculated as the difference between the current state (enclosed in a red dashed box) and the estimated state (enclosed in a green dashed box). CFA is trained to balance the position and attitude control frequencies to achieve time-optimal waypoint tracking control.
Multi-Task State Aggregation (MTSA) algorithm (Figure 6c): The MTSA algorithm aims to enhance the efficiency and performance of reinforcement learning by combining multi-task learning with state aggregation techniques. Its core principle leverages shared information across tasks to accelerate learning while reducing state space complexity through state aggregation, enabling more effective handling of complex environments. The agent interacts with the environment, selecting actions based on the current state. These actions are generated by a Q-network, which guides the agent’s behavior by estimating the value of state–action pairs. During training, state-aggregated training is accelerated using MPI (Message Passing Interface). This approach enables efficient processing of large-scale state spaces.

4. Key Challenges of DRL in Agricultural Applications

DRL has shown significant advantages in agricultural ground platform navigation, robotic arm motion planning, and low-altitude drone operations, it still faces numerous severe challenges during practical implementation. Overall, these challenges mainly manifest in three dimensions: environmental complexity, limitations of algorithms and computational resources, and issues of safety and reliability. The following analysis will delve into each dimension in detail.

4.1. Environmental Complexity Issues

Agricultural production sites are typically unstructured environments characterized by high dynamism and uncertainty. Regarded as semi-natural systems, agricultural settings inherently involve numerous stochastic and seasonal factors [15], posing significant challenges for autonomous agricultural systems [87]. For instance, variations in crop growth cycles, undulating terrain, climatic conditions, and sudden weather disturbances impose heightened adaptability requirements on automated equipment for navigation and task execution. Particularly in highly dynamic and unstructured agricultural environments—which may include obstacles of varying sizes (e.g., branches, leaves), gusts of wind, multi-obstacle occlusions, and fluctuating light conditions [88]—the unpredictable distribution of crops, diverse topography, and erratic obstacles render traditional experience- or rule-based methods inadequate in addressing these environmental complexities.

The technical bottlenecks of path planning in dynamic environments are further exacerbated under such conditions. In dynamic scenarios, if sensors detect unexpected obstacles, path planning systems must rapidly recalculate trajectories to avoid collisions while progressing toward the target location [89]. Conventional path planning approaches often struggle to deliver satisfactory performance under these constraints. Although DRL technology offers adaptive learning capabilities, continuously evolving dynamic factors in real-world environments (e.g., crop growth stages, sudden weather changes) can cause trained policies to become obsolete quickly, compromising long-term stability. Table 4 summarizes the strengths and weaknesses of contemporary path planning methods.

The above content categorizes and summarizes various path planning methods, but their application advantages, limitations, and suitability in specific agricultural scenarios still require further exploration. Below, we analyze the applicability and challenges of different methods by combining three typical scenarios proposed in this paper: navigation for agricultural ground platforms, motion planning for intelligent agricultural end-effectors, and low-altitude drone operations. For agricultural ground platform navigation, the core requirements include dynamic path planning, real-time obstacle avoidance, high-precision path tracking, and adaptability to unstructured farmland environments (such as crop growth interference and terrain undulations), as shown in Table 5. However, its limitations lie in the fact that dynamic farmland environments (such as crop growth interference and sudden obstacles) cause traditional methods (like RRT and A*) to fail; while DRL methods (such as TD3 and SAC) are more adaptable, they rely on high-performance computing devices, making it difficult to deploy them in edge environments in real time. For intelligent agricultural end-effector motion planning, the core requirements include high-precision trajectory control, dynamic obstacle avoidance (such as moving crops), multi-arm collaboration, and real-time response (<50 ms), as shown in Table 6. However, its limitations are that agricultural end-effectors need to balance efficiency and safety at millimeter-level precision, which traditional methods (like APF) cannot meet under dynamic constraints; although DRL methods (such as MADDPG) support collaborative optimization, they require high-precision simulation environment migration support. For low-altitude drone operations, the core requirements include multi-objective coordination (such as balancing coverage and energy consumption), 3D path planning, wind disturbance resistance, and real-time decision-making (<100 ms), as shown in Table 7. However, its limitations are that drones need to achieve multi-objective optimization under resource constraints, which traditional methods (like genetic algorithms) cannot respond to online; hybrid architectures (such as DQN + PSO) improve performance but depend on high-precision environmental modeling and edge-cloud collaborative computing.

In summary, in terms of agricultural ground platforms, DRL methods such as TD3 and SAC are gradually replacing traditional rule-based algorithms but need to address edge computing bottlenecks. For agricultural end-effectors, multi-agent DRL like MADDPG is driving collaborative automation but relies on high-fidelity simulation-to-reality transfer. In the case of low-altitude agricultural drones, hybrid architectures combining DRL with optimization algorithms have become mainstream but require breakthroughs in lightweight deployment and cross-scenario generalization. Notably, even if path planning algorithms can handle dynamic environments, agricultural automation systems face another fundamental challenge: technical limitations in data acquisition and processing. Agricultural operations rely heavily on data from vision systems, LiDAR, GNSS, and environmental sensors, which are often plagued by noise and latency. Moreover, significant disparities in data formats, accuracy, and update frequencies across different sensors complicate multi-sensor data fusion. While DRL algorithms approximate Q-value functions via deep neural networks during training, enabling agents to learn and optimize path planning strategies [101], they must process high-dimensional and imperfect state information. This compromises model convergence and decision robustness, particularly in scenarios involving transient obstacles or abrupt local environmental changes. Future research must therefore prioritize breakthroughs in sensor fusion and noise-resistant algorithm design to enhance model adaptability.

4.2. Algorithm and Computational Resource Limitations

Despite demonstrating theoretical potential in agricultural automation, the practical deployment of DRL faces systematic challenges arising from computational constraints, data acquisition, and real-time responsiveness.

First, computational resource bottlenecks directly limit algorithmic performance. DRL’s training and inference heavily rely on GPU parallel computing capabilities [102]. However, field operation platforms must simultaneously process multimodal data streams, including multispectral imaging (crop vitality assessment [103]), thermodynamic analysis (water stress detection [104]), computer vision (pest/disease detection [105]), and LiDAR point clouds (vegetation density reconstruction [106]). These demands impose stringent requirements on edge computing devices, with existing agricultural machinery often suffering from delayed policy updates due to insufficient computational power. For instance, during fruit harvesting operations using agricultural robotic end-effectors, oscillations of target fruits during the grasping process may significantly increase the computational load of motion planning algorithms. This elevated demand for real-time processing could induce mechanical blockages (e.g., delayed actuator responses or trajectory conflicts), thereby compromising the practical utility of automated harvesting systems in field applications [107].

Second, high data acquisition costs and simulation-to-reality transfer hurdles exacerbate deployment difficulties. DRL’s reliance on extensive trial-and-error interactions compels researchers to use virtual simulation platforms like MuJoCo and PyBullet as substitutes for costly field trials. However, discrepancies persist between simulated and real-world environments in terms of lighting randomness, soil physical properties, and other factors. Even with domain randomization strategies to inject perturbed parameters, path planning errors in simulation-to-reality transfer remain unresolved. Furthermore, the massive computational resources required for simulation further elevate technical barriers to application.

More critically, real-time decision-making poses a pressing contradiction. Traditional DRL frameworks, designed to balance multi-objective optimization (e.g., waypoint tracking and sudden obstacle avoidance), often employ complex neural networks and multi-level reward functions. This leads to latency in online computation and decision-making during real-world operations. In high-speed scenarios, such as hilly terrain operations, such delays can trigger cumulative trajectory deviations or even cascading control failures. Thus, reducing computational complexity and energy consumption while maintaining decision accuracy is a critical breakthrough point for future research.

4.3. Safety and Reliability Issues

Agricultural automation equipment must not only perform tasks efficiently but also ensure the safety of on-site personnel and crops. Consequently, safety and reliability issues have emerged as a core challenge in technology promotion. In recent years, agricultural machinery accidents have occurred frequently in China. Statistical data indicate that 220 agricultural machinery incidents were reported nationwide in 2021, resulting in 46 fatalities and significant economic losses [108]. These recurring accidents highlight that the safety and reliability of automated agricultural equipment have become critical factors constraining the development of agricultural modernization.

In practical applications, agricultural robots often operate in human–robot collaborative environments. For example, when deploying spraying drones in complex agricultural settings, it is critical to account for obstacles such as heavy machinery, moving vehicles, and coexisting robotic agents to prevent collisions while simultaneously fulfilling the dual requirements of “precision spraying” and “safety” [109]. In tasks such as autonomous navigation and robotic arm grasping, inaccurate risk quantification in decision-making mechanisms may lead to erroneous device actuation, potentially causing crop damage or operator injuries. Therefore, safety verification and confidence estimation for DRL strategies, combined with real-time fault detection modules, are imperative for reliable operations. Furthermore, long-term deployed agricultural equipment often exhibits policy drift—a phenomenon where initially trained strategies gradually deteriorate due to dynamic environmental changes and evolving task requirements. This necessitates continuous online learning or periodic policy updates. Achieving adaptive adjustments while maintaining system stability (i.e., avoiding cascading erroneous decisions) has emerged as a key research frontier. Potential solutions include structured stochastic curriculum learning, transfer learning, and online adaptation algorithms; however, their practical efficacy remains to be empirically validated under real-world agricultural scenarios.

5. Future Research Directions for DRL in Agricultural Applications

With AI’s evolution, DRL applications in agriculture will rapidly advance toward higher efficiency, intelligence, and integration. Although DRL has demonstrated significant potential in the intelligentization of agricultural machinery, its practical deployment still faces three major technical barriers: First, the contradiction between complex neural network structures and the edge computing resources of agricultural machinery results in existing systems struggling to meet real-time response requirements in actual operations. Second, the black-box nature of strategies fundamentally conflicts with the safety requirements of human–machine coexistence scenarios in farmland, and the lack of quantitative verification mechanisms may lead to catastrophic failures. Third, the differences in physical properties between simulation environments and real farmland make virtual training strategies frequently ineffective in field applications, as evidenced by repeated cases. To overcome these bottlenecks, coordinated innovation across the three dimensions of algorithm–architecture–verification must be pursued:

(1): Complex Hybrid Architectures and Algorithmic Innovations: Develop hybrid decision frameworks combining model predictive control with DRL to enhance stability and interpretability; innovate algorithms merging value and policy methods (e.g., SAC + DQN) to boost convergence and generalization; introduce attention mechanisms and multi-scale state representations to improve strategy expressiveness in complex scenarios.
(2): Edge-Cloud Collaborative Computing and Lightweight Deployment: Establish edge-cloud collaborative architectures, training complex models in the cloud while deploying lightweight, quantized models (e.g., TinyDRL, ONNX-quantized) on edge devices for low latency and energy efficiency. Leveraging 5G/6G networks enables real-time model updates and strategy synchronization, reconstructing intelligent farm IT infrastructure.
(3): Meta-Learning and Self-Supervised DRL for Rapid Adaptation: Integrate meta-learning and self-supervised tasks to enable few-shot policy generalization across crop varieties, climates, and regions. Construct self-supervised tasks from historical trajectories and visual representations to achieve plug-and-play agricultural agents.

In summary, DRL’s future in agriculture lies in simultaneous breakthroughs in algorithmic depth and application breadth. Through algorithm innovation, system integration, and computing architecture collaboration, DRL will become the core engine driving smart agriculture toward higher efficiency, lower carbon emissions, and increased precision, providing key technology support for sustainable global agriculture.

6. Conclusions

DRL, as a cutting-edge artificial intelligence technology, has emerged as a core driver of the intelligent transformation of agricultural machinery, demonstrating significant potential for decision optimization in complex and dynamic farmland environments. This paper systematically reviews the research progress of DRL in the autonomous navigation of agricultural machinery, motion planning for intelligent robotic arms, and optimization of low-altitude unmanned aerial vehicle (UAV) operations, highlighting its technical advantages in enhancing agricultural automation efficiency and adaptability. Key challenges and future directions for practical applications are also identified. Key findings include the following:

Autonomous Navigation and Path Planning: Hybrid DRL algorithms integrating value function and policy gradient approaches (e.g., Double-DQN and DDPG) have achieved centimeter-level precision in agricultural vehicle path tracking (e.g., reducing errors to ±3 cm in [44]). Dynamic state representations (e.g., path curvature parameters) significantly improved robustness in complex farmland scenarios. However, the long-term stability of high-precision path tracking requires further validation, particularly under dynamic conditions such as crop growth interference and sudden obstacles.
UAV Operation Optimization: Multi-objective reward mechanisms (e.g., balancing coverage efficiency and energy consumption) demonstrated preliminary success in pesticide spraying tasks (e.g., a 41.68% increase in coverage in [67]). However, their generalization across heterogeneous farmland scenarios necessitates further validation through cross-regional and multi-crop experiments.
Multi-Agent Collaboration: Distributed DRL frameworks with attention mechanisms (e.g., MADDPG) reduced task completion time by 10.7% in collaborative harvesting scenarios ([63]). Yet, real-time responsiveness and safety of multi-agent systems require systematic optimization under hardware constraints.

Nevertheless, the practical deployment of DRL in agriculture faces persistent challenges, including unstructured environmental dynamics, real-time algorithmic constraints, and safety risks in human–machine collaboration. Future research should prioritize the following:

Hybrid DRL architectures coupled with model predictive control to enhance decision interpretability and robustness;
Edge-cloud collaborative computing frameworks for deploying lightweight models with low-latency field response and high energy efficiency;
Cross-scene adaptive agents developed through meta-learning and self-supervised mechanisms to advance agricultural machinery from task-specific to generalized intelligence.

In conclusion, by synergizing algorithmic depth and systemic breadth, DRL is poised to resolve the trilemma of “complex environmental conditions, multitask coordination, and computing resources” in agriculture, serving as a core engine to propel smart agriculture toward sustainable, low-carbon, and fully autonomous practices. This evolution will provide critical technological support for global food security and agricultural modernization.

Author Contributions

Conceptualization, Q.Z. and J.Z.; methodology, J.Z. and S.F.; validation, B.Z.; formal analysis, A.W. and L.Z.; investigation, J.Z. and B.Z.; resources, S.F. and Q.Z.; data curation, Q.Z. and S.F.; writing—original draft preparation, J.Z.; writing—review and editing, Q.Z. and S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Jiangsu Province (Grant No. BK20230548), the National Natural Science Foundation of China (Grant No. 32301712), and the Project of Faculty of Agricultural Equipment of Jiangsu University (Grant No. NGXB20240203).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yin, X.; Wang, Y.X.; Chen, Y.L.; Jin, C.Q.; Du, J. Development of autonomous navigation controller for agricultural vehicles. Int. J. Agric. Biol. Eng. 2020, 13, 70–76. [Google Scholar] [CrossRef]
Wu, C.C.; Fang, X.M. Development of precision service system for intelligent agriculture field crop production based on BeiDou system. Smart Agric. 2019, 1, 83–90. [Google Scholar]
Sun, J.L.; Wang, Z.; Ding, S.H.; Xia, J.; Xing, G.Y. Adaptive disturbance observer-based fixed time nonsingular terminal sliding mode control for path-tracking of unmanned agricultural tractors. Biosyst. Eng. 2024, 246, 96–109. [Google Scholar] [CrossRef]
Hu, J.T.; Gao, L.; Bai, X.P.; Li, T.C.; Liu, X.G. Review of research on automatic guidance of agricultural vehicles. Trans. Chin. Soc. Agric. Eng. 2015, 31, 1–10. [Google Scholar]
Cui, B.; Cui, X.; Wei, X.; Zhu, Y.; Ma, Z.; Zhao, Y.; Liu, Y. Design and Testing of a Tractor Automatic Navigation System Based on Dynamic Path Search and a Fuzzy Stanley Model. Agriculture 2024, 14, 2136. [Google Scholar] [CrossRef]
Duan, J.; Wang, M.L.; Jiang, Y.; Zhao, J.B.; Tang, Y.W. A Compound Fuzzy Control Method for Agricultural Machinery Automatic Navigation Based on Genetic Algorithm. In Proceedings of the IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 December 2017; pp. 1663–1666. [Google Scholar]
Hameed, I.A. Intelligent Coverage Path Planning for Agricultural Robots and Autonomous Machines on Three-Dimensional Terrain. J. Intell. Robot. Syst. 2014, 74, 965–983. [Google Scholar] [CrossRef]
Gong, J.L.; Wang, W.; Zhang, Y.F.; Lan, Y.B. Cooperative working strategy for agricultural robot groups based on farmlandenvironment. Trans. Chin. Soc. Agric. Eng. 2021, 37, 11–19. [Google Scholar]
Li, Y.W.; Xu, J.J.; Wang, M.F.; Liu, D.X.; Sun, H.W.; Wang, X.J. Development of autonomous drivingtransfer trolley on field roads and its visual navigation system for hilly areas. Trans. Chin. Soc. Agric. Eng. 2019, 35, 52–61. [Google Scholar]
Wan, J.; Sun, W.; Ge, M.; Wang, K.H.; Zhang, X.Y. Robot Path Planning Based on Artificial Potential Field Method with Obstacle Avoidance Angles. J. Agric. Mach. 2024, 55, 409–418. [Google Scholar]
Zheng, Z.; Yang, S.H.; Zheng, Y.J.; Liu, X.X.; Chen, J.; Su, D.B.L.G. Obstacle avoidance path planning algorithm for multi-rotor UAVs. Trans. Chin. Soc. Agric. Eng. 2020, 36, 59–69. [Google Scholar]
Zhang, X.H.; Fan, C.G.; Cao, Z.Y.; Fang, J.L.; Jia, Y.J. Novel obstacle-avoiding path planning for crop protection UAV using optimized Dubins curve. Int. J. Agric. Biol. Eng. 2020, 13, 172–177. [Google Scholar] [CrossRef]
Wu, C.C.; Wang, D.X.; Chen, Z.B.; Song, B.B.; Yang, L.L.; Yang, W.Z. Autonomous driving and operation control method for SF2104 tractors. Trans. Chin. Soc. Agric. Eng. 2020, 36, 42–48. [Google Scholar]
Lu, E.; Xu, L.Z.; Li, Y.M.; Tang, Z.; Ma, Z. Modeling of working environment and coverage path planning method of combine harvesters. Int. J. Agric. Biol. Eng. 2020, 13, 132–137. [Google Scholar] [CrossRef]
Jin, Y.C.; Liu, J.Z.; Xu, Z.J.; Yuan, S.Q.; Li, P.P.; Wang, J.Z. Development status and trend of agricultural robot technology. Int. J. Agric. Biol. Eng. 2021, 14, 1–19. [Google Scholar] [CrossRef]
Kang, H.W.; Chen, C. Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Comput. Electron. Agric. 2020, 171, 105302. [Google Scholar] [CrossRef]
Kim, W.S.; Lee, D.H.; Kim, Y.J.; Kim, T.; Hwang, R.Y.; Lee, H.J. Path detection for autonomous traveling in orchards using patch-based CNN. Comput. Electron. Agric. 2020, 175, 105620. [Google Scholar] [CrossRef]
Li, T.; Yu, J.P.; Qiu, Q.; Zhao, C.J. Hybrid Uncalibrated Visual Servoing Control of Harvesting Robots With RGB-D Cameras. IEEE Trans. Ind. Electron. 2023, 70, 2729–2738. [Google Scholar] [CrossRef]
Aguilar, W.G.; Luna, M.A.; Moya, J.F.; Abad, V.; Parra, H.; Ruiz, H. Pedestrian detection for UAVs using cascade classifiers with Meanshift. In Proceedings of the 11th IEEE International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 30 January–1 February 2017; pp. 509–514. [Google Scholar]
Liao, M.; Sen, T.; Elasser, Y.; Al Hassan, H.A.; Pigney, A.; Knapp, E.; Chen, M.J. UAV Fleet Charging on Telecom Towers with Differential Capacitive Wireless Power Transfer. IEEE Trans. Power Electron. 2025, 40, 6370–6384. [Google Scholar] [CrossRef]
Zhao, D.B.; Shao, K.; Zhu, Y.H.; Li, D.; Chen, Y.R.; Wang, H.T.; Liu, D.R.; Zhou, T.; Wang, C.H. Review of deep reinforcement learning and discussions onthe development of computer Go. Control Theory Appl. 2016, 33, 701–717. [Google Scholar]
Farias, G.; Garcia, G.; Montenegro, G.; Fabregas, E.; Dormido-Canto, S.; Dormido, S. Reinforcement Learning for Position Control Problem of a Mobile Robot. IEEE Access 2020, 8, 152941–152951. [Google Scholar] [CrossRef]
Gao, J.L.; Ye, W.J.; Guo, J.; Li, Z.J. Deep Reinforcement Learning for Indoor Mobile Robot Path Planning. Sensors 2020, 20, 5493. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Zhai, J.W.; Zhang, Z.Z.; Zhong, S.; Zhou, Q.; Zhang, P.; Xu, J. A Survey on Deep Reinforcement Learning. Chin. J. Comput. 2018, 41, 1–27. [Google Scholar]
Wen, H.; Li, H.; Wang, Z.; Hou, X.; He, K. Application of DDPG-based Collision Avoidance Algorithm in Air Traffic Control. In Proceedings of the 2019 12th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 14–15 December 2019; pp. 130–133. [Google Scholar]
Zhang, J.; Lu, S.; Zhang, Z.; Yu, J.; Gong, X. Survey on Deep Reinforcement Learning Methods Based on Sample Efficiency Optimization. J. Softw. 2022, 33, 4217–4238. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Luo, L.; Li, L.; Yang, R.; Chen, L.L. Vector Antenna Neighborhood Search Tuning Algorithm. Commun. Technol. 2019, 52, 1800–1803. [Google Scholar]
Yan, D.; Weng, J.Y.; Huang, S.Y.; Li, C.X.; Zhou, Y.C.; Su, H.; Zhu, J. Deep reinforcement learning with credit assignment for combinatorial optimization. Pattern Recogn. 2022, 124, 108466. [Google Scholar] [CrossRef]
Shi, X.T.; Li, Y.J.; Du, C.L.; Shi, Y.; Yang, C.H.; Gui, W.H. Fully Distributed Event-Triggered Control of Nonlinear Multiagent Systems Under Directed Graphs: A Model-Free DRL Approach. IEEE Trans. Autom. Control 2025, 70, 603–610. [Google Scholar] [CrossRef]
Zhou, Y.; Zhou, L.C.; Yi, Z.H.; Shi, D.; Guo, M.J. Leveraging AI for Enhanced Power Systems Control: An Introductory Study of Model-Free DRL Approaches. IEEE Access 2024, 12, 98189–98206. [Google Scholar] [CrossRef]
Herrera, E.M.; Calvet, L.; Ghorbani, E.; Panadero, J.; Juan, A.A. Enhancing carsharing experiences for Barcelona citizens with data analytics and intelligent algorithms. Computers 2023, 12, 33. [Google Scholar] [CrossRef]
Liu, J.W.; Gao, F.; Luo, X.L. Survey of Deep Reinforcement Learning Based on Value Function and Policy Gradient. Chin. J. Comput. 2019, 42, 1406–1438. [Google Scholar]
Bravo-Arrabal, J.; Toscano-Moreno, M.; Fernández-Lozano, J.J.; Mandow, A.; Gomez-Ruiz, J.A.; García-Cerezo, A. The internet of cooperative agents architecture (X-ioca) for robots, hybrid sensor networks, and mec centers in complex environments: A search and rescue case study. Sensors 2021, 21, 7843. [Google Scholar] [CrossRef]
Deisenroth, M.P.; Rasmussen, C.E. PILCO: A Model-Based and Data-Efficient Approach to Policy Search; Omnipress: Madison, WI, USA, 2011. [Google Scholar]
Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to Trust Your Model: Model-Based Policy Optimization. In Proceedings of the Advances in Neural Information Processing Systems 32, Volume 16 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, CA, USA, 8–14 December 2019; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2020. [Google Scholar]
Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An Evaluation Platform for General Agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
Bae, S.-Y.; Lee, J.; Jeong, J.; Lim, C.; Choi, J. Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Comput. Toxicol. 2021, 20, 100178. [Google Scholar] [CrossRef]
Choi, S.; Park, S. Development of Smart Mobile Manipulator Controlled by a Single Windows PC Equipped with Real-Time Control Software. Int. J. Precis. Eng. Manuf. 2021, 22, 1707–1717. [Google Scholar] [CrossRef]
Liu, Z.T.; Shi, Y.H.; Chen, H.W.; Qin, T.X.; Zhou, X.J.; Huo, J.; Dong, H.; Yang, X.; Zhu, X.D.; Chen, X.N.; et al. Machine learning on properties of multiscale multisource hydroxyapatite nanoparticles datasets with different morphologies and sizes. npj Comput. Mater. 2021, 7, 142. [Google Scholar] [CrossRef]
Chen, K.Y.; Liang, Y.F.; Jha, N.; Ichnowski, J.; Danielczuk, M.; Gonzalez, J.; Kubiatowicz, J.; Goldberg, K. FogROS: An Adaptive Framework for Automating Fog Robotics Deployment. In Proceedings of the 17th IEEE International Conference on Automation Science and Engineering (CASE), Lyon, France, 23–27 August 2021; pp. 2035–2042. [Google Scholar]
Ou, J.J.; Guo, X.; Lou, W.J.; Zhu, M. Learning the Spatial Perception and Obstacle Avoidance with the Monocular Vision on a Quadrotor. In Proceedings of the IEEE International Conference on Mechatronics and Automation (IEEE ICMA), Takamatsu, Japan, 8–11 August 2021; pp. 582–587. [Google Scholar]
Saeedvand, S.; Mandala, H.; Baltes, J. Hierarchical deep reinforcement learning to drag heavy objects by adult-sized humanoid robot. Appl. Soft. Comput. 2021, 110, 107601. [Google Scholar] [CrossRef]
Jembre, Y.Z.; Nugroho, Y.W.; Khan, M.T.R.; Attique, M.; Paul, R.; Shah, S.H.A.; Kim, B. Evaluation of reinforcement and deep learning algorithms in controlling unmanned aerial vehicles. Appl. Sci. 2021, 11, 7240. [Google Scholar] [CrossRef]
Timmis, I.; Paul, N.; Chung, C.J. Teaching Vehicles to Steer Themselves with Deep Learning. In Proceedings of the IEEE International Conference on Electro Information Technology (EIT), Mount Pleasant, MI, USA, 14–15 May 2021; pp. 419–421. [Google Scholar]
Yuhas, M.; Feng, Y.L.; Ng, D.J.X.; Rahiminasab, Z.; Easwaran, A. Embedded Out-of-Distribution Detection on an Autonomous Robot Platform. In Proceedings of the Design Automation for CPS and IoT, Nashville, TN, USA, 18 May 2021; pp. 13–18. [Google Scholar]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the 30th Association-for-the-Advancement-of-Artificial-Intelligence (AAAI) Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Zhang, W.Y.; Gai, J.Y.; Zhang, Z.G.; Tang, L.; Liao, Q.X.; Ding, Y.C. Double-DQN based path smoothing and tracking control method for robotic vehicle navigation. Comput. Electron. Agric. 2019, 166, 104985. [Google Scholar] [CrossRef]
Yu, Y.; Liu, Y.F.; Wang, J.C.; Noguchi, N.; He, Y. Obstacle avoidance method based on double DQN for agricultural robots. Comput. Electron. Agric. 2023, 204, 107546. [Google Scholar] [CrossRef]
Ren, Z.G.; Liu, Z.J.; Yuan, M.X.; Liu, H.; Wang, W.; Qin, J.F.; Yang, F.Z. Double-DQN-Based Path-Tracking Control Algorithm for Orchard Traction Spraying Robot. Agronomy 2022, 12, 2803. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
Hu, Z.Y.; Wang, Y. A deep deterministic policy gradient method for collision avoidance of autonomous ship. Command Control Simul. 2024, 46, 37–44. [Google Scholar]
Takahashi, K.; Tomah, S. Online optimization of AGV transport systems using deep reinforcement learning. Bull. Netw. Comput. Syst. Softw. 2020, 9, 53–57. [Google Scholar]
Ye, X.F.; Deng, Z.Y.; Shi, Y.J.; Shen, W.M. Toward Energy-Efficient Routing of Multiple AGVs with Multi-Agent Reinforcement Learning. Sensors 2023, 23, 5615. [Google Scholar] [CrossRef]
Chen, Y.; Schomaker, L.; Cruz, F. Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks Using Adaptive Potential Functions. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Melbourne, Australia, 25–29 November 2024; pp. 52–64. [Google Scholar]
Sasaki, Y.; Matsuo, S.; Kanezaki, A.; Takemura, H. A3C Based Motion Learning for an Autonomous Mobile Robot in Crowds. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 1036–1042. [Google Scholar]
Lyu, S.; Gong, X.Y.; Zhang, Z.H.; Han, S.; Zhang, J.W. Survey of Deep Reinforcement Learning Methods with Evolutionary Algorithms. Chin. J. Comput. 2022, 45, 1478–1499. [Google Scholar]
Freitas, G.; Hamner, B.; Bergerman, M.; Singh, S. A practical obstacle detection system for autonomous orchard vehicles. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots & Systems, Tokyo, Japan, 3–7 November 2013. [Google Scholar]
Blok, P.M.; Kwon, S.H.; Boheemen, K.v.; Hakjin, K. Autonomous In-Row Navigation of an Orchard Robot with a 2D LIDAR Scanner and Particle Filter with a Laser-Beam Model. Inst. Control Robot. Syst. 2018, 24, 726–735. [Google Scholar] [CrossRef]
Wang, M.; Xu, J.; Zhang, J.C.; Cui, Y. An autonomous navigation method for orchard rows based on a combination of an improved a-star algorithm and SVR. Precis. Agric. 2024, 25, 1429–1453. [Google Scholar] [CrossRef]
Kiumarsi, B.; Vamvoudakis, K.G.; Modares, H.; Lewis, F.L. Optimal and Autonomous Control Using Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2042–2062. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, R.R.; Zhang, L.H.; Chen, L.P.; Yi, T.C.; Wu, M.Q.; Yue, X.L. DQN-based Path Tracking Control for Intelligent Agricultural Machinery. J. Agric. Mech. Res. 2025, 47, 28–34. [Google Scholar] [CrossRef]
Zhang, F.; Wan, X.F.; Cui, J.; Liu, H.D.; Cai, T.T.; Yang, Y. UAV path planning strategy for smart tourism agriculture. Comput. Eng. Des. 2022, 43, 1905–1914. [Google Scholar] [CrossRef]
Hu, G.M. Research on Navigation of Orchard Inspection Robot Based on Deep Reinforcement Learning. Mod. Inf. Technol. 2021, 5, 154–156+160. [Google Scholar] [CrossRef]
Devarajan, G.G.; Nagarajan, S.M.; Alnumay, G.W. DDNSAS: Deep reinforcement learning based deep Q-learning network for smart agriculture system. Sustain. Comput.-Infor. 2023, 39, 100890. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining Improvements in Deep Reinforcement Learning. In Thirty-Second AAAI Conference on Artificial Intelligence; AAAI: Philadelphia, PA, USA, 2017; Volume 32. [Google Scholar]
Zhang, Y.H.; Chen, W.B.; Zhang, J.Q.; Ma, H. A flexible assembly strategy for a 6-DOF robotic arm based on deep reinforcement learning. J. Chongqing Univ. Technol. (Nat. Sci.) 2024, 38, 148–154. [Google Scholar]
Zhang, S.; Shen, J.; Cao, K.; Dai, H.S.; Li, T. Research on 6D Robotic Arm Grasping Method Based on Improved DDPG. Comput. Eng. Appl. 2025, 1–10. [Google Scholar]
Zhang, X.; Lu, H.M.; Ren, J.K.; Mo, X.M.; Xiao, H.R.; Zhang, W.J.; Yang, X. A Dynamic Target Grasping Method for Manipulator Based on Deep Reinforcement Learning. Ordnance Ind. Autom. 2024, 43, 91–96. [Google Scholar]
Liu, J.; Yap, H.J.; Khairuddin, A.S.M. Path Planning for the Robotic Manipulator in Dynamic Environments Based on a Deep Reinforcement Learning Method. J. Intell. Robot. Syst. 2024, 111, 3. [Google Scholar] [CrossRef]
Xie, F.; Guo, Z.; Li, T.; Feng, Q.; Zhao, C. Dynamic Task Planning for Multi-Arm Harvesting Robots Under Multiple Constraints Using Deep Reinforcement Learning. Horticulturae 2025, 11, 88. [Google Scholar] [CrossRef]
Yang, B.; Wu, X.; Zhang, M.L.; Feng, S.K. Path Planning of Weeding Robot Arm Based on Deep Reinforcement Learning. J. Agric. Mech. Res. 2025, 47, 15–21. [Google Scholar] [CrossRef]
Zhang, S.C.; Xue, X.Y.; Chen, C.; Sun, Z.; Sun, T. Development of a low-cost quadrotor UAV based on ADRC for agricultural remote sensing. Int. J. Agric. Biol. Eng. 2019, 12, 82–87. [Google Scholar] [CrossRef]
Shi, Q.; Liu, D.; Mao, H.; Shen, B.; Li, M. Wind-induced response of rice under the action of the downwash flow field of a multi-rotor UAV. Biosyst. Eng. 2021, 203, 60–69. [Google Scholar] [CrossRef]
Almalki, F.A.; Salem, S.M.; Fawzi, W.M.; Alfeteis, N.S.; Esaifan, S.A.; Alharthi, A.S.; Alnefaiey, R.Z.; Naith, Q.H. Coupling an Autonomous UAV With a ML Framework for Sustainable Environmental Monitoring and Remote Sensing. Int. J. Aerosp. Eng. 2024, 2024. [Google Scholar] [CrossRef]
Hu, J.; Wang, T.; Yang, J.C.; Lan, Y.B.; Lv, S.L.; Zhang, Y.L. WSN-Assisted UAV Trajectory Adjustment for Pesticide Drift Control. Sensors 2020, 20, 5473. [Google Scholar] [CrossRef] [PubMed]
Fu, H.T.; Li, Z.; Zhang, W.J.; Feng, Y.X.; Zhu, L.; Fang, X.; Li, J. Research on Path Planning of Agricultural UAV Based on Improved Deep Reinforcement Learning. Agronomy 2024, 14, 2669. [Google Scholar] [CrossRef]
Kang, C.; Park, B.; Choi, J. Scheduling PID Attitude and Position Control Frequencies for Time-Optimal Quadrotor Waypoint Tracking under Unknown External Disturbances. Sensors 2022, 22, 150. [Google Scholar] [CrossRef]
Huang, Y.Y.; Li, Z.W.; Yang, C.H.; Huang, Y.M. Automatic Path Planning for Spraying Drones Based on Deep Q-Learning. J. Internet Technol. 2023, 24, 565–575. [Google Scholar] [CrossRef]
Zhu, W.; Feng, Z.; Dai, S.; Zhang, P.; Wei, X. Using UAV Multispectral Remote Sensing with Appropriate Spatial Resolution and Machine Learning to Monitor Wheat Scab. Agriculture 2022, 12, 1785. [Google Scholar] [CrossRef]
Chen, Y.Y.; Zhou, B.B.; Chen, X.P.; Ma, C.K.; Cui, L.; Lei, F.; Han, X.J.; Chen, L.J.; Wu, S.S.; Ye, D.P. A method of deep network auto-training based on the MTPI auto-transfer learning and a reinforcement learning algorithm for vegetation detection in a dry thermal valley environment. Front. Plant Sci. 2025, 15, 1448669. [Google Scholar] [CrossRef] [PubMed]
Gözen, D.; Özer, S. Visual Object Tracking in Drone Images with Deep Reinforcement Learning. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 10082–10089. [Google Scholar]
Lytridis, C.; Kaburlasos, V.G.; Pachidis, T.; Manios, M.; Vrochidou, E.; Kalampokas, T.; Chatzistamatis, S. An Overview of Cooperative Robotics in Agriculture. Agronomy 2021, 11, 1818. [Google Scholar] [CrossRef]
Delavarpour, N.; Koparan, C.; Nowatzki, J.; Bajwa, S.; Sun, X. A Technical Study on UAV Characteristics for Precision Agriculture Applications and Associated Practical Challenges. Remote Sens. 2021, 13, 1204. [Google Scholar] [CrossRef]
Miao, H.; Tian, Y.C. Dynamic robot path planning using an enhanced simulated annealing approach. Appl. Math. Comput. 2013, 222, 420–437. [Google Scholar] [CrossRef]
Yang, K.J.; Sukkarieh, S. REAL-TIME Continuous Curvature Path Planning of Uavs in Cluttered Environments. In Proceedings of the 5th International Symposium on Mechatronics and its Applications, Amman, Jordan, 27–29 May 2008; pp. 1–6. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Chen, X.; Zhang, J. The Three-Dimension Path Planning of UAV Based on Improved Artificial Potential Field in Dynamic Environment. In Proceedings of the 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 26–27 August 2013; pp. 144–147. [Google Scholar]
Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef]
Yan, F.; Liu, Y.S.; Xiao, J.Z. Path Planning in Complex 3D Environments Using a Probabilistic Roadmap Method. Int. J. Autom. Comput. 2013, 10, 525–533. [Google Scholar] [CrossRef]
Tsai, J.T.; Chou, J.H.; Liu, T.K. Tuning the structure and parameters of a neural network by using hybrid Taguchi-genetic algorithm. IEEE Trans. Neural Netw. 2006, 17, 69–80. [Google Scholar] [CrossRef]
Wu, K.Y.; Esfahani, M.A.; Yuan, S.H.; Wang, H. TDPP-Net: Achieving three-dimensional path planning via a deep neural network architecture. Neurocomputing 2019, 357, 151–162. [Google Scholar] [CrossRef]
Guo, T.; Jiang, N.; Biyue, L.I.; Zhu, X.; Wenbo, D.U. UAV navigation in high dynamic environments: A deep reinforcement learning approach. Chin. J. Aeronaut. 2020, 34, 479–489. [Google Scholar] [CrossRef]
Tian, S.S.; Li, Y.X.; Zhang, X.; Zheng, L.; Cheng, L.H.; She, W.; Xie, W. Fast UAV path planning in urban environments based on three-step experience buffer sampling DDPG. Digit. Commun. Netw. 2024, 10, 813–826. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Baumgartner, J.; Petric, T.; Klancar, G. Potential Field Control of a Redundant Nonholonomic Mobile Manipulator with Corridor-Constrained Base Motion. Machines 2023, 11, 293. [Google Scholar] [CrossRef]
Li, L.Y.; Wu, D.F.; Huang, Y.Q.; Yuan, Z.M. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Appl. Ocean Res. 2021, 113, 102759. [Google Scholar] [CrossRef]
Cao, K.; Zhu, Y.; Gao, Q.; Liu, J.H. Research status and prospect of deep reinforcement learning in automatic control. J. Drain. Irrig. Mach. Eng. 2023, 41, 638–648. [Google Scholar]
Choudhury, M.R.; Das, S.; Christopher, J.; Apan, A.; Chapman, S.; Menzies, N.W.; Dang, Y.P. Improving Biomass and Grain Yield Prediction of Wheat Genotypes on Sodic Soil Using Integrated High-Resolution Multispectral, Hyperspectral, 3D Point Cloud, and Machine Learning Techniques. Remote Sens. 2021, 13, 3482. [Google Scholar] [CrossRef]
Awais, M.; Li, W.; Cheema, M.J.M.; Zaman, Q.U.; Shaheen, A.; Aslam, B.; Zhu, W.; Ajmal, M.; Faheem, M.; Hussain, S.; et al. UAV-based remote sensing in plant stress imagine using high-resolution thermal sensor for digital agriculture practices: A meta-review. Int. J. Environ. Sci. Technol. 2023, 20, 1135–1152. [Google Scholar] [CrossRef]
Velusamy, P.; Rajendran, S.; Mahendran, R.K.; Naseer, S.; Shafiq, M.; Choi, J.G. Unmanned Aerial Vehicles (UAV) in Precision Agriculture: Applications and Challenges. Energies 2022, 15, 217. [Google Scholar] [CrossRef]
Gao, T.; Gao, Z.H.; Sun, B.; Qin, P.Y.; Li, Y.F.; Yan, Z.Y. An Integrated Method for Estimating Forest-Canopy Closure Based on UAV LiDAR Data. Remote Sens. 2022, 14, 4317. [Google Scholar] [CrossRef]
Zhao, D.A.; Lv, J.D.; Ji, W.; Zhang, Y.; Chen, Y. Design and control of an apple harvesting robot. Biosyst. Eng. 2011, 110, 112–122. [Google Scholar] [CrossRef]
Feng, R.G.; Hu, J.P.; Wang, M.J.; Niu, X.S.; Li, H.T. Key technologies and application analysis of intelligent agricultural machinery autonomous driving. Agric. Equip. Veh. Eng. 2024, 62, 15–18. [Google Scholar]
Ahmed, S.; Qiu, B.J.; Ahmad, F.; Kong, C.W.; Xin, H. A State-of-the-Art Analysis of Obstacle Avoidance Methods from the Perspective of an Agricultural Sprayer UAV’s Operation Scenario. Agronomy 2021, 11, 1069. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of DRL algorithm classification.

Figure 2. Classification of DRL algorithms based on gradient policies.

Figure 3. Diagram of the algorithm framework for agricultural ground platform navigation. (a) Double-DQN algorithm topology diagram; (b) Actor–Critic network architecture; (c) SMO-Rainbow strategy network architecture.

Figure 4. Diagram of the motion planning algorithm framework for intelligent agricultural end-effectors. (a) TD3 framework; (b) a deep reinforcement learning network with self-attention mechanism; (c) flowchart of the DDPG algorithm.

Figure 5. Common drone system architectures [79].

Figure 6. Agricultural low-altitude drone algorithm framework diagram. (a) Neural network-based value function approximator of reinforcement learning. (b) The proposed QFP and CFA models [82]. (c) The block diagram of MTSA reinforcement learning.

Table 1. Comparison of model-based vs. model-free deep reinforcement learning algorithms.

Attribute	Model-Based DRL	Model-Free DRL
Core Concept	Builds a model of environment dynamics for policy optimization	Learns optimal policy directly through interaction with environment
Sample Efficiency	High—can generate abundant data via the learned model	Low—requires large amounts of real-world interaction
Generalization Ability	Strong—suitable for transfer and multi-task learning	Moderate—relies on experience replay or exploration mechanisms
Typical Application Scenarios	Structured environments (e.g., greenhouse climate control)	Complex, unstructured environments (e.g., field navigation)
Representative Algorithms	PILCO [37], MBPO [38], Dreamer [39]	DQN, DDPG, TD3
Major Challenges	Model error accumulation, increased training complexity	High data demand, unstable convergence
Agricultural Use Cases	Crop growth modeling, irrigation/fertilization control	Tractor path tracking, UAV spraying missions

Table 2. Comparison of performance of agricultural machinery path planning methods.

Method	Path Accuracy	Adaptability	Computational Load	Suitable Terr
PPC	Medium	Low	Low	Flat Fields
MPC	High	Medium	High	Greenhouses
DRL(DQN,PPO)	High	High	Medium	Unstructured fields

Table 3. DRL Gains in Agricultural Manipulator Tasks.

Metric	Traditional	DRL Solution
Obstacle reaction time	42 ms	12.1 ms
Multi-target success rate	65%	87%
Multi-arm task time	120 s	107 s

Table 4. Advantages and disadvantages of path planning methods.

Method	Advantages	Disadvantages	Reference
RRT	Simple, low computation	Static-only; non-optimal paths	Yang et al. [90]
A*	Fast search	Static-only; high computational burden; non-smooth	Hart et al. [91]
APF	Fast convergence	Local minima problem	Chen & Zhang [92]
Dijkstra’s Alg	Easy implementation	High complexity; static-only	Dijkstra et al. [93]
PRM	Handles complex spaces	Static-only; non-optimal; time-consuming	Yan et al. [94]
Genetic Alg	Solves NP-hard and multi-objective	High complexity; premature convergence	Tsai et al. [95]
CNN	Handles dynamic, unknown	High complexity; many hyperparameters; non-optimal	Wu et al. [96]
DQN	Adaptive	Non-optimal; hard to implement	Tong et al. [97]
DDPG	Supports continuous action spaces; suitable for high-dimensional control	Unstable training; high computational resource demands	Tian et al. [98]
TD3	Improved stability over DDPG; reduces overestimation	Complex parameter tuning; insufficient real-time performance	Liu et al. [74]
SAC	Balances exploration-exploitation; adapts to complex dynamic environments	High algorithmic complexity; long training time	Haarnoja et al. [99]
PPO	Stable training; suitable for multi-task scenarios	Requires fine hyperparameter tuning; slow convergence	Baumgartner et al. [100]

Table 5. Advantages and limitations of agricultural ground platform navigation.

Method	Advantages	Limitations	Applicability Case
RRT	Simple, low computation	Only suitable for static environments, the path is not optimal, and it cannot handle dynamic obstacles	For initial path exploration of static obstacles in farmland (such as avoiding fixed gullies); it should be combined with other methods to optimize the path.
A*	Fast search	Training is unstable, requires high computational resources, and has limited real-time performance	For initial global path planning of agricultural machinery; it needs to be combined with local dynamic adjustments.
DDPG	Supports continuous action spaces; suitable for high-dimensional control	Parameter tuning is complex, and hardware compatibility is poor	Zhang et al. used DDPG to optimize tractor dynamic path tracking, improving response speed by 30%.
TD3	Improved stability over DDPG; reduces overestimation	High algorithm complexity and long training cycles	Liu et al. implemented dynamic obstacle avoidance path generation for agricultural machinery based on TD3, reducing the response time to 12.1 milliseconds.
SAC	Balances exploration–exploitation; adapts to complex dynamic environments	Only applicable for static maps; path is not smooth; high computational burden	Hu adopted an SAC-designed orchard inspection robot navigation strategy to address the issue of sparse GPS signals.

Table 6. Advantages and limitations of motion planning for intelligent agricultural end-effectors.

Method	Advantages	Limitations	Applicability Case
PPO	Training is stable and suitable for multi-task scenarios	Requires fine-tuning of parameters, with slow convergence speed	Xu et al. achieved dynamic grasping with a 6-degree-of-freedom robotic arm using PPO, improving the success rate by 21%
DQN	Strong adaptive learning ability; suitable for discrete action spaces (such as decision-making)	Difficult to handle continuous actions	For discrete task selection of end-effectors (such as spray switch control); it needs to be combined with policy gradient methods
MADDPG	Supports multi-agent collaboration	High risk of strategy drift requires complex reward function design	Xie et al. developed a multi-robotic arm Markov game model, reducing task completion time by 10.7%
APF	Converges quickly; suitable for local obstacle avoidance	Prone to becoming stuck in local minima; unable to handle complex constraints	Chen et al. combined APF with DDPG to design an obstacle avoidance reward mechanism, reducing the seedling damage rate to 2.82%

Table 7. Advantages and limitations of agricultural low-altitude drone operations.

Method	Advantages	Limitations	Applicability Case
Genetic Alg	Suitable for multi-objective optimization	High computational complexity; prone to premature convergence	For offline global path planning (such as large field area division); but it needs to be combined with online DRL dynamic adjustment
DQN + PSO	Combining the adaptability of DQN with the global search capability of PSO to enhance path optimality	Implementation is complex and requires data fusion from multiple sensors	Hu et al. designed a drone trajectory correction system assisted by WSN, which increased pesticide deposition coverage by 41.68%
TD3	Adapt to continuous action space	Insufficient real-time performance; difficulties in deploying at the edge	For drone attitude control in simulation environments, a lightweight model such as TinyDRL is needed to adapt to actual hardware
SAC	Supports complex dynamic environments (such as sudden changes in wind speed), with high efficiency	The need for extensive training data makes it difficult to quickly transfer to new scenarios	Almalki et al. combined SAC with Faster R-CNN to achieve vegetation detection in arid river valleys with an accuracy of 98%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Fan, S.; Zhang, B.; Wang, A.; Zhang, L.; Zhu, Q. Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery. Agriculture 2025, 15, 1223. https://doi.org/10.3390/agriculture15111223

AMA Style

Zhao J, Fan S, Zhang B, Wang A, Zhang L, Zhu Q. Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery. Agriculture. 2025; 15(11):1223. https://doi.org/10.3390/agriculture15111223

Chicago/Turabian Style

Zhao, Jiamuyang, Shuxiang Fan, Baohua Zhang, Aichen Wang, Liyuan Zhang, and Qingzhen Zhu. 2025. "Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery" Agriculture 15, no. 11: 1223. https://doi.org/10.3390/agriculture15111223

APA Style

Zhao, J., Fan, S., Zhang, B., Wang, A., Zhang, L., & Zhu, Q. (2025). Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery. Agriculture, 15(11), 1223. https://doi.org/10.3390/agriculture15111223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research Status and Development Trends of Deep Reinforcement Learning in the Intelligent Transformation of Agricultural Machinery

Abstract

1. Introduction

2. Deep Reinforcement Learning Technology Framework and Algorithm Architecture

2.1. Deep Reinforcement Learning Algorithm Framework

2.1.1. Value Function-Based DRL Algorithms

2.1.2. Policy Gradient-Based DRL Algorithms

3. Applications of DRL in Agricultural Production Environments

3.1. Agricultural Ground Platform Navigation

3.1.1. Application Background

3.1.2. DRL-Based Solution and Key Advantages

3.2. Motion Planning for Intelligent Agricultural End-Effectors

3.2.1. Application Background

3.2.2. DRL-Based Solution and Key Advantages

3.3. Agricultural Low-Altitude Drone Operations

3.3.1. Application Background

3.3.2. DRL-Based Solution and Key Advantages

4. Key Challenges of DRL in Agricultural Applications

4.1. Environmental Complexity Issues

4.2. Algorithm and Computational Resource Limitations

4.3. Safety and Reliability Issues

5. Future Research Directions for DRL in Agricultural Applications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI