Next Article in Journal
Path Planning for Unmanned Aerial Vehicle: A-Star-Guided Potential Field Method
Previous Article in Journal / Special Issue
Fast Detection of Plants in Soybean Fields Using UAVs, YOLOv8x Framework, and Image Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A DDPG-LSTM Framework for Optimizing UAV-Enabled Integrated Sensing and Communication

School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea
*
Author to whom correspondence should be addressed.
Drones 2025, 9(8), 548; https://doi.org/10.3390/drones9080548 (registering DOI)
Submission received: 16 June 2025 / Revised: 26 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

Abstract

Highlights

  • What are the main findings?
  • A novel integrated sensing and communication (ISAC)-enabled unmanned aerial vehicle (UAV) architecture is proposed, enabling a single UAV to jointly perform uplink communication and radar sensing.
  • A long short-term memory (LSTM)-enhanced deep deterministic policy gradient (DDPG)-based deep reinforcement learning (DRL) framework is developed to optimize UAV trajectory and uplink power control.
  • The proposed approach effectively adapts to dynamic target movements, enhancing both sensing accuracy and communication reliability.
  • What is the implication of the main finding?
  • The framework enables energy-efficient UAV operation by minimizing movement energy consumption while maintaining sensing performance.
  • The framework enables UAVs to autonomously and efficiently balance sensing and communication tasks in dynamic environments.
  • The solution supports real-time adaptation to target mobility, making it suitable for practical ISAC-UAV applications such as surveillance and smart city monitoring.

Abstract

This paper proposes a novel dual-functional radar-communication (DFRC) framework that integrates unmanned aerial vehicle (UAV) communications into an integrated sensing and communication (ISAC) system, termed the ISAC-UAV architecture. In this system, the UAV’s mobility is leveraged to simultaneously serve multiple single-antenna uplink users (UEs) and perform radar-based sensing tasks. A key challenge stems from the target position uncertainty due to movement, which impairs matched filtering and beamforming, thereby degrading both uplink reception and sensing performance. Moreover, UAV energy consumption associated with mobility must be considered to ensure energy-efficient operation. We aim to jointly maximize radar sensing accuracy and minimize UAV movement energy over multiple time steps, while maintaining reliable uplink communications. To address this multi-objective optimization, we propose a deep reinforcement learning (DRL) framework based on a long short-term memory (LSTM)-enhanced deep deterministic policy gradient (DDPG) network. By leveraging historical target trajectory data, the model improves prediction of target positions, enhancing sensing accuracy. The proposed DRL-based approach enables joint optimization of UAV trajectory and uplink power control over time. Extensive simulations validate that our method significantly improves communication quality and sensing performance, while ensuring energy-efficient UAV operation. Comparative results further confirm the model’s adaptability and robustness in dynamic environments, outperforming existing UAV trajectory planning and resource allocation benchmarks.

1. Introduction

The rapid expansion of the Internet of Things (IoT) and the emergence of intelligent applications, such as autonomous driving, have significantly increased the demand for high-speed wireless communication and precise radar sensing [1]. Traditionally, communication and sensing functions have been designed and operated separately, resulting in inefficient spectrum utilization and redundant hardware resources. To overcome these limitations, integrated sensing and communication (ISAC) has emerged as a promising paradigm that co-designs communication and sensing functionality through shared waveforms and hardware [1,2]. This integration not only improves spectral efficiency and reduces system costs but also supports the development of versatile, multifunctional wireless systems.
ISAC has gained considerable attention as a key enabler for future sixth-generation (6G) wireless systems, which aim to deliver ultra-reliable, low-latency communications alongside high-resolution environmental awareness through joint use of infrastructure and spectrum [3,4]. Leveraging cutting-edge technologies such as multiple-input–multiple-output (MIMO) antenna systems and access to high-frequency bands including millimeter-wave (mmWave) and terahertz (THz), ISAC enables the simultaneous delivery of broadband communication and fine-grained sensing capabilities within a unified system [5]. A prominent implementation of ISAC is the dual-functional radar-communication (DFRC) framework, which supports concurrent sensing and communication using a common waveform. Recent research has made significant advances in this area, including coding techniques [6], waveform design methodologies [7,8], and joint beamforming techniques [9,10], all aimed at enhancing signal reuse efficiency across radar and communication subsystems. In particular, recent studies have introduced advanced hybrid beamforming methods tailored for multi-user ISAC systems, including partially connected analog–digital architectures for Cramér–Rao bound (CRB)-optimal design [11], prior-information-aware beamforming that minimizes the posterior CRB [12], and communication-centric hybrid designs that jointly optimize multi-user transmission and sensing constraints [13]. These contributions significantly improve the practicality and performance of ISAC systems in real-world deployments. Despite these advancements, terrestrial ISAC systems remain limited by factors such as fixed infrastructure and environmental obstructions. Static base station (BS) deployments often suffer from unreliable line-of-sight (LoS) links, which degrade sensing accuracy and reduce operational coverage [14].
To overcome these challenges, unmanned aerial vehicles (UAVs) have been increasingly explored as mobile ISAC platforms capable of augmenting both communication and sensing performance. UAVs offer three-dimensional mobility [15,16] allowing real-time adjustment of altitude and trajectory to optimize channel conditions and sensing geometry [17]. This spatial flexibility significantly enhances link reliability and sensing accuracy, especially in environments where terrestrial infrastructure is insufficient or obstructed. UAV-based systems, with their rapid deployment capabilities and high probability of maintaining LoS connections, are thus strong candidates for airborne ISAC applications [18]. Their mobility enables context-aware adaptation to changing user demands and environmental dynamics, effectively balancing communication throughput and sensing resolution [19]. Importantly, UAVs excel in real-time tracking of moving targets, something static ground-based ISAC systems often struggle with, thereby enabling persistent sensing and reliable operation in mobility-constrained scenarios.

1.1. Related Work and Motivation

The synergistic integration of ISAC technologies into UAV systems holds substantial promise for enhancing both wireless communication and sensing capabilities. Consequently, this research area has attracted significant attention in recent years [20,21,22]. In [20], the authors deployed a UAV-enabled ISAC system, where the UAV senses ground users and forwards the information to BS, with the dual objective of maximizing energy efficiency and minimizing the radar mutual information. In [21], the authors introduced a massive MIMO-based UAV-enabled ISAC system for the radar probing tasks with various optimization schemes. In [22], the authors proposed a multi-stage UAV-enabled ISAC system for target localization, where the UAV first performs wide-beam sensing while stationary to acquire a coarse estimate of the target’s position, followed by narrow-beam sensing with jointly optimized trajectory and beamforming to refine localization accuracy. While these studies primarily focus on fixed targets, the emergence of autonomous vehicular applications has prompted a shift toward ISAC systems capable of real-time tracking and detection of dynamic targets. In response, the authors of [23] proposed a globally optimal trajectory design for UAV-enabled ISAC with moving ground users by projecting the trajectories into a user-relative frame. The problem was reformulated as a shape optimization involving a density-varying catenary within an artificial potential field. To enhance ISAC performance with limited UAV resources, ref. [24] examined an orthogonal frequency division multiple access (OFDMA)-based cooperative multi-UAV system that enables joint multi-static sensing and coordinated multiple point transmission. Many existing studies assume known target positions when designing UAV trajectories, which limits their practicality in dynamic environments.
To address this limitation, recent research has introduced advanced signal processing techniques to estimate parameters such as target position, angle of arrival (AoA), angle of departure (AoD), time delay, and Doppler shift under uncertainty. For instance, ref. [25] proposed a networked ISAC-based UAV tracking and handover scheme using a virtual sensing cell, where one primary BS transmits while multiple BSs receive echoes. Each BS removes clutter and estimates the UAV’s position and velocity, which are then fused by a centralized extended Kalman filter (EKF) for accurate multi-UAV tracking and handover. In a similar direction, ref. [26] presented a UAV-enabled ISAC system for joint tracking and communication with a moving target, employing an EKF-based scheme and optimizing the UAV trajectory to minimize a weighted sum of the predicted posterior Cramér–Rao bound (PCRB) for position and velocity estimation. Building on this, the authors in [27] proposed an online energy-aware UAV trajectory optimization method, further minimizing the weighted sum of the predicted PCRB while satisfying UAV energy and location constraints. In [28], a UAV-enabled ISAC system was designed for vehicular networks, introducing a three-stage approach: initial state estimation, wide-beam EKF-based tracking, and adaptive transmission based on real-time sensing and communication performance. Additionally, ref. [29] developed a UAV-assisted ISAC system for IoT networks, where a UAV senses IoT nodes with unknown positions and forwards the collected data to the BS. To cope with location uncertainty and limited resources, a trajectory optimization problem was formulated to maximize the radar estimation rate and minimize flight time, using a deep reinforcement learning (DRL) approach. The authors in [30] proposed a UAV-ISAC-assisted IoT system in which a UAV detects unknown IoT node locations and relays the data to a ground BS. This task presents several challenges, including unknown node positions, non-convex optimization, limited resources, dynamic environmental conditions, and complex resource allocation. To address these issues, they proposed a multi-step dueling double deep Q-network (DDQN) based on DRL. In [31], the authors proposed an integrated sensing, communication, and computation (ISCC) system utilizing collaborative UAVs equipped with edge servers to support intelligent transportation systems. To address the architectural and coordination challenges in UAV networks, they developed the multi-UAV collaborative air-ISCC (MCAI) algorithm, which employs an asynchronous advantage actor–critic framework to collaboratively train a DRL model across UAV. This approach aims to improve ISCC service reliability while enhancing UAV energy efficiency. In [32], an ISAC system designed for the low-altitude economy (LAE) was investigated, where a ground BS simultaneously provided UAVs with communication and navigation services while sensing the airspace to detect unauthorized mobile targets. Building on this model, the authors proposed a novel LAE-specific ISAC framework, termed DeepLSC, which leverages a DRL approach to address multi-objective optimization and satisfy diverse system constraints effectively. While these studies represent significant progress, many DRL-based ISAC-UAV frameworks overlook valuable historical information, such as past target positions, which are strongly correlated with future behavior. This motivates us to integrate predictive methods into DRL frameworks to more effectively exploit this information and further enhance the learning process, aiming to achieve superior performance in ISAC-based UAV systems.

1.2. Contribution and Outline

This study tackles the challenging task of maximizing radar sensing performance while minimizing the UAV movement energy consumption over multiple time steps, all while ensuring reliable uplink communication quality. To achieve this, we jointly optimize the UAV’s trajectory and resource allocation within a UAV-enabled ISAC framework. A central challenge lies in the uncertainty of the target’s position, which undermines traditional methods such as convex optimization that require prior knowledge of the target’s location. Effective optimization in this setting demands both the predictive capability to estimate future target positions and the decision-making capability to optimize actions over time. To tackle these challenges, we propose a novel DRL framework that integrates long short-term memory (LSTM) units into the deep deterministic policy gradient (DDPG) architecture. This LSTM-DDPG network is designed to capture temporal dependencies in historical trajectory data, thereby improving both target prediction and long-term optimization performance. The main contributions of this study can be summarized as follows:
  • We introduce a new DFRC framework that incorporates UAV-based communication into an ISAC system. The proposed architecture enables the UAV to simultaneously support uplink communication for ground users and perform radar-based sensing.
  • To reflect realistic operating conditions, we incorporate a detailed UAV movement energy consumption model. This component directly addresses the constraint of limited onboard energy in UAV systems.
  • The impact of target mobility and position uncertainty on matched filtering and beamforming in ISAC systems is analyzed and discussed. Accordingly, we formulate a multi-step joint optimization problem, where the UAV’s trajectory, uplink transmit power, and predicted target positions are optimized to balance communication efficiency and sensing performance.
  • A novel DRL method, termed LSTM-DDPG, is proposed by embedding LSTM modules into both the actor and critic networks of DDPG. This integration facilitates the model to capture temporal correlations in historical target trajectories, enhancing prediction accuracy and enabling more effective long-term UAV control and resource allocation.
  • Simulation results demonstrate that the proposed method significantly enhances communication efficiency, sensing robustness, and energy conservation. Comparative studies further highlight the framework’s ability to leverage historical target position data, leading to more efficient learning and superior performance compared to existing benchmarks.
Notation: Scalars are denoted by lowercase italic letters (e.g., x; y), and column vectors are represented by lowercase boldface letters (e.g., x ; y ). For a vector x , the Hermitian transpose and conjugate are denoted as x H and x * , respectively. The norm, expectation, and variance are represented by | · | , E [ · ] , and Var [ · ] , respectively.

2. System Model and Problem Formulation

We consider an ISAC system built on a full-duplex (FD) UAV platform, where the UAV simultaneously performs target motion sensing and uplink communication with multiple ground users, as depicted in Figure 1. The UAV is equipped with two uniform linear arrays (ULAs): one with N t transmit antennas and the other with N r receive antennas. Leveraging its FD capabilities, the UAV can simultaneously perform uplink communication with K ul single-antenna user equipments (UEs) and conduct radar-based sensing by transmitting probing signals, even in the presence of C clutter sources. This architecture enables concurrent communication and sensing operations. This configuration offers a flexible and mobile solution for intelligent transportation systems, enhancing terrestrial communication, particularly in applications such as smart cities and highway monitoring. Subsequently, we formulate a joint optimization problem over a finite time horizon consisting of S discrete time steps, indexed by the set S = { 1 , 2 , , S } , where s S . The set of uplink UEs is denoted by K ul = { 1 , 2 , , K ul } , where each u K ul represents a single-antenna uplink, UE u ul . Let T = { 1 , , 1 + C } denote the set of targets, including the desired target and C clutter sources. Each element, t T corresponds to a specific target or clutter object, where t = 1 denotes the desired target, and t 1 denotes a clutter source.

2.1. Channel Model

Due to the elevated altitude of the UAV, the channels from both the UEs and the target/clutter to the UAV, denoted as h u ul and h t ut , respectively, are assumed to be dominated by LoS propagation. Each channel is modeled by incorporating both large-scale path loss and small-scale fading effects. Accordingly, the channel vectors h u ul and h t ut at each time step are given by
h u ul = β u ul a ( ϕ u ul ) ,
h t ut = β t ul a ( ϕ t ut ) ,
where a ( ϕ ) is the steering vector representing the small-scale fading, defined as
a ( ϕ ) = 1 , , e j 2 π d λ ( N 1 ) sin ( ϕ ) ,
which depends on the AoA ϕ u t and AoD ϕ u l between the UAV and a target/clutter or UE, the wavelength λ , and the antenna spacing d. The number of antenna elements N corresponds to N t or N r , depending on the array. The large-scale path loss between the UAV and a target/clutter or UE is modeled as β t ut = C 0 ( d t ut ) α , where C 0 is the path loss at a reference distance D 0 = 1 m, d t ut is the 3D distance between the UAV and the target/clutter, and α is the path loss exponent. Similar notation applies for β u ul between the UAV and the uplink UE, UE u ul .
The UAV’s position is denoted by u = ( x uav , y uav , z uav ) , with altitude constrained by h min z u h max . The ground-level positions of the u-th UE and the t-th target/clutter are denoted as a u ul = ( x u ul , y u ul , 0 ) and a t target = ( x t target , y t target , 0 ) , respectively. Accordingly, the distances d u ul and d t ut can be computed as
d u ul = | u a u ul | ,
d t ut = | u a t target | .
A key challenge in enabling FD communication lies in self-interference (SI), the leakage of the UAV’s transmit signal into its own receiver, which can degrade the desired signal quality. Recent advances in both active and passive SI cancellation techniques have enabled suppression levels that approach the thermal noise floor. Accordingly, this work adopts the widely used assumption of perfect SI cancellation at the UAV receiver. This assumption is supported by recent FD communication systems, where experimental setups have demonstrated SI suppression exceeding 110 dB, sufficient to render residual interference negligible in many practical scenarios [33]. Such suppression is typically achieved through a combination of passive methods (e.g., antenna isolation), analog cancellation at the RF stage, and digital cancellation at the baseband level.

2.2. Communication Model

During the downlink sensing phase at the s-th time step, the transmitted probing sensing signal x rad C N t × 1 at the UAV is expressed as
x rad ( s ) = z rad ( s ) b rad ( s ) ,
where b r a d ( s ) , satisfying E b r a d ( s ) 2 = 1 , represents the radar probing symbol, and z r a d ( s ) C N t × 1 is the radar beamforming vector designed to focus on the desired target at the s-th time step.
In the uplink communication phase at the s-th time step, all UEs simultaneously transmit to the UAV. To decode the signal from UE u ul , the UAV applies a receive beamforming vector o u ( s ) (i.e., matched-filter) to the received signal y m ul . The resulting signal for UE u ul is expressed as
y ^ u ul ( s ) = o u ( s ) y m ul ( s ) = o u ( s ) h u ul ( s ) p max ul η u ul ( s ) v u ( s ) Desired signal DU u ul ( s ) + u = 1 | u u K ul o u ( s ) h u ul ( s ) p max ul η u ul ( s ) v u ( s ) Inter - user interference IUI u , u ul ( s ) + t = 1 C + 1 o u ( s ) β t ut ( s ) α t A ( ϕ t ut ( s ) ) x rad ( s ) Echo signal interference ESI t , u ( s ) + o u ( s ) n 0 ( s ) Noise N u ul ( s ) ,
where A ( ϕ t ut ) = a r ( ϕ t ut ) a t ( ϕ t ut ) H represents the target response matrix, constructed from transmit steering vector a t ( ϕ t ut ) and receive steering vector a r ( ϕ t ut ) corresponding to the direction ϕ t ut . The parameter α t denotes the complex reflection coefficient, dependent on the radar cross-section (RCS) of the target/clutter. η u up denotes the power control coefficient for UE u ul , and v u ( s ) , with E v u ( s ) 2 = 1 , is the normalized data symbol, subject to the maximum uplink power constraint p max ul . n 0 ( s ) CN ( 0 , σ 2 I ) is the additive white Gaussian noise vector at the UAV.
The signal-to-interference-plus-noise ratio (SINR) for uplink user UE u ul at the s-th time step, denoted as γ u ul ( s ) , is derived in (10). Consequently, the achievable uplink rate (in bps/Hz) for UE u ul at the s-th time step is given by
R u ul ( s ) = log 2 ( 1 + γ u ul ( s ) ) .

2.3. Sensing Radar Model

During the sensing phase, the UAV applies a matched filtering by correlating the received signal with a beamforming vector r t ( s ) , which is designed to suppress the interference from clutter sources and maximize the radar SINR. The received echo signal corresponding to the desired target t = 1 at the s-th time step is expressed as
y ^ t rad ( s ) = r t ( s ) y m ul ( s ) = r t ( s ) β t ut ( s ) α t A ( ϕ t ut ( s ) ) x rad ( s ) Desired radar signal DU t , m rad ( s ) + t = 2 C + 1 r t ( s ) β t ut ( s ) α t A ( ϕ t ut ( s ) ) x rad ( s ) Clutter interference CCI t , t ( s ) + u = 1 K ul r t ( s ) h u ul ( s ) p max ul η u ul ( s ) v u ( s ) Inter - user interference IUI u , t rad ( s ) + r t ( s ) n 0 ( s ) Noise N t rad ( s ) .
γ u ul ( s ) = | DU u ul ( s ) | 2 u = 1 | u u K ul | IUI u , u ul ( s ) | 2 + t = 1 C + 1 | ESI t , u ( s ) | 2 + | N u ul ( s ) | 2 ,
γ t rad ( s ) = | DU t rad | 2 u = 1 K ul | IUI u , t rad ( s ) | 2 + t = 1 | t t C + 1 | CCI t , t ( s ) | 2 + | N t rad ( s ) | 2 .
From an information-theoretic perspective, the mutual information (MI) between the transmitted radar waveform and the received echo signal quantifies the amount of information shared between two random variables by measuring the reduction in uncertainty of one given knowledge of the other. In radar systems, MI serves as a fundamental metric to evaluate target detection performance by capturing the correlation between the transmitted radar signal and the target response. A higher MI indicates a stronger correlation, leading to improved detection and estimation accuracy. Prior studies have demonstrated that MI effectively characterizes the accuracy of radar target impulse response estimation [34,35]. Factors such as waveform design, target scattering characteristics, clutter, and path loss all influence the achievable MI. Consequently, maximizing MI plays a critical role in enhancing radar estimation and unifying traditional detection metrics with information-theoretic insights. In particular, in MIMO radar systems detecting point targets, the detection probability is typically proportional to the radar output SINR. Accordingly, the radar performance between the UAV and the target in time step s is characterized using the radar SINR, denoted as γ t rad , which is defined in (11).

2.4. UAV Movement Energy Consumption

We adopt an analytical energy consumption model for rotary-wing UAVs based on actuator disc theory and blade element theory, as established in the classical aeronautical literature [36,37]. This energy consumption model captures the propulsion energy required for rotary-wing UAV flight, which is a critical factor in trajectory design, battery management, and overall mission planning. The same model has been widely adopted and validated in recent studies on UAV-enabled wireless communication systems [38,39,40], further supporting its reliability for performance evaluation and optimization. In this study, we focus on a rotary-wing UAV whose movement energy consumption between two consecutive time steps, i.e., from time step s 1 to s, is characterized following the approach in [41] as
E uav mov ( s ) = τ ( s ) × P uav mov τ ( s ) A 0 1 + 3 v h 2 tip 2 + τ ( s ) A 1 1 + v h 4 4 v 0 2 1 / 2 v h 2 2 v 0 4 1 / 2
+ τ ( s ) 1 2 fus air sol disc v h 3 + τ ( s ) wei v t .
We consider the time flight duration τ ( s ) required for a UAV to travel from its previous position at the ( s 1 ) -th time step, u ( s 1 ) = ( x uav ( s 1 ) , y uav ( s 1 ) , z uav ( s 1 ) ) to its current position at the s-th time step, u ( s ) = ( x uav ( s ) , y uav ( s ) , z uav ( s ) ) , which is given by τ ( s ) = d uav ( s ) v uav = | u ( s ) u ( s 1 ) | v uav , where d uav denotes the Euclidean distance traveled, and v uav is the UAV’s constant flight speed. The UAV’s energy consumption is modeled using the analytical framework for rotary-wing UAVs, incorporating actuator disc and blade element theories. In this model, the constants A 0 and A 1 in (12) represent the blade profile power and induced power, respectively, and are defined as A 0 = drag 8 air sol disc ang 3 rad 3 and A 1 = ( 1 + incre ) wei 3 / 2 2 air disc . Here, tip is the tip speed of the rotor blade, while fus , air , sol , disc , and wei represent the fuselage drag ratio, air density, rotor solidity, rotor disc area, and UAV weight, respectively. The parameters drag , ang , and rad denote the profile drag coefficient, blade angular velocity, and rotor radius, respectively. The variable v 0 denotes the mean rotor-induced velocity during hover, and v uav is the UAV’s flight velocity. A maximum velocity constraint v max is incorporated to ensure realistic operation. Additionally, the UAV’s velocity is decomposed into horizontal and vertical components as v h ( s ) = | x ( s ) x ( s 1 ) | τ ( s ) = v uav × | x ( s ) x ( s 1 ) | d uav ( s ) and v t ( s ) = | z uav ( s ) z uav ( s 1 ) | τ = v uav × | z uav ( s ) z uav ( s 1 ) | d uav ( s ) , where x ( s ) = ( x uav ( s ) , y uav ( s ) ) .

3. Problem Formulation

Before presenting the proposed optimization problem, we first design the beamforming vectors c u and r t for uplink communication and sensing, respectively. Given that the UAV is equipped with an antenna array where the number of receive antennas exceeds the number of users N r > K ul , we employ the zero-forcing (ZF) beamforming technique to effectively exploit the spatial degrees of freedom for interference mitigation and improved radar sensing accuracy. ZF effectively maximizes the SINR by mitigating multi-user interference, thereby enhancing both uplink decoding and sensing performance. Accordingly, the receive beamforming vectors c u C 1 × N r and r t C 1 × N r corresponds to the u-th and ( K + 1 ) -th row of the ZF receiver A ZF at the s-th time step, which is given by
A ZF = ( ( G ul ( s ) ) H G ul ( s ) ) 1 ( G ul ( s ) ) H .
where G ul ( s ) [ h 1 ul , , h K ul ul , h 1 ut , , h C + 1 ut ] . Moreover, to maximize the transmit gain toward the target during the radar sensing phase, we employ a maximum ratio transmission (MRT) precoder at the UAV transmitter. Accordingly, the radar beamforming vector z rad is given by
z rad ( u ( s ) , a t target ( s ) ) = a t H ( ϕ t ut ( s ) ) | a t H ( ϕ t ut ( s ) ) | .
Notably, a normalization factor is in (14) which ensures compliance with the radar transmit power constraint. The radar beamforming and matched filter design at the UAV transceiver relies on steering vectors derived from the relative positions of the UAV, UEs, and targets. The assumption of ideal beamforming using ZF and MRT is justified by the LoS-dominant nature of UAV communication environments. UAVs typically operate at high altitudes, ensuring strong LoS links and enabling accurate channel estimation based on geometric information. Furthermore, since UEs are assumed to be equipped with positioning capabilities (e.g., GPS), their location information can be reliably shared with the UAV to construct precise steering vectors. However, in scenarios involving target mobility, location uncertainty can significantly challenge this beamforming design, potentially degrading both sensing and communication performance. To address this, we formulate an optimization problem aimed at jointly optimizing the UAV trajectory and uplink power control at UEs while achieving optimal beamforming and matched filtering designs based on the estimated target’s position. The objective is to maximize the radar SINR across S time steps while minimizing the UAV movement energy consumption, subject to constraints on the uplink communication rate. The problem is formulated as
max η , u , a ¯ t target ( s ) s = 1 S γ t rad ( s ) ϵ 0 s = 1 S E uav mov ( s ) .
s . t . 0 η u ( s ) 1 , u K u ,
  | u ( s ) u ( s 1 ) | L max , s S ,
h min z uav ( s ) h max , s S ,
R u ul ( s ) R QoS , u K u , s S ,
γ t rad ( s ) γ 0 rad , s S ,
where the scalar ϵ 0 is a weighting coefficient (or trade-off parameter) that balances radar performance and energy efficiency. Constraint (15b) ensures compliance with the uplink power budget, constraint (15c) limits UAV mobility per time step, and constraint (15d) ensures that the UAV’s altitude remains within an allowable range. Constraint (15e) guarantees the quality of service (QoS) requirements for uplink UEs at each time step, and constraint (15f) ensures a minimum radar SINR for reliable sensing. The optimization problem (15) is inherently non-convex due to the complex nature of the objective function (15a) and constraints (15e) and (15f), which include exponential terms, high-order non-linear functions, and fractional expressions. These features make it challenging for traditional convex optimization methods to find globally optimal solutions, even when convex approximations are applied. Moreover, convex optimization techniques lack the ability to utilize historical data for predictive learning, which is essential in our dynamic, time-varying environment. Furthermore, the continuous and high-dimensional action space of problem (15) complicates the application of heuristic methods, which are generally better suited for discrete or simplified search spaces. Consequently, DRL provides a more flexible and effective framework to handle these challenges, enabling adaptive policy learning over continuous action domains.
While many prior studies often assume perfect target location estimates (e.g., using the last known position), such assumptions fail under fast-moving targets. To address this limitation, the proposed method exploits historical trajectory information for enhanced target position prediction. This data-driven approach enables more accurate estimation of future target locations, thereby enhancing robustness and reliability of sensing in dynamic movement scenarios. To tackle the complex, sequential decision-making problem under target location uncertainty, the LSTM-DDPG learning framework is employed. This model integrates the strengths of DRL for continuous control of trajectory and power and LSTM networks for modeling temporal dependencies and predicting future target positions. The proposed LSTM-DDPG framework is trained across diverse target movement scenarios to ensure robust generalization and effective joint optimization of UAV trajectory and UE power control over multiple time steps, while maintaining energy efficiency and radar reliability.

4. LSTM-DDPG-Based Approach

In this section, we present the LSTM-DDPG framework, developed to solve the joint trajectory and power control problem formulated in (15). The proposed method builds upon the DDPG algorithm, a reinforcement learning technique well-suited for continuous action spaces, by integrating it with LSTM networks to capture temporal dependencies in the environment. We begin by briefly reviewing the DDPG algorithm, followed by a description of how LSTM is incorporated to enhance learning performance under target mobility.

4.1. Background on DDPG

The optimization problem formulated in (15) involves continuous action spaces, rendering conventional deep reinforcement learning algorithms such as the deep Q-network (DQN) ineffective, since it is designed for discrete action spaces and typically exhibits poor convergence behavior when extended to continuous domains [42]. To address this, we adopt the DDPG algorithm, a model-free, off-policy actor–critic method that leverages deep neural networks to approximate both the policy and the value functions in continuous control settings. The DDPG framework consists of four neural networks: two evaluation networks for actor ( μ ) and critic (Q), and two target networks for actor ( μ ) and critic ( Q ). The actor networks generate actions and interact with the environment, while the critic networks assess the actor’s performance to guide policy updates. The goal is to optimize the parameters ( θ μ , θ q , θ μ , θ q ) of these networks to learn the optimal policy. Since the critic networks estimate continuous action value functions, the actor networks directly output optimal continuous actions without requiring a discrete action–value search, making DDPG well-suited for continuous control problems. Additionally, DDPG leverages experience replay to handle correlated sequential data, enhancing training stability and performance in continuous action reinforcement learning tasks.

4.2. Background on LSTM

LSTM networks are a specialized form of recurrent neural network (RNN) designed to capture long-term dependencies in sequential data. Traditional RNNs often struggle with retaining information over long time horizons due to issues such as vanishing or exploding gradients during training. To overcome these limitations, Hochreiter and Schmidhuber introduced LSTM architecture [43], which incorporates a memory cell and three gating mechanisms, namely input gate, forget gate, and output gate, that collectively control the flow of information through the network. This design enables LSTM networks to selectively retain, update, or discard information, making them highly effective for learning temporal patterns and dependencies in complex time-series data.
In the context of the ISAC-UAV system, where the target’s trajectory is uncertain and dynamically evolving, we employ an LSTM structure to predict future positions of the target by learning patterns from historical trajectory data. In this study, LSTM is selected over gated recurrent unit (GRU) and attention-based models based on three key considerations. First, given the limited onboard computational resources of UAV systems, LSTM provides an effective trade-off between modeling capacity and computational efficiency, enabling real-time inference with moderate complexity, unlike attention-based models, which typically require higher memory and processing power. Second, LSTM’s well-established gating mechanism ensures robust learning training and stable performance in dynamic and partially observable environments, which is crucial for reliable UAV trajectory control. Third, compared to GRU, LSTM is better equipped to capture long-term temporal dependencies, leading to improved decision-making in a sequential state setting. While GRU is more lightweight, it may sacrifice learning accuracy in complex scenarios. Therefore, LSTM was selected as a practical and effective solution for integration with the DDPG framework in our ISAC-UAV system. Integrating LSTM into the DDPG algorithm enables the model to adapt more effectively to dynamic environments, enhance its ability to learn from temporally correlated data, and improve generalization performance across diverse target movement scenarios.

4.3. Proposed LSTM-DDPG-Based Algorithm Architecture

A block diagram of the proposed LSTM-DDPG-based algorithm architecture for the ISAC-UAV system with four neural networks, denoted as q, μ , q , and μ , is presented in Figure 2. These networks are initialized with corresponding parameters θ q , θ μ , θ q , and θ μ , serving as the initial weights for the respective networks. The input to both the evaluation and target actor networks is the environment state observed at each time step. Specifically, the state space s s at the s-th time step in the actor networks μ ( s s | θ μ ) and μ ( s s | θ μ ) is defined as
s s = { u ( s 1 ) , a ul , a t target ( s ) , a t target ( s 1 ) , . . . , a t target ( s 1 ) } ,
where represents the history window size used as input to the LSTM network, and a ul [ a 1 ul , , a K ul ul ] C 1 × K ul represents the uplink UE positions. To effectively exploit the temporal dependencies in the target’s position history, the LSTM network is leveraged in the actor network of the DRL framework, as illustrated in Figure 3. The historical sequence of target positions at the s-th time step with the history window is fed into the LSTM module to capture temporal dynamics, while the current UAV and UE positions are processed through a separate feed-forward neural network. The outputs of both the LSTM and the neural network are then concatenated and fed into another fully connected network that generates the final action output of the actor.
The outputs of the evaluation actor μ ( s s | θ μ ) and target actor μ ( s s | θ μ ) networks represent the actions selected by the agent based on its observations of the environment. The corresponding action space a s at the s-th time step is defined as
a s = u ( s ) , a ¯ t target ( s ) , η ( s ) .
At each time step s, the agent utilizes the observed state s t to determine the corresponding action a s , which includes the UAV’s optimal 3D position u ( s ) , the uplink power control coefficients η ( s ) , and the estimated target position a ¯ t target ( s ) . Based on this estimate, the UAV employs a ¯ t target ( s ) to configure the matched filter r t ( u ( s ) , a ¯ t target ( s ) ) and the radar beamforming z rad ( u ( s ) , a ¯ t target ( s ) ) . The objective is to maximize the radar sensing performance while minimizing UAV movement energy consumption. Consequently, the reward r s at the s-th time step in the ISAC-UAV system is defined as
r s = γ t rad ( s ) ϵ 0 E uav mov ( s ) .
During the training phase, the agent collects experience tuples of the form ( s s , a s , r s , s s + 1 ) , which are stored in a replay buffer B with a fixed capacity. This buffer enables the agent to sample past experiences uniformly, thereby reducing the correlation between sequential data and improving training stability. The critic networks utilize samples from B to learn the mapping between states and actions. Specifically, the input to both the evaluation and target critic networks is the concatenation of the current state s and the corresponding action a . Similar to the actor network, the critic network is enhanced with an LSTM module to capture temporal dependencies in the target’s trajectory, enabling more accurate state–action value estimation, as shown in Figure 4. Concurrently, the current positions of the UAV and uplink UEs are processed through a dedicated feed-forward neural network to extract spatial features. Additionally, the action taken by the agent is processed separately through another neural network designed to extract action-specific features.
The output of the LSTM module, the spatial feature extractor and the action feature extractor, are concatenated and passed through a final fusion neural network to produce the Q-value Q ( s s , a s ) . This architecture enhances the critic’s ability to accurately estimate value functions by learning complex interactions among temporal dynamics, spatial configurations, and action decisions in highly dynamic environments.

4.4. Proposed DDPG-Based Optimization Procedure

Given a mini-batch of N B transitions { ( s i , a i , r i , s i + 1 ) } i = 1 N B sampled from the replay buffer B , the target Q-value y i is computed based on the Markov decision process framework as
y i = r i + ξ Q s i + 1 , μ ( s i + 1 | θ μ ) | θ q ,
where ξ denotes the discount factor applied to future rewards. The following loss function is minimized to update the parameters θ q of the evaluation critic network:
L ( θ q ) = 1 N B i = 1 N B y i Q ( s i , a i ) | θ q 2 .
At each time step, the parameters θ q of the evaluation critic network are updated using the gradient descent method:
θ q θ q ϵ 2 θ q L ( θ q ) ,
where ϵ 2 [ 0 , 1 ] denotes the learning rate of the critic networks, and ( · ) represents the first-order partial derivative.
Once the critic network is updated, the policy network updates the parameters of the evaluation actor network by using the observed input states s t from the mini-batch N B to maximize the expected reward. Based on the deterministic policy gradient theorem [44], the following gradient is applied to update the weights θ μ of the evaluation actor network:
θ μ J ( θ μ ) 1 N B i = 1 N B θ μ μ ( s i | θ μ ) × a Q ( s i , a | θ q ) | a = μ ( s i | θ μ ) ,
where a Q ( s i , a | θ q ) is obtained through backpropagation in the evaluation critic network Q using the actions predicted by the evaluation actor network μ . The parameters θ μ of the evaluation actor network are updated using the gradient descent algorithm as
θ μ θ μ + ϵ 1 θ μ J ( θ μ ) ,
where ϵ 1 [ 0 , 1 ] represents the learning rate for the actor networks. To enhance training stability and mitigate divergence issues in Q-learning, DDPG employs soft updates by gradually adjusting the parameters of the target networks to follow those of the learned evaluation networks. Specifically, the target actor network Q ( ( s , a ) θ q ) and the target critic network μ ( s θ μ ) are cloned to compute their respective target values. Their parameters are then updated as
θ q τ θ q + ( 1 τ ) θ q , θ μ τ θ μ + ( 1 τ ) θ μ ,
where τ 1 is the soft update coefficient that controls the adaptation speed of the target networks. Exploring continuous action spaces poses unique challenges [45]. To promote effective exploration, DDPG introduces a simple yet effective method by adding random noise to the action space. The exploration action a t is formed by injecting noise sampled from N t into the evaluation actor policy μ , enhancing exploration throughout the learning process:
a t = μ ( s i | θ μ ) + N t ,
where the noise sample N t follows an Ornstein–Uhlenbeck (OU) process to generate temporally correlated random noise, as described in [42]. This inclusion of time-correlated noise plays a crucial role in facilitating more efficient and effective exploration throughout the learning process.
Eventually, the training procedure for the proposed LSTM-DDPG framework is detailed in Algorithm 1.
Algorithm 1 Proposed LSTM-DDPG-based Framework to Solve (15).
Initialization: Initialize Q ( s s , a s , θ q ) , μ ( s s , a s , θ μ ) , Q ( s s , a s , θ q ) , and μ ( s s , a s , θ μ ) using θ q = θ q and θ μ = θ μ . Set up B , ϵ 1 , ϵ 2 , ξ , N B , S, and ep max .
Set   ep ^ 1 , ep 1 , s 1 .
for  eq = 1 to ep max  do
    Sample a dataset D i { D 1 , . . . , D N } .
    for  ep ¯ = 1 to ep ¯ max  do
        Set the UAV position at time step 0 u ( 0 ) and gather historical target positions.
        Set the initial state s 1 .
        for  s = 1 to S do
           Select action a s = μ ( s s | θ μ ) + N s .
           Extract u ( s ) , a ¯ t target ( s ) , η ( s ) .
           Calculate γ t rad ( s ) and E uav mov ( s ) using (11) and (12).
           Determine the next state s s + 1 based on (16).
           Calculate reward r s using (18).
           If constraints (15e) and (15f) violated, set r s 0 .
           Store s s , a s , r s , s s + 1 in B .
           Randomly sample a mini-batch of size N B from B .
           Update evaluation critic network using (21).
           Update evaluation actor network using (23).
           Update target actor and target critic networks using (24).
        end for
    end for
end for

5. Numerical Results

5.1. Simulation Setup

5.1.1. ISAC-UAV System Setup

To realistically simulate the UAV’s physical movement, we adopt mechanical parameters listed in Table 1, following the setup in [41]. Specifically, the rotor disc area, tip speed, rotor solidity, fuselage drag ratio, and mean rotor induced velocity are computed as follows: disc π rad 2 , tip ang rad , sol ( num aer ) / ( π rad ) , fus flat / ( sol disc ) , and v 0 ( wei / ( 2 air disc ) ) 0.5 , where num and aer denote the number of rotor blades and the aerofoil chord length, respectively. All relevant parameters are provided in detail in Table 1.
The UAVs, UEs, targets, and clutters are initially randomly distributed within a square area of size L × L , with L × L , with L = 100 m, for simulation convenience. However, the UAV’s position is optimized dynamically over the entire time horizon without spatial constraints, allowing it to adapt its trajectory based on the target’s movement. Following the configuration in [46], the radar SINR threshold γ 0 rad is set to 12 dB to ensure a target detection probability above 90% while maintaining a false alarm rate of 10 4 . The simulation parameters for wireless communication are listed in Table 2, based on [47].

5.1.2. LSTM-DDPG-Based Algorithm Configuration

Figure 5 presents the overall architecture of the actor and critic networks in the proposed LSTM-DDPG framework. The actor network processes the state s t , which includes the historical target positions and the current positions of the UAV and UEs. The target’s position history is passed through an LSTM layer, while the spatial features of the UAV/UEs are processed via a fully connected layer, both using ReLU activation. The outputs are concatenated and passed through two additional FC layers with ReLU activation, followed by a final Sigmoid-activated output layer to generate the action a t . The critic network takes both s t and a t as inputs, processes them through similar structures with ReLU activation, concatenates the outputs, and passes them through two fully connected layers, concluding with a linear-activated output to estimate the Q-value.
The key hyperparameters of the proposed LSTM-DDPG-based framework are as follows. The soft update coefficient τ is set to 0.005. The learning rates for the actor and critic models, ϵ 1 and ϵ 2 , are both set to 0.001. The discount factor ξ is 0.99. The replay buffer memory size B is 1,000,000, and the mini-batch size N B is 32. The number of time steps S is 30. The history window size is 5. The total number of episodes e p max is 2000, with a pre-training phase of e p ¯ max = 100 episodes. The sizes of the fully connected layers L 1 through L 9 are all set to 128. Simulations were performed on a desktop computer equipped with an Intel Core i9-13900KF CPU running at 3.40 GHz and 32 GB of RAM, using TensorFlow 2.9.1 and Python via Visual Studio Code. To train the proposed DRL framework, we generate a dataset consisting of 500 diverse training samples. Each sample includes a unique set of zigzag target trajectories, as well as randomized initial positions for the UAV and ground UEs within a square area of size L × L . This diversity enables the model to learn robust policies that generalize well to unseen scenarios. Specifically, the zigzag movement model is employed to more accurately capture the non-linear and dynamic motion patterns commonly observed in real-world ISAC-UAV applications such as surveillance, target tracking, and mobile sensing [48]. Unlike straight-line motion, this model simulates more realistic mobility by having the target move at a constant speed while periodically changing direction by a fixed angle θ z , resulting in a piecewise linear trajectory. Straight-line motion is a special case when θ z = 0 . By incorporating this model, we aim to evaluate the ISAC-UAV system under more challenging and realistic conditions, thereby improving the practical relevance and robustness of the proposed approach. In reinforcement learning, particularly in continuous control tasks such as UAV navigation and beamforming, dataset size alone does not directly determine training effectiveness as it does in supervised learning [49]. Instead, the diversity and representativeness of the experience replay buffer, combined with exploration strategies, play a critical role in learning performance. Our dataset of 500 samples was carefully generated to cover a wide range of UAV states, target positions, and movement patterns, including the zigzag mobility model, thereby ensuring sufficient coverage of the state–action space. Furthermore, the DDPG algorithm continuously interacts with the environment during training, updating policies based on newly collected experience, which enables the model to adapt well beyond the initial dataset limitations. All numerical results are averaged over 100 independent testing datasets.
To comprehensively evaluate the performance of the proposed method, we compare it with several benchmark schemes:
(1)
DDPG-based approach: This approach employs the standard DDPG algorithm to solve problem (15), without leveraging historical target position data.
(2)
KF-DDPG-based approach: In this approach, the LSTM module in the DDPG-based framework is replaced with a Kalman filter (KF) to predict the target’s future position. The KF is commonly used in traditional ISAC systems due to its simplicity and effectiveness in estimating dynamic states under linear system dynamics and Gaussian noise assumptions.
(3)
LSTM-DDGP-based approach in 2D: This approach employs the proposed LSTM-DDPG method in a fixed-altitude (2D) UAV configuration to assess the effect of UAV altitude on system performance.
(4)
LSTM-DDPG-based approach in 3D: This approach corresponds to the full implementation of the proposed LSTM-DDPG framework, solving the complete 3D optimization problem defined in (15), including the UAV altitude.

5.2. Simulation Results

As a result, the proposed learning framework enables the UAV to perform more efficient trajectory planning and resource allocation, thereby reducing movement-related energy consumption while simultaneously enhancing radar performance. Figure 6a,b show the cumulative reward curves achieved by the LSTM-DDPG model and the baseline DDPG model without LSTM integration, respectively, under the parameter setting K u l = 2 and C = 4 for solving the problem (15) in the ISAC-UAV system. In Figure 6a, the LSTM-DDPG exhibits rapid convergence and stable performance, achieving near-optimal cumulative rewards within approximately 1000 episodes. The curve remains smooth with minimal variance, indicating the effectiveness of the LSTM module in capturing temporal dependencies and enhancing the agent’s decision-making capabilities across sequential states. In contrast, the baseline DDPG model in Figure 6b exhibits a much slower learning process. It shows significantly slower convergence, requiring over 8000 episodes to reach a comparable reward level. Moreover, its learning trajectory is marked by pronounced instability and oscillations throughout training. These findings clearly demonstrate that the proposed LSTM-DDPG framework outperforms the conventional DDPG approach in terms of both convergence speed and policy robustness. The integration of LSTM layers enables the agent to retain and leverage historical state information, thereby significantly enhancing its ability to predict future target positions and adapt to dynamic environmental conditions. Figure 7 presents a 3D visualization of the UAV’s trajectory generated by the trained LSTM-DDPG-based learning model. The target’s movement is illustrated from the first to the S-th time step. In response, the UAV adapts its trajectory, including altitude adjustments, based on the learned policy to effectively track the target, avoid clutter, and maintain reliable uplink communication with the UEs. The visualization highlights the UAV’s ability to dynamically plan efficient paths in complex and time-varying environments. By continuously adjusting its position to preserve sensing quality and communication reliability, the UAV minimizes unnecessary movements and energy consumption while ensuring high radar performance. These results confirm the effectiveness of the proposed LSTM-DDPG framework in enabling energy-efficient and adaptive trajectory planning in dynamic scenarios.
To evaluate energy efficiency, we compare the results obtained by varying the weighting coefficient ϵ 0 , as shown in Figure 8. The coefficient ϵ 0 in the reward function governs the trade-off between sensing performance and UAV movement energy consumption. As ϵ 0 increases, the model increasingly prioritizes energy savings, potentially at the expense of radar sensing accuracy, and vice versa. Figure 8 shows the UAV trajectories for ϵ 0 = 0.1 and ϵ 0 = 1 . It can be observed that as ϵ 0 increases, the trajectory between the initial and final steps becomes shorter, reflecting the model’s preference for energy-efficient paths. This behavior confirms the model’s adaptability in adjusting to different optimization objectives. Moreover, these results emphasize the importance of carefully selecting ϵ 0 to achieve a desirable balance between radar performance and energy efficiency. Figure 9 further supports this finding by showing a clear reduction in UAV energy consumption as ϵ 0 increases from 0.1 to 1. The model increasingly favors shorter trajectories, thereby minimizing movement-related energy usage.
Figure 10 illustrates the convergence behavior of the proposed LSTM-DDPG framework for solving problem (15) under various learning rates ϵ 1 and ϵ 2 over 1600 training episodes. The learning rate plays a crucial role in controlling the step size of weight updates during training, thereby significantly influencing both convergence speed and training stability. Among the configurations tested, ϵ 1 = 0.001 and ϵ 2 = 0.001 yielded the best most stable and efficient convergence. These values were therefore adopted as the default learning rates for all subsequent experiments.
To demonstrate the superior performance of the proposed framework, we conducted extensive comparisons with several benchmark algorithms. As discussed earlier, increasing the value of ϵ 0 places greater emphasis on energy efficiency, which may lead to degraded radar sensing performance and vice versa. This trade-off is illustrated in Figure 11, which presents the effects of varying ϵ 0 from 0.1 to 1. The proposed LSTM-DDPG framework consistently outperforms the benchmark approaches, achieving the highest radar sensing performance while simultaneously minimizing UAV movement energy consumption. Among the baselines, the KF-DDPG-based approach performs the worst in radar performance, particularly when compared to the framework that employs DDPG solely. This stems from its reliance on a KF to predict the target’s next position. While KF is effective for linear motion under Gaussian noise, it is ill-suited for non-linear target trajectories, resulting in large prediction errors. These inaccuracies hinder the learning process when integrated with DDPG, ultimately degrading overall system performance. In contrast, the integration of LSTM into the DDPG-based model allows the learning framework to effectively capture and adapt to complex, non-linear motion patterns. Furthermore, the proposed framework also outperforms a variant of the LSTM-DDPG model that does not optimize the UAV’s altitude, underscoring the importance of full 3D trajectory optimization. By enabling the UAV to dynamically adjust its altitude, the proposed model achieves substantial performance gains over all benchmark algorithms. A similar trend is observed in Figure 12, which presents the comparison results for different values of the number of UEs K. As K increases, the radar sensing performance of the ISAC-UAV system tends to degrade due to increased uplink interference. Despite this, the proposed LSTM-DDPG framework consistently achieves the best performance, demonstrating its robustness and effectiveness in more densely populated and interference-limited scenarios.
Finally, we evaluate the model complexity by measuring the processing time required for UAV trajectory and resource allocation optimization in a representative test scenario. As shown in Table 3, the results highlight a trade-off between performance and computational complexity.

6. Conclusions

In this study, we investigated an ISAC-UAV system and designed a novel DRL framework to jointly optimize the UAV’s trajectory and uplink power, alongside beamforming and matched filtering design, with the goal of maximizing performance while minimizing the UAV’s energy consumption. To address the complexity of this optimization problem, we adopted a DDPG-based DRL approach. However, a primary challenge in this setting lies in the uncertainty of the target’s position due to its movement, which can hinder the effectiveness of the learning process. To overcome this limitation, we integrated an LSTM model into both the actor and critic networks of the DDPG framework, enabling the system to better capture temporal dependencies and accurately estimate the target’s motion. The resulting LSTM-DDPG framework demonstrated strong robustness and significantly improved learning stability and performance compared to benchmark algorithms. Our experimental results show that the proposed framework effectively handles non-linear target trajectories, ensuring reliable uplink communication for UEs while optimizing radar sensing and minimizing the UAV’s energy consumption in a dynamic environment. In the future, we will extend the learning model structure to handle more complex target motion patterns. Additionally, we aim to scale the ISAC-UAV system to accommodate multiple UAVs and multiple moving targets, further enhancing its applicability to real-world scenarios. Future work will extend the current model by accounting for residual SI after cancellation and beamforming imperfections caused by practical factors such as channel estimation errors and hardware impairments. Incorporating these aspects will enable a more realistic and robust performance analysis, providing deeper insights into the practical design and deployment of ISAC-enabled UAV systems. Future work will focus on developing a theoretical analysis to characterize the convergence and performance bounds of the proposed LSTM-DDPG framework. This endeavor is challenging due to the inherent non-convexity of DRL algorithms and the non-Markovian properties of LSTM networks. Eventually, future work will focus on extending the proposed framework to multi-UAV scenarios to improve scalability, coverage, and robustness. This expansion will address new challenges including UAV coordination, interference mitigation, and distributed cooperative trajectory optimization.

Author Contributions

Conceptualization, X.-T.D. and O.-S.S.; methodology, X.-T.D. and O.-S.S.; validation, X.-T.D. and O.-S.S.; investigation, X.-T.D., J.-S.E., B.-M.V. and O.-S.S.; writing—original draft preparation, X.-T.D., J.-S.E., B.-M.V. and O.-S.S.; writing—review and editing, X.-T.D. and O.-S.S.; visualization, O.-S.S.; supervision, O.-S.S.; project administration, O.-S.S.; funding acquisition, O.-S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) under Grant RS-2023-00208995 and RS-2025-02214082.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, J.A.; Rahman, M.L.; Wu, K.; Huang, X.; Guo, Y.J.; Chen, S.; Yuan, J. Enabling joint communication and radar sensing in mobile networks—A survey. IEEE Commun. Surv. Tutor. 2021, 24, 306–345. [Google Scholar] [CrossRef]
  2. Zhou, Y.; Rao, B.; Wang, W. UAV swarm intelligence: Recent advances and future trends. IEEE Access 2020, 8, 183856–183878. [Google Scholar] [CrossRef]
  3. Wen, D.; Zhou, Y.; Li, X.; Shi, Y.; Huang, K.; Letaief, K.B. A survey on integrated sensing, communication, and computation. IEEE Commun. Surv. Tutor. 2024. [Google Scholar] [CrossRef]
  4. Zhou, Y.; Liu, X.; Zhai, X.; Zhu, Q.; Durrani, T.S. UAV-enabled integrated sensing, computing, and communication for Internet of things: Joint resource allocation and trajectory design. IEEE Internet Things J. 2023, 11, 12717–12727. [Google Scholar] [CrossRef]
  5. Meng, K.; Wu, Q.; Ma, S.; Chen, W.; Wang, K.; Li, J. Throughput maximization for UAV-enabled integrated periodic sensing and communication. IEEE Trans. Wirel. Commun. 2022, 22, 671–687. [Google Scholar] [CrossRef]
  6. Memisoglu, E.; Yılmaz, T.; Arslan, H. Waveform design with constellation extension for OFDM dual-functional radar-communications. IEEE Trans. Veh. Technol. 2023, 72, 14245–14254. [Google Scholar] [CrossRef]
  7. Hassanien, A.; Amin, M.G.; Zhang, Y.D.; Ahmad, F. Dual-function radar-communications: Information embedding using sidelobe control and waveform diversity. IEEE Trans. Signal Process. 2015, 64, 2168–2181. [Google Scholar] [CrossRef]
  8. Liu, R.; Li, M.; Liu, Q.; Swindlehurst, A.L. Dual-functional radar-communication waveform design: A symbol-level precoding approach. IEEE J. Sel. Top. Signal Process. 2021, 15, 1316–1331. [Google Scholar] [CrossRef]
  9. Wang, X.; Fei, Z.; Zhang, J.A.; Huang, J.; Yuan, J. Constrained utility maximization in dual-functional radar-communication multi-UAV networks. IEEE Trans. Commun. 2020, 69, 2660–2672. [Google Scholar] [CrossRef]
  10. Lu, Z.; Zhai, L.; Zhou, W.; Xue, K.; Gao, X. Beamforming design and trajectory optimization for integrated sensing and communication supported By multiple UAVs based on DRL. Veh. Commun. 2025, 54, 100932. [Google Scholar] [CrossRef]
  11. Wang, X.; Fei, Z.; Zhang, J.A.; Xu, J. Partially-connected hybrid beamforming design for integrated sensing and communication systems. IEEE Trans. Commun. 2022, 70, 6648–6660. [Google Scholar] [CrossRef]
  12. Wang, Y.; Zhang, S. Hybrid beamforming design for integrated sensing and communication exploiting prior information. In Proceedings of the GLOBECOM 2024-2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4576–4581. [Google Scholar]
  13. Leyva, L.; Castanheira, D.; Silva, A.; Gameiro, A. Hybrid Beamforming Design for Communication-Centric ISAC. IEEE Sens. J. 2024, 24, 21179–21190. [Google Scholar] [CrossRef]
  14. Hua, H.; Xu, J.; Han, T.X. Optimal transmit beamforming for integrated sensing and communication. IEEE Trans. Veh. Technol. 2023, 72, 10588–10603. [Google Scholar] [CrossRef]
  15. Liu, Z.; Liu, X.; Liu, Y.; Leung, V.C.; Durrani, T.S. UAV assisted integrated sensing and communications for Internet of things: 3D trajectory optimization and resource allocation. IEEE Trans. Wirel. Commun. 2024, 23, 8654–8667. [Google Scholar] [CrossRef]
  16. Dang, X.T.; Nguyen, H.V.; Shin, O.S. Physical Layer Security for IRS-UAV-Assisted Cell-Free Massive MIMO Systems. IEEE Access 2024, 12, 89520–89537. [Google Scholar] [CrossRef]
  17. Wu, J.; Yuan, W.; Hanzo, L. When UAVs meet ISAC: Real-time trajectory design for secure communications. IEEE Trans. Veh. Technol. 2023, 72, 16766–16771. [Google Scholar] [CrossRef]
  18. Yilmaz, M.B.; Xiang, L.; Klein, A. Joint beamforming and trajectory optimization for UAV-aided ISAC with dipole antenna array. In Proceedings of the 2024 27th International Workshop on Smart Antennas (WSA), Dresden, Germany, 17–19 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
  19. Lyu, Z.; Zhu, G.; Xu, J. Joint maneuver and beamforming design for UAV-enabled integrated sensing and communication. IEEE Trans. Commun. 2022, 22, 2424–2440. [Google Scholar] [CrossRef]
  20. Liu, Y.; Liu, S.; Liu, X.; Liu, Z.; Durrani, T.S. Sensing fairness-based energy efficiency optimization for UAV enabled integrated sensing and communication. IEEE Wirel. Commun. Lett. 2023, 12, 1702–1706. [Google Scholar] [CrossRef]
  21. Liu, Y.; Mao, W.; He, B.; Huangfu, W.; Huang, T.; Zhang, H.; Long, K. Radar probing optimization for joint beamforming and UAV trajectory design in UAV-enabled integrated sensing and communication. IEEE Trans. Commun. 2024, 73, 4469–4485. [Google Scholar] [CrossRef]
  22. Xu, L.; Zhu, Q.; Xia, W.; Wang, Z.; Quek, T.Q.; Zhu, H. Joint placement and beamforming design in UAV enabled multi-stage ISAC system. IEEE Trans. Commun. 2025. [Google Scholar] [CrossRef]
  23. Li, Y.; Yuan, X.; Hu, Y.; Yang, J.; Schmeink, A. Optimal UAV trajectory design for moving users in integrated sensing and communications networks. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15113–15130. [Google Scholar] [CrossRef]
  24. Pan, Y.; Li, R.; Da, X.; Hu, H.; Zhang, M.; Zhai, D.; Cumanan, K.; Dobre, O.A. Cooperative trajectory planning and resource allocation for UAV-enabled integrated sensing and communication systems. IEEE Trans. Veh. Technol. 2023, 73, 6502–6516. [Google Scholar] [CrossRef]
  25. Feng, Y.; Zhao, C.; Luo, H.; Gao, F.; Liu, F.; Jin, S. Networked ISAC based UAV tracking and handover towards low-altitude economy. IEEE Trans. Wirel. Commun. 2025. [Google Scholar] [CrossRef]
  26. Jiang, Y.; Wu, Q.; Chen, W.; Meng, K. UAV-enabled integrated sensing and communication: Tracking design and optimization. IEEE Commun. Lett. 2024, 28, 1024–1028. [Google Scholar] [CrossRef]
  27. Jiang, Y.; Wu, Q.; Chen, W.; Hui, H. Energy-aware UAV-enabled target tracking: Online optimization with location constraints. IEEE Trans. Veh. Technol. 2024, 74, 6668–6673. [Google Scholar] [CrossRef]
  28. Pang, X.; Guo, S.; Tang, J.; Zhao, N.; Al-Dhahir, N. Dynamic ISAC beamforming design for UAV-enabled vehicular networks. IEEE Trans. Wirel. Commun. 2024, 23, 16852–16864. [Google Scholar] [CrossRef]
  29. Liu, X.; Wu, J.; Zhao, C.; Liu, Z. Integrated sensing and communications for UAV assisted Internet of things based on deep reinforcement learning. IEEE Trans. Veh. Technol. 2025, 74, 9604–9616. [Google Scholar] [CrossRef]
  30. Qin, Y.; Zhang, Z.; Li, X.; Huangfu, W.; Zhang, H. Deep reinforcement learning based resource allocation and trajectory planning in integrated sensing and communications UAV network. IEEE Trans. Wirel. Commun. 2023, 22, 8158–8169. [Google Scholar] [CrossRef]
  31. Hou, P.; Huang, Y.; Zhu, H.; Lu, Z.; Huang, S.C.; Yang, Y.; Chai, H. Distributed DRL-based integrated sensing, communication and computation in cooperative UAV-enabled intelligent transportation systems. IEEE Internet Things J. 2024, 12, 5792–5806. [Google Scholar] [CrossRef]
  32. Ye, X.; Mao, Y.; Yu, X.; Sun, S.; Fu, L.; Xu, J. Integrated sensing and communications for low-altitude economy: A deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2025. [Google Scholar] [CrossRef]
  33. Smida, B.; Wichman, R.; Kolodziej, K.E.; Suraweera, H.A.; Riihonen, T.; Sabharwal, A. In-Band Full-Duplex: The Physical Layer. Proc. IEEE 2024, 112, 433–462. [Google Scholar] [CrossRef]
  34. Chiriyath, A.R.; Paul, B.; Jacyna, G.M.; Bliss, D.W. Inner bounds on performance of radar and communications co-existence. IEEE Trans. Signal Process. 2016, 64, 464–474. [Google Scholar] [CrossRef]
  35. Zhang, Q.; Wang, X.; Li, Z.; Wei, Z. Design and performance evaluation of joint sensing and communication integrated system for 5G mmWave enabled CAVs. IEEE J. Sel. Top. Signal Process. 2021, 15, 1500–1514. [Google Scholar] [CrossRef]
  36. Bramwell, A.R.S.; Balmford, D.; Done, G. Bramwell’s Helicopter Dynamics; Elsevier: Amsterdam, The Netherlands, 2001. [Google Scholar]
  37. Filippone, A. Flight Performance of Fixed and Rotary Wing Aircraft; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
  38. Wang, J.; Zhang, H.; Zhou, X.; Liu, W.; Yuan, D. Joint resource allocation and trajectory design for energy-efficient UAV assisted networks with user fairness guarantee. IEEE Internet Things J. 2024, 11, 23835–23849. [Google Scholar] [CrossRef]
  39. Dai, X.; Duo, B.; Yuan, X.; Di Renzo, M. Energy-efficient UAV communications with directional antennas: Tilting effect modeling and trajectory optimization. IEEE Trans. Veh. Technol. 2025, 74, 11194–11206. [Google Scholar] [CrossRef]
  40. Pan, H.; Liu, Y.; Sun, G.; Wu, Q.; Gong, T.; Wang, P.; Niyato, D.; Yuen, C. Cooperative UAV-mounted RISs-assisted energy-efficient communications. IEEE Trans. Mobile Comput. 2025, 1–18. [Google Scholar] [CrossRef]
  41. Zeng, Y.; Xu, J.; Zhang, R. Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef]
  42. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  43. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  44. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; Volume 32, pp. 387–395. [Google Scholar]
  45. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  46. De Maio, A.; De Nicola, S.; Huang, Y.; Zhang, S.; Farina, A. Code design to optimize radar detection performance under accuracy and similarity constraints. IEEE Trans. Signal Process. 2008, 56, 5618–5629. [Google Scholar] [CrossRef]
  47. He, Z.; Xu, W.; Shen, H.; Ng, D.W.K.; Eldar, Y.C.; You, X. Full-duplex communication for ISAC: Joint beamforming and power optimization. IEEE J. Sel. Areas Commun. 2023, 41, 2920–2936. [Google Scholar] [CrossRef]
  48. Mbam, C.J. Fixed-Wing UAV Tracking of Evasive Targets in 3-Dimensional Space. Ph.D. Thesis, University of Leeds, Leeds, UK, 2024. [Google Scholar]
  49. de Froissard de Broissia, A.; Sigaud, O. Actor-critic versus direct policy search: A comparison based on sample complexity. arXiv 2016, arXiv:1606.09152. [Google Scholar] [CrossRef]
Figure 1. ISAC-UAV system model for simultaneous target motion sensing and uplink communication with ground UEs.
Figure 1. ISAC-UAV system model for simultaneous target motion sensing and uplink communication with ground UEs.
Drones 09 00548 g001
Figure 2. Block diagram of the proposed LSTM-DDPG-based algorithm architecture for the ISAC-UAV system.
Figure 2. Block diagram of the proposed LSTM-DDPG-based algorithm architecture for the ISAC-UAV system.
Drones 09 00548 g002
Figure 3. Block diagram of the actor network architecture in the proposed LSTM-DDPG-based framework.
Figure 3. Block diagram of the actor network architecture in the proposed LSTM-DDPG-based framework.
Drones 09 00548 g003
Figure 4. Block diagram of the critic network architecture in the proposed LSTM-DDPG-based framework.
Figure 4. Block diagram of the critic network architecture in the proposed LSTM-DDPG-based framework.
Drones 09 00548 g004
Figure 5. Architecture of the actor and critic networks in the proposed LSTM-DDPG-based framework.
Figure 5. Architecture of the actor and critic networks in the proposed LSTM-DDPG-based framework.
Drones 09 00548 g005
Figure 6. Cumulative reward curves under parameter settings ( K ul = 2 and C = 4 ). (a) The proposed LSTM-DDPG-based approach. (b) The DDPG-based approach without LSTM integration.
Figure 6. Cumulative reward curves under parameter settings ( K ul = 2 and C = 4 ). (a) The proposed LSTM-DDPG-based approach. (b) The DDPG-based approach without LSTM integration.
Drones 09 00548 g006
Figure 7. The 3D trajectory of the UAV optimized by the proposed LSTM-DDPG framework ( K ul = 2 and C = 4 ).
Figure 7. The 3D trajectory of the UAV optimized by the proposed LSTM-DDPG framework ( K ul = 2 and C = 4 ).
Drones 09 00548 g007
Figure 8. UAV trajectory optimized by the proposed LSTM-DDPG framework for different values of ϵ 0 ( K ul = 2 and C = 4 ). (a) ϵ 0 = 0.1. (b) ϵ 0 = 1.
Figure 8. UAV trajectory optimized by the proposed LSTM-DDPG framework for different values of ϵ 0 ( K ul = 2 and C = 4 ). (a) ϵ 0 = 0.1. (b) ϵ 0 = 1.
Drones 09 00548 g008
Figure 9. UAV movement energy consumption of the proposed LSTM-DDPG-based approach for varying values of ϵ 0 in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Figure 9. UAV movement energy consumption of the proposed LSTM-DDPG-based approach for varying values of ϵ 0 in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Drones 09 00548 g009
Figure 10. Performance comparison of the proposed LSTM-DDPG-based approach under different learning rates in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Figure 10. Performance comparison of the proposed LSTM-DDPG-based approach under different learning rates in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Drones 09 00548 g010
Figure 11. Performance comparison of the proposed and benchmark approaches under varying values of ϵ 0 in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Figure 11. Performance comparison of the proposed and benchmark approaches under varying values of ϵ 0 in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Drones 09 00548 g011
Figure 12. Performance comparison of the proposed and benchmark approaches under varying the number of UEs (K) in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Figure 12. Performance comparison of the proposed and benchmark approaches under varying the number of UEs (K) in the ISAC-UAV system ( K ul = 2 and C = 4 ).
Drones 09 00548 g012
Table 1. Simulation parameters for UAV movement.
Table 1. Simulation parameters for UAV movement.
ParameterValue
Tip speed of rotor blade ( tip )120 m/s
Fuselage equivalent flat plate area ( flat ) 0.0151 m 2
Fuselage drag ratio ( fus ) 0.6
Air density ( air ) 1.225 kg / m 3
Rotor solidity ( sol ) 0.05
Rotor disc area ( disc ) 0.503 m 2
Weight of UAV ( wei ) 20 Newton
Profile drag coefficient ( drag ) 0.012
Blade angular velocity ( ang )300  rad / s
Blade or aerofoil chord length ( aer )0.0157  m
Number of blades ( num )4
Rotor radius ( rad ) 0.4 m
Incremental correction factor ( incre )0.1
Mean rotor-induced velocity in hover ( v 0 ) 4.03 m / s
Table 2. Simulation parameters for an ISAC-FD-CFMM system.
Table 2. Simulation parameters for an ISAC-FD-CFMM system.
ParameterValue
Path loss at reference distance ( C 0 ) 30 dB
The path loss exponent ( α ) 2
Noise power ( σ 2 ) 104 dBm
Maximum flight speed of UAV ( v max ) 20 m / s
Target velocity ( v target ) 20 m / s
Time step duration 300 ms
Required QoS threshold ( R QoS ) 0.5 bps / Hz
Power budget at UL UEs ( p max ul ) 23 dBm
Power budget at DL UAVs ( p max dl ) 30 dBm
Radar SINR threshold ( γ 0 rad ) 12 dB
Table 3. The processing time for the methods.
Table 3. The processing time for the methods.
MethodLatency (ms)
DDPG approach 12.21
KF-DDPG approach 16.36
LSTM-DDPG approach in 2D 40.23
LSTM-DDPG approach in 3D 41.76
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dang, X.-T.; Eom, J.-S.; Vu, B.-M.; Shin, O.-S. A DDPG-LSTM Framework for Optimizing UAV-Enabled Integrated Sensing and Communication. Drones 2025, 9, 548. https://doi.org/10.3390/drones9080548

AMA Style

Dang X-T, Eom J-S, Vu B-M, Shin O-S. A DDPG-LSTM Framework for Optimizing UAV-Enabled Integrated Sensing and Communication. Drones. 2025; 9(8):548. https://doi.org/10.3390/drones9080548

Chicago/Turabian Style

Dang, Xuan-Toan, Joon-Soo Eom, Binh-Minh Vu, and Oh-Soon Shin. 2025. "A DDPG-LSTM Framework for Optimizing UAV-Enabled Integrated Sensing and Communication" Drones 9, no. 8: 548. https://doi.org/10.3390/drones9080548

APA Style

Dang, X.-T., Eom, J.-S., Vu, B.-M., & Shin, O.-S. (2025). A DDPG-LSTM Framework for Optimizing UAV-Enabled Integrated Sensing and Communication. Drones, 9(8), 548. https://doi.org/10.3390/drones9080548

Article Metrics

Back to TopTop