Next Article in Journal
Special Issue on Digitization, Information Technology and Social Development
Previous Article in Journal
A Novel IoT Security Framework Combining X25519 with NIST Lightweight Ascon Encryption and Hybrid Transform-Domain Steganography
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RSMA-Assisted Fluid Antenna ISAC via Hierarchical Deep Reinforcement Learning

Centre for Smart Systems and Automation, CoE for Robotics and Sensing Technologies, Faculty of Artificial Intelligence and Engineering, Multimedia University, Cyberjaya 63100, Malaysia
*
Author to whom correspondence should be addressed.
Telecom 2026, 7(2), 41; https://doi.org/10.3390/telecom7020041
Submission received: 16 January 2026 / Revised: 20 March 2026 / Accepted: 2 April 2026 / Published: 9 April 2026

Abstract

Integrated sensing and communications (ISAC) requires tight coordination between spatial signal design and multiple-access strategies to balance communication throughput and sensing accuracy under shared spectral and hardware constraints. However, existing ISAC frameworks with rate-splitting multiple access (RSMA) typically rely on fixed antenna arrays and decoupled optimization, which fundamentally limit their ability to adapt to fast channel variations and dynamic sensing requirements. This paper introduces a fluid antenna-enabled RSMA-assisted ISAC architecture, in which movable antenna ports are exploited as a new spatial degree of freedom to enhance adaptability in both communication and sensing operations. Fluid antenna systems (FAS) are deployed at both the base station and user terminals, allowing dynamic port selection that reshapes the effective channel and sensing beampattern in real time. We formulate a joint sum-rate maximization problem subject to explicit sensing-quality constraints, capturing the coupled impact of antenna port selection, RSMA rate allocation, and multi-beam transmit design. The proposed framework maximizes the communication sum-rate while ensuring that the sensing functionality satisfies a predefined sensing quality constraint. This constraint-based ISAC formulation guarantees that sufficient sensing power is directed toward the target while optimizing communication performance. The resulting optimization involves strongly coupled discrete and continuous decision variables, rendering conventional optimization methods ineffective. To address this challenge, a hierarchical deep reinforcement learning (HDRL) framework is developed, where an upper-layer deep Q-network (DQN) determines discrete antenna port selection and a lower-layer twin delayed deep deterministic policy gradient (TD3) algorithm optimizes continuous beamforming and rate-splitting parameters. Numerical results demonstrate that the proposed approach significantly improves system performance, achieving higher communication sum-rate while satisfying sensing requirements under dynamic propagation conditions.

1. Introduction

The emergence of sixth-generation (6G) wireless networks marks a paradigm shift toward massive connectivity, which is expected to be a cornerstone of future communication systems [1]. However, the persistent challenge of spectrum scarcity continues to hinder the realization of these next-generation networks. To address this limitation, Integrated Sensing and Communications (ISAC) has emerged as a compelling paradigm that enables the joint utilization of radio frequency (RF) resources for both communication and sensing tasks, particularly through the reuse of radar spectrum [2,3,4,5]. ISAC has garnered significant attention in recent years due to its potential to substantially enhance spectral efficiency, thus mitigating spectrum congestion and minimizing resource wastage, while concurrently decreasing hardware and signaling overhead [6]. Consequently, it has been incorporated into diverse wireless architectures, such as ISAC-enabled wireless power transfer (WPT) systems [7], ISAC-based non-orthogonal multiple access (NOMA) systems [8], and ISAC-assisted physical layer security (PLS) frameworks [9]. Furthermore, the inherent performance tradeoffs between sensing and communication functionalities in ISAC systems have been rigorously analyzed in [10].
With the growing number of users, inter-user interference has emerged as a critical factor constraining communication performance. Rate-splitting multiple access (RSMA) has been developed as an efficient mechanism for interference management and enhance network efficiency [11]. In the RSMA approach, the transmitter partitions information into a common stream and multiple private streams via linear precoded rate-splitting. The common stream is intended for decoding by all receivers, while each private stream is independently encoded and decoded at the corresponding receiver using successive interference cancellation (SIC) [12]. By enabling flexible resource allocation and effective interference mitigation, RSMA possesses the potential to address the limitations imposed by scarce wireless resources and multi-user communication requirements, thereby significantly improving the performance of communication systems, including integrated sensing and communication (ISAC) networks [13].
Consequently, RSMA-assisted ISAC provides a promising framework for next-generation wireless networks, offering improved spectral efficiency, robust interference mitigation, and enhanced overall system performance. However, the full potential of RSMA-assisted ISAC may be restricted due to the limited spatial flexibility of conventional fixed antennas, which can constrain interference management and overall communication performance. In this regard, movable antennas (MAs) offer a solution by enabling dynamic spatial repositioning of antennas to synthesize flexible radiation patterns, thereby introducing additional spatial degrees of freedom (DoF) that enhance desired links and effectively suppress interference [14].
Recently, MA technology has emerged as a promising paradigm to further exploit spatial DoFs through antenna position reconfiguration. Traditionally, this is achieved via mechanically movable antenna elements, where antennas are physically repositioned within a predefined region to exploit channel variations and improve system performance.
Beyond mechanical movement, electrically reconfigurable implementations of movable antennas have also been proposed, where antenna movement is emulated without physical displacement. In particular, dense antenna arrays integrated with reconfigurable devices such as pixel-based antenna architectures enable different radiating elements or ports to be activated electronically, thereby realizing virtual antenna movement. Such electrically movable antenna systems offer reduced latency and enhanced adaptability, especially in scenarios where physical movement is constrained [15].
However, electrically reconfigurable movable antenna architectures typically rely on dense arrays of reconfigurable elements and complex control circuitry, which significantly increases hardware complexity and energy consumption. In addition, antenna position adaptation is restricted to discrete switching among predefined ports or pixels, limiting spatial resolution. Furthermore, the dense arrangement of radiating elements may introduce mutual coupling effects that complicate system design and degrade performance. These limitations motivate the development of alternative architectures that can achieve efficient spatial adaptability with reduced implementation overhead.
Building on these observations, the fluid antenna system (FAS) has emerged as a practical and efficient realization of movable antenna concepts. Unlike mechanically movable antennas that rely on physical displacement and electrically reconfigurable architectures that require dense arrays and complex control circuitry, FAS achieves spatial adaptability through a single antenna element that switches among multiple predefined ports within a compact region. This approach significantly reduces hardware complexity and energy consumption, while mitigating mutual coupling effects by avoiding densely packed radiating elements.
In FAS, antenna adaptation is realized through discrete port selection, where the system leverages channel state information (CSI) at each port to determine the optimal antenna position [16], thereby enhancing communication reliability and efficiency. In FAS, each antenna element is connected to its respective RF chain via flexible cabling, allowing real-time repositioning of the antenna within a designated region. This spatial mobility introduces a new degree of freedom for system-level optimization. By adaptively configuring antenna locations based on instantaneous channel conditions, FAS technology enables dynamic channel shaping, thereby enhancing both communication and sensing performance. Hence, the integration of FAS with RSMA-assisted ISAC emerges as a natural and powerful evolution for next-generation wireless systems. By combining the near-continuous spatial flexibility of FAS with the interference-mitigation capabilities of RSMA, the system can dynamically optimize antenna positions and rate-splitting strategies to enhance communication performance while preserving sensing accuracy. This integrated framework possesses potential to achieve higher spectral efficiency, reliability, and enhanced network performance.
The integration of FAS into ISAC gives rise to a mixed discrete–continuous optimization problem that is particularly challenging. Specifically, the antenna port selection involves discrete spatial decisions, while the beamforming and rate-splitting parameters are continuous variables. These two types of decisions are strongly coupled, as any change in antenna port configuration alters the instantaneous channel states, thereby affecting optimal beamforming and power allocation strategies. This coupling leads to a high-dimensional, non-convex, and combinatorial optimization landscape, rendering traditional convex or alternating optimization methods ineffective. The deep reinforcement learning (DRL) approaches, while capable of model-free optimization, often struggle to efficiently explore and balance actions across heterogeneous discrete–continuous decision spaces, leading to suboptimal convergence and instability [17,18]. Therefore, it is imperative to adopt a hybrid deep reinforcement learning approach that can effectively handle both discrete and continuous action spaces, enabling a more flexible and efficient decision-making mechanism [19].

1.1. Related Works

In [20], the authors investigate ISAC networks and address the challenge of jointly optimizing communication and sensing performance under practical constraints, including beamforming, transmit power, and QoS requirements. To tackle the resulting non-convex and high-dimensional optimization problem, a mixture-of-experts (MoE)-based DRL framework is proposed to enable efficient policy learning, where expert networks specialize in different regions of the state space and a gating network dynamically integrates their outputs, thereby promoting stable convergence and effective exploration of complex state-action spaces. It is evident from the obtained results that the proposed DRL achieves significant improvements in sensing and communication performance, highlighting the critical role of DRL in adaptive and efficient ISAC system design.
Researchers have explored multiple-access schemes for ISAC to enhance spectrum efficiency. RSMA is particularly effective, as it enables partial decoding of interference while treating the rest as noise, offering flexible interference management. This improves spectral efficiency and robustness, making RSMA well suited for multiuser ISAC scenarios with heterogeneous channels and strict sensing–communication requirements. In [21], the authors address the challenge of interference management in multi-antenna dual-functional radar-communication (DFRC) systems. The authors propose an RSMA-assisted framework that splits messages into common and private streams and jointly optimizes message splitting, precoding, and radar sequence design. The obtained results demonstrate that RSMA significantly achieve high spectral efficiency by enabling the common stream to function simultaneously as a radar sequence and an interference mitigation tool. In [22], the authors investigate RSMA-assisted ISAC transmission design under realistic channel fading conditions. The authors develop a geometry-based 3D channel model incorporating a dual-functional base station, multiple communication users, and moving targets such as UAVs, capturing distance, velocity, and angle-dependent fading effects. The authors formulate an energy efficiency maximization problem that jointly considers transceiver beamforming, phase shifts, and quality of service (QoS) constraints for communication and sensing. To solve this highly non-convex problem, the authors deploy DRL approach based on the proximal policy optimization (PPO) algorithm, enabling efficient joint optimization of beamforming and phase shift. Simulation results demonstrate that RSMA effectively manages interference and improve spectral efficiency in multiuser ISAC scenarios.
In multiuser ISAC systems, interference between communication users and sensing signals can limit both data transmission and sensing accuracy, highlighting the need for advanced multiple access schemes that can jointly optimize these conflicting objectives. In [23], the authors propose an RSMA-enabled ISAC (RISAC) framework, where RSMA precoding and rate-splitting parameters are jointly optimized using a combination of particle swarm optimization and semidefinite relaxation (SDR) techniques. Simulation results show that by exploiting RSMA’s common stream, the proposed framework effectively mitigates both inter-user and sensing–communication interference, achieving a superior trade-off between communication rates and sensing performance and demonstrating the potential of RSMA to enhance the overall efficiency and flexibility of ISAC systems.
The integration of sensing and communication has created a demand for architectures capable of flexibly managing their strong spatial coupling. FAS-assisted ISAC has emerged as a promising solution because the spatial agility of fluid antennas enables more effective optimization of sensing and communication links than fixed-antenna systems. In [24], the authors tackle the complexity of jointly satisfying sensing and communication signal-to-noise ratio (SNR) requirements under limited antenna ports, which is an NP-hard problem arising from the strong coupling between port selection and beamforming. The authors propose an FAS-enabled ISAC framework that exploits dynamic port switching to unlock additional spatial DoFs. Simulation results demonstrate that the proposed approach achieves a 33% reduction in transmit power while ensuring sensing and communications requirements, outperforming conventional and uniformly distributed antenna schemes. However, port selection remains a fundamental challenge in FAS-assisted ISAC systems due to the combinatorial complexity of choosing the optimal subset of ports under joint sensing and communication constraints. This challenge is addressed in [25]. The authors study joint multi-port selection and precoder design for FAS-assisted multiuser MIMO downlink ISAC systems to maximize the sum-rate under sensing power constraints. Since the problem is NP-hard due to strong coupling between port selection and precoding, they propose a DRL-based solution. A constraint-aware neural precoding network is trained via primal–dual unsupervised learning, while port selection is handled by a pointer-network-based A2C framework using sum-rate as the reward. Results show near-optimal performance close to exhaustive search, over two-fold gains versus random selection, and strong robustness even with only 15% CSI, highlighting the effectiveness of DRL-enabled FAS for ISAC optimization under limited CSI.
In [26], the authors address the limitations of FPA-based ISAC systems in exploiting spatial degrees of freedom by proposing a FAS-enhanced ISAC framework. The authors jointly optimize FAS positions and dual-functional beamforming to maximize sensing SNR while ensuring minimum communication SINR per user, under FAS movement and port separation constraints. For perfect CSI, an alternating optimization (AO) algorithm using semidefinite relaxation (SDR) and successive convex approximation (SCA) is developed, while for imperfect CSI, an AO-based design with the S-Procedure and SCA handles uncertainty. Simulation results demonstrate sensing SNR gains of 8.72–178.85% while satisfying QoS. In another research effort, in [27], the authors integrate FAS with RSMA for downlink multiuser communications to improve reliability and outage performance. Users dynamically adjust FAS positions, and the BS employs RSMA signaling to flexibly manage interference. The channel gain distributions are modeled via a joint multivariate t-distribution with a copula-based formulation, enabling accurate analytical and asymptotic outage probability expressions. It is evident from the obtained results that FAS-RSMA significantly lowers outage probability and achieves higher spectral efficiency. Thus, an integration of FAS and RSMA can provide robust and efficient solution for both multi-user ISAC systems.

1.2. Contributions

Motivated by the above discussion, this paper investigates the integration of FAS into RSMA-assisted ISAC networks to overcome the challenges of spectrum scarcity, inter-user interference, and limited spatial flexibility in next-generation wireless systems. Although ISAC enhances spectrum reuse and RSMA provides robust interference management, their joint performance is still constrained by the fixed nature of conventional antennas. By offering adaptive spatial diversity through rapid port switching, FAS enables dynamic channel reshaping, stronger desired links, and more effective interference suppression. To fully exploit these advantages, we develop an FAS-enabled RSMA-ISAC framework supported by a hierarchical deep reinforcement learning (HDRL) approach that jointly optimizes beamforming, rate-splitting, and antenna port selection, thereby improving both communication efficiency and sensing performance. The key technical contributions are summarized as follows:
  • We propose an FAS-assisted RSMA-enabled ISAC framework in which both the base station (BS) and users leverage FAS to dynamically optimize antenna port locations, thereby enhancing spatial diversity and interference management.
  • We formulate a joint optimization problem to maximize the communication sum-rate while satisfying sensing beampattern gain and transmit power constraints by jointly designing RSMA rate-splitting ratios, transmit beamforming, and FAS port selection.
  • To solve the mixed discrete–continuous optimization challenge, an HDRL framework is developed that integrates a deep Q-network (DQN) for discrete port selection with a twin delayed deep deterministic policy gradient (TD3) algorithm for continuous beamforming and rate-splitting optimization.
  • Simulation results demonstrate that the proposed RSMA-based joint design substantially improves the sum rate, highlighting the powerful synergy between RSMA’s flexible interference management and the spatial adaptability of FAS in enhancing ISAC performance compared to state-of-the-art benchmark approaches.

1.3. Organization

The rest of the paper is organized as follows. Section 2 introduces the system model and the objective function. Section 3 presents the proposed algorithm. Section 4 provides analysis and insightful discussion. Finally, Section 5 concludes the paper and highlights the future research direction.

2. System Formulation

A single-cell ISAC system is considered, as illustrated in Figure 1, comprising a dual-function BS that performs both communication and sensing services, a set of K users, and a sensing target. The BS is equipped with N 2 fluid antennas. Each is capable of switching among ϖ t discrete ports, denoted by ς n = { 1 , 2 , , ϖ t } . Moreover, each user is equipped with a single fluid antenna consisting of ϖ r ports, where ς k = { 1 , 2 , , ϖ r } . Both the BS and users can instantly switch their radiating elements among these ports, enabling spatial reconfigurability and improved link reliability. These fluid antennas are connected to a dedicated RF chain via integrated waveguides or flexible cables, enabling continuous reconfiguration and movement within predefined spatial regions. Let S t R 2 and S r R 2 denote the feasible positioning regions for the BS and users’ fluid antennas, respectively. The position of the n-th BS fluid antenna at port μ ς n is denoted as t n μ = x n μ , y n μ T and the positions of all BS antennas are collectively represented as t ¯ = t 1 μ 1 , t 2 μ 2 , , t N μ N R 2 × N . Similarly, the single fluid antenna at user k can switch among ϖ r discrete ports located within the feasible region S r , where the position corresponding to port η ς k is denoted by r k η = [ x k η , y k η ] T .

2.1. Channel Model

A far-field channel model is adopted, under the assumption that the movement range of the fluid antennas is significantly smaller than the distance between the transmitter and receiver [28]. The angles of departure (AoD) and angles of arrival (AoA) for each multipath component are considered constant.
The variation in propagation distance for the n-th BS antenna at port μ ς n , along the i { 1 , , p t c k } transmission path, is determined by
ρ t , c k i ( t n μ ) = x n μ sin ϕ t , c k i cos ψ t , c k i + y n μ cos ϕ t , c k i ,
where ϕ t , c k i [ 0 , π ] and ψ t , c k i [ 0 , π ] are the elevation and azimuth AoDs for the i-th path to user k, and p t c k is the total number of transmission paths from the BS to user k.
The transmit response vector for the n-th BS antenna at port μ is:
e c k ( t n μ ) e j 2 π λ ρ t , c k 1 ( t n μ ) e j 2 π λ ρ t , c k p t c k ( t n μ ) C p t c k × 1 , n N .
After combining the response vectors of all N BS antennas, the transmit response matrix for user k is:
E c k ( t ¯ ) e c k ( t 1 μ 1 ) , e c k ( t 2 μ 2 ) , , e c k ( t N μ N ) C p t c k × N ,
where p t c k represents the total number of transmission paths from the BS to user k.
For the l-th receive path at user k, the propagation distance offset for user port η ς k is:
ρ r k , l ( r k η ) = x k η sin ϕ r k , l cos ψ r k , l + y k η cos ϕ r k , l , l { 1 , , p r } ,
where ϕ r k , l [ 0 , π ] and ψ r k l [ 0 , π ] denote the elevation and azimuth AoA for the l-th receive path of user k, and p r is the total number of receive paths.
The corresponding receive response vector for user k at port η is:
f k ( r k η ) e j 2 π λ ρ r k , 1 ( r k η ) e j 2 π λ ρ r k , p r ( r k η ) C p r × 1 .
Let Σ k C p r × p t c k denote the path response matrix from the BS reference point t 0 to user k’s reference point r 0 . The overall channel vector from the BS with port selection to user k at port η is:
h k η , μ = f k H ( r k η ) Σ k E c k ( t ¯ ) ,
where f k H ( r k η ) is the receive steering vector and E c k ( t ¯ ) is the BS transmit steering matrix for the selected ports.

2.2. Communication Model

To enhance spectral efficiency and user fairness in the downlink, the system incorporates RSMA at the BS. RSMA enables joint communication and sensing by allowing flexible interference management and partial interference decoding at the receivers.
In this framework, the message W k intended for user k is split into two parts: a common message W c , k , decodable by all users, and a private message W p , k , k K , intended specifically for user k. All common messages { W c , 1 , , W c , K } are combined into W c and encoded as s c , while the private messages { W p , 1 , , W p , K } are independently encoded into private streams { s p , 1 , , s p , K } .
Then, the transmit signal is defined as s = [ s p , 1 , , s p , K , s c ] T C ( K + 1 ) × 1 , where E [ s s H ] = I K + 1 . The signal transmitted by the BS is x = W s , where W = [ w 1 , , w K , w c ] C N × ( K + 1 ) denote the beamforming matrix, which depends on the selected port. The received signal at user k port y k η = ( h k η , μ ) H x + n k . h k η , μ is the channel vector from the BS to user k for the selected user port η and BS port μ . However, for ease of notation, the channel vector h k η , μ is denoted as h k throughout the remainder of the paper.
Consequently, the received signal at user k can be expressed as
y k = h k H x + n k ,
where h k C N × 1 represents the channel between the BS and user k, and n k C N ( 0 , σ k 2 ) represents the AWGN at user k.
According to the RSMA decoding mechanism [29], user k first decodes the common stream s c by treating both the common and private streams as noise. Thus, SINR of the common stream at user k is evaluated by
γ c , k = | h k H w c | 2 i = 1 K | h k H w i | 2 + σ k 2 ,
Then, the achievable rate for user k is evaluated by
R c , k = log 2 1 + γ c , k .
After successfully decoding and removing s c via SIC, user k can decode the private data stream s p , k . The SINR of user k private stream is determined as:
γ p , k = h k H w k 2 i = 1 , i k K h k H w i 2 + σ k 2 .
Then, the achievable rate of user k is determined by
R p , k = log 2 1 + γ p , k
To ensure successful decoding of the common stream s c by all users, the transmission rate of the common stream must not exceed the achievable decoding rate at any user. For instance, when decoding s c , user k achieves an achievable rate R c , k determined by its received SINR. Since the same common stream is decoded by all users, its transmission rate must satisfy
R c R c , k , k K .
Let r c , k denote the portion of the common information W c , k intended for user k. All such portions are multiplexed into the common stream s c , whose total transmission rate is determined by
R c = k = 1 K r c , k .
Accordingly, to guarantee successful decoding of the common stream at all users, the common-rate allocation must satisfy
k = 1 K r c , k R c , k , k K .
The achievable rate of user k is therefore expressed as
R k = R p , k + r c , k .
We adopt the sum rate as the communication performance metric, which is given by
R = k = 1 K ( R p , k + r c , k ) = k = 1 K R p , k + R c .

2.3. Sensing Model

As shown in Figure 1, the detection signal reaches a single target through the direct channel h r C N × 1 . The echo signal returns to the BS along the same path. Since the BS employs an FAS with N antennas, and each antenna can select a port μ n ς n , the equivalent target channel depends on the BS port configuration. Hence, the received signal at the BS is determined by
y s = A r μ x + n ,
where A r μ = h r μ h r μ H C N × N represents the sensing channel for the selected BS ports. h r μ represents the BS to target channel for the selected port configuration μ . n CN ( 0 , σ s 2 I N ) represents the additive Gaussian noise with variance σ s 2 I N . The probability of target detection improves monotonically with the sensing signal-to-noise ratio (SNR), which is evaluated as
γ r = Tr ( W H A r μ H A r μ W ) N σ s 2 .

2.4. Objective Function

The objective is to maximize the system sum-rate in an FAS-enabled RSMA-assisted ISAC system while ensuring that the sensing power toward the target direction satisfies a minimum required threshold. This design reflects a practical ISAC operation in which communication performance is optimized under sensing quality constraints. The optimization variables include the transmit beamforming matrix W, the BS port-selection vector μ , and the user port-selection vector η . These variables jointly determine the allocation of spatial resources and the decoding performance of both the common and private RSMA streams. Accordingly, the joint optimization problem is formulated as:
max W , μ , η R
s . t . μ n ς n , n N ,
η k ς k , k K ,
t n μ n t v μ v 2 D , n , v N , n v ,
Tr W W H P max ,
γ r γ min ,
k K r c , k R c , k , k K ,
r c , k 0 , k K .
The optimization problem in (19) aims to maximize the system sum-rate by jointly optimizing the transmit beamforming matrix W, the BS port-selection vector μ , and the user port-selection vector η . Constraint (19a) ensures that each BS fluid antenna selects a valid port from its available discrete port set ς n = { 1 , 2 , , ϖ t } , while constraint (19b) enforces that each user selects one of its available ports from ς k = { 1 , 2 , , ϖ r } . Constraint (19c) guarantees a minimum Euclidean separation D between any two active BS antenna ports to limit mutual coupling effects and maintain antenna performance. Constraint (19d) enforces the maximum transmit power budget P max at the BS by limiting the total radiated power of all beams. Constraint (19e) ensures that the sensing functionality of the ISAC system is preserved by enforcing a minimum sensing quality requirement. Constraint (19f) enforces the RSMA common-stream decodability condition by ensuring that the total allocated common rate does not exceed the achievable common-stream decoding rate at any user. This guarantees that the common stream can be successfully decoded by all users. Constraint (19g) ensures the non-negativity of the common-rate allocation variables. Due to the discrete port selections at the BS and users, combined with the non-linear dependence of the sum-rate on beamforming and port indices as well as the coupled rate and separation constraints, the optimization problem is highly non-convex. To address this challenge, an HDRL algorithm is deployed to efficiently learn policies for maximizing the system sum-rate while satisfying both communication and sensing requirements.

3. Hierarchical Deep Reinforcement Learning (HDRL) Approach

Conventional RL algorithms are insufficient in this setting due to their limited scalability when dealing with high-dimensional state–action spaces and their inability to simultaneously manage discrete and continuous decision variables. To address these limitations, we adopt an HDRL approach that integrates DQN with the TD3 algorithm. In this novel design, DQN handles discrete actions, while TD3 manages continuous and high-dimensional action spaces. The DQN manages FAS port selection by outputting discrete Q-values for all candidate ports and selecting the port that maximizes the reward, while its exploration enables efficient and robust port switching. Afterwards, the RSMA–ISAC resource-allocation variables are continuous that benefit from deterministic policy-gradient optimization. For this purpose, TD3 is employed to improve stability through its clipped double-Q learning, target policy smoothing, and delayed actor updates, enabling efficient refinement of continuous physical-layer parameters. The combination of DQN and TD3 enables shared experience replay, improved sample efficiency, and coordinated updates across decision layers. This hierarchical decomposition reduces action dimensionality, preserves the mathematical structure of both discrete and continuous decisions, stabilizes training, and prevents the quantization loss or infeasible actions that arise when continuous-control algorithms are forced onto discrete problems.
The optimization problem is reformulated as a Markov decision process (MDP), which facilitates its solution within a reinforcement learning framework. The MDP is modeled as a 4-tuple ( s t , a t , r t , s t + 1 ) , where s t denotes the current state, a t the selected action, r t the immediate reward, and s t + 1 the resulting state. At each time step t, the agent observes s t S and selects a t based on its policy to interact with the environment.
State: Owing to the large number of available fluid antenna ports at both the BS and the users, directly incorporating the channel gains of all ports into the state representation would result in a prohibitively high-dimensional state space, thereby hindering the training efficiency and convergence of the DRL model. To mitigate this issue, we exploit the inherent spatial correlation among adjacent ports and construct a reduced state representation by uniformly sampling the channel gains from a representative subset of ports. This approach significantly lowers the dimensionality of the state space while preserving the essential channel characteristics required for accurate decision-making. The state space at time t includes the distances between the BS and users, the SINR of the common and private streams, and the active antenna port positions, and is represented as s t = [ d 1 , , d K ] , [ γ c , 1 , , γ c , K ] , [ γ p , 1 , , γ p , K ] , [ x n μ n , y n μ n ] , [ x k η k , y k η k ] , where μ n ς n = { 1 , 2 , , ϖ t } denotes the selected port index of the n-th BS fluid antenna, and η k ς k = { 1 , 2 , , ϖ r } denotes the selected port index of user k’s fluid antenna. The corresponding 2D positions of the active ports are given by t n μ n = [ x n μ n , y n μ n ] T for the BS antennas and r k η k = [ x k η k , y k η k ] T for the users.
Action: The action space is designed to jointly capture the port selection of the fluid antennas at both the BS and users, as well as the transmit power allocation strategy. The first component corresponds to the port indices of the BS fluid antennas, denoted by μ B S = [ μ 1 , , μ N ] , where μ n ς n = { 1 , 2 , , ϖ t } indicates the active port of the n-th BS antenna. The second component, η u = [ η 1 , , η K ] , represents the port indices of the user-side fluid antennas, where η k ς k = { 1 , 2 , , ϖ r } selects the active port of user k. The third component, [ p 1 c , , p N c ] , specifies the power allocated to the common stream from each BS antenna port μ n , while the fourth component, [ p 11 , , p N K ] , defines the power allocation from each BS antenna port μ n to each user port η k for the private streams. Therefore, the overall action space is given by a t = [ μ 1 , , μ N ] , [ η 1 , , η K ] , [ p 1 c , , p N c ] , [ p 11 , , p N K ] , which allows the agent to autonomously adapt both antenna port selections and power allocation, thereby enhancing spectral efficiency and overall system performance.
Reward: The reward function is designed to be directly aligned with the system objective, namely, the maximization of the sum-rate. In reinforcement learning, the reward serves as the feedback signal that guides the agent’s policy updates; hence, it must be strongly correlated with the performance metric of interest. To this end, the agent receives a positive reward proportional to the achievable system sum-rate whenever all optimization constraints (19a)–(19e) are satisfied. Conversely, if any constraint is violated, the agent is penalized with a zero reward. Formally, the reward at time step t is expressed as follows:
r t = R , if   ( 19 a ) ( 19 g )   are satisfied , 0 , otherwise .
This formulation ensures that the agent is incentivized to learn policies that not only maximize the sum-rate but also strictly adhere to the feasibility conditions of the optimization problem. The sensing requirement is enforced through the constraint-aware reward design. Specifically, the agent receives the achieved communication sum-rate when all system constraints, including the sensing power constraint, are satisfied; otherwise, the reward is set to zero. This mechanism ensures that the learning agent favors actions that simultaneously satisfy both communication and sensing requirements while maximizing communication performance.
The core idea of the proposed HDRL approach is to decompose the system action space into discrete and continuous components. The discrete action a d , corresponding to FAS port selection, is optimized using a DQN. Specifically, the DQN approximates the discrete action-value function Q ( s , a d ) with a neural network parameterized by θ d , enabling the agent to evaluate and select the most promising port configuration. At each time step, the DQN updates its parameters by minimizing a TD loss that measures the discrepancy between predicted and target Q-values: The loss function for training the value network is defined as
L d ( θ d ) = 1 N i = 1 N Q d ( s i , a d , i ; θ d ) y i 2 ,
where y i denotes the TD target constructed from sampled transitions in the replay buffer. This loss represents the mean squared error (MSE) between the predicted Q-values from the online network and the target values y i . By penalizing large deviations between predicted and target Q-values, it encourages the network to iteratively approximate the optimal action-value function.
The target value y i is evaluated using the target Q-network with parameters θ d , which is a delayed copy of the online Q-network:
y i = r i + γ max a d Q d ( s i + 1 , a d ; θ d ) ,
where θ d is updated as θ d θ d every t d steps, with t d being the delay interval. This delay mechanism reduces the risk of divergence by decoupling target computation from rapid online weight updates.
The TD error, defined as the discrepancy between the target and predicted Q-values, is expressed as:
δ i = y i Q d ( s i , a d , i ; θ d ) .
The gradient of the loss function with respect to the network parameters θ d is obtained as:
θ d L d ( θ d ) = 2 N i = 1 N δ i θ d Q d ( s i , a d , i ; θ d ) ,
The weight update rule for the Q-network follows the standard stochastic gradient descent (SGD) form:
θ d θ d η d θ d L d ( θ d ) ,
where η d denotes the learning rate of the Q-network.
Consequently, the policy of the outer-loop DQN procedure is evaluated as:
π d ( s t ) = arg max a d Q d ( s t , a d ; θ d ) ,
which ensures that the selected discrete action maximizes the estimated long-term return at the current state. This discrete action serves as the input for the inner-loop TD3 module, enabling coordinated optimization over both discrete and continuous control spaces.
The TD3 algorithm is an enhanced variant of the Deep Deterministic Policy Gradient (DDPG) method, designed to improve stability and reduce overestimation bias in continuous action space learning.

3.1. Critic Network Update

During training, a mini-batch of N B transition tuples { ( s i , a i , r i , s i + 1 ) } i = 1 N B is sampled uniformly from the experience replay buffer D . For the i-th tuple, the target Q-value is computed as:
y i = r i + γ Q s i + 1 , π ( s i + 1 ; θ π ) ; θ Q ,
where r i denotes the immediate reward obtained by executing action a i in state s i , γ [ 0 , 1 ] is the discount factor, and π ( s i + 1 ; θ π ) denotes the deterministic target policy (actor) parameterized by θ π . The function Q ( · , · ; θ Q ) represents the target critic network parameterized by θ Q . The parameters θ π and θ Q are delayed (soft-updated) copies of the corresponding online actor and critic network parameters, respectively.
The critic network parameters θ Q are updated by minimizing the mean squared Bellman error (MSBE) between the predicted Q-value and the target value:
L Q ( θ Q ) = 1 N B i = 1 N B Q ( s i , a i ; θ Q ) y i 2 .
The gradient of the loss with respect to θ Q is:
θ Q L Q = 2 N B i = 1 N B Q ( s i , a i ; θ Q ) y i θ Q Q ( s i , a i ; θ Q ) ,
and the parameters are updated via gradient descent:
θ Q θ Q η Q θ Q L Q ,
where η Q is the learning rate of the critic network.

3.2. Actor Network Update

To improve the quality of the learned policy, the actor network is optimized such that its output action maximizes the expected Q-value as estimated by the critic network. The actor parameters θ π are updated by performing gradient ascent on the expected return:
θ π J ( θ π ) = 1 N B i = 1 N B a Q ( s i , a ; θ Q ) | a = π ( s i ; θ π ) θ π π ( s i ; θ π ) ,
where:
  • π ( s i ; θ π ) is the deterministic policy output by the actor network,
  • Q ( s i , a ; θ Q ) is the critic’s Q-value estimation,
  • θ Q and θ π are the critic and actor network parameters, respectively.
The actor parameters are then updated using gradient ascent:
θ π θ π + η π θ π J ( θ π ) ,
where η π is the actor learning rate.
To ensure stable learning, the target critic and target actor networks are updated via a soft update mechanism, preventing large oscillations in target values. The target critic parameters θ Q are updated as:
θ Q τ θ Q + ( 1 τ ) θ Q ,
and the target actor parameters θ π are updated analogously:
θ π τ θ π + ( 1 τ ) θ π ,
where τ ( 0 , 1 ] is the Polyak averaging coefficient.

3.3. Exploration Strategy

To ensure adequate exploration in the continuous action space, the deterministic policy output is perturbed with zero-mean Gaussian noise:
a t = π ( s t ; θ π ) + N ( 0 , σ t 2 ) ,
where σ t = υ e κ t controls the exploration variance, υ is the initial noise scale, and κ is a decay constant controlling the rate at which exploration noise decreases over time.
However, overestimation bias can lead to the accumulation of approximation errors over time. Specifically, inaccuracies in the Q-value function may result in the agent assigning artificially high values to suboptimal state–action pairs, thereby producing a suboptimal policy. The primary objective of TD3 is to mitigate such function approximation errors, which can lead to overestimation of Q-value functions and degrade policy performance in the DDPG algorithm [30]. TD3 employs a dual-critic architecture to address this issue, consisting of three main neural networks: one actor network and two critic networks. The actor network comprises:
  • Policy network parameterized by θ π , outputs the continuous control action a c = π ( s ; θ π ) for a given state s.
  • Target policy network parameterized by θ π , is a delayed copy of the policy network used for stable target value computation.
Furthermore, the critic network consists of:
  • Main Q-networks Q 1 ( s , a ; θ Q 1 ) and Q 2 ( s , a ; θ Q 2 ) , estimate the action-value function for a given state–action pair.
  • Target Q-networks Q 1 ( s , a ; θ Q 1 ) and Q 2 ( s , a ; θ Q 2 ) are delayed copies of the main Q-networks and used for computing target values in the Bellman updates.
This network structure decouples target Q-value computation from action selection, thereby reducing the correlation between estimated and target values. To further suppress overestimation bias, TD3 introduces several key modifications:
  • Clipped Double Q-Learning: Uses the minimum value of the two target Q-networks, i.e.,
    y i = r i + γ min Q 1 ( s i + 1 , a ˜ i + 1 ) , Q 2 ( s i + 1 , a ˜ i + 1 ) ,
    where a ˜ i + 1 is the target action from the smoothed target policy.
  • Target Policy Smoothing: Adds small clipped noise to the target action before evaluating it with the target critics to prevent exploitation of Q-function errors.
  • Delayed Policy Updates: Updates the actor and target networks less frequently than the critics, stabilizing policy learning.
TD3 maintains two independent critic networks, Q 1 ( s , a ; θ Q 1 ) and Q 2 ( s , a ; θ Q 2 ) , to reduce overestimation. The target Q-value is computed using the smaller of the two critic estimates:
y i = r i + γ min j = 1 , 2 Q j s i + 1 , π ( s i + 1 ; θ π ) ; θ Q j ,
where ( θ Q 1 , θ Q 2 ) are target critic parameters.
To prevent the actor from being trained on inaccurate value estimates, TD3 updates the policy network and target networks less frequently than the critics. Specifically, for every d critic updates, one policy update is performed.
To reduce exploitation of Q-function peaks caused by function approximation errors, TD3 applies clipped Gaussian noise to the target policy actions:
a ˜ = π ( s i + 1 ; θ π ) + clip ϵ , c , c ,
where ϵ N ( 0 , σ 2 ) is Gaussian noise and c > 0 is the noise clipping threshold. This regularization smooths the Q-value landscape by preventing abrupt policy changes in narrow action regions.
The proposed HDRL algorithm is summarized in Algorithm 1. The training procedure begins with the initialization phase, where the maximum number of episodes and time steps per episode are specified, a unified replay buffer for both the DQN and TD3 modules is established, and the parameters of the DQN, actor, and critic networks, together with their corresponding target networks, are randomly initialized to enable stable learning. Training proceeds over multiple episodes, each of which resets the environment with new channel conditions, antenna positions, and RSMA variables. Within each episode, the algorithm iterates through time steps, where in every step, the outer DQN module first selects a discrete action corresponding to the BS antenna port based on the observed system state by greedily choosing the action with the highest Q-value. Conditioned on this discrete choice, the inner TD3 module generates continuous control actions, including user FAS displacements, RSMA power allocation, and beamforming vectors, with additional Gaussian noise injected for exploration. These actions are executed in the environment, which produces the next state, achievable communication rates, and sensing metrics. A reward is then computed based on the achieved communication sum-rate, and it is set to zero if any feasibility constraints, such as transmit power or antenna separation, are violated. The resulting transition is stored in the replay buffer, enabling off-policy learning. Subsequently, the DQN network is updated by minimizing the temporal-difference error using sampled mini-batches, while the TD3 actor and critic networks are trained with mini-batch updates incorporating clipped double Q-learning, target policy smoothing, and delayed policy updates to enhance stability. Target networks for both modules are softly updated to track the learned networks. This process continues until the episode concludes, after which the algorithm advances to the next episode. Upon completion of all episodes, the algorithm yields a hierarchical policy consisting of the trained DQN for BS antenna port selection and the TD3 for continuous optimization of user antenna positions and RSMA power allocation, thereby solving the hybrid ISAC optimization problem.

3.4. Complexity Analysis

The computational complexity of the proposed HDRL algorithm can be analyzed by considering both the DQN and TD3 components. For the DQN, which handles discrete BS antenna port selection, the state dimension includes channel and system information, and the action space consists of selecting one of M antenna ports. Each forward pass through the DQN network has a complexity of O ( | θ d | ) , where | θ d | denotes the total number of network parameters. Performing a gradient update using a mini-batch of size B requires O ( B · | θ d | ) . Across N time steps in an episode, the total DQN computational cost is therefore O ( N B | θ d | ) . Furthermore, the TD3 component, responsible for continuous control of the users’ FAS positions and RSMA variables, processes a state vector including the positions of all K users and relevant RSMA parameters. The action dimension reflects user FAS displacements and power allocation variables. Each forward pass through the actor and twin critics has a complexity of O ( | θ π | + | θ Q 1 | + | θ Q 2 | ) , and gradient updates on a mini-batch of size B require O ( B · ( | θ π | + | θ Q 1 | + | θ Q 2 | ) ) . Over N time steps, this results in a TD3 computational cost of O ( N B ( | θ π | + | θ Q 1 | + | θ Q 2 | ) ) per episode. Combining both components, the overall computational complexity of HDRL per episode is O N B | θ d | + | θ π | + | θ Q 1 | + | θ Q 2 | and for T max episodes, it scales as O T max N B | θ d | + | θ π | + | θ Q 1 | + | θ Q 2 | . The complexity of the DQN network scales linearly with the number of BS antennas M due to the output layer, while the TD3 network scales roughly linearly with the number of users K because the action dimensions increase with user FAS displacements and power allocations. The mini-batch size B dominates the gradient computation cost in both networks. This analysis demonstrates how system parameters such as K, M, and network sizes influence the overall computational effort required for training the HDRL algorithm.
Algorithm 1 Hierarchical Deep Reinforcement Learning (HDRL) Approach
  1:
Initialize: DQN network weights θ d , TD3 actor network π weights θ π , TD3 twin critic networks Q 1 , Q 2 weights θ Q 1 , θ Q 2 , Target networks θ d , θ π , θ Q 1 , θ Q 2 , Replay buffer D, Exploration parameters: ϵ for DQN, noise process for TD3
  2:
for each episode do
  3:
   Initialize environment and get initial state s 0
  4:
   for each time step do
  5:
     Compute Q-values: Q _ v a l u e s = DQN ( s t ; θ d )
  6:
     Select discrete action a d = arg max a Q [ a ] (port selection)
  7:
     Actor continuous action a c = π ( s t , a d ; θ π )
  8:
     Add exploration noise: a c n o i s y = a c + noise t
  9:
     Full action a t = ( a d , a c n o i s y )
10:
     Apply antenna port switching per a d
11:
     Apply power, beamforming, RSMA parameters per a c n o i s y
12:
     Observe next state s t + 1 and reward r t
13:
     Store ( s t , a t , r t , s t + 1 ) in replay buffer D
14:
     Sample minibatch { ( s i , a i , r i , s i + 1 ) } from D
15:
     for each sample do
16:
        Compute target:
y i = r i + γ max a d Q d ( s i + 1 , a d ; θ d )
17:
     end for
18:
     Update DQN weights θ d by minimizing loss:
L d = 1 N Q d ( s i , a d , i ; θ d ) y i 2
19:
     for each sample do
20:
        Compute target actions with target actor + clipped noise:
         a c , i = π ( s i + 1 , a d , i + 1 ; θ π ) + clipped_noise
21
         Compute target Q-values:
y i = r i + γ min ( Q 1 ( s i + 1 , a d , i + 1 , a c , i ) , Q 2 ( s i + 1 , a d , i + 1 , a c , i ) )
22:
     end for
23:
     Update critics θ Q 1 , θ Q 2 by minimizing MSE loss on y i
24:
     Update actor θ π by policy gradient to maximize
Q 1 ( s i , a d , i , π ( s i , a d , i ) )
25:
     Update target networks softly:
θ d τ θ d + ( 1 τ ) θ d θ π τ θ π + ( 1 τ ) θ π θ Q 1 τ θ Q 1 + ( 1 τ ) θ Q 1 θ Q 2 τ θ Q 2 + ( 1 τ ) θ Q 2
26:
     Update state s t s t + 1
27:
     Decay ϵ over time
28:
   end for
29:
end for
Finally, it is important to note that the inference stage is significantly less computationally demanding than training. During deployment, the learned policies are executed through forward passes of the trained networks without performing gradient updates or replay buffer operations. Specifically, the DQN module selects the antenna port configuration, and the TD3 actor network generates the continuous control variables. Therefore, the inference-time complexity can be approximated as O ( | θ d | + | θ π | ) , which is sufficiently lightweight for real-time decision-making in practical wireless communication systems.

4. Performance Evaluation

In the simulations, the elevation and azimuth angles are modeled as independent and identically distributed random variables uniformly drawn from 0 , π . The distance between adjacent fluid antenna ports is fixed to λ / 2 , while the fluid antenna movement is constrained within a square region of size ϖ × ϖ , where ϖ = 4 λ . The path response matrix is assumed to be diagonal, i.e., Σ = diag ( Σ 1 , 1 , Σ 2 , 2 , , Σ p r , p r ) . The diagonal elements follow complex Gaussian distributions such that Σ 1 , 1 CN 0 , τ τ + 1 and Σ l , l CN 0 , 1 ( τ + 1 ) ( p t 1 ) for l = 2 , 3 , , p t , where τ 0 denotes the Rician factor representing the power ratio between the line-of-sight (LoS) and non-line-of-sight (NLoS) components. In this work, τ is set to 1. The numbers of transmit and receive paths are both set to 3. The maximum transmit signal-to-noise ratio (SNR) at the base station (BS) is defined as P max / σ w 2 = 5 dB , while the sensing constraint is characterized by Γ / σ w 2 = 9 dB . Moreover, the number of BS antennas is set to N = 4 . The proposed HDRL framework integrates a DQN and a TD3 algorithm in a hierarchical manner. The DQN consists of a three-layer fully connected neural network with two hidden layers of 128 neurons each and ReLU activation functions. The TD3 actor network employs two hidden layers with 256 neurons each, followed by a Tanh activation to constrain the continuous action space. The TD3 critic adopts twin Q-networks with identical architectures to mitigate overestimation bias. During training, the system is trained for 1000 episodes, each consisting of 12 interaction steps. A replay buffer of size 500 is utilized, and mini-batches of size 64 are sampled for network updates. The learning rates for the actor and critic networks are set to 0.02 and 0.001, respectively, while the learning rate for the DQN is set to 0.001. The soft update rate is fixed at 0.005, and the discount factor is set to 0.9. State normalization is performed using logarithmic scaling of channel gains to improve training stability, while continuous actions are clipped within predefined bounds to satisfy system constraints. The simulation parameters are presented in Table 1. To evaluate the effectiveness of the proposed HDRL approach, its performance is compared against three widely used following benchmarks.
  • Proximal Policy Optimization (PPO): PPO is a policy gradient method that balances exploration and stability by constraining policy updates within a trust region. It has been widely adopted due to its strong empirical performance and relative simplicity, making it a standard benchmark for continuous and discrete action-space problems.
  • Advantage Actor-Critic (A2C): A2C is a synchronous actor-critic algorithm that utilizes the advantage function to reduce the variance of policy gradient estimates. It is valued for its training efficiency and its ability to stabilize learning compared to basic policy gradient methods, serving as a strong baseline for policy optimization tasks.
  • Asynchronous Advantage Actor-Critic (A3C): A3C runs multiple parallel agents asynchronously. This approach decorrelates experiences and accelerates learning, particularly in complex or high-dimensional environments.
Figure 2 illustrates the convergence behavior of the proposed approach compared to the benchmark schemes. The results clearly show that the proposed hierarchical DRL framework outperforms all baselines, achieving higher rewards throughout the training process and stabilizing around episode 185 with an average reward of approximately 665, whereas the benchmarks converge more slowly and reach significantly lower reward levels. The hierarchical design of the proposed framework effectively balances exploration of promising strategies with exploitation of well-learned policies for efficient execution. The integration of TD3 further enhances stability by addressing overestimation bias through its twin-critic network, delayed policy updates, and target smoothing, thereby reducing oscillations and promoting smoother reward progression. As training progresses, the proposed approach effectively balances exploration with exploitation of well-learned policies. This eliminates the early over-optimistic actions, causing the reward to converge smoothly. In contrast, the benchmarks face inherent limitations that prevent them from achieving the performance of the proposed approach. PPO, although more stable due to its clipped surrogate objective, updates conservatively and therefore struggles to achieve peak performance in tasks requiring fine-grained exploration. A2C improves stability but remains constrained by biased value estimation and the absence of hierarchical abstraction. Moreover, A3C encourages exploration via asynchronous updates but suffers from high gradient variance, which slows convergence; furthermore, the lack of experience replay and target networks results in unstable learning in non-convex environments with complex constraints, limiting its optimization capability. Thus, the proposed approach achieves more efficient learning, smoother convergence, and superior asymptotic performance compared to benchmark approaches.
Figure 3 illustrates the training time required for convergence of different reinforcement learning algorithms, including the proposed HDRL framework, PPO, A2C, and A3C. It can be observed that the proposed HDRL requires the longest training time, approximately 1820 s, due to its hierarchical structure that integrates both discrete and continuous optimization via DQN and TD3 components. This increased complexity leads to higher computational overhead during training. In contrast, PPO converges faster with a training time of 1490 s, as it relies on a single policy network with clipped updates, reducing computational burden. The A2C and A3C algorithms exhibit significantly lower convergence times of 980 s and 760 s, respectively. This is mainly because these methods employ simpler architectures and fewer parameters, resulting in faster updates but comparatively lower learning capability. Despite the higher training time, the proposed HDRL framework achieves superior performance, as demonstrated in the sum-rate results. This highlights a trade-off between computational complexity and performance, where HDRL provides enhanced optimization capability at the cost of increased training time.
The ablation results in Figure 4 clearly show that the full HDRL framework achieves the highest average sum-rate among all considered variants. This is because the complete scheme jointly exploits the benefits of FAS, RSMA, and HDRL, where the discrete and continuous decision spaces are optimized in a coordinated manner. In particular, FAS provides additional spatial flexibility by enabling adaptive antenna/port selection, RSMA improves interference management through message splitting and partial interference decoding, and the hierarchical DQN-TD3 structure allows the system to handle hybrid optimization variables efficiently. Since the considered problem contains both discrete and continuous control dimensions, the full HDRL framework is able to search a richer solution space and identify better transmission configurations, which directly translates into improved sum-rate performance.
When FAS is removed, the performance decreases because the system loses its ability to adapt the antenna configuration according to the channel conditions. In the full scheme, FAS introduces additional spatial DoF that can be used to strengthen desired links and reduce inter-user interference by selecting more favorable antenna positions or ports. Without this spatial adaptability, the transceiver operates with a fixed antenna structure, which limits channel exploitation capability and reduces beamforming flexibility. Consequently, even if the remaining optimization and interference-management mechanisms are still active, the achievable sum-rate is lower than that of the full HDRL scheme.
A performance reduction is also observed in the without RSMA case. This degradation is expected because RSMA plays an important role in multiuser interference mitigation. By splitting each user message into common and private parts, RSMA provides an additional degree of freedom for balancing interference suppression and information delivery. This is particularly beneficial in overloaded or interference-limited multiuser downlink scenarios, where purely private-stream transmission is often less efficient. When RSMA is disabled, the system must rely only on conventional signaling, which reduces flexibility in handling coupled user interference and therefore leads to a lower sum-rate than the complete framework.
The only DQN variant further illustrates the importance of hierarchical optimization. DQN is well suited for handling discrete decision variables, such as selecting transmission modes, ports, or index-based actions. Therefore, it can still provide meaningful gains by identifying favorable structural configurations of the system. However, when only DQN is used, the framework loses the ability to finely optimize continuous control variables, such as power allocation, rate-splitting coefficients, or other analog parameters. Consequently, the obtained solution is only partially optimized continuous refinement, preventing the system from reaching its full performance potential. Therefore, it performs better than only TD3.
Among all ablated variants, the only TD3 case exhibits the lowest performance. The main reason is that TD3 is fundamentally designed for continuous optimization, whereas the considered joint design problem is inherently hybrid, involving both discrete and continuous decisions. Even if the continuous variables are optimized effectively, the system still depends on appropriate discrete decisions to determine the underlying transmission structure. Without DQN, those discrete choices cannot be properly explored or optimized, and the continuous controller is forced to operate over a restricted or suboptimal structural configuration. Hence, continuous optimization alone cannot compensate for the lack of discrete adaptation, because the quality of the final solution strongly depends on first selecting a good discrete operating point. Therefore, it achieves the lowest sum-rate, which confirms that the discrete decision-making layer is indispensable in this problem.
Overall, the ablation study demonstrates that the performance gain of the proposed method does not come from a single module in isolation. Rather, it originates from the synergistic interaction of spatial adaptability provided by FAS, interference management enabled by RSMA, and hybrid discrete–continuous optimization realized through the HDRL framework. The superiority of full HDRL therefore validates the need for jointly optimizing all components in an integrated manner.
Figure 5 illustrates the impact of increasing the user rate on the sum-rate. It is evident from the obtained results that the sum-rate values are high at low rate because the BS can flexibly allocate transmit power between the common and private streams, allowing even weak users to reliably decode through the common stream. The deployment of FAS at both the BS and user sides further enhances spatial diversity, as the fluid antenna position can be dynamically adjusted to select more favorable channels. However, as the rate threshold increases, each user’s QoS demand becomes more stringent, forcing the BS to dedicate more transmit power to guarantee per-user requirements and thereby reducing the DoF for flexible resource balancing. This leads to a saturation effect where sum-rate growth slows and eventually declines with higher thresholds. For instance, at a rate of 0.8, the proposed method achieves a 31.6%, 42.9%, and 47.1% higher sum-rate than the PPO, A2C, and A3C, respectively. This performance gain results from the proposed HDRL approach, which combines the discrete action-handling capability of DQN for FAS position selection and RSMA mode selection with the continuous control strength of TD3 for beamforming, power allocation, and common-rate optimization. Hence, the proposed approach efficiently adapts to the hybrid action space, effectively balancing exploration and exploitation, and converges toward more optimal strategies. In contrast, the benchmarks struggle with dynamic adaptation in mixed action environments, leading to suboptimal allocation strategies and reduced performance.
The impact of an increasing number of users on the system sum-rate is evaluated in Figure 6. The results show a consistent increase in sum-rate as the number of users grows. This improvement is mainly attributed to the capability of RSMA to effectively manage multiuser interference through partial interference decoding and flexible message splitting. In addition, the deployment of FAS at both the BS and user sides enables dynamic adjustment of antenna positions to better exploit favorable channel conditions, thereby reducing interference, enhancing spatial diversity, and improving the overall system performance. It can be observed that the proposed approach consistently outperforms PPO, A2C, and A3C across all user settings. For instance, when there are 20 users, then the proposed approach achieves performance gains of approximately 9.0% over PPO, 14.9% over A2C, and 14.9% over A3C. This clearly highlights the effectiveness of the proposed framework in handling interference while leveraging spatial adaptability. Furthermore, as the number of users increases, the advantages of RSMA become more pronounced in the downlink, since its flexible message-splitting strategy efficiently accommodates multiple user transmissions without significant degradation in performance. Overall, the proposed framework effectively mitigates the adverse effects of user densification, ensuring scalable and high-throughput downlink operation.
The impact of the BS transmit power on the system sum-rate is evaluated in Figure 7. The results reveal a significant increase in sum-rate with higher BS transmit power, primarily due to the enhanced channel adaptability provided by the FAS deployed at both the BS and user terminals. FAS enables dynamic reconfiguration of antenna positions, offering increased spatial diversity and improved beamforming precision, which significantly boosts the SINR across all users. Coupled with RSMA’s rate-splitting mechanism, which allows users to partially decode the common data stream and efficiently manage interference, the system can exploit FAS-enabled channel gains to maximize downlink transmission. For instance, at 8 dB transmit power, the proposed approach achieves a 4.2%, 6.4%, and 7.8% higher sum-rate than PPO, A2C, and A3C, respectively. This gain is achieved by the proposed approach through efficient exploration of the hybrid action space. Therefore, the proposed approach dynamically balances interference management and spatial diversity, enabling the agent to achieve higher SINR and more efficient resource utilization. Consequently, the sum-rate grows more rapidly with increasing transmit power compared to the benchmarks, as the proposed approach outperforms less adaptive or suboptimal policy updates in resource allocation.
In Figure 8, the sensing SINR decreases as the communication rate threshold increases. At low rate thresholds, communication QoS requirements can be satisfied with a small fraction of the transmit power, leaving more power for sensing, which enhances the sensing SINR. In the proposed approach, the common message stream acts as an interference suppression layer, allowing the system to exploit additional sensing DoF without compromising communication reliability. Private streams require less aggressive power boosting in this regime, further conserving resources for sensing. Moreover, the deployment of FAS at the BS and users dynamically reconfigures antenna positions to select favorable channel states, minimizing inter-user interference and maximizing spatial diversity, which further improves sensing performance. It can be seen that the proposed approach achieves 31.2%, 74.1%, and 88.0% higher sensing SINR than PPO, A2C, and A3C, respectively, effectively balancing the trade-off between communication and sensing. When the rate threshold increases, more power must be allocated to the RSMA common and private streams to meet communication demands, reducing the power available for sensing. However, FAS can compensate by adjusting antenna positions, but their effectiveness is limited by the reduced sensing power budget. Consequently, low rate thresholds yield higher sensing SINR and larger sensing DoF, whereas high rate thresholds prioritize communication at the expense of sensing, with FAS still providing some resilience by improving spatial channel conditions.
In Figure 9, the impact of varying bandwidth on the system sum-rate is analyzed. The results demonstrate that the sum-rate increases with bandwidth, consistent with the theoretical relationship between throughput and spectrum size. At low bandwidths, the sum-rate rises rapidly, as additional spectrum allows more bits to be transmitted per unit time while maintaining relatively high SINR per subcarrier. As the bandwidth increases further, the growth rate slows due to the fixed total transmit power being spread across more subcarriers, reducing the per-subcarrier SINR and increasing inter-user interference. The proposed RSMA-FAS framework effectively addresses these limitations. FAS at both the BS and user sides dynamically reconfigures antenna positions to exploit favorable spatial channels, reducing interference and enhancing channel diversity. Moreover, the RSMA common stream acts as an interference mitigation layer, allowing the system to leverage additional spatial DoF while meeting user-specific communication requirements. In addition, the proposed HDRL approach enables autonomous and adaptive optimization of both antenna configurations and power allocation, allowing the agent to dynamically balance spatial diversity, interference management, and per-user rate requirements. At a bandwidth of 6 MHz, the proposed method achieves 12%, 13.3%, and 14.2% higher sum-rates than PPO, A2C, and A3C, respectively, with the most pronounced gains observed at low-to-moderate bandwidths. This performance advantage arises because HDRL jointly optimizes the hybrid action space, effectively exploiting FAS-enabled channel variations and RSMA’s interference mitigation to maximize the SINR across subcarriers. These results underscore the effective interplay between dynamic antenna positioning, RSMA-based message splitting, and HDRL-driven adaptive control in enhancing spectral efficiency and sum-rate under varying bandwidth conditions.

5. Conclusions

In this work, an FAS-assisted ISAC system is investigated that integrates the advantages of RSMA for efficient interference management. The communication rate is maximized between the BS and users under constraints on BS transmit power and sensing beampattern gain. The sum-rate is maximized by jointly optimizing the beamforming vectors, fluid antenna positions, and RSMA common rate, with the design formulated as an optimization problem to balance sensing and communication requirements. To efficiently solve this complex problem, a novel HDRL approach is proposed, which integrates the DQN and the TD3 in a hierarchical manner, enabling effective exploration of the solution space and convergence to high-performance configurations. Simulation results confirm that the proposed approach significantly outperforms benchmarks approaches by achieving higher sum-rates while maintaining robust sensing capabilities, demonstrating the synergy between fluid antenna flexibility and RSMA’s interference management. In future work, we will extend the proposed framework to more realistic ISAC scenarios by incorporating imperfect CSI conditions, including partial CSI for fluid antenna ports, as well as mobility-aware environments where dynamic user and target movements together with CSI acquisition uncertainty require rapid and continuous adaptation of beamforming strategies and fluid antenna port selection to simultaneously maintain sensing accuracy and communication quality. In addition, we will investigate more sophisticated sensing models that account for multi-target environments, clutter effects, and bistatic or monostatic sensing configurations.

Author Contributions

Conceptualization and methodology, M.S.; software and implementation, M.S.; formal analysis, M.S.; supervision, T.C.C.; writing—original draft preparation, M.S.; writing—review and editing, M.S. and I.E.L.; funding—T.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Multimedia University under the Research Fellow Grant MMUI/250008, and in part by Telekom Research & Development Sdn. Bhd. under Grant RDTC/241149.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article. Further enquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. An, J.; Yuen, C.; Guan, Y.L.; Di Renzo, M.; Debbah, M.; Poor, H.V.; Hanzo, L. Two-Dimensional Direction-of-Arrival Estimation Using Stacked Intelligent Metasurfaces. IEEE J. Sel. Areas Commun. 2024, 42, 2786–2802. [Google Scholar] [CrossRef]
  2. Liu, F.; Masouros, C.; Petropulu, A.P.; Griffiths, H.; Hanzo, L. Joint Radar and Communication Design: Applications, State-of-the-Art, and the Road Ahead. IEEE Trans. Commun. 2020, 68, 3834–3862. [Google Scholar] [CrossRef]
  3. Lu, S.; Liu, F.; Li, Y.; Zhang, K.; Huang, H.; Zou, J.; Li, X.; Dong, Y.; Dong, F.; Zhu, J. Integrated Sensing and Communications: Recent Advances and Ten Open Challenges. IEEE Internet Things J. 2024, 11, 19094–19120. [Google Scholar] [CrossRef]
  4. Yao, J.; Mai, L.; Zhang, Q. Approximate Capacity–Distortion Region of Joint State Sensing and Communication in MIMO Real Gaussian Channels. IEEE Trans. Commun. 2023, 72, 2625–2638. [Google Scholar] [CrossRef]
  5. Wu, T.; Pan, C.; Pan, Y.; Hong, S.; Ren, H.; Elkashlan, M.; Shu, F.; Wang, J. Joint Angle Estimation Error Analysis and 3-D Positioning Algorithm Design for mmWave Positioning Systems. IEEE Internet Things J. 2023, 11, 2181–2197. [Google Scholar] [CrossRef]
  6. Chen, Z.; Zheng, T.; Hu, C.; Cao, H.; Yang, Y.; Jiang, H.; Luo, J. ISACoT: Integrating Sensing with Data Traffic for Ubiquitous IoT Devices. IEEE Commun. Mag. 2022, 61, 98–104. [Google Scholar] [CrossRef]
  7. Chen, Y.; Hua, H.; Xu, J.; Ng, D.W.K. ISAC Meets SWIPT: Multifunctional Wireless Systems Integrating Sensing, Communication, and Powering. IEEE Trans. Wirel. Commun. 2024, 23, 8264–8280. [Google Scholar] [CrossRef]
  8. Cui, Z.; Hu, J.; Cheng, J.; Li, G. Multi-Domain NOMA for ISAC: Utilizing the Degrees of Freedom in the Delay–Doppler Domain. IEEE Commun. Lett. 2022, 27, 726–730. [Google Scholar] [CrossRef]
  9. Su, N.; Liu, F.; Masouros, C. Sensing-Assisted Eavesdropper Estimation: An ISAC Breakthrough in Physical Layer Security. IEEE Trans. Wirel. Commun. 2023, 23, 3162–3174. [Google Scholar] [CrossRef]
  10. An, J.; Li, H.; Ng, D.W.K.; Yuen, C. Fundamental Detection Probability vs. Achievable Rate Tradeoff in Integrated Sensing and Communication Systems. IEEE Trans. Wirel. Commun. 2023, 22, 9835–9853. [Google Scholar] [CrossRef]
  11. Yang, Z.; Chen, M.; Saad, W.; Shikh-Bahaei, M. Downlink Sum-Rate Maximization for Rate-Splitting Multiple Access. In Proceedings of the IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
  12. Rimoldi, B.; Urbanke, R. A Rate-Splitting Approach to the Gaussian Multiple-Access Channel. IEEE Trans. Inf. Theory 1996, 42, 364–375. [Google Scholar] [CrossRef]
  13. Can, M.; Ilter, M.C.; Altunbas, I. Data-Oriented Downlink RSMA Systems. IEEE Commun. Lett. 2023, 27, 2812–2816. [Google Scholar] [CrossRef]
  14. Zhu, L.; Ma, W.; Zhang, R. Modeling and Performance Analysis for Movable Antenna Enabled Wireless Communications. IEEE Trans. Wirel. Commun. 2023, 23, 6234–6250. [Google Scholar] [CrossRef]
  15. Chen, K.; Qi, C.; Hong, Y.; Yuen, C. REMAA: Reconfigurable Pixel Antenna-Based Electronic Movable-Antenna Arrays for Multiuser Communications. IEEE Trans. Commun. 2025, 73, 12913–12928. [Google Scholar] [CrossRef]
  16. Chai, Z.; Wong, K.-K.; Tong, K.-F.; Chen, Y.; Zhang, Y. Port Selection for Fluid Antenna Systems. IEEE Commun. Lett. 2022, 26, 1180–1184. [Google Scholar] [CrossRef]
  17. Naderializadeh, N.; Sydir, J.J.; Simsek, M.; Nikopour, H. Resource Management in Wireless Networks via Multi-Agent Deep Reinforcement Learning. IEEE Trans. Wirel. Commun. 2021, 20, 3507–3523. [Google Scholar] [CrossRef]
  18. Kim, K.; Tun, Y.K.; Munir, M.S.; Saad, W.; Hong, C.S. Deep Reinforcement Learning for Channel Estimation in RIS-Aided Wireless Networks. IEEE Commun. Lett. 2023, 27, 2053–2057. [Google Scholar] [CrossRef]
  19. Rabee, A.; Barhumi, I. Toward Energy-Efficient Dynamic Resource Allocation in Uplink NOMA Systems. IEEE Trans. Veh. Technol. 2025, 74, 9313–9327. [Google Scholar] [CrossRef]
  20. Ma, Z.; Liang, Y.; Zhu, Q.; Zheng, J.; Lian, Z.; Zeng, L.; Fu, C.; Peng, Y.; Ai, B. Hybrid-RIS-Assisted Cellular ISAC Networks for UAV-Enabled Low-Altitude Economy via Deep Reinforcement Learning with Mixture-of-Experts. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 3875–3888. [Google Scholar] [CrossRef]
  21. Xu, C.; Clerckx, B.; Chen, S.; Mao, Y.; Zhang, J. Rate-Splitting Multiple Access for Multi-Antenna Joint Radar and Communications. IEEE J. Sel. Top. Signal Process. 2021, 15, 1332–1347. [Google Scholar] [CrossRef]
  22. Ma, Z.; Zhang, R.; Ai, B.; Lian, Z.; Zeng, L.; Niyato, D.; Peng, Y. Deep Reinforcement Learning for Energy Efficiency Maximization in RSMA-IRS-Assisted ISAC System. IEEE Trans. Veh. Technol. 2025, 74, 18273–18278. [Google Scholar] [CrossRef]
  23. Liu, Z.; Jint, Y.; Cao, B.; Lu, R. RISAC: Rate-Splitting Multiple Access Enabled Integrated Sensing and Communication Systems. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023; pp. 6449–6454. [Google Scholar]
  24. Zou, J.; Xu, H.; Wang, C.; Xu, L.; Sun, S.; Meng, K.; Masouros, C.; Wong, K.-K. Shifting the ISAC Trade-Off with Fluid Antenna Systems. IEEE Wirel. Commun. Lett. 2024, 13, 3479–3483. [Google Scholar] [CrossRef]
  25. Wang, C.; Li, G.; Zhang, H.; Wong, K.-K.; Li, Z.; Ng, D.W.K.; Chae, C.-B. Fluid Antenna System Liberating Multiuser MIMO for ISAC via Deep Reinforcement Learning. IEEE Trans. Wirel. Commun. 2024, 23, 10879–10894. [Google Scholar] [CrossRef]
  26. Hao, T.; Shi, C.; Wu, Q.; Xia, B.; Guo, Y.; Ding, L.; Yang, F. Fluid-Antenna Enhanced ISAC: Joint Antenna Positioning and Dual-Functional Beamforming Design. IEEE Trans. Veh. Technol. 2025, 74, 17204–17219. [Google Scholar] [CrossRef]
  27. Ghadi, F.R.; Wong, K.-K.; López-Martínez, F.J.; Hanzo, L.; Chae, C.-B. Fluid Antenna-Aided Rate-Splitting Multiple Access. IEEE Trans. Veh. Technol. 2025, 75, 3417–3422. [Google Scholar] [CrossRef]
  28. Ma, W.; Zhu, L.; Zhang, R. MIMO Capacity Characterization for Movable Antenna Systems. IEEE Trans. Wirel. Commun. 2023, 23, 3392–3407. [Google Scholar] [CrossRef]
  29. Clerckx, B.; Mao, Y.; Jorswieck, E.A.; Yuan, J.; Love, D.J.; Erkip, E.; Niyato, D. A Primer on Rate-Splitting Multiple Access: Tutorial, Myths, and Frequently Asked Questions. IEEE J. Sel. Areas Commun. 2023, 41, 1265–1308. [Google Scholar] [CrossRef]
  30. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Figure 1. Illustration of an FAS-assisted RSMA–ISAC framework.
Figure 1. Illustration of an FAS-assisted RSMA–ISAC framework.
Telecom 07 00041 g001
Figure 2. Convergence of the proposed approach and benchmarks: rewards versus training episodes.
Figure 2. Convergence of the proposed approach and benchmarks: rewards versus training episodes.
Telecom 07 00041 g002
Figure 3. Comparison of training time to convergence.
Figure 3. Comparison of training time to convergence.
Telecom 07 00041 g003
Figure 4. Ablation study: Average sum-rate performance of different variants, including Full HDRL, without FAS, without RSMA, only DQN, and only TD3.
Figure 4. Ablation study: Average sum-rate performance of different variants, including Full HDRL, without FAS, without RSMA, only DQN, and only TD3.
Telecom 07 00041 g004
Figure 5. Sum-rate versus varying user rate.
Figure 5. Sum-rate versus varying user rate.
Telecom 07 00041 g005
Figure 6. Sum-rate versus varying number of users.
Figure 6. Sum-rate versus varying number of users.
Telecom 07 00041 g006
Figure 7. Sum-rate versus varying transmit power.
Figure 7. Sum-rate versus varying transmit power.
Telecom 07 00041 g007
Figure 8. Sensing SINR versus varying rate threshold.
Figure 8. Sensing SINR versus varying rate threshold.
Telecom 07 00041 g008
Figure 9. Sum-rate versus varying bandwidth.
Figure 9. Sum-rate versus varying bandwidth.
Telecom 07 00041 g009
Table 1. Parameter values for system setup.
Table 1. Parameter values for system setup.
ParametersValues
Cell size250 × 250 m 2
Bandwidth6 MHz
Number of users7
Target position[89 36 0]
SNR of the BS5 dB
BS power10 W
User noise power 10 10 W
Target noise power 10 12 W
Soft update rate0.005
Carrier frequency3.5 GHz
Discount rate0.9
Exploration noise0.2
Mini-batch64
Actor and critic networks learning rates0.02, 0.001
Learning rate for Q-network0.001
Experience replay buffer500
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sheraz, M.; Chuah, T.C.; Lee, I.E. RSMA-Assisted Fluid Antenna ISAC via Hierarchical Deep Reinforcement Learning. Telecom 2026, 7, 41. https://doi.org/10.3390/telecom7020041

AMA Style

Sheraz M, Chuah TC, Lee IE. RSMA-Assisted Fluid Antenna ISAC via Hierarchical Deep Reinforcement Learning. Telecom. 2026; 7(2):41. https://doi.org/10.3390/telecom7020041

Chicago/Turabian Style

Sheraz, Muhammad, Teong Chee Chuah, and It Ee Lee. 2026. "RSMA-Assisted Fluid Antenna ISAC via Hierarchical Deep Reinforcement Learning" Telecom 7, no. 2: 41. https://doi.org/10.3390/telecom7020041

APA Style

Sheraz, M., Chuah, T. C., & Lee, I. E. (2026). RSMA-Assisted Fluid Antenna ISAC via Hierarchical Deep Reinforcement Learning. Telecom, 7(2), 41. https://doi.org/10.3390/telecom7020041

Article Metrics

Back to TopTop