1. Introduction
The emergence of sixth-generation (6G) wireless networks marks a paradigm shift toward massive connectivity, which is expected to be a cornerstone of future communication systems [
1]. However, the persistent challenge of spectrum scarcity continues to hinder the realization of these next-generation networks. To address this limitation, Integrated Sensing and Communications (ISAC) has emerged as a compelling paradigm that enables the joint utilization of radio frequency (RF) resources for both communication and sensing tasks, particularly through the reuse of radar spectrum [
2,
3,
4,
5]. ISAC has garnered significant attention in recent years due to its potential to substantially enhance spectral efficiency, thus mitigating spectrum congestion and minimizing resource wastage, while concurrently decreasing hardware and signaling overhead [
6]. Consequently, it has been incorporated into diverse wireless architectures, such as ISAC-enabled wireless power transfer (WPT) systems [
7], ISAC-based non-orthogonal multiple access (NOMA) systems [
8], and ISAC-assisted physical layer security (PLS) frameworks [
9]. Furthermore, the inherent performance tradeoffs between sensing and communication functionalities in ISAC systems have been rigorously analyzed in [
10].
With the growing number of users, inter-user interference has emerged as a critical factor constraining communication performance. Rate-splitting multiple access (RSMA) has been developed as an efficient mechanism for interference management and enhance network efficiency [
11]. In the RSMA approach, the transmitter partitions information into a common stream and multiple private streams via linear precoded rate-splitting. The common stream is intended for decoding by all receivers, while each private stream is independently encoded and decoded at the corresponding receiver using successive interference cancellation (SIC) [
12]. By enabling flexible resource allocation and effective interference mitigation, RSMA possesses the potential to address the limitations imposed by scarce wireless resources and multi-user communication requirements, thereby significantly improving the performance of communication systems, including integrated sensing and communication (ISAC) networks [
13].
Consequently, RSMA-assisted ISAC provides a promising framework for next-generation wireless networks, offering improved spectral efficiency, robust interference mitigation, and enhanced overall system performance. However, the full potential of RSMA-assisted ISAC may be restricted due to the limited spatial flexibility of conventional fixed antennas, which can constrain interference management and overall communication performance. In this regard, movable antennas (MAs) offer a solution by enabling dynamic spatial repositioning of antennas to synthesize flexible radiation patterns, thereby introducing additional spatial degrees of freedom (DoF) that enhance desired links and effectively suppress interference [
14].
Recently, MA technology has emerged as a promising paradigm to further exploit spatial DoFs through antenna position reconfiguration. Traditionally, this is achieved via mechanically movable antenna elements, where antennas are physically repositioned within a predefined region to exploit channel variations and improve system performance.
Beyond mechanical movement, electrically reconfigurable implementations of movable antennas have also been proposed, where antenna movement is emulated without physical displacement. In particular, dense antenna arrays integrated with reconfigurable devices such as pixel-based antenna architectures enable different radiating elements or ports to be activated electronically, thereby realizing virtual antenna movement. Such electrically movable antenna systems offer reduced latency and enhanced adaptability, especially in scenarios where physical movement is constrained [
15].
However, electrically reconfigurable movable antenna architectures typically rely on dense arrays of reconfigurable elements and complex control circuitry, which significantly increases hardware complexity and energy consumption. In addition, antenna position adaptation is restricted to discrete switching among predefined ports or pixels, limiting spatial resolution. Furthermore, the dense arrangement of radiating elements may introduce mutual coupling effects that complicate system design and degrade performance. These limitations motivate the development of alternative architectures that can achieve efficient spatial adaptability with reduced implementation overhead.
Building on these observations, the fluid antenna system (FAS) has emerged as a practical and efficient realization of movable antenna concepts. Unlike mechanically movable antennas that rely on physical displacement and electrically reconfigurable architectures that require dense arrays and complex control circuitry, FAS achieves spatial adaptability through a single antenna element that switches among multiple predefined ports within a compact region. This approach significantly reduces hardware complexity and energy consumption, while mitigating mutual coupling effects by avoiding densely packed radiating elements.
In FAS, antenna adaptation is realized through discrete port selection, where the system leverages channel state information (CSI) at each port to determine the optimal antenna position [
16], thereby enhancing communication reliability and efficiency. In FAS, each antenna element is connected to its respective RF chain via flexible cabling, allowing real-time repositioning of the antenna within a designated region. This spatial mobility introduces a new degree of freedom for system-level optimization. By adaptively configuring antenna locations based on instantaneous channel conditions, FAS technology enables dynamic channel shaping, thereby enhancing both communication and sensing performance. Hence, the integration of FAS with RSMA-assisted ISAC emerges as a natural and powerful evolution for next-generation wireless systems. By combining the near-continuous spatial flexibility of FAS with the interference-mitigation capabilities of RSMA, the system can dynamically optimize antenna positions and rate-splitting strategies to enhance communication performance while preserving sensing accuracy. This integrated framework possesses potential to achieve higher spectral efficiency, reliability, and enhanced network performance.
The integration of FAS into ISAC gives rise to a mixed discrete–continuous optimization problem that is particularly challenging. Specifically, the antenna port selection involves discrete spatial decisions, while the beamforming and rate-splitting parameters are continuous variables. These two types of decisions are strongly coupled, as any change in antenna port configuration alters the instantaneous channel states, thereby affecting optimal beamforming and power allocation strategies. This coupling leads to a high-dimensional, non-convex, and combinatorial optimization landscape, rendering traditional convex or alternating optimization methods ineffective. The deep reinforcement learning (DRL) approaches, while capable of model-free optimization, often struggle to efficiently explore and balance actions across heterogeneous discrete–continuous decision spaces, leading to suboptimal convergence and instability [
17,
18]. Therefore, it is imperative to adopt a hybrid deep reinforcement learning approach that can effectively handle both discrete and continuous action spaces, enabling a more flexible and efficient decision-making mechanism [
19].
1.1. Related Works
In [
20], the authors investigate ISAC networks and address the challenge of jointly optimizing communication and sensing performance under practical constraints, including beamforming, transmit power, and QoS requirements. To tackle the resulting non-convex and high-dimensional optimization problem, a mixture-of-experts (MoE)-based DRL framework is proposed to enable efficient policy learning, where expert networks specialize in different regions of the state space and a gating network dynamically integrates their outputs, thereby promoting stable convergence and effective exploration of complex state-action spaces. It is evident from the obtained results that the proposed DRL achieves significant improvements in sensing and communication performance, highlighting the critical role of DRL in adaptive and efficient ISAC system design.
Researchers have explored multiple-access schemes for ISAC to enhance spectrum efficiency. RSMA is particularly effective, as it enables partial decoding of interference while treating the rest as noise, offering flexible interference management. This improves spectral efficiency and robustness, making RSMA well suited for multiuser ISAC scenarios with heterogeneous channels and strict sensing–communication requirements. In [
21], the authors address the challenge of interference management in multi-antenna dual-functional radar-communication (DFRC) systems. The authors propose an RSMA-assisted framework that splits messages into common and private streams and jointly optimizes message splitting, precoding, and radar sequence design. The obtained results demonstrate that RSMA significantly achieve high spectral efficiency by enabling the common stream to function simultaneously as a radar sequence and an interference mitigation tool. In [
22], the authors investigate RSMA-assisted ISAC transmission design under realistic channel fading conditions. The authors develop a geometry-based 3D channel model incorporating a dual-functional base station, multiple communication users, and moving targets such as UAVs, capturing distance, velocity, and angle-dependent fading effects. The authors formulate an energy efficiency maximization problem that jointly considers transceiver beamforming, phase shifts, and quality of service (QoS) constraints for communication and sensing. To solve this highly non-convex problem, the authors deploy DRL approach based on the proximal policy optimization (PPO) algorithm, enabling efficient joint optimization of beamforming and phase shift. Simulation results demonstrate that RSMA effectively manages interference and improve spectral efficiency in multiuser ISAC scenarios.
In multiuser ISAC systems, interference between communication users and sensing signals can limit both data transmission and sensing accuracy, highlighting the need for advanced multiple access schemes that can jointly optimize these conflicting objectives. In [
23], the authors propose an RSMA-enabled ISAC (RISAC) framework, where RSMA precoding and rate-splitting parameters are jointly optimized using a combination of particle swarm optimization and semidefinite relaxation (SDR) techniques. Simulation results show that by exploiting RSMA’s common stream, the proposed framework effectively mitigates both inter-user and sensing–communication interference, achieving a superior trade-off between communication rates and sensing performance and demonstrating the potential of RSMA to enhance the overall efficiency and flexibility of ISAC systems.
The integration of sensing and communication has created a demand for architectures capable of flexibly managing their strong spatial coupling. FAS-assisted ISAC has emerged as a promising solution because the spatial agility of fluid antennas enables more effective optimization of sensing and communication links than fixed-antenna systems. In [
24], the authors tackle the complexity of jointly satisfying sensing and communication signal-to-noise ratio (SNR) requirements under limited antenna ports, which is an NP-hard problem arising from the strong coupling between port selection and beamforming. The authors propose an FAS-enabled ISAC framework that exploits dynamic port switching to unlock additional spatial DoFs. Simulation results demonstrate that the proposed approach achieves a 33% reduction in transmit power while ensuring sensing and communications requirements, outperforming conventional and uniformly distributed antenna schemes. However, port selection remains a fundamental challenge in FAS-assisted ISAC systems due to the combinatorial complexity of choosing the optimal subset of ports under joint sensing and communication constraints. This challenge is addressed in [
25]. The authors study joint multi-port selection and precoder design for FAS-assisted multiuser MIMO downlink ISAC systems to maximize the sum-rate under sensing power constraints. Since the problem is NP-hard due to strong coupling between port selection and precoding, they propose a DRL-based solution. A constraint-aware neural precoding network is trained via primal–dual unsupervised learning, while port selection is handled by a pointer-network-based A2C framework using sum-rate as the reward. Results show near-optimal performance close to exhaustive search, over two-fold gains versus random selection, and strong robustness even with only 15% CSI, highlighting the effectiveness of DRL-enabled FAS for ISAC optimization under limited CSI.
In [
26], the authors address the limitations of FPA-based ISAC systems in exploiting spatial degrees of freedom by proposing a FAS-enhanced ISAC framework. The authors jointly optimize FAS positions and dual-functional beamforming to maximize sensing SNR while ensuring minimum communication SINR per user, under FAS movement and port separation constraints. For perfect CSI, an alternating optimization (AO) algorithm using semidefinite relaxation (SDR) and successive convex approximation (SCA) is developed, while for imperfect CSI, an AO-based design with the S-Procedure and SCA handles uncertainty. Simulation results demonstrate sensing SNR gains of 8.72–178.85% while satisfying QoS. In another research effort, in [
27], the authors integrate FAS with RSMA for downlink multiuser communications to improve reliability and outage performance. Users dynamically adjust FAS positions, and the BS employs RSMA signaling to flexibly manage interference. The channel gain distributions are modeled via a joint multivariate t-distribution with a copula-based formulation, enabling accurate analytical and asymptotic outage probability expressions. It is evident from the obtained results that FAS-RSMA significantly lowers outage probability and achieves higher spectral efficiency. Thus, an integration of FAS and RSMA can provide robust and efficient solution for both multi-user ISAC systems.
3. Hierarchical Deep Reinforcement Learning (HDRL) Approach
Conventional RL algorithms are insufficient in this setting due to their limited scalability when dealing with high-dimensional state–action spaces and their inability to simultaneously manage discrete and continuous decision variables. To address these limitations, we adopt an HDRL approach that integrates DQN with the TD3 algorithm. In this novel design, DQN handles discrete actions, while TD3 manages continuous and high-dimensional action spaces. The DQN manages FAS port selection by outputting discrete Q-values for all candidate ports and selecting the port that maximizes the reward, while its exploration enables efficient and robust port switching. Afterwards, the RSMA–ISAC resource-allocation variables are continuous that benefit from deterministic policy-gradient optimization. For this purpose, TD3 is employed to improve stability through its clipped double-Q learning, target policy smoothing, and delayed actor updates, enabling efficient refinement of continuous physical-layer parameters. The combination of DQN and TD3 enables shared experience replay, improved sample efficiency, and coordinated updates across decision layers. This hierarchical decomposition reduces action dimensionality, preserves the mathematical structure of both discrete and continuous decisions, stabilizes training, and prevents the quantization loss or infeasible actions that arise when continuous-control algorithms are forced onto discrete problems.
The optimization problem is reformulated as a Markov decision process (MDP), which facilitates its solution within a reinforcement learning framework. The MDP is modeled as a 4-tuple , where denotes the current state, the selected action, the immediate reward, and the resulting state. At each time step t, the agent observes and selects based on its policy to interact with the environment.
State: Owing to the large number of available fluid antenna ports at both the BS and the users, directly incorporating the channel gains of all ports into the state representation would result in a prohibitively high-dimensional state space, thereby hindering the training efficiency and convergence of the DRL model. To mitigate this issue, we exploit the inherent spatial correlation among adjacent ports and construct a reduced state representation by uniformly sampling the channel gains from a representative subset of ports. This approach significantly lowers the dimensionality of the state space while preserving the essential channel characteristics required for accurate decision-making. The state space at time t includes the distances between the BS and users, the SINR of the common and private streams, and the active antenna port positions, and is represented as , where denotes the selected port index of the n-th BS fluid antenna, and denotes the selected port index of user k’s fluid antenna. The corresponding 2D positions of the active ports are given by for the BS antennas and for the users.
Action: The action space is designed to jointly capture the port selection of the fluid antennas at both the BS and users, as well as the transmit power allocation strategy. The first component corresponds to the port indices of the BS fluid antennas, denoted by , where indicates the active port of the n-th BS antenna. The second component, , represents the port indices of the user-side fluid antennas, where selects the active port of user k. The third component, , specifies the power allocated to the common stream from each BS antenna port , while the fourth component, , defines the power allocation from each BS antenna port to each user port for the private streams. Therefore, the overall action space is given by , which allows the agent to autonomously adapt both antenna port selections and power allocation, thereby enhancing spectral efficiency and overall system performance.
Reward: The reward function is designed to be directly aligned with the system objective, namely, the maximization of the sum-rate. In reinforcement learning, the reward serves as the feedback signal that guides the agent’s policy updates; hence, it must be strongly correlated with the performance metric of interest. To this end, the agent receives a positive reward proportional to the achievable system sum-rate whenever all optimization constraints (19a)–(19e) are satisfied. Conversely, if any constraint is violated, the agent is penalized with a zero reward. Formally, the reward at time step
t is expressed as follows:
This formulation ensures that the agent is incentivized to learn policies that not only maximize the sum-rate but also strictly adhere to the feasibility conditions of the optimization problem. The sensing requirement is enforced through the constraint-aware reward design. Specifically, the agent receives the achieved communication sum-rate when all system constraints, including the sensing power constraint, are satisfied; otherwise, the reward is set to zero. This mechanism ensures that the learning agent favors actions that simultaneously satisfy both communication and sensing requirements while maximizing communication performance.
The core idea of the proposed HDRL approach is to decompose the system action space into discrete and continuous components. The discrete action
, corresponding to FAS port selection, is optimized using a DQN. Specifically, the DQN approximates the discrete action-value function
with a neural network parameterized by
, enabling the agent to evaluate and select the most promising port configuration. At each time step, the DQN updates its parameters by minimizing a TD loss that measures the discrepancy between predicted and target Q-values: The loss function for training the value network is defined as
where
denotes the TD target constructed from sampled transitions in the replay buffer. This loss represents the mean squared error (MSE) between the predicted Q-values from the online network and the target values
. By penalizing large deviations between predicted and target Q-values, it encourages the network to iteratively approximate the optimal action-value function.
The target value
is evaluated using the target Q-network with parameters
, which is a delayed copy of the online Q-network:
where
is updated as
every
steps, with
being the delay interval. This delay mechanism reduces the risk of divergence by decoupling target computation from rapid online weight updates.
The TD error, defined as the discrepancy between the target and predicted Q-values, is expressed as:
The gradient of the loss function with respect to the network parameters
is obtained as:
The weight update rule for the Q-network follows the standard stochastic gradient descent (SGD) form:
where
denotes the learning rate of the Q-network.
Consequently, the policy of the outer-loop DQN procedure is evaluated as:
which ensures that the selected discrete action maximizes the estimated long-term return at the current state. This discrete action serves as the input for the inner-loop TD3 module, enabling coordinated optimization over both discrete and continuous control spaces.
The TD3 algorithm is an enhanced variant of the Deep Deterministic Policy Gradient (DDPG) method, designed to improve stability and reduce overestimation bias in continuous action space learning.
3.3. Exploration Strategy
To ensure adequate exploration in the continuous action space, the deterministic policy output is perturbed with zero-mean Gaussian noise:
where
controls the exploration variance,
is the initial noise scale, and
is a decay constant controlling the rate at which exploration noise decreases over time.
However, overestimation bias can lead to the accumulation of approximation errors over time. Specifically, inaccuracies in the Q-value function may result in the agent assigning artificially high values to suboptimal state–action pairs, thereby producing a suboptimal policy. The primary objective of TD3 is to mitigate such function approximation errors, which can lead to overestimation of Q-value functions and degrade policy performance in the DDPG algorithm [
30]. TD3 employs a dual-critic architecture to address this issue, consisting of three main neural networks: one actor network and two critic networks. The actor network comprises:
Policy network parameterized by , outputs the continuous control action for a given state s.
Target policy network parameterized by , is a delayed copy of the policy network used for stable target value computation.
Furthermore, the critic network consists of:
Main Q-networks and , estimate the action-value function for a given state–action pair.
Target Q-networks and are delayed copies of the main Q-networks and used for computing target values in the Bellman updates.
This network structure decouples target Q-value computation from action selection, thereby reducing the correlation between estimated and target values. To further suppress overestimation bias, TD3 introduces several key modifications:
Clipped Double Q-Learning: Uses the minimum value of the two target Q-networks, i.e.,
where
is the target action from the smoothed target policy.
Target Policy Smoothing: Adds small clipped noise to the target action before evaluating it with the target critics to prevent exploitation of Q-function errors.
Delayed Policy Updates: Updates the actor and target networks less frequently than the critics, stabilizing policy learning.
TD3 maintains two independent critic networks,
and
, to reduce overestimation. The target Q-value is computed using the smaller of the two critic estimates:
where
are target critic parameters.
To prevent the actor from being trained on inaccurate value estimates, TD3 updates the policy network and target networks less frequently than the critics. Specifically, for every d critic updates, one policy update is performed.
To reduce exploitation of Q-function peaks caused by function approximation errors, TD3 applies clipped Gaussian noise to the target policy actions:
where
is Gaussian noise and
is the noise clipping threshold. This regularization smooths the Q-value landscape by preventing abrupt policy changes in narrow action regions.
The proposed HDRL algorithm is summarized in Algorithm 1. The training procedure begins with the initialization phase, where the maximum number of episodes and time steps per episode are specified, a unified replay buffer for both the DQN and TD3 modules is established, and the parameters of the DQN, actor, and critic networks, together with their corresponding target networks, are randomly initialized to enable stable learning. Training proceeds over multiple episodes, each of which resets the environment with new channel conditions, antenna positions, and RSMA variables. Within each episode, the algorithm iterates through time steps, where in every step, the outer DQN module first selects a discrete action corresponding to the BS antenna port based on the observed system state by greedily choosing the action with the highest Q-value. Conditioned on this discrete choice, the inner TD3 module generates continuous control actions, including user FAS displacements, RSMA power allocation, and beamforming vectors, with additional Gaussian noise injected for exploration. These actions are executed in the environment, which produces the next state, achievable communication rates, and sensing metrics. A reward is then computed based on the achieved communication sum-rate, and it is set to zero if any feasibility constraints, such as transmit power or antenna separation, are violated. The resulting transition is stored in the replay buffer, enabling off-policy learning. Subsequently, the DQN network is updated by minimizing the temporal-difference error using sampled mini-batches, while the TD3 actor and critic networks are trained with mini-batch updates incorporating clipped double Q-learning, target policy smoothing, and delayed policy updates to enhance stability. Target networks for both modules are softly updated to track the learned networks. This process continues until the episode concludes, after which the algorithm advances to the next episode. Upon completion of all episodes, the algorithm yields a hierarchical policy consisting of the trained DQN for BS antenna port selection and the TD3 for continuous optimization of user antenna positions and RSMA power allocation, thereby solving the hybrid ISAC optimization problem.
4. Performance Evaluation
In the simulations, the elevation and azimuth angles are modeled as independent and identically distributed random variables uniformly drawn from
. The distance between adjacent fluid antenna ports is fixed to
, while the fluid antenna movement is constrained within a square region of size
, where
. The path response matrix is assumed to be diagonal, i.e.,
. The diagonal elements follow complex Gaussian distributions such that
and
for
, where
denotes the Rician factor representing the power ratio between the line-of-sight (LoS) and non-line-of-sight (NLoS) components. In this work,
is set to 1. The numbers of transmit and receive paths are both set to 3. The maximum transmit signal-to-noise ratio (SNR) at the base station (BS) is defined as
, while the sensing constraint is characterized by
. Moreover, the number of BS antennas is set to
. The proposed HDRL framework integrates a DQN and a TD3 algorithm in a hierarchical manner. The DQN consists of a three-layer fully connected neural network with two hidden layers of 128 neurons each and ReLU activation functions. The TD3 actor network employs two hidden layers with 256 neurons each, followed by a Tanh activation to constrain the continuous action space. The TD3 critic adopts twin Q-networks with identical architectures to mitigate overestimation bias. During training, the system is trained for 1000 episodes, each consisting of 12 interaction steps. A replay buffer of size 500 is utilized, and mini-batches of size 64 are sampled for network updates. The learning rates for the actor and critic networks are set to 0.02 and 0.001, respectively, while the learning rate for the DQN is set to 0.001. The soft update rate is fixed at 0.005, and the discount factor is set to 0.9. State normalization is performed using logarithmic scaling of channel gains to improve training stability, while continuous actions are clipped within predefined bounds to satisfy system constraints. The simulation parameters are presented in
Table 1. To evaluate the effectiveness of the proposed HDRL approach, its performance is compared against three widely used following benchmarks.
Proximal Policy Optimization (PPO): PPO is a policy gradient method that balances exploration and stability by constraining policy updates within a trust region. It has been widely adopted due to its strong empirical performance and relative simplicity, making it a standard benchmark for continuous and discrete action-space problems.
Advantage Actor-Critic (A2C): A2C is a synchronous actor-critic algorithm that utilizes the advantage function to reduce the variance of policy gradient estimates. It is valued for its training efficiency and its ability to stabilize learning compared to basic policy gradient methods, serving as a strong baseline for policy optimization tasks.
Asynchronous Advantage Actor-Critic (A3C): A3C runs multiple parallel agents asynchronously. This approach decorrelates experiences and accelerates learning, particularly in complex or high-dimensional environments.
Figure 2 illustrates the convergence behavior of the proposed approach compared to the benchmark schemes. The results clearly show that the proposed hierarchical DRL framework outperforms all baselines, achieving higher rewards throughout the training process and stabilizing around episode 185 with an average reward of approximately 665, whereas the benchmarks converge more slowly and reach significantly lower reward levels. The hierarchical design of the proposed framework effectively balances exploration of promising strategies with exploitation of well-learned policies for efficient execution. The integration of TD3 further enhances stability by addressing overestimation bias through its twin-critic network, delayed policy updates, and target smoothing, thereby reducing oscillations and promoting smoother reward progression. As training progresses, the proposed approach effectively balances exploration with exploitation of well-learned policies. This eliminates the early over-optimistic actions, causing the reward to converge smoothly. In contrast, the benchmarks face inherent limitations that prevent them from achieving the performance of the proposed approach. PPO, although more stable due to its clipped surrogate objective, updates conservatively and therefore struggles to achieve peak performance in tasks requiring fine-grained exploration. A2C improves stability but remains constrained by biased value estimation and the absence of hierarchical abstraction. Moreover, A3C encourages exploration via asynchronous updates but suffers from high gradient variance, which slows convergence; furthermore, the lack of experience replay and target networks results in unstable learning in non-convex environments with complex constraints, limiting its optimization capability. Thus, the proposed approach achieves more efficient learning, smoother convergence, and superior asymptotic performance compared to benchmark approaches.
Figure 3 illustrates the training time required for convergence of different reinforcement learning algorithms, including the proposed HDRL framework, PPO, A2C, and A3C. It can be observed that the proposed HDRL requires the longest training time, approximately 1820 s, due to its hierarchical structure that integrates both discrete and continuous optimization via DQN and TD3 components. This increased complexity leads to higher computational overhead during training. In contrast, PPO converges faster with a training time of 1490 s, as it relies on a single policy network with clipped updates, reducing computational burden. The A2C and A3C algorithms exhibit significantly lower convergence times of 980 s and 760 s, respectively. This is mainly because these methods employ simpler architectures and fewer parameters, resulting in faster updates but comparatively lower learning capability. Despite the higher training time, the proposed HDRL framework achieves superior performance, as demonstrated in the sum-rate results. This highlights a trade-off between computational complexity and performance, where HDRL provides enhanced optimization capability at the cost of increased training time.
The ablation results in
Figure 4 clearly show that the full HDRL framework achieves the highest average sum-rate among all considered variants. This is because the complete scheme jointly exploits the benefits of FAS, RSMA, and HDRL, where the discrete and continuous decision spaces are optimized in a coordinated manner. In particular, FAS provides additional spatial flexibility by enabling adaptive antenna/port selection, RSMA improves interference management through message splitting and partial interference decoding, and the hierarchical DQN-TD3 structure allows the system to handle hybrid optimization variables efficiently. Since the considered problem contains both discrete and continuous control dimensions, the full HDRL framework is able to search a richer solution space and identify better transmission configurations, which directly translates into improved sum-rate performance.
When FAS is removed, the performance decreases because the system loses its ability to adapt the antenna configuration according to the channel conditions. In the full scheme, FAS introduces additional spatial DoF that can be used to strengthen desired links and reduce inter-user interference by selecting more favorable antenna positions or ports. Without this spatial adaptability, the transceiver operates with a fixed antenna structure, which limits channel exploitation capability and reduces beamforming flexibility. Consequently, even if the remaining optimization and interference-management mechanisms are still active, the achievable sum-rate is lower than that of the full HDRL scheme.
A performance reduction is also observed in the without RSMA case. This degradation is expected because RSMA plays an important role in multiuser interference mitigation. By splitting each user message into common and private parts, RSMA provides an additional degree of freedom for balancing interference suppression and information delivery. This is particularly beneficial in overloaded or interference-limited multiuser downlink scenarios, where purely private-stream transmission is often less efficient. When RSMA is disabled, the system must rely only on conventional signaling, which reduces flexibility in handling coupled user interference and therefore leads to a lower sum-rate than the complete framework.
The only DQN variant further illustrates the importance of hierarchical optimization. DQN is well suited for handling discrete decision variables, such as selecting transmission modes, ports, or index-based actions. Therefore, it can still provide meaningful gains by identifying favorable structural configurations of the system. However, when only DQN is used, the framework loses the ability to finely optimize continuous control variables, such as power allocation, rate-splitting coefficients, or other analog parameters. Consequently, the obtained solution is only partially optimized continuous refinement, preventing the system from reaching its full performance potential. Therefore, it performs better than only TD3.
Among all ablated variants, the only TD3 case exhibits the lowest performance. The main reason is that TD3 is fundamentally designed for continuous optimization, whereas the considered joint design problem is inherently hybrid, involving both discrete and continuous decisions. Even if the continuous variables are optimized effectively, the system still depends on appropriate discrete decisions to determine the underlying transmission structure. Without DQN, those discrete choices cannot be properly explored or optimized, and the continuous controller is forced to operate over a restricted or suboptimal structural configuration. Hence, continuous optimization alone cannot compensate for the lack of discrete adaptation, because the quality of the final solution strongly depends on first selecting a good discrete operating point. Therefore, it achieves the lowest sum-rate, which confirms that the discrete decision-making layer is indispensable in this problem.
Overall, the ablation study demonstrates that the performance gain of the proposed method does not come from a single module in isolation. Rather, it originates from the synergistic interaction of spatial adaptability provided by FAS, interference management enabled by RSMA, and hybrid discrete–continuous optimization realized through the HDRL framework. The superiority of full HDRL therefore validates the need for jointly optimizing all components in an integrated manner.
Figure 5 illustrates the impact of increasing the user rate on the sum-rate. It is evident from the obtained results that the sum-rate values are high at low rate because the BS can flexibly allocate transmit power between the common and private streams, allowing even weak users to reliably decode through the common stream. The deployment of FAS at both the BS and user sides further enhances spatial diversity, as the fluid antenna position can be dynamically adjusted to select more favorable channels. However, as the rate threshold increases, each user’s QoS demand becomes more stringent, forcing the BS to dedicate more transmit power to guarantee per-user requirements and thereby reducing the DoF for flexible resource balancing. This leads to a saturation effect where sum-rate growth slows and eventually declines with higher thresholds. For instance, at a rate of 0.8, the proposed method achieves a 31.6%, 42.9%, and 47.1% higher sum-rate than the PPO, A2C, and A3C, respectively. This performance gain results from the proposed HDRL approach, which combines the discrete action-handling capability of DQN for FAS position selection and RSMA mode selection with the continuous control strength of TD3 for beamforming, power allocation, and common-rate optimization. Hence, the proposed approach efficiently adapts to the hybrid action space, effectively balancing exploration and exploitation, and converges toward more optimal strategies. In contrast, the benchmarks struggle with dynamic adaptation in mixed action environments, leading to suboptimal allocation strategies and reduced performance.
The impact of an increasing number of users on the system sum-rate is evaluated in
Figure 6. The results show a consistent increase in sum-rate as the number of users grows. This improvement is mainly attributed to the capability of RSMA to effectively manage multiuser interference through partial interference decoding and flexible message splitting. In addition, the deployment of FAS at both the BS and user sides enables dynamic adjustment of antenna positions to better exploit favorable channel conditions, thereby reducing interference, enhancing spatial diversity, and improving the overall system performance. It can be observed that the proposed approach consistently outperforms PPO, A2C, and A3C across all user settings. For instance, when there are 20 users, then the proposed approach achieves performance gains of approximately 9.0% over PPO, 14.9% over A2C, and 14.9% over A3C. This clearly highlights the effectiveness of the proposed framework in handling interference while leveraging spatial adaptability. Furthermore, as the number of users increases, the advantages of RSMA become more pronounced in the downlink, since its flexible message-splitting strategy efficiently accommodates multiple user transmissions without significant degradation in performance. Overall, the proposed framework effectively mitigates the adverse effects of user densification, ensuring scalable and high-throughput downlink operation.
The impact of the BS transmit power on the system sum-rate is evaluated in
Figure 7. The results reveal a significant increase in sum-rate with higher BS transmit power, primarily due to the enhanced channel adaptability provided by the FAS deployed at both the BS and user terminals. FAS enables dynamic reconfiguration of antenna positions, offering increased spatial diversity and improved beamforming precision, which significantly boosts the SINR across all users. Coupled with RSMA’s rate-splitting mechanism, which allows users to partially decode the common data stream and efficiently manage interference, the system can exploit FAS-enabled channel gains to maximize downlink transmission. For instance, at 8 dB transmit power, the proposed approach achieves a 4.2%, 6.4%, and 7.8% higher sum-rate than PPO, A2C, and A3C, respectively. This gain is achieved by the proposed approach through efficient exploration of the hybrid action space. Therefore, the proposed approach dynamically balances interference management and spatial diversity, enabling the agent to achieve higher SINR and more efficient resource utilization. Consequently, the sum-rate grows more rapidly with increasing transmit power compared to the benchmarks, as the proposed approach outperforms less adaptive or suboptimal policy updates in resource allocation.
In
Figure 8, the sensing SINR decreases as the communication rate threshold increases. At low rate thresholds, communication QoS requirements can be satisfied with a small fraction of the transmit power, leaving more power for sensing, which enhances the sensing SINR. In the proposed approach, the common message stream acts as an interference suppression layer, allowing the system to exploit additional sensing DoF without compromising communication reliability. Private streams require less aggressive power boosting in this regime, further conserving resources for sensing. Moreover, the deployment of FAS at the BS and users dynamically reconfigures antenna positions to select favorable channel states, minimizing inter-user interference and maximizing spatial diversity, which further improves sensing performance. It can be seen that the proposed approach achieves 31.2%, 74.1%, and 88.0% higher sensing SINR than PPO, A2C, and A3C, respectively, effectively balancing the trade-off between communication and sensing. When the rate threshold increases, more power must be allocated to the RSMA common and private streams to meet communication demands, reducing the power available for sensing. However, FAS can compensate by adjusting antenna positions, but their effectiveness is limited by the reduced sensing power budget. Consequently, low rate thresholds yield higher sensing SINR and larger sensing DoF, whereas high rate thresholds prioritize communication at the expense of sensing, with FAS still providing some resilience by improving spatial channel conditions.
In
Figure 9, the impact of varying bandwidth on the system sum-rate is analyzed. The results demonstrate that the sum-rate increases with bandwidth, consistent with the theoretical relationship between throughput and spectrum size. At low bandwidths, the sum-rate rises rapidly, as additional spectrum allows more bits to be transmitted per unit time while maintaining relatively high SINR per subcarrier. As the bandwidth increases further, the growth rate slows due to the fixed total transmit power being spread across more subcarriers, reducing the per-subcarrier SINR and increasing inter-user interference. The proposed RSMA-FAS framework effectively addresses these limitations. FAS at both the BS and user sides dynamically reconfigures antenna positions to exploit favorable spatial channels, reducing interference and enhancing channel diversity. Moreover, the RSMA common stream acts as an interference mitigation layer, allowing the system to leverage additional spatial DoF while meeting user-specific communication requirements. In addition, the proposed HDRL approach enables autonomous and adaptive optimization of both antenna configurations and power allocation, allowing the agent to dynamically balance spatial diversity, interference management, and per-user rate requirements. At a bandwidth of 6 MHz, the proposed method achieves 12%, 13.3%, and 14.2% higher sum-rates than PPO, A2C, and A3C, respectively, with the most pronounced gains observed at low-to-moderate bandwidths. This performance advantage arises because HDRL jointly optimizes the hybrid action space, effectively exploiting FAS-enabled channel variations and RSMA’s interference mitigation to maximize the SINR across subcarriers. These results underscore the effective interplay between dynamic antenna positioning, RSMA-based message splitting, and HDRL-driven adaptive control in enhancing spectral efficiency and sum-rate under varying bandwidth conditions.