Skip Content
You are currently on the new version of our website. Access the old version .
Engineering ProceedingsEngineering Proceedings
  • Proceeding Paper
  • Open Access

6 February 2026

Subchannel Allocation in Massive Multiple-Input Multiple-Output Orthogonal Frequency-Division Multiple Access and Hybrid Beamforming Systems with Deep Reinforcement Learning †

and
Department of Communication Engineering, National Central University, Taoyuan 320, Taiwan
*
Author to whom correspondence should be addressed.
Presented at 8th International Conference on Knowledge Innovation and Invention 2025 (ICKII 2025), Fukuoka, Japan, 22–24 August 2025.
Eng. Proc.2025, 120(1), 55;https://doi.org/10.3390/engproc2025120055 
(registering DOI)
This article belongs to the Proceedings 8th International Conference on Knowledge Innovation and Invention

Abstract

In this study, we emphasize that the maximum sum rate can be achieved through AI-based subchannel allocation, while taking into account all users’ quality of service (QoS) requirements in data rates for hybrid beamforming systems. We assume a limited number of radio frequency (RF) chains in practical hybrid beamforming architectures. This constraint makes subchannel allocation a critical aspect of hybrid beamforming in massive multiple-input multiple-output (MIMO) systems with orthogonal frequency division multiple access (MIMO-OFDMA), as it enables the system to serve more users within a single time slot. Unlike conventional subcarrier allocation methods, we employ a deep reinforcement learning (DRL)-based algorithm to address real-time decision-making challenges. Specifically, we propose a dueling double deep Q-network (Dueling-DDQN) to implement dynamic subchannel allocation. Simulation results demonstrate that the performance of the proposed algorithm gradually approaches that of the greedy method. Furthermore, both the average sum rate and the average spectral efficiency per user improve with a reasonable variation in outage probability.

1. Introduction

To meet users’ requirements, millimeter-wave (mmWave) technology, operating in the 30 to 300 GHz frequency band, has emerged as a promising candidate and has been widely discussed in 5G communication systems [1,2]. However, due to its high carrier frequency, mmWave signals suffer from severe path loss. Fortunately, the short wavelength of mmWave allows for dense antenna array configurations within a limited physical space. This enables significant beamforming gain through large antenna arrays, which can compensate for the high path loss [3,4,5], making massive multiple-input multiple-output (MIMO) an attractive solution for mmWave systems.
Traditional MIMO systems typically employ a fully digital architecture at the baseband for beamforming. While effective, this approach is power-hungry and costly, particaularly for massive MIMO and mmWave systems, as it requires a dedicated radio frequency (RF) chain for each antenna element [6]. To address this issue, hybrid beamforming architectures have been proposed [7,8].
As previously mentioned, the limited bandwidth must be shared among multiple users. To enhance data rates in wireless communication, orthogonal frequency division multiple access (OFDMA) combined with space division multiple access (SDMA) has been proposed [9]. OFDMA with hybrid beamforming systems can select the optimal set of users to serve within the same resource block (RB) and adjust beamforming weight vectors to minimize interference among users sharing the same subcarrier.
Undoubtedly, optimizing resource allocation across frequency, space, and time is a highly complex problem. To tackle this, various AI-based approaches have been explored. However, implementing real-time subchannel allocation in complex communication systems remains an open challenge. Motivated by the capabilities of AI, we propose an AI-based approach to solve the dynamic subchannel allocation problem. Specifically, we employ a deep reinforcement learning (DRL) algorithm to handle subchannel scheduling in a massive MIMO-OFDMA system with hybrid beamforming, aiming to maximize the overall data rate while also satisfying each user’s quality of service (QoS) requirements. Nevertheless, designing an appropriate network architecture and tuning hyperparameters to suit the specific scenario remains a significant challenge.

2. System Model

In the downlink of an mmWave and large-scale MIMO-OFDMA system with a fully connected hybrid beamforming design for a frequency-selective channel, we consider that base station (BS) equips N t antennas and N t R F RF chains. BS transmits N t s data streams simultaneously at each subcarrier [10,11]. The total number of subcarriers is denoted by K . Moreover, a total N of users are served in each time slot, with each user equipped with a single antenna. The number of RF chains cannot exceed the number of transceiver antennas, i.e., N t s   N t R F   N t , and N r s   N r R F   N r . We assume that the number of scheduled users U per resource block (RB) is equal to N t R F , and these users are selected from the total user set N , i.e., U = N t R F < N . Let F b denote the set of users scheduled in the b t h RB, where b = 1 , , B , and F b is U . Furthermore, all N users must be served in at least one RB per time slot. Therefore, in our simulation setup, we assume that U = N t s =   N t R F   N t , with the hybrid beamforming structure employed at the transmitter and single-antenna multi-user terminals at the receiver.
Although subchannel allocation is implemented in units of RBs indexed by k, the digital precoder is designed for each subcarrier. At each subcarrier k, where k = 1 , , K , N t s data symbols s k   C N t s × 1 are first processed by a low-dimensional digital precoder F B B k   C N t R F × N t s . The digitally precoded signals are then transformed into the time domain using N t R F   K -point inverse fast Fourier transforms (IFFTs). After this, cyclic prefixes (CPs) are added to each stream to mitigate inter-symbol interference (ISI). Following the CP addition, the signal is passed through a high-dimensional analog precoder F R F   C N t × N t R F . In a fully connected structure, each RF chain is connected to all antennas via phase shifters. Since the analog precoder operates in the time domain, it remains identical across all subcarriers. Finally, the transmitted signal at subcarrier k is expressed as:
x k = F R F F B B k s k = u = 1 U F R F f B B u k s u k
where s k = s 1 k , s 2 k ,   s U k T represent the combined data symbols for U served users at the k subcarrier/RB. We assume that the total transmit power P is equally divided among each symbol as E s k s H k = P U K I N s ; s u k C is the data symbol of user u in F b at the subcarrier/RB k . Next, F B B k = f B B 1 k , f B B 2 k , , f B B U k is the digital precoder matrix at the k subcarrier/RB for U served users, and f B B u k   C N t R F × 1 is the digital precoder for the user u . The received signal for the n t h user at the k subcarrier/RB is given by
y n k = h n H k x k + z n k
where z n k ~ C N 0 , σ 2 and h n H k   C 1 × N t are the additive white Gaussian noise (AWGN) and the channel vector at the k subcarrier for user n , respectively.
The main goal of this paper is to maximize the sum rate while considering the Quality of Service (QoS) for each user’s data rate through dynamic subcarrier/RB allocation. The problem formulation can be expressed as:
max ρ n , k ,   F RF ,   F BB k k = 1 K n = 1 N R n k
a r H a r = 1 ,       r = 1 , , N t R F ;
F R F f B B n k 2 = 1 ,       k = 1 , , K ,       n = 1 , , N ;
Ω k = 1 K R n k , n = 1 , ,   N ;
where (3b) represents the hardware constraint ensuring that each column of the analog precoder has unit norm, and F R F = a 1 , a 2 , , a N t R F . Equation (3c) is the transmit power budget constraint for each subcarrier and user, and Ω is the data rate constraint for each user in (3d). In (3a), it is important to note that   F R F and   F B B k are designed as in [12]. Furthermore, R n k denotes the data rate for user n at subcarrier k , and the rate can be expressed as
R n k = B S C S log 2 1 + S I N R n [ k ]
where B S C S is the subcarrier spacing (SCS). In the subchannel allocation strategy for maximizing the sum rate under deep fading, we define the outage probability as follows.
p o u t ( t ) = P r o b { N o u t ( t ) 1 }
where N o u t ( t ) is the number of users whose achievable data rates fall below the target constraint Ω at the slot t .

3. Dynamic Subchannel Allocation with DRL-Based Method

We define the states, actions, rewards, and next states for our proposed method. The downlink transmissions are organized into frames of 10 ms, and each frame contains 10 subframes [13].
  • States ( s t ): In the proposed algorithm, s t is a set that combines different meaningful vectors or scalars, with time-variant feature data as the input to the network. We define s t = p l t , g t , d r t , d t , o t , a t , s r t , where p l t = 1 / p l 1 ( t ) , , 1 / p l n ( t ) , , 1 / p l N ( t ) , and p l n t represents the path loss of the n t h user at the t t h time slot. Here, we assume that the path loss does not change at all time slots at an episode for each user, and each user’s path loss is only updated at the beginning of the next episode. Next, g t = g 1 T t , , g n T t , , g N T t   C 1 × K N is the vector representing each user’s channel state, where g n T t = g n , 1 t , , g n , k t , , g n , K t   C 1 × K . We preprocess channel gain for all subcarriers of each user before inputting them into the neural network. g n , k t is computed by the following equation
g n , k t = 10 l o g 10 h n H k 2 a v g _ p n
where a v g _ p n is the average power across all subcarriers for the n t h user expressed as k = 1 K 10 l o g 10 h n H k 2 / K . Thus, g n , k t represents the power difference between subcarrier k for the n t h user and the average power of all subcarriers. d r t is the data rate of each user after the agent takes the action at the t t h time slot, and is denoted as d r t = d r 1 t , , d r n t , , d r N t . At the beginning of each episode, all elements in d r t are initialized to 0, as no action has been taken yet. d t is a detection/indicator vector that determines whether the user satisfies the QoS requirement. The vector is denoted by d t = d 1 t , , d n t , , d N t . If d n t is 1, it means that the constraint of the data rate for the n t h user is satisfied after taking the current subchannel allocation. Otherwise, d n t = 0. All the elements d t are set to −1 at the beginning of the episode. o t is the outage probability of an episode. An outage is counted when one of the served users cannot meet the data rate constraint at the t t h time slot. a t and s r t are the current action chosen by the agent and the sum rate of all users after taking the a t , respectively. The action a t is a number representing a particular subchannel allocation, and as we mentioned earlier, a t = −1 corresponds to a non-allocation state. At the start of the episode, s r t = 0 because the agent has not taken any action.
  • Actions ( a t ): We treat each possible subchannel assignment as an action and implement the allocation at each time slot. For partial allocation, we randomly generate some allocation cases as an action set, without considering all possible subchannel allocations. For full allocation, the action set includes all possible subchannel allocation solutions. According to [13], RB consists of 12 consecutive subcarriers in the frequency domain. For simplicity in this simulation, we assume that each RB group (RBG) contains only one RB, rather than multiple RBs, as defined in the specification. We assume that each user in the set F b is served at least one RB in each time slot. Therefore, we eliminate the options where a user is not allocated any RBs. In summary, we represent a t as a number corresponding to a bitmap that indicates the subchannel allocation, and the agent selects a t based on the current s t at each time slot t .
  • Rewards ( r t ): We define a reward function to guide the agent, rewarding correct actions and penalizing irregular ones. The agent’s goal is to maximize the long-term expected cumulative reward. The reward function is expressed as:
r t = n = 1 N f p r ( d r n t Ω ) ,       if   n = 1 N d n t = N f n r o t                       ,       if   n = 1 N d n t < N
where f p r ( · ) and f n r ( · ) are the positive reward and negative reward functions, respectively. In (7), if the data rate of all users achieves Ω after the agent takes a t , the function f p r ( · ) generates a positive reward. Each user’s positive reward increases as their data rate increases, and then we sum the output of the f p r ( · ) for each user as the current reward r t . If a user’s data rate does not meet the constraint after taking action a t , the function f n r ( · ) generates a negative reward. The negative reward depends on the outage probability o t , and the reward becomes more negative as combines these two streams using the following equation:the outage probability increases.

Dueling-DDQN Algorithm and Network Structure

In the proposed Dueling-DDQN structure, the weights and biases of the neural network are updated via backpropagation (BP) to minimize the loss value in the loss function. The specific structure of the proposed Dueling-DDQN is shown in Figure 1, where “Act” is an activation function. The activation function used is the rectified linear unit (ReLU). The number of outputs corresponds to the size of the action set, and each output represents the expected cumulative reward for that action.
Figure 1. Neural network structure of the proposed Dueling-DDQN.
The output layer of the Dueling-DDQN has two streams: one for estimating the state value (a scalar) and another for estimating the advantages of each action. The blue line in Figure 1 combines these two streams using the following equation:
Q s t , a t ; θ = V s t ; θ + A s t , a t ; θ 1 A a t A A s t , a t ; θ
where V s t ; θ estimates the state-value and its output is a scalar, and A s t , a t ; θ estimates the advantages for each action, and its output is a vector whose size is the same as the action set. Further, θ is the parameter of the Q-network, and A is the total number of actions. The specific process of the complete model is shown in Figure 2. Due to limited space, we omit the detailed discussions.
Figure 2. Workflow of Dueling-DDQN algorithm.
For the hybrid beamforming design, we apply the signal-to-leakage-plus-noise ratio (SLNR) approach, which is commonly used in MU-MIMO beamformer design to mitigate the challenges of non-convex design problems [12]. We use the SLNR method to design an analog precoder and the zero-forcing (ZF) method to design a digital precoder. Due to space constraints, we also omit the details of this part, which will be discussed in the full version of the journal paper.

4. Simulation Results

The bandwidth part (BWP) size refers to the number of Resource Blocks (RBs) for a given bandwidth range, while the RBG size represents the number of RBs in a group. We assume a subcarrier spacing (SCS) of 60 kHz, as this value is suitable for both frequency range 1 (FR1) and frequency range 2 (FR2) [14]. Additionally, subchannel allocation is performed at each time slot, with a total of 40 slots in one radio frame. The environment is modeled with two clusters and ten rays, where each cluster has an angular spread of 10 degrees, and the cluster angle range is [0, 2π). The path loss p l n for the n user is generated using the Non-Line of Sight (NLOS) urban microcell (UMi)-Street Canyon scenario described in [15]. In the numerical simulation, unless otherwise mentioned in the article, the following experiments are used N t = 16 , N t R F = U = 2 , N = 3 , and B W P = 4 .
Figure 3 illustrates the average data rate based on the agent’s policy in a fixed path loss environment. The average data rate for each user is computed for one episode, i.e., one radio frame. In this experiment, a random episode is selected from the testing data, and this selection is fixed for each test, as previously mentioned. The testing results are recorded after 5, 50, and 100 training epochs. As shown in Figure 3, the sum rate increases as the training steps progress for the observed episode. However, this allocation may slightly increase the outage probability, as the spectrum resources available to disadvantaged users may not be sufficient to meet their data rate constraints. Even when the policy allocates most of the subchannels to these users, they may still fail to achieve the required data rate in some cases. As a result, the agent prioritizes increasing spectral efficiency to raise the sum rate, rather than allocating most resources to disadvantaged users.
Figure 3. Comparison among all the users’ testing performance after different training epochs in the fixed path loss value case.
The outage probability during testing is shown in Figure 4, where each circular marker on the curve represents the outage probability for a specific testing time. The average curve is computed by averaging the outage probabilities from every 10 testing times. It should be noted that the experiments shown in Figure 3 focus on a single radio frame’s state. To demonstrate the stable performance of the proposed algorithm, we test the well-trained network on identical testing data.
Figure 4. Outage probability of the proposed algorithm in testing.
The performance of the proposed algorithm and the compared methods is shown in Figure 5. The greedy line and the random line are computed over the testing set. The greedy method tests all subchannel allocations and selects the action that avoids outage. It then chooses the action that results in the highest sum rate from the non-outaging actions. In contrast, the random method randomly selects an action from all possible actions. If the channel state is suboptimal and none of the allocations can meet each user’s QoS in a radio frame, the greedy-allocation method selects an action from the ineligible options, while the random-allocation method picks an action at random. Similarly, for both methods, the channel parameters vary with testing steps, though the path loss values remain fixed. As observed, the performance of the partial allocation method is comparable to that of the full allocation method, with both curves closely matching the greedy line after training on the testing set. This suggests that the algorithm is capable of handling large subchannel allocations in real-world scenarios and demonstrates good generalization.
Figure 5. Testing curve of different methods in average data rate.

5. Conclusions

In this study, we consider the hybrid beamforming and the subchannel allocation with the AI-based algorithm for the mmWave massive MIMO-OFDMA systems. The simulation results verify that our proposed algorithm can gradually approach the performance of the greedy allocation in testing for both architectures. Moreover, stability and generalization are also shown in the results. Additionally, we try both cases of full assignment and partial assignment for subchannel allocation, which shows that the performance of the proposed algorithm could achieve the same effect as long as we consider some parts of all the assignments as actions. Finally, compared to the different scenes, we refer that the subchannel allocation with the Dueling-DDQN algorithm could be applied to the practical mmWave and massive MIMO-OFDMA with hybrid beamforming.

Author Contributions

Conceptualization, Y.-F.C.; methodology, J.-W.L. and Y.-F.C.; software, J.-W.L.; validation, J.-W.L. and Y.-F.C.; formal analysis, J.-W.L.; investigation, J.-W.L.; resources, J.-W.L. and Y.-F.C.; data curation, Y.-F.C.; writing—original draft preparation, J.-W.L.; writing—review and editing, Y.-F.C.; visualization, J.-W.L.; supervision, Y.-F.C.; project administration, Y.-F.C.; funding acquisition, Y.-F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, R.O.C., under Grant NSTC 114-2221-E-008-057.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are generated through computer simulations. The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Andrews, J.G.; Buzzi, S.; Choi, W.; Hanly, S.V.; Lozano, A.; Soong, A.C.K. What Will 5G Be? IEEE J. Sel. Areas Commun. 2014, 32, 1065–1082. [Google Scholar] [CrossRef]
  2. Rappaport, T.S.; Sun, S.; Mayzus, R.; Zhao, H.; Azar, Y.; Wang, K. Millimeter Wave Mobile Communications for 5G Cellular: It Will Work! IEEE Access 2013, 1, 335–349. [Google Scholar] [CrossRef]
  3. Hoseini, S.A.; Ding, M.; Hassan, M. Massive MIMO performance comparison of beamforming and multiplexing in the Terahertz band. In Proceedings of the 2017 IEEE Globecom Workshops (GC Wkshps), Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
  4. Swindlehurst, A.; Ayanoglu, E.; Heydari, P.; Capolino, F. Millimeter-wave massive MIMO: The next wireless revolution? IEEE Commun. Mag. 2014, 52, 56–62. [Google Scholar] [CrossRef]
  5. Lu, L.; Li, G.Y.; Swindlehurst, A.L.; Ashikhmin, A.; Zhang, R. An Overview of Massive MIMO: Benefits and Challenges. IEEE J. Sel. Top. Signal Process. 2014, 8, 742–758. [Google Scholar] [CrossRef]
  6. Bogale, T.E.; Le, L.B.; Haghighat, A.; Vandendorpe, L. On the number of RF chains and phase shifters, and scheduling design with hybrid analog–digital beamforming. IEEE Trans. Wirel. Commun. 2016, 15, 3311–3326. [Google Scholar] [CrossRef]
  7. Zhang, J.; Yu, X.; Letaief, K.B. Hybrid Beamforming for 5G and Beyond Millimeter-Wave Systems: A Holistic View. IEEE Open J. Commun. Soc. 2020, 1, 77–91. [Google Scholar] [CrossRef]
  8. Molisch, A.F.; Ratnam, V.V.; Han, S.; Li, Z.; Le Hong Nguyen, S.; Li, L. Hybrid beamforming for massive MIMO: A survey. IEEE Commun. Mag. 2017, 55, 134–141. [Google Scholar] [CrossRef]
  9. Maciel, T.F.; Klein, A. A resource allocation strategy for SDMA/OFDMA systems. In Proceedings of the 2007 16th IST Mobile and Wireless Communications Summit, Budapest, Hungary, 1–5 July 2007. [Google Scholar]
  10. Tseng, H.-H.; Chen, Y.F.; Tseng, S.-M. Hybrid Beamforming and Resource Allocation Designs for mmWave Multi-User Massive MIMO-OFDM Systems on Uplink. IEEE Access 2023, 11, 133070–133085. [Google Scholar] [CrossRef]
  11. Chen, B.Y.; Chen, Y.F.; Tseng, S.M. Hybrid Beamforming and Data Stream Allocation Algorithms for Power Minimization in Multi-User Massive MIMO-OFDM Systems. IEEE Access 2022, 10, 101898–101912. [Google Scholar] [CrossRef]
  12. Ha, V.N.; Nguyen, D.H.N.; Frigon, J. Subchannel allocation and hybrid precoding in millimeter-wave OFDMA systems. IEEE Trans. Wirel. Commun. 2018, 17, 5900–5914. [Google Scholar] [CrossRef]
  13. 3GPP Technical Specification 38.211 v15.2.0, Physical channels and modulation (Release 15), 2018. Available online: https://www.3gpp.org/ftp/Specs/archive/38_series/38.211/. (accessed on 1 May 2024).
  14. 3GPP Technical Specification 38.104 v16.5.0, Base Station (BS) radio transmission and reception (Release 16), 2020. Available online: https://www.3gpp.org/ftp/Specs/archive/38_series/38.104 (accessed on 1 May 2024).
  15. 3GPP Technical Report 38.901 v16.1.0, Study on channel model for frequencies from 0.5 to 100 GHz (Release 16), 2020. Available online: https://www.3gpp.org/ftp/Specs/archive/38_series/38.901 (accessed on 1 May 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.