Next Article in Journal
High-Accuracy Detection of Odor Presence from Olfactory Bulb Local Field Potentials via Deep Neural Networks
Next Article in Special Issue
Design and Experimental Evaluation of a Hierarchical LoRaMESH-Based Sensor Network with Wi-Fi HaLow Backhaul for Smart Agriculture
Previous Article in Journal
Physiological Assessment of Mental Stress in Construction Workers Under High-Risk Working Conditions: ECG-Based Field Measurements on Inexperienced Scaffolders
Previous Article in Special Issue
Optimal Resource Allocation via Unified Closed-Form Solutions for SWIPT Multi-Hop DF Relay Networks
 
 
Article
Peer-Review Record

Enhancing Spectral Efficiency of 6G Downlink Beamforming via Cooperative Multi-Agent Deep Reinforcement Learning

Sensors 2026, 26(3), 950; https://doi.org/10.3390/s26030950
by Ali Al Janaby 1,*, Hussain Al-Rizzo 2 and Yahya Qassim 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Sensors 2026, 26(3), 950; https://doi.org/10.3390/s26030950
Submission received: 6 December 2025 / Revised: 12 January 2026 / Accepted: 19 January 2026 / Published: 2 February 2026
(This article belongs to the Special Issue Wireless Communication and Networking for loT)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript proposes a combined scheme of multi-agent reinforcement learning and MVDR beamforming, which innovatively realizes the integration of 3D beam weight optimization and user-base station association. The work demonstrates a certain degree of workload and innovation, yet there remain the following issues to be addressed:

  1. The manuscript claims the scalability of base stations as a contribution, but this advantage is not highlighted in subsequent chapters, especially in the mathematical model and scenario setup sections. Additional descriptions should be added to emphasize this contribution.
  2. The manuscript mentions the joint optimization of azimuth and elevation angles, but the mathematical model only describes their respective calculation methods separately. Supplementary descriptions of the joint optimization mechanism are required.
  3. The simulation scenarios are overly simplistic and do not involve typical 6G application scenarios. Moreover, the performance metrics are monotonous, focusing only on SINR and throughput as core indicators. It is recommended to enrich the simulation scenarios and expand the set of performance metrics.
  4. The Related Works section is inadequately elaborated. Current content merely lists relevant studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on the relevance of the manuscript’s optimization for 5G base station scenarios to 6G technologies. In addition, it is suggested to supplement the literature survey regarding performance gains achieved by leveraging multi-user interference, with reference to the following papers:

[1] "Achieving Positive Rate of Covert Communications Covered by Randomly Activated Overt Users," IEEE Transactions on Information Forensics and Security, vol. 20, pp. 2480-2495, 2025.

[2] "Achieving Covert Communication With a Probabilistic Jamming Strategy," IEEE Transactions on Information Forensics and Security, vol. 19, pp. 5561-5574, 2024.

  1. There are inconsistencies between some figures and textual descriptions. Specifically, Figure 2 suffers from typesetting flaws; Figures 4 and 5 have text distortion problems; the content in Table 1 is inconsistent with the 625 antenna elements specified in the system model section. It is recommended to verify and revise these parts thoroughly.

Author Response

January 7, 2026 
Dear Editors and Reviewers, 
We express our sincere gratitude to both reviewers for their comprehensive evaluation and 
constructive feedback. 
Their insightful comments on our manuscript entitled “Enhancing Spectral Efficiency of 5G 
Downlink Beamforming via Cooperative Multi-Agent Deep Reinforcement Learning” have been 
carefully addressed. We have revised the manuscript to address all comments and enhance clarity 
and technical rigor. 
Recognition 
The authors sincerely acknowledge the anonymous reviewers for their constructive comments 
and suggestions, which have significantly enhanced the content and quality of the original paper. 
We are attaching the original manuscript with red-marked changes, as well as a clean copy of the 
revised manuscript. 
Kind regards, 
Reviewer 1 
Question 1: The manuscript claims the scalability of base stations as a contribution, but this 
advantage is not highlighted in subsequent chapters, especially in the mathematical model and 
scenario setup sections. Additional descriptions should be added to emphasize this contribution. 
Answer: 
We appreciate the reviewer’s important observation. The manuscript has been revised to 
demonstrate the scalability contribution in the following ways more clearly: 
Explicit Reinforcement Learning and Scalability Clarification: 
The system model and algorithm sections provide a clear description of the cooperative multi
agent reinforcement learning (MARL) framework. Each base station is explicitly modeled as an 
independent learning agent, and the scalability of the proposed approach is demonstrated by the 
learning formulation’s independence from the number of base stations. The following 
explanation has been added below Figure 7 to emphasize the reinforcement learning and 
scalability clarification further: 
Finally, to validate the scalability of the proposed  algorithm, we analyzed the computational 
complexity per episode: 
ï‚· Per-BS complexity: O(N²) for MVDR computation + O(|A|·|S|) for Q-learning updates, 
where N=625 antenna elements, |A|=2 actions, |S|=10 states 
ï‚· Total network complexity: O(B·N²) grows linearly with base stations, not exponentially 
ï‚· Comparison with centralized approach: A centralized controller would require O(BU) 
state space (210 = 1024 states for our scenario versus 10 states per agent) 
For our 2-BS, 10-user scenario: 
ï‚· MVDR computation: ~0.8ms per BS per iteration (625×625 matrix inversion) 
ï‚· Q-learning update: ~0.05ms per user assignment 
ï‚· Total per-episode time: ~150ms (150 iterations × 1ms average) 
Scaling to 5 BSs and 25 users would require: 
ï‚· 5× MVDR computations (still parallelizable): ~4ms 
ï‚· 2.5× Q-learning updates per BS: ~0.125ms 
ï‚· Projected per-episode time: ~375ms (maintaining near-linear scaling) 
The architecture’s scalability is further demonstrated by enabling agents to be pre-trained in 
smaller scenarios and subsequently deployed in larger networks without requiring full retraining. 
This leverages the generalization properties of the SINR-based reward function. The RL agent 
jointly optimizes both angles via the reward function: the elevation term explicitly accounts for 
vertical angle optimization, while the ΔSINR term captures the coupled effect of both angles on 
signal quality. The Q-learning update, therefore, learns to select beam directions that maximize 
joint azimuth-elevation performance. In summary: 
1. Coupled estimation: The beam scanning maximizes over the 2D (θ,φ) space jointly 
2. Integrated weights: MVDR weights encode both dimensions in a single 625-element 
vector 
3. Holistic reward: RL rewards depend on SINR, which inherently reflects the combined 3D 
beam accuracy 
4. “ Each Q-learning episode adjusts user assignments based on 3D beam performance, 
creating a feedback loop that jointly optimizes both angles. 
This differs from sequential optimization (optimize θ, then φ) in that our approach considers the 
interaction between azimuth and elevation throughout the optimization process. 
Question 2: The manuscript mentions joint optimization of azimuth and elevation angles, but the 
mathematical model describes their calculation methods separately. Supplementary descriptions 
of the joint optimization mechanism are required. 
We appreciate the reviewer’s important observation. 
Answer: 
Integration of RL and MVDR Beamforming: 
We have clarified the hierarchical relationship between reinforcement learning and MVDR 
beamforming. The learning agents optimize high-level decisions, such as user association and 3D 
beam selection, while MVDR analytically computes the beamforming weights. This distinction 
addresses the ambiguity identified by the reviewer. 
The joint optimization of azimuth and elevation angles is achieved through the integrated RL
MVDR framework as follows: 
Step 1: Coupled Angle Estimation: The 2D beam scanning estimator naturally couples azimuth 
and elevation through the spatial power spectrum, using the steering vector, which inherently 
represents the joint 3D direction. 
The grid search over φ ∈ [-90°, 90°] and θ ∈ [-30°, 30°] yields the coupled estimate. This is a 
joint optimization because the maximization occurs over the 2D angular space simultaneously, 
not separately for each dimension. 
Step 2: Joint MVDR Weight Calculation: The MVDR beamformer uses the jointly estimated 
angles to compute weights that account for both dimensions simultaneously. The 625-element 
weight vector (25×25 URA) implicitly encodes the joint spatial pattern, with the weights for each 
element depending on both θ and φ. 
Step 3: RL-Based Joint Refinement: The RL agent optimizes both angles jointly through the 
reward function  
The elevation term explicitly incorporates vertical angle optimization, while the ΔSINR term 
captures the coupled effect of both angles on signal quality. The Q-learning update thus learns to 
select beam directions that maximize joint azimuth-elevation performance. 
Why This Constitutes Joint Optimization: 
1. Coupled estimation: The beam scanning maximizes over the 2D (θ,φ) space jointly 
2. Integrated weights: MVDR weights encode both dimensions in a single 625-element 
vector 
3. Holistic reward: RL rewards depend on SINR, which inherently reflects the combined 3D 
beam accuracy 
4. Iterative refinement: Each Q-learning episode adjusts user assignments based on 3D 
beam performance, creating a feedback loop that jointly optimizes both angles 
This differs from sequential optimization (optimize θ, then φ) in that our approach considers the 
interaction between azimuth and elevation throughout the optimization process." 
Question 3: The simulation scenarios are overly simplistic and do not involve typical 6G 
application scenarios. Moreover, the performance metrics are monotonous, focusing only on 
SINR and throughput as core indicators. It is recommended to enrich the simulation scenarios 
and expand the set of performance metrics. 
We appreciate the reviewer’s observation. 
Answer: 
This question will be addressed in future research, which is the focus of a forthcoming separate 
paper. We have added the following statement to the Conclusion section: Future research will 
extend this framework to 6G application scenarios.  Scenario 1: High-Mobility Vehicular 
Communications, Scenario 2: Dense Urban Hotspot (6G XR Applications), Scenario 3: Multi
Tier Heterogeneous Network, Expanded Performance Metrics, beyond SINR and throughput, we 
will evaluate: Energy Efficiency, Fairness Index (Jain's Fairness), Handoff Rate and Latency, 
Convergence Speed, Beam Alignment Accuracy, Outage Probability, and finally Comparison 
Baselines: We will compare against: 
ï‚· Random BS assignment with fixed MVDR 
ï‚· Greedy nearest-BS assignment with MVDR 
ï‚· Centralized exhaustive search  
ï‚· Deep learning baseline: DQN without MVDR integration 
ï‚· State-of-the-art: Coordinated beamforming 
Question 4: The Related Works section is inadequately elaborated. Current content merely lists 
relevant studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on 
the relevance of the manuscript’s optimization for 5G base station scenarios to 6G technologies. 
In addition, it is suggested that the literature survey on performance gains achieved by leveraging 
multi-user interference be supplemented with references to the following papers.  
We appreciate the reviewer’s important observation. 
Answer: 
The Introduction section has been revised to establish a more apparent connection between prior 
5G learning-based beamforming research and emerging 6G requirements, with emphasis on 
distributed intelligence, interference management, and hybrid machine learning–signal 
processing approaches: 
Existing works address parts of 6G beamforming, including DRL for adaptation, multi-agent 
coordination, 3D beamforming, and generative AI for security. However, no prior work 
integrates: 
1. Multi-agent RL with MVDR for joint user assignment and 3D beamforming 
2. Scalable architecture suitable for dense 6G deployments 
3. Explicit joint azimuth-elevation optimization 
4. Validation showing 5G techniques that scale to 6G requirements 
Our contribution addresses this gap by presenting an end-to-end framework validated in 5G
realistic scenarios (28 GHz, 625 elements, Rayleigh fading) while incorporating essential 6G 
features such as 3D beamforming, multi-agent coordination, and AI-native design. 
Scalable architecture: The proposed framework is inherently scalable due to its decentralized 
architecture. For a network with B base stations and U users, each base station operates an 
independent RL agent with its own Q-network. The state space for each agent scales linearly as 
O(U), while the action space remains constant at O(B), regardless of network size. This is 
because each agent only needs to decide user assignments within its coverage area, avoiding the 
exponential state-space growth (O(UB)) that would occur in a centralized approach. The 
coordination mechanism scales efficiently through: 
1. Shared network topology: All agents access a standard graph G = (V, E) where V 
represents base stations and E represents coverage overlaps. 
2. Distributed covariance computation: Each BS independently computes its 625×625 
covariance matrix using only local observations 
3. Coverage-based pre-filtering: The coverage check reduces the adequate action space from 
B to typically 1-2 neighboring BSs per user 
Adding k new base stations requires only: 
ï‚· Instantiating k new identical Q-networks (no architectural redesign) 
ï‚· Updating the shared topology graph with k new nodes 
ï‚· No retraining of existing agents due to the transfer learning properties of the reward 
function. 
Question 4 (continued):  
The Related Works section is inadequately elaborated. Current content merely lists relevant 
studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on the 
relevance of the manuscript’s optimization for 5G base stations  
Answer: 
The “Related Works: section covers relevant studies in chronological order from 2021 to 2025. 
We have clearly explained the relevance of the manuscript’s optimization to 5G. An extensive 
review of other beamforming approaches is available in several surveys and review papers. 
These are not related to or within the scope of the research reported in this manuscript, which is 
why we did not use a “literature review” but rather focused on “related work.” 
The literature review has been supplemented by adding the suggested two references, though 
they are not directly related to the research reported in our manuscript. Hence, no further action 
has been added to “it is suggested to supplement the literature survey regarding performance 
gains achieved by leveraging multi-user interference, with reference to the following papers:” 
Question 5: There are inconsistencies between some figures and textual descriptions. 
Specifically, Figure 2 suffers from typesetting flaws; Figures 4 and 5 exhibit text distortion; and 
the content in Table 1 is inconsistent with the 625 antenna elements specified in the system 
model section. It is recommended to verify and revise these parts thoroughly.  
We appreciate the reviewer’s important observation. 
Answer: 
All inconsistencies in antenna configuration, figures, and tables have been corrected, and all 
formatting issues have been resolved. 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present a cooperative multi-agent RL approach for downlink beamforming + user assignment in a 2-BS, 10UE mmWave scenario with a 25 x 25 URA and MVDR beamforming, reporting SINR and throughput gains after appropriate training. Here are a couple of comments to help revise the manuscript.

1). The authors talk about MVDR weight computed from the sample covariance and steering vector, while also discussing the RL agent that it optimizes beamforming weights. Please specify the comparison between RL optimization v.s. MVDR and their relations.

2). 2000 samples are used for covariance estimation. However, the math derivation uses N_s = 1000 snapshots while elsewhere N_s = 100 for the beam-scanning step. Please clarify why the inconsistency.

3). Also, Table 1 only show ULA = 256 elements while manuscript presents 625 elements.

4).  Fig. 6 compares the RL policy against a "conventional MVDR" baseline at 0.75dB. How does this baseline choose? Since MVDR performance is typically scenario dependent.

5). A 25 x 25 URA MVDR requires inversion of a 625 x 625 matrix, then what is the computational cost?

6). Several typos should be fixed for better reading experience.

Author Response

Reviewer 2 
Question 1: The authors discuss MVDR weights computed from the sample covariance and 
steering vector, while also discussing the RL agent that optimizes the beamforming weights. 
Please specify the comparison between RL optimization v.s. MVDR and its relations. 
We appreciate the reviewer’s important observation. 
Answer: 
Hybrid Approach - Role Separation: 
MVDR Role: Computes optimal continuous weights for given user-BS assignment (closed-form 
solution) 
RL Role: Learns optimal discrete user-BS assignment policy (which users to which BS) 
Two-Stage Process: 
1. 
2. 
3. 
RL Decision: Select BS via Q-learning (discrete: 2 actions) 
MVDR Execution: Compute beamforming weights (continuous: 625 weights) 
RL Update: Receive SINR reward, improve assignment policy 
Approach 
Action Space  
MVDR Only N/A   
Pure RL 
1250 dims  
RL + MVDR 2 dims   
Training   
None    
Weeks    
31 min   
SINR 
0.75 dB 
Won't converge 
1.15 dB 
Question 2: 2000 samples are used for covariance estimation. However, the math derivation uses 
N_s = 1000 snapshots while elsewhere N_s = 100 for the beam-scanning step. Please clarify why 
there is an inconsistency: 
We appreciate the reviewer’s important observation 
Answer: 
We have added before Section 5: 
Our system uses three different sets of signal samples at various stages with different purposes: 
1. N_covariance = 2000 samples (for MVDR covariance matrix estimation) 
Purpose: Estimating the spatial covariance matrix for the MVDR beamformer 
Location used: Table 1, Section 4.3 parameter list 
Justification: MVDR requires accurate covariance estimation for optimal performance. For an 
N=625 element array, the covariance matrix C{625×625} has 625² = 390,625 unique entries 
(exploiting Hermitian symmetry). The rule of thumb for covariance estimation is N_samples ≥ 
2N to 3N for stability. We use 2000 > 3×625 = 1875 to ensure: 
ï‚· Low estimation variance 
ï‚· Accurate interference characterization 
ï‚· Stable matrix inversion  
Time duration: At 1 MHz symbol rate, 2000 samples = 2 ms, which is within the channel 
coherence time (Tc ≈ 1/(2fD) = 1/(2×50Hz) = 10 ms). 
2. N_scan = 100 snapshots (for beam-scanning direction estimation) 
Purpose: Quick direction-of-arrival (DOA) estimation via 2D beam scanning  
Why fewer samples? 
ï‚· Speed requirement: Beam scanning performs a grid search over θ ∈ [-90°, 90°] (180 
points) × φ ∈ [-30°, 30°] (60 points) = 10,800 evaluations of PB(θ,φ) 
ï‚· Accuracy sufficiency: We only need rough DOA estimates (within ~1° resolution) to 
initialize the steering vector for MVDR. Fine-grained SINR optimization happens 
through the MVDR weights themselves. 
ï‚· Computational trade-off: Using 100 samples instead of 2000 reduces beam scanning time 
by 20×, from 40ms to 2ms per user, while still providing adequate DOA accuracy. 
Time duration: 100 samples at 1 MHz = 0.1 ms, much faster than covariance estimation. 
Question 3: Also, Table 1 only shows ULA = 256 elements, while the manuscript presents 625 
elements 
We appreciate the reviewer’s important observation 
Answer: 
Corrected. Also, all mentions of 1000 snapshots are changed to 2000 samples. 
Question 4: Fig. 6 compares the RL policy against a "conventional MVDR" baseline at 0.75dB. 
How does this baseline choose? Since MVDR performance is typically scenario dependent. 
We appreciate the reviewer’s important observation. 
Answer: 
Excellent question.  
The 'conventional MVDR' baseline (dashed red line at 0.75 dB in Figure 6) represents current 
state-of-practice beamforming without intelligent user assignment. Its configuration is: 
User Assignment Rule: 
ï‚· Nearest base station selection: Each user k is assigned to BS i that minimizes Euclidean 
distance: arg min_i ||pk - qi|| 
where pk is the user position, qf is the BS position 
ï‚· This is a static, deterministic rule requiring no learning or optimization. 
ï‚· Commonly used in practical systems due to simplicity. 
Beamforming Method: 
ï‚· Uses the equations listed for DOA estimation and the equation for weight computation 
ï‚· Same parameters: 2000 samples for covariance, 28 GHz carrier, 625-element URA 
ï‚· Identical channel conditions: Same Rayleigh fading, AWGN (3 dB SNR), path loss 
model 
Why 0.75 dB specifically? 
This value is not arbitrarily chosen but rather the measured performance of the nearest-BS + 
MVDR policy in our specific scenario: 
Scenario details affecting baseline performance: 
1. User distribution: 10 users randomly placed in a 100m × 50m area (seed fixed for 
reproducibility) 
2. BS positions: BS1 at (0, 0, 0), BS2 at (100, 0, 0) 
3. Resulting assignments: Nearest-BS rule assigns six users to BS1, four users to BS2 
4. Interference level: User 3 and User 7 experience substantial mutual interference (both 
near x=50m boundary) 
5. Path loss impact: User 10 assigned to BS2 has 65m distance → 36 dB path loss 
Measured SINR distribution (baseline): 
ï‚· Best user (User 1, 15m from BS1): 3.2 dB 
ï‚· Worst user (User 10, 65m from BS2): -2.1 dB 
ï‚· Average across 10 users: 0.75 dB 
Why this is a fair baseline: 
1. Scenario-dependent acknowledgment: You are correct that MVDR performance varies 
with scenario. We report the baseline SINR specific to our scenario (Figure 1 topology). 
Different user placements would yield different baseline values. 
2. Reproducible comparison: Both the baseline and RL policy are evaluated in the same 
scenario (same user positions, channel realizations, noise). The RL improvement (0.75 
dB → 1.15 dB) reflects learning better assignments for this specific topology. 
3. Standard practice: The Nearest-BS assignment is widely used in the literature and in 
practical systems (3GPP standards specify similar criteria). It represents a strong baseline 
that accounts for path loss but ignores interference and load. 
4. Apples-to-apples MVDR: Both baseline and RL use an identical MVDR implementation. 
The only difference is the assignment policy (deterministic vs. learned). 
Baseline variations across episodes: 
We note that the baseline SINR (0.75 dB) remains constant across episodes (flat red line in Fig. 
6) because: 
ï‚· User positions are static in our scenario. 
ï‚· Nearest-BS assignment is deterministic (no randomness) 
ï‚· Channel statistics are stationary (Rayleigh fading averaged over 2000 samples) 
In contrast, the RL policy SINR varies during training as the exploration-exploitation balance 
evolves. 
Question 5: A 25 x 25 URA MVDR requires inversion of a 625 x 625 matrix. What is the 
computational cost?  
Excellent question.  
Answer: 
Computational Analysis (Intel Xeon Gold 6248R, 3.0 GHz): 
Operation   
Covariance   
Matrix inversion  
MVDR weights  
Total per user   
Complexity  
O(N²×Nsamples) 
O(N³/3) Cholesky 
O(N²)   
—   
System-level (2 BSs, 10 users): 
Measured Time 
12.3 ms 
8.7 ms 
0.4 ms 
21.4 ms 
• Per iteration: 63 ms (42.8 ms MVDR parallelizable + 20 ms beam scanning) 
• Per episode (150 iterations): 9.45 seconds 
• Full training (200 episodes): 31.5 minutes 
• Real-time inference: <1 ms (Q-table lookup) 
With GPU + incremental updates: <2 ms per user 
Question 6: Several typos should be fixed for a better reading experience. 
Answer: 
We sincerely apologize for the typographical errors and have conducted a thorough proofreading 
of the entire manuscript. 
We believe that these revisions substantially strengthen the manuscript and comprehensively 
address all reviewer concerns. We appreciate the opportunity to revise our work and hope that 
the improved version will be suitable for publication. 
Thank you for your time and consideration. 
Sincerely, 
Dr. Ali Othman 
(on behalf of all authors) 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the author's detailed response, all issues have been resolved. I have no further concerns.

Comments on the Quality of English Language

English is satisfactory.

Reviewer 2 Report

Comments and Suggestions for Authors

The revised manuscript looks good to me.

Back to TopTop