Enhancing Spectral Efficiency of 6G Downlink Beamforming via Cooperative Multi-Agent Deep Reinforcement Learning
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript proposes a combined scheme of multi-agent reinforcement learning and MVDR beamforming, which innovatively realizes the integration of 3D beam weight optimization and user-base station association. The work demonstrates a certain degree of workload and innovation, yet there remain the following issues to be addressed:
- The manuscript claims the scalability of base stations as a contribution, but this advantage is not highlighted in subsequent chapters, especially in the mathematical model and scenario setup sections. Additional descriptions should be added to emphasize this contribution.
- The manuscript mentions the joint optimization of azimuth and elevation angles, but the mathematical model only describes their respective calculation methods separately. Supplementary descriptions of the joint optimization mechanism are required.
- The simulation scenarios are overly simplistic and do not involve typical 6G application scenarios. Moreover, the performance metrics are monotonous, focusing only on SINR and throughput as core indicators. It is recommended to enrich the simulation scenarios and expand the set of performance metrics.
- The Related Works section is inadequately elaborated. Current content merely lists relevant studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on the relevance of the manuscript’s optimization for 5G base station scenarios to 6G technologies. In addition, it is suggested to supplement the literature survey regarding performance gains achieved by leveraging multi-user interference, with reference to the following papers:
[1] "Achieving Positive Rate of Covert Communications Covered by Randomly Activated Overt Users," IEEE Transactions on Information Forensics and Security, vol. 20, pp. 2480-2495, 2025.
[2] "Achieving Covert Communication With a Probabilistic Jamming Strategy," IEEE Transactions on Information Forensics and Security, vol. 19, pp. 5561-5574, 2024.
- There are inconsistencies between some figures and textual descriptions. Specifically, Figure 2 suffers from typesetting flaws; Figures 4 and 5 have text distortion problems; the content in Table 1 is inconsistent with the 625 antenna elements specified in the system model section. It is recommended to verify and revise these parts thoroughly.
Author Response
January 7, 2026
Dear Editors and Reviewers,
We express our sincere gratitude to both reviewers for their comprehensive evaluation and
constructive feedback.
Their insightful comments on our manuscript entitled “Enhancing Spectral Efficiency of 5G
Downlink Beamforming via Cooperative Multi-Agent Deep Reinforcement Learning” have been
carefully addressed. We have revised the manuscript to address all comments and enhance clarity
and technical rigor.
Recognition
The authors sincerely acknowledge the anonymous reviewers for their constructive comments
and suggestions, which have significantly enhanced the content and quality of the original paper.
We are attaching the original manuscript with red-marked changes, as well as a clean copy of the
revised manuscript.
Kind regards,
Reviewer 1
Question 1: The manuscript claims the scalability of base stations as a contribution, but this
advantage is not highlighted in subsequent chapters, especially in the mathematical model and
scenario setup sections. Additional descriptions should be added to emphasize this contribution.
Answer:
We appreciate the reviewer’s important observation. The manuscript has been revised to
demonstrate the scalability contribution in the following ways more clearly:
Explicit Reinforcement Learning and Scalability Clarification:
The system model and algorithm sections provide a clear description of the cooperative multi
agent reinforcement learning (MARL) framework. Each base station is explicitly modeled as an
independent learning agent, and the scalability of the proposed approach is demonstrated by the
learning formulation’s independence from the number of base stations. The following
explanation has been added below Figure 7 to emphasize the reinforcement learning and
scalability clarification further:
Finally, to validate the scalability of the proposed algorithm, we analyzed the computational
complexity per episode:
ï‚· Per-BS complexity: O(N²) for MVDR computation + O(|A|·|S|) for Q-learning updates,
where N=625 antenna elements, |A|=2 actions, |S|=10 states
ï‚· Total network complexity: O(B·N²) grows linearly with base stations, not exponentially
ï‚· Comparison with centralized approach: A centralized controller would require O(BU)
state space (210 = 1024 states for our scenario versus 10 states per agent)
For our 2-BS, 10-user scenario:
ï‚· MVDR computation: ~0.8ms per BS per iteration (625×625 matrix inversion)
ï‚· Q-learning update: ~0.05ms per user assignment
ï‚· Total per-episode time: ~150ms (150 iterations × 1ms average)
Scaling to 5 BSs and 25 users would require:
ï‚· 5× MVDR computations (still parallelizable): ~4ms
ï‚· 2.5× Q-learning updates per BS: ~0.125ms
ï‚· Projected per-episode time: ~375ms (maintaining near-linear scaling)
The architecture’s scalability is further demonstrated by enabling agents to be pre-trained in
smaller scenarios and subsequently deployed in larger networks without requiring full retraining.
This leverages the generalization properties of the SINR-based reward function. The RL agent
jointly optimizes both angles via the reward function: the elevation term explicitly accounts for
vertical angle optimization, while the ΔSINR term captures the coupled effect of both angles on
signal quality. The Q-learning update, therefore, learns to select beam directions that maximize
joint azimuth-elevation performance. In summary:
1. Coupled estimation: The beam scanning maximizes over the 2D (θ,φ) space jointly
2. Integrated weights: MVDR weights encode both dimensions in a single 625-element
vector
3. Holistic reward: RL rewards depend on SINR, which inherently reflects the combined 3D
beam accuracy
4. “ Each Q-learning episode adjusts user assignments based on 3D beam performance,
creating a feedback loop that jointly optimizes both angles.
This differs from sequential optimization (optimize θ, then φ) in that our approach considers the
interaction between azimuth and elevation throughout the optimization process.
Question 2: The manuscript mentions joint optimization of azimuth and elevation angles, but the
mathematical model describes their calculation methods separately. Supplementary descriptions
of the joint optimization mechanism are required.
We appreciate the reviewer’s important observation.
Answer:
Integration of RL and MVDR Beamforming:
We have clarified the hierarchical relationship between reinforcement learning and MVDR
beamforming. The learning agents optimize high-level decisions, such as user association and 3D
beam selection, while MVDR analytically computes the beamforming weights. This distinction
addresses the ambiguity identified by the reviewer.
The joint optimization of azimuth and elevation angles is achieved through the integrated RL
MVDR framework as follows:
Step 1: Coupled Angle Estimation: The 2D beam scanning estimator naturally couples azimuth
and elevation through the spatial power spectrum, using the steering vector, which inherently
represents the joint 3D direction.
The grid search over φ ∈ [-90°, 90°] and θ ∈ [-30°, 30°] yields the coupled estimate. This is a
joint optimization because the maximization occurs over the 2D angular space simultaneously,
not separately for each dimension.
Step 2: Joint MVDR Weight Calculation: The MVDR beamformer uses the jointly estimated
angles to compute weights that account for both dimensions simultaneously. The 625-element
weight vector (25×25 URA) implicitly encodes the joint spatial pattern, with the weights for each
element depending on both θ and φ.
Step 3: RL-Based Joint Refinement: The RL agent optimizes both angles jointly through the
reward function
The elevation term explicitly incorporates vertical angle optimization, while the ΔSINR term
captures the coupled effect of both angles on signal quality. The Q-learning update thus learns to
select beam directions that maximize joint azimuth-elevation performance.
Why This Constitutes Joint Optimization:
1. Coupled estimation: The beam scanning maximizes over the 2D (θ,φ) space jointly
2. Integrated weights: MVDR weights encode both dimensions in a single 625-element
vector
3. Holistic reward: RL rewards depend on SINR, which inherently reflects the combined 3D
beam accuracy
4. Iterative refinement: Each Q-learning episode adjusts user assignments based on 3D
beam performance, creating a feedback loop that jointly optimizes both angles
This differs from sequential optimization (optimize θ, then φ) in that our approach considers the
interaction between azimuth and elevation throughout the optimization process."
Question 3: The simulation scenarios are overly simplistic and do not involve typical 6G
application scenarios. Moreover, the performance metrics are monotonous, focusing only on
SINR and throughput as core indicators. It is recommended to enrich the simulation scenarios
and expand the set of performance metrics.
We appreciate the reviewer’s observation.
Answer:
This question will be addressed in future research, which is the focus of a forthcoming separate
paper. We have added the following statement to the Conclusion section: Future research will
extend this framework to 6G application scenarios. Scenario 1: High-Mobility Vehicular
Communications, Scenario 2: Dense Urban Hotspot (6G XR Applications), Scenario 3: Multi
Tier Heterogeneous Network, Expanded Performance Metrics, beyond SINR and throughput, we
will evaluate: Energy Efficiency, Fairness Index (Jain's Fairness), Handoff Rate and Latency,
Convergence Speed, Beam Alignment Accuracy, Outage Probability, and finally Comparison
Baselines: We will compare against:
ï‚· Random BS assignment with fixed MVDR
ï‚· Greedy nearest-BS assignment with MVDR
ï‚· Centralized exhaustive search
ï‚· Deep learning baseline: DQN without MVDR integration
ï‚· State-of-the-art: Coordinated beamforming
Question 4: The Related Works section is inadequately elaborated. Current content merely lists
relevant studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on
the relevance of the manuscript’s optimization for 5G base station scenarios to 6G technologies.
In addition, it is suggested that the literature survey on performance gains achieved by leveraging
multi-user interference be supplemented with references to the following papers.
We appreciate the reviewer’s important observation.
Answer:
The Introduction section has been revised to establish a more apparent connection between prior
5G learning-based beamforming research and emerging 6G requirements, with emphasis on
distributed intelligence, interference management, and hybrid machine learning–signal
processing approaches:
Existing works address parts of 6G beamforming, including DRL for adaptation, multi-agent
coordination, 3D beamforming, and generative AI for security. However, no prior work
integrates:
1. Multi-agent RL with MVDR for joint user assignment and 3D beamforming
2. Scalable architecture suitable for dense 6G deployments
3. Explicit joint azimuth-elevation optimization
4. Validation showing 5G techniques that scale to 6G requirements
Our contribution addresses this gap by presenting an end-to-end framework validated in 5G
realistic scenarios (28 GHz, 625 elements, Rayleigh fading) while incorporating essential 6G
features such as 3D beamforming, multi-agent coordination, and AI-native design.
Scalable architecture: The proposed framework is inherently scalable due to its decentralized
architecture. For a network with B base stations and U users, each base station operates an
independent RL agent with its own Q-network. The state space for each agent scales linearly as
O(U), while the action space remains constant at O(B), regardless of network size. This is
because each agent only needs to decide user assignments within its coverage area, avoiding the
exponential state-space growth (O(UB)) that would occur in a centralized approach. The
coordination mechanism scales efficiently through:
1. Shared network topology: All agents access a standard graph G = (V, E) where V
represents base stations and E represents coverage overlaps.
2. Distributed covariance computation: Each BS independently computes its 625×625
covariance matrix using only local observations
3. Coverage-based pre-filtering: The coverage check reduces the adequate action space from
B to typically 1-2 neighboring BSs per user
Adding k new base stations requires only:
ï‚· Instantiating k new identical Q-networks (no architectural redesign)
ï‚· Updating the shared topology graph with k new nodes
ï‚· No retraining of existing agents due to the transfer learning properties of the reward
function.
Question 4 (continued):
The Related Works section is inadequately elaborated. Current content merely lists relevant
studies in chronological order from 2021 to 2025. Revisions should clearly elaborate on the
relevance of the manuscript’s optimization for 5G base stations
Answer:
The “Related Works: section covers relevant studies in chronological order from 2021 to 2025.
We have clearly explained the relevance of the manuscript’s optimization to 5G. An extensive
review of other beamforming approaches is available in several surveys and review papers.
These are not related to or within the scope of the research reported in this manuscript, which is
why we did not use a “literature review” but rather focused on “related work.”
The literature review has been supplemented by adding the suggested two references, though
they are not directly related to the research reported in our manuscript. Hence, no further action
has been added to “it is suggested to supplement the literature survey regarding performance
gains achieved by leveraging multi-user interference, with reference to the following papers:”
Question 5: There are inconsistencies between some figures and textual descriptions.
Specifically, Figure 2 suffers from typesetting flaws; Figures 4 and 5 exhibit text distortion; and
the content in Table 1 is inconsistent with the 625 antenna elements specified in the system
model section. It is recommended to verify and revise these parts thoroughly.
We appreciate the reviewer’s important observation.
Answer:
All inconsistencies in antenna configuration, figures, and tables have been corrected, and all
formatting issues have been resolved.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors present a cooperative multi-agent RL approach for downlink beamforming + user assignment in a 2-BS, 10UE mmWave scenario with a 25 x 25 URA and MVDR beamforming, reporting SINR and throughput gains after appropriate training. Here are a couple of comments to help revise the manuscript.
1). The authors talk about MVDR weight computed from the sample covariance and steering vector, while also discussing the RL agent that it optimizes beamforming weights. Please specify the comparison between RL optimization v.s. MVDR and their relations.
2). 2000 samples are used for covariance estimation. However, the math derivation uses N_s = 1000 snapshots while elsewhere N_s = 100 for the beam-scanning step. Please clarify why the inconsistency.
3). Also, Table 1 only show ULA = 256 elements while manuscript presents 625 elements.
4). Fig. 6 compares the RL policy against a "conventional MVDR" baseline at 0.75dB. How does this baseline choose? Since MVDR performance is typically scenario dependent.
5). A 25 x 25 URA MVDR requires inversion of a 625 x 625 matrix, then what is the computational cost?
6). Several typos should be fixed for better reading experience.
Author Response
Reviewer 2
Question 1: The authors discuss MVDR weights computed from the sample covariance and
steering vector, while also discussing the RL agent that optimizes the beamforming weights.
Please specify the comparison between RL optimization v.s. MVDR and its relations.
We appreciate the reviewer’s important observation.
Answer:
Hybrid Approach - Role Separation:
MVDR Role: Computes optimal continuous weights for given user-BS assignment (closed-form
solution)
RL Role: Learns optimal discrete user-BS assignment policy (which users to which BS)
Two-Stage Process:
1.
2.
3.
RL Decision: Select BS via Q-learning (discrete: 2 actions)
MVDR Execution: Compute beamforming weights (continuous: 625 weights)
RL Update: Receive SINR reward, improve assignment policy
Approach
Action Space
MVDR Only N/A
Pure RL
1250 dims
RL + MVDR 2 dims
Training
None
Weeks
31 min
SINR
0.75 dB
Won't converge
1.15 dB
Question 2: 2000 samples are used for covariance estimation. However, the math derivation uses
N_s = 1000 snapshots while elsewhere N_s = 100 for the beam-scanning step. Please clarify why
there is an inconsistency:
We appreciate the reviewer’s important observation
Answer:
We have added before Section 5:
Our system uses three different sets of signal samples at various stages with different purposes:
1. N_covariance = 2000 samples (for MVDR covariance matrix estimation)
Purpose: Estimating the spatial covariance matrix for the MVDR beamformer
Location used: Table 1, Section 4.3 parameter list
Justification: MVDR requires accurate covariance estimation for optimal performance. For an
N=625 element array, the covariance matrix C{625×625} has 625² = 390,625 unique entries
(exploiting Hermitian symmetry). The rule of thumb for covariance estimation is N_samples ≥
2N to 3N for stability. We use 2000 > 3×625 = 1875 to ensure:
ï‚· Low estimation variance
ï‚· Accurate interference characterization
ï‚· Stable matrix inversion
Time duration: At 1 MHz symbol rate, 2000 samples = 2 ms, which is within the channel
coherence time (Tc ≈ 1/(2fD) = 1/(2×50Hz) = 10 ms).
2. N_scan = 100 snapshots (for beam-scanning direction estimation)
Purpose: Quick direction-of-arrival (DOA) estimation via 2D beam scanning
Why fewer samples?
ï‚· Speed requirement: Beam scanning performs a grid search over θ ∈ [-90°, 90°] (180
points) × φ ∈ [-30°, 30°] (60 points) = 10,800 evaluations of PB(θ,φ)
ï‚· Accuracy sufficiency: We only need rough DOA estimates (within ~1° resolution) to
initialize the steering vector for MVDR. Fine-grained SINR optimization happens
through the MVDR weights themselves.
ï‚· Computational trade-off: Using 100 samples instead of 2000 reduces beam scanning time
by 20×, from 40ms to 2ms per user, while still providing adequate DOA accuracy.
Time duration: 100 samples at 1 MHz = 0.1 ms, much faster than covariance estimation.
Question 3: Also, Table 1 only shows ULA = 256 elements, while the manuscript presents 625
elements
We appreciate the reviewer’s important observation
Answer:
Corrected. Also, all mentions of 1000 snapshots are changed to 2000 samples.
Question 4: Fig. 6 compares the RL policy against a "conventional MVDR" baseline at 0.75dB.
How does this baseline choose? Since MVDR performance is typically scenario dependent.
We appreciate the reviewer’s important observation.
Answer:
Excellent question.
The 'conventional MVDR' baseline (dashed red line at 0.75 dB in Figure 6) represents current
state-of-practice beamforming without intelligent user assignment. Its configuration is:
User Assignment Rule:
ï‚· Nearest base station selection: Each user k is assigned to BS i that minimizes Euclidean
distance: arg min_i ||pk - qi||
where pk is the user position, qf is the BS position
ï‚· This is a static, deterministic rule requiring no learning or optimization.
ï‚· Commonly used in practical systems due to simplicity.
Beamforming Method:
ï‚· Uses the equations listed for DOA estimation and the equation for weight computation
ï‚· Same parameters: 2000 samples for covariance, 28 GHz carrier, 625-element URA
ï‚· Identical channel conditions: Same Rayleigh fading, AWGN (3 dB SNR), path loss
model
Why 0.75 dB specifically?
This value is not arbitrarily chosen but rather the measured performance of the nearest-BS +
MVDR policy in our specific scenario:
Scenario details affecting baseline performance:
1. User distribution: 10 users randomly placed in a 100m × 50m area (seed fixed for
reproducibility)
2. BS positions: BS1 at (0, 0, 0), BS2 at (100, 0, 0)
3. Resulting assignments: Nearest-BS rule assigns six users to BS1, four users to BS2
4. Interference level: User 3 and User 7 experience substantial mutual interference (both
near x=50m boundary)
5. Path loss impact: User 10 assigned to BS2 has 65m distance → 36 dB path loss
Measured SINR distribution (baseline):
ï‚· Best user (User 1, 15m from BS1): 3.2 dB
ï‚· Worst user (User 10, 65m from BS2): -2.1 dB
ï‚· Average across 10 users: 0.75 dB
Why this is a fair baseline:
1. Scenario-dependent acknowledgment: You are correct that MVDR performance varies
with scenario. We report the baseline SINR specific to our scenario (Figure 1 topology).
Different user placements would yield different baseline values.
2. Reproducible comparison: Both the baseline and RL policy are evaluated in the same
scenario (same user positions, channel realizations, noise). The RL improvement (0.75
dB → 1.15 dB) reflects learning better assignments for this specific topology.
3. Standard practice: The Nearest-BS assignment is widely used in the literature and in
practical systems (3GPP standards specify similar criteria). It represents a strong baseline
that accounts for path loss but ignores interference and load.
4. Apples-to-apples MVDR: Both baseline and RL use an identical MVDR implementation.
The only difference is the assignment policy (deterministic vs. learned).
Baseline variations across episodes:
We note that the baseline SINR (0.75 dB) remains constant across episodes (flat red line in Fig.
6) because:
ï‚· User positions are static in our scenario.
ï‚· Nearest-BS assignment is deterministic (no randomness)
ï‚· Channel statistics are stationary (Rayleigh fading averaged over 2000 samples)
In contrast, the RL policy SINR varies during training as the exploration-exploitation balance
evolves.
Question 5: A 25 x 25 URA MVDR requires inversion of a 625 x 625 matrix. What is the
computational cost?
Excellent question.
Answer:
Computational Analysis (Intel Xeon Gold 6248R, 3.0 GHz):
Operation
Covariance
Matrix inversion
MVDR weights
Total per user
Complexity
O(N²×Nsamples)
O(N³/3) Cholesky
O(N²)
—
System-level (2 BSs, 10 users):
Measured Time
12.3 ms
8.7 ms
0.4 ms
21.4 ms
• Per iteration: 63 ms (42.8 ms MVDR parallelizable + 20 ms beam scanning)
• Per episode (150 iterations): 9.45 seconds
• Full training (200 episodes): 31.5 minutes
• Real-time inference: <1 ms (Q-table lookup)
With GPU + incremental updates: <2 ms per user
Question 6: Several typos should be fixed for a better reading experience.
Answer:
We sincerely apologize for the typographical errors and have conducted a thorough proofreading
of the entire manuscript.
We believe that these revisions substantially strengthen the manuscript and comprehensively
address all reviewer concerns. We appreciate the opportunity to revise our work and hope that
the improved version will be suitable for publication.
Thank you for your time and consideration.
Sincerely,
Dr. Ali Othman
(on behalf of all authors)
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you for the author's detailed response, all issues have been resolved. I have no further concerns.
Comments on the Quality of English LanguageEnglish is satisfactory.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe revised manuscript looks good to me.
