Next Article in Journal
Toward Active Distributed Fiber-Optic Sensing: A Review of Distributed Fiber-Optic Photoacoustic Non-Destructive Testing Technology
Previous Article in Journal
Condition Monitoring System for Planetary Journal Bearings in Wind Turbines Based on Surface Acoustic Wave Measurements—Validation on a System Level
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications

by
Yunxing Xiang
1,
Yi Yan
1,
Yunchao Song
2,*,
Jing Gao
1,
Xiaohui You
1,
Jun Wang
1,
Huibin Liang
2 and
Yixin Jiang
2
1
Supply Service Management Center, State Grid Jiangxi Electric Power Co., Ltd., Nanchang 330001, China
2
College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommuncations, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 60; https://doi.org/10.3390/s26010060
Submission received: 10 November 2025 / Revised: 5 December 2025 / Accepted: 18 December 2025 / Published: 21 December 2025

Abstract

For XL-MIMO multi-user frequency division duplex systems, this paper proposes a near-field beam training scheme using a two-phase combinatorial multi-armed bandit (MAB) framework. This scheme leverages the MAB framework, integrating energy-aware user scheduling and hierarchical beam training to balance communication quality and device battery level, thereby effectively enhancing system energy efficiency and extending the device’s lifespan. Specifically, in the first phase, we account for user battery levels by designing an energy-aware upper confidence bound (UCB) algorithm for user scheduling. This algorithm effectively balances exploration and exploitation, prioritizing users with higher achievable rates and sufficient battery level. In the second phase, based on the scheduled users, two UCB algorithms are employed for beam training. In the first layer, discrete Fourier transform codebook-based beam scanning is utilized, and a UCB algorithm is applied to initially acquire angle information for scheduled users. In the second layer, based on the obtained angle information, a candidate set of polar-domain codewords is constructed. Another UCB algorithm is then employed to select the optimal polar-domain codewords. The effectiveness of our scheme is confirmed by simulations, demonstrating notable achievable rate gains for multi-user communications.

1. Introduction

The expanded spatial degrees of freedom in Extremely Large-Scale Multiple-Input Multiple-Output (XL-MIMO) facilitate simultaneous improvements in frequency utilization, channel capacity, and transmission delays, paving the way for advanced wireless architectures [1,2,3]. However, the substantial increase in array scale also presents new theoretical and practical challenges. In particular, as the propagation scenario moves to the near-field scenario, the traditional scheme in far-field model becomes inadequate, invalidating the planar wave approximation and necessitating the incorporation of wavefront curvature effects. This shift necessitates adaptations in critical processes such as channel modeling, beam training, and user scheduling [4,5,6].
To tackle the excessive complexity and overhead associated with near-field beam training, researchers have recently proposed various schemes aimed at reducing resource consumption while maintaining accuracy. For example, by modeling inter-user relationships with a Graph Neural Network, the three-phase method proposed in [7] for multi-user XL-MIMO systems achieves higher beam estimation accuracy and substantially lower pilot overhead. Another widely studied approach adopts a two-stage hierarchical search framework, such as using a central sub-array for coarse angular search to determine the general user direction, followed by refined estimation in the polar domain jointly optimizing angle and distance [8]. Building on this, Lu et al. proposeded a multi-resolution codebook design [9], enabling two-dimensional hierarchical near-field beam training through wide-beam coarse search followed by progressive refinement, thereby avoiding the high overhead of an exhaustive search. A substantial reduction in training overhead was achieved in [10] through an auxiliary codebook based on spatial chirp beams. This method analyzes beam patterns in the slant-intercept domain to capitalize on the coupling relationship between the angle and distance. More recently, a sparse discrete Fourier transform (DFT) codebook-based three-stage beam training method has been introduced: first performing sparse scanning in the angular subspace to select candidate angles, then activating the central sub-array to resolve angle ambiguity, and finally searching for the optimal distance within the estimated angle using a polar-domain codebook [11]. These studies have laid an important foundation for transitioning near-field XL-MIMO systems from theory to practice. However, most existing studies focus primarily on the beam training process within a single coherence block interval and fail to fully exploit the correlation of channel characteristics across different coherence periods. In contrast, statistical channel state information (CSI) often remains relatively stable over longer time scales [12], offering new opportunities for designing low-overhead beam training schemes.
Meanwhile, within the context of near-field communications for XL-MIMO systems, the multi-user scheduling problem also requires re-evaluation. A variety of specialized strategies have been introduced for XL-MIMO systems, seeking to optimize system performance through different technical avenues. By integrating coalition-search-based scheduling with optimal power allocation, the work in [13] addresses joint user scheduling and power allocation with a quality-of-service awareness. This approach improves user accommodation in dense networks while guaranteeing minimum user rates. Building upon this, to better adapt to channel uncertainties in practical systems, the authors [14] considered the impact of imperfect CSI, employed statistical CSI (S-CSI) for user scheduling, and jointly designed precoding to improve overall spectral efficiency and reduce training overhead. Recently, user scheduling mechanisms have begun to integrate deeply with specific application scenarios. A work focused on maritime unmanned aerial vehicle (UAV)-assisted communication, deriving orthogonal positions in the near-field region and jointly optimizing UAV altitude and user scheduling to effectively reduce channel interference and improve the system sum rate [15]. Beyond channel-state-based methods, geometry-based scheduling strategies have garnered attention due to their effectiveness in reducing complexity. The authors [16] proposed a distance-based scheduling mechanism that classifies users based on their equivalent distance relative to the antenna array, significantly reducing computational complexity while mitigating inter-user interference. A similar interference management approach has been extended to multi-cell systems. The authors [17] proposed a novel user scheduling strategy for multi-cell near-field MIMO systems, clustering users from the perspectives of both inter-user interference and inter-cell interference, thereby achieving effective management of system interference and a reduction in scheduling complexity. However, these existing user scheduling schemes generally overlook the energy status of user equipment and fail to incorporate terminal battery level constraints into the scheduling decision process [18]. This limitation becomes particularly critical in practical energy-constrained communication scenarios. It may lead to frequently scheduling users with an insufficient battery level, thereby accelerating their energy depletion, shortening device lifespan, and potentially causing connection interruptions, ultimately compromising overall system energy efficiency and user experience.
Existing beam training schemes for near-field XL-MIMO face several limitations. Symmetric beam training methods suffer from beam absence at certain angular regions [19]. Array partition techniques can mitigate this but require extra calibration beams, increasing training overhead [20]. For user scheduling, many approaches lack adaptability or key practical considerations. Supervised learning models [21] cannot handle dynamic environments, while deep reinforcement learning methods [22] often ignore terminal energy constraints. Even near-field scheduling schemes using statistical CSI [23] focus solely on received energy, lacking a joint energy-distance awareness.
In this context, the focus of this paper is the investigation of integrated multi-user beam training and scheduling strategies tailored for XL-MIMO. We propose a two-phase combinatorial multi-armed bandit (MAB) beam training framework. In the first phase, we propose an energy-aware upper confidence bound (UCB) algorithm to address the user scheduling problem. This algorithm utilizes the average achievable rate as the reward, jointly considering user channel quality and battery level. The energy-aware UCB (EA-UCB) value consists of three components: the historical average achievable rate, an uncertainty term, and a penalty term related to the remaining battery level. Maximizing the EA-UCB value selects users that are most likely to deliver superior rates within each coherence block, thereby facilitating efficient scheduling. In the first layer of the second phase, we employ a DFT codebook-based beam scanning, and apply a Kullback–Leibler UCB (KL-UCB) algorithm to obtain angle information for the scheduled users. Subsequently, with the acquired angle information, the second layer proceeds to construct a corresponding candidate set of polar-domain codewords. Again, using a KL-UCB algorithm with beam gain as the reward, we select the polar-domain codeword that maximizes the average beam gain for the scheduled users. The contributions are concluded as follows:
  • An energy-aware user scheduling algorithm that addresses the common oversight of terminal energy efficiency in existing mechanisms is developed. By incorporating the residual energy status of devices into UCB strategy as a weighting factor, our approach dynamically prioritizes users with favorable channel conditions and sufficient energy. This strategy not only balances exploration and exploitation, but also significantly enhances overall network energy efficiency, extends the operational lifespan of battery-constrained devices, and improves service fairness in energy-heterogeneous scenarios.
  • This paper formulates an integrated framework that co-optimizes user scheduling and near-field beam training (US-NFBT), overcoming the limitations of conventional decoupled designs. The proposed method employs a two-phase combinatorial (C-MAB) model to first carry out energy-aware user selection, followed by a dual-layer beam training mechanism. This hierarchical process begins with coarse-grained angle estimation and advances to fine-grained polar-domain beam training, leading to a substantial reduction in training overhead. As a result, the proposed scheme effectively mitigates the resource consumption problem introduced by the additional ranging requirement inherent to near-field communications.
  • A codebook-based non-reciprocal beam training scheme is designed. By efficiently acquiring downlink channel information using a triple UCB strategy with integrated device battery awareness, a practical and energy-efficient multi-user near-field communication solution is offered. This approach not only improves beam training performance, but also extends device battery life, thereby facilitating the real-world deployment of frequency division duplex (FDD) XL-MIMO systems.
The proposed framework directly addresses these limitations by integrating several synergistic innovations. A dual-layer KL-UCB beam training structure progressively narrows the search space, effectively pruning redundant beams to reduce overhead. Simultaneously, a novel energy-aware UCB scheduler incorporates a distance–energy penalty term, dynamically balancing the communication load to prolong network lifetime. Furthermore, by leveraging the temporal stability of statistical CSI instead of instantaneous CSI, the framework minimizes the need for frequent and costly channel estimation. Together, these design choices form a cohesive and practical solution for low-overhead, energy-efficient multi-user communications in the near-field.

2. System Model

2.1. Communication Scenario and Channel Model

As shown in Figure 1, this work studies an FDD XL-MIMO downlink system. The base station (BS) is equipped with a uniform linear array (ULA). The ULA has N antennas and M RF RF chains. A ULA aligned with the y-axis is considered, where the coordinate of the n-th antenna element (for n = 1 , , N ) is 0 , δ n d , with antenna spacing d = λ / 2 defined by the carrier wavelength λ . The BS communicates with K users, all located within the array near-field, i.e., their distances from the BS are less than the Rayleigh distance [9]. Each user has one antenna. The channel model adopts the spherical wave assumption. Consider a multiple scattering clusters channel, during the t-th coherence block, the channel vector h t k from the BS to user k is [24]
h t k = a k s = 1 S l = 1 L s λ g l s e j 2 π λ ( r l s + μ l s ) + j w l s ( 4 π ) 3 / 2 r l s μ l s b ( θ l s , r l s ) ,
where a k = N s = 1 S L s is a normalization constant, which ensures per-user power normalization, accounting for the total number of paths across all clusters associated with user k, S is the number of clusters, L s is the number of paths within the s-th cluster, μ l s is the distance between the reference point and the user, and  w l s is a phase shift, characterized as a variable followed by a uniform distribution across the interval [ π , π ) . g l s CN ( 0 , σ l s , 2 ) is the l-th path gain in the s-th cluster, where σ l s is uniformly distributed in [ 0 , 1 ] . The channel model in Equation (1) is adapted from Equations (33)–(34) in [24], with modifications to accommodate single-antenna users in our system. Unlike [24], where users employ multiple antennas, our model eliminates the user-side steering vector, thus simplifying the expression.
The spherical wave array steering vector in the near-field, denoted as b ( θ l s , r l s ) , is
b ( θ l s , r l s ) = 1 N [ e j 2 π λ ( r l s , ( 0 ) r l s ) , , e j 2 π λ ( r l s , ( N 1 ) r l s ) ] T ,
where r l s , ( n ) = r l s 1 ( sin θ l s ) 2 0 2 + r l s sin θ l s n d 2 = ( r l s ) 2 + ( n d ) 2 2 n d r l s sin θ l s , gives the propagation path length from the n-th BS antenna to the scatterer in the s-th cluster, while r l s and θ l s represent the l-th path’s distance and angle parameters, respectively. This differs from the cosine term in ([24], Equation (2)), which describes the distance from the user’s receiving antenna to the cluster.
The received signal y t = [ y 1 , t , , y K , t ] C K × 1 in the t-th coherence block is
y t = H t H W t s t + n t ,
where H t = [ h t 1 , , h t K ] is the channel matrix, W t = D t V t = [ w 1 , t , , w K , t ] C N × K is the beamforming matrix, D t C N × | G t | is the codebook-based analog beamforming matrix designed through beam training, V t is a digital precoding matrix, which can be designed based on the zero-forcing criterion [25], s t = [ s 1 , , s K ] T C K × 1 is the signal vector transmitted from the base station obeying E ( s t s t H ) = I , and  n t CN ( 0 , σ 2 I ) is the AWGN.

2.2. Energy Model

The system-wide power consumption comprises two components: the transmit power consumption and the fixed circuit power consumption. Specifically, under the considered communication scenario and channel model for user k, the system-wide power consumption follows the model in [26]
ξ k total = γ w k , t 2 2 + ξ system ,
where 1 / γ denotes the power amplifier efficiency, w k , t represents the beamforming vector for user k at the t-th coherence block, and  ξ system denotes the system circuit power.
The user scheduling decision is represented by a binary variable G k , t [27]
G k , t = 1 , if user k is selected ; 0 , otherwise .
The subset of scheduled users is denoted by G t , where G t = { G ϕ 1 , t , t , , G ϕ j , t , t } , ϕ j , t 1 , 2 , , K , | G t | M and M is the maximum schedulable user number.
The residual energy of a scheduled user k is updated according to [18]
Ψ k , t = Ψ k , t 1 ξ k total L data W bw E k , t ,
where L data denotes the number of bits of the requested data obtained by scheduled user k per coherence block, W bw denotes the bandwidth, and  E k , t represents the effective achievable rate (EAR) [28], which is
E k , t = G k , t ( T total T B , t T total ) log 2 ( 1 + P signal P interference + σ 2 ) ,
where T B , t and T total denote the beam training overhead and the total symbol count per coherence block, respectively. P signal = ( h t k ) H w k , t 2 represents the received signal power, and  P interference = n k h t k H w n , t 2 denotes the aggregate interference power from other users. Then, the total EAR is E t = k = 1 K E k , t . The term ξ k total L data W bw E k , t models the energy consumption per L data bit data transmitted. Note that if Ψ k , t < Ψ th , user k will not be scheduled in subsequent coherence blocks, where Ψ th is a predefined energy threshold. For the subsequent analysis, we set E ˜ k , t = log 2 ( 1 + P signal P interference + σ 2 ) .
The energy consumption model employed in this work follows established formulations adopted in the literature [18,29,30]. As described by Equation (4), the system-wide power consumption per user incorporates a transmit power term γ w k , t 2 2 , proportional to the beamforming gain and directly modeling the dominant consumption of the radio frequency power amplifier [31], alongside a constant circuit power component ξ system accounting for the static power draw of essential baseband and frequency synthesis circuits. This physical consistency is maintained in the user-side residual energy update of Equation (6), where the total energy consumed for transmitting a fixed data volume L data depends on the required transmission time L data W bw E k , t . Since this transmission time is inversely proportional to the effective achievable rate E k , t , the model captures a key practical energy efficiency principle: poor channel conditions (lower rates) necessitate longer transmission times and thus higher energy consumption, accelerating battery drain, whereas favorable conditions enable more energy-efficient data transfer.
The model is both physically grounded and adaptable to various terminal types through parameter adjustment. For battery-limited IoT devices [29], it accurately reflects their typical energy profile, enabling scheduling algorithms to extend network lifetime by managing the energy expenditure of nodes. For more complex devices like smartphones, the circuit power parameter ξ system can be calibrated to represent their higher active power, allowing the scheduler to balance consumption across diverse users. Overall, this formulation uses a concise set of meaningful parameters to facilitate energy-aware scheduling across a wide range of practical wireless scenarios.

2.3. Problem Formulation

Most conventional near-field beam training methods overlook user energy disparities and lack integrated scheduling. To address this, we introduce a scheduling strategy that selects users for service during each coherence block. This selection jointly considers two factors: channel quality and residual energy levels. The aim is to maximize the total EAR throughout the S-CSI invariance interval.
To this end, it is imperative to have an effective user scheduling strategy coupled with a well-designed beamforming matrix. To mitigate CSI acquisition overhead, the adoption of codebook-based analog beamforming is widespread in terrestrial systems [32]. Typically, such an analog beamforming matrix is constructed from a predefined codebook. The polar-domain codebook, as a predefined codebook choice, offers a distinct advantage over its DFT-based counterpart for near-field model [33]. In the polar-domain codebook D n , the i-th codeword b θ i , r i is formulated using the near-field steering vector.
Assuming S-CSI remains constant over T coherence blocks, the problem is formulated as
(8) P : max W t , G k , t t = 1 T k = 1 K E k , t (8a) s . t . k G k , t M (8b) Ψ k , t Ψ th , k 1 , 2 , , K (8c) d k , t D n , k 1 , 2 , , K .
To reduce computational complexity and training overhead, we propose a two-phase scheme for jointly perform user scheduling and optimal codeword selection. In the first phase, a user subset is selected through scheduling. In the second phase, the codebook-based analog beamforming matrix D t is optimized for this subset. We formulate this joint design problem and introduce a MAB-based beam training scheme as our solution, detailed in the subsequent section.

3. Two-Phase Near-Field Beam Training Scheme with User Scheduling

A reinforcement learning framework, called the MAB, is designed for sequential decision-making, maximizing the cumulative reward through an explore–exploit trade-off. This framework has been successfully applied to various challenges in wireless communications, such as user scheduling and/or beam training. Instances include using deep contextual bandits for near-optimal mmWave beam selection [34], MAB-based pilot allocation for efficient beam alignment [35], and contextual MAB for dynamic user scheduling in massive MIMO with fairness considerations [36]. A recent advancement employs a hierarchical MAB with a dual-UCB strategy for joint beamforming and user scheduling in low earth orbit satellite networks, enabling pilot-free acquisition of S-CSI from historical data to greatly enhance net spectral efficiency [23]. The aforementioned works demonstrate the potential of MAB-based solutions in addressing key challenges within the field of wireless communications.
Inspired by these developments, we introduce a C-MAB-based beam training scheme that operates in two phases. Our scheme, depicted in Figure 2, operates within each coherence block through a sequence of four steps: user scheduling, beam training, digital precoding, and data transmission. Specifically, in the first phase, the scheduling strategy is determined by both the channel quality and the residual energy levels of the users. The second phase subsequently performs angular scanning followed by polar-domain scanning. This decomposition breaks down the original optimization problem P into three coupled subproblems: user scheduling, angular scanning, and polar-domain scanning.

3.1. EA-UCB-Based User Scheduling

In recent years, UCB and its variants have found widespread applicability in wireless network optimization. Examples include correlation-aware link rate selection (related to Min-UCB) [37], outage-based meta-scheduling for downlink systems [38], and multi-agent task offloading algorithms in edge computing [39]. Motivated by these advances, in the first phase, we propose an EA-UCB algorithm for user scheduling. Our EA-UCB algorithm dynamically selects an optimal user subset in each coherence block, achieving joint optimization of communication performance and energy consumption. Unlike conventional UCB methods that focus solely on channel state or throughput, the proposed EA-UCB incorporates multiple dimensions, including instantaneous achievable rate, historical scheduling experience, and residual battery level into a unified decision-making framework.
For each user k, the scheduling merit metric (EA-UCB value) [40] is defined as follows
EU k ( t ) = E ¯ k , t + B ln t C k , t ( a ) r k Ψ k , t ( b ) ,
where E ¯ k , t represents the historical average achievable rate (excluding the overhead term) of user k up to the t-th coherence block, C k , t gives the total scheduling count for user k up to the t-th coherence block, B is an exploration coefficient that controls the degree of exploration, and r k indicates a normalized parameter based on the distance between user k to the BS.
In Equation (9), component (a) embodies the core principle of the classical UCB algorithm, which seeks to balance exploitation of known information and exploration of new alternatives during decision-making. Here, E ¯ k , t favors users with a historically good performance, representing exploitation. The second term, B ln t C k , t , serves as an exploration bonus, where C k , t denotes the number of times user k has been scheduled so far. By being inversely proportional to the scheduling count, this term encourages sampling less-scheduled users, thereby helping to discover potentially superior options. Collectively, component (a) acts as an optimistic estimate of user performance, promoting both the selection of top-performing users and sufficient exploration of less frequently scheduled ones, which enables the method to effectively converge to an optimal user combination in dynamic propagation conditions.
This subsection formally defines the distance–energy ratio as a key metric, denoted as component (b): r k Ψ k , t . This metric integrates energy availability and distance-related paths into a proportional penalty term. A higher value of r k Ψ k , t indicates that a user is either severely energy-constrained or located farther from the base station, resulting in a lower scheduling priority. This strategy enhances energy efficiency and extends network longevity without sacrificing system throughput. By jointly considering near-field path loss and energy consumption, the approach reduces the priority given to users with poor channel conditions or insufficient energy, leading to more efficient resource allocation.
Then, the user scheduling problem is formulated as follows
(10) P 1 : arg max G t E ¯ k , t + B ln t C k , t r k Ψ k , t (10a) s . t . G t M ,
where the number of users scheduled at the t-th coherence is M t = G t . Algorithm 1 details the proposed EA-UCB-based user scheduling strategy. Given that R k = E ( h t k ) ( h t k ) H , E P signal Trace ( R k ) N and E P interference 0 , it follows that E E ˜ k , t log 2 1 + N σ 2 . Thus, we set E ¯ k , 1 = log 2 1 + N σ 2 for all k.
Remark 1.
Users whose EA-UCB values are negative or whose residual energy falls below the threshold should be excluded to avoid inefficient scheduling of energy-depleted devices.
With pre-screening conditions that mandate non-negative EA-UCB values and residual energy above a threshold, the algorithm efficiently disqualifies users not meeting basic criteria, thus shrinking the solution space and boosting real-time performance. The embedding of energy state into the scheduling value function allows the EA-UCB method to achieve higher system energy efficiency and longer device operational life, without greatly increasing computational complexity.
Algorithm 1 EA-UCB-based user scheduling
Input:  Ψ th , M, r k , Ψ k , 1
1:Initialize E ¯ k , 1 = log 2 1 + N σ 2 and C k , 1 = 1 for all k
2:Initialize G t Ø , V t Ø {Initialize valid user set}
3:for   t = 2   to T do
4:   for each user k do
5:       if  Ψ k , t Ψ th  then
6:           Compute EU k ( t ) = E ¯ k , t + B ln t C k , t r k Ψ k , t
7:           if  EU k ( t ) > 0  then
8:               Add k to V t
9:           end if
10:      else
11:            EU k ( t )
12:      end if
13:   end for
14:    M t = min ( M , V t ) Determine number of users to schedule
15:   Select top M t users from V t with highest EU k ( t ) to form G t
16:   Run Algorithm 2 to get the selected polar-domain codewords
17:   Compute achievable rate (excluding the overhead term) E ˜ k , t for each scheduled user
18:   Update for each scheduled user k G t :
19:         C k , t + 1 C k , t + 1
20:         E ¯ k , t + 1 E ¯ k , t C k , t + E ˜ k , t C k , t + 1
21:         Ψ k , t + 1 Ψ k , t ξ k total L data W bw E k , t
22:end for
Output:   G t
Algorithm 2 Two-layer MAB-Based Beam Training
Input:  D f , D n , G t
1:Initialize C i , t f = 0 and C i , t n = 0
2:Initialize the first and second layers base arms
3:for  t = 2  to T do
4:   Layer 1: Angular Scanning (DFT Codebook)
5:   Compute KU i f ( t ) via the bisection method
6:   Select M f codewords to constitute A t f (Remark 2)
7:   Layer 2: Polar-domain scanning (constructed polar-domain codebook)
8:   Construct candidate angle set Θ ¯ from A t f
9:   Construct candidate codebook D ¯ n from Θ ¯
10:   Compute KU i n ( t ) via the bisection method
11:   Select M n codewords to constitute A t n (similar to Remark 2)
12:   Update C i , t + 1 f , C j , t + 1 n , E ¯ i , t + 1 f , E ¯ j , t + 1 n using received signals
13:end for
Output:   A t f , A t n

3.2. MAB-Based Angular Scanning

We note that for the DFT codebook, the near-field channel exhibits a comparable energy distribution within angular ranges where the energy spreads towards the neighboring angles. This allows the DFT codebook to effectively perform initial angular-domain sweeping in the near-field as a practical starting point. Figure 3 shows the beam gain of the DFT codebook under far-field and near-field paths, where Δ θ = cos ( θ c ) cos ( θ p ) , θ c is the corresponding angle of the codeword in the DFT codebook, and θ p is the angle of the path. As illustrated in Figure 3, angular domain sweeping with the DFT codebook yields an exact angle for users in the far-field; for near-field users, it provides a broad angular range, within which subsequent polar domain sweeping using the polar-domain codebook can precisely locate the user.
In the first layer of the second phase, this paper adopts a DFT codebook D f for angular scanning. The goal is to determine some codewords in the codebook that maximize the sum performance gain at the scheduled users, thereby providing a candidate degree set for the subsequent polar-domain codebook construction.
Let M f denote the training overhead, and let A t f represent the collection of indices for the selected DFT codewords within the t-th block. Denote d i f , i = 1 , , N as the i-th DFT codeword and D f = d 1 f , , d N f . The optimization problem can be formulated as selecting a codeword set A t f to maximize the beam gain for the scheduled users, expressed as
(11) P 2 : max A t f t = 1 T k G t i A t f h t k H d i f (11a) s . t . d i f D f .
Within information theory and statistical learning, divergence functions serve as a fundamental tool for measuring differences between probability distributions. A prominent example is the asymmetric Kullback–Leibler (KL) divergence [41], commonly employed to evaluate the difference between a true distribution Q and an approximating distribution Q ˜ , with its discrete form given by f KL ( Q | | Q ˜ ) = Q ( x ) log Q ( x ) Q ˜ ( x ) d x . Theoretically, KL-UCB leverages the KL divergence to construct tighter confidence bounds than the standard Hoeffding inequality-based UCB. This enables a more precise characterization of tail behavior in probability distributions, leading to superior asymptotic regret performance. For arm i, we define the UCB value as
KU i ( t ) = max μ ˜ : C i , t f KL μ ¯ i | | μ ˜ f expl ( t ) ,
where f expl ( t ) = log t + c · log log t is an exploration term, with c as a tuning coefficient, μ ¯ i means the sample average reward of arm i, and μ ˜ is a candidate value for the theoretical expected reward to be solved for.
Conventional optimization methods based on channel covariance matrices (CCMs) are often ineffective in practice due to the lack of S-CSI. To address this challenge, this paper employs a MAB framework and adopts a KL-UCB strategy for efficient problem-solving. The CCM, a key component of S-CSI, varies at a much slower rate than instantaneous CSI (I-CSI) [12]. Thus, it is considered approximately constant over multiple channel coherence intervals. For each user, its corresponding CCM R k remains unchanged over an extended period T.
We formulate the aforementioned optimization problem P 2 as a C-MAB problem oriented towards the scheduled users, and employ the KL-UCB algorithm to solve it. In this model, each codeword d i f in the DFT codebook D f is treated as a base arm, while the set of selected DFT codewords in each training round forms a super arm, denoted as A t f . In the t-th coherence block, for any scheduled user k, the reward associated with the i-th base arm is set to user k’s detected signal power when the i-th codeword is used. This reward is expressed as E i , t f = y k , i , t f 2 = ( h t k ) H d i f + n k , t 2 .
We now give lemmas that are useful to analyze the distribution form of y k , i , t f 2 and the corresponding KL divergence form for this distribution.
Lemma 1.
For a random variable A CN ( 0 , κ ) , the squared modulus A 2 follows an exponential distribution, i.e., A 2 Exp ( 1 κ ) .
Proof. 
See Appendix A. □
Therefore, y k , i , t f 2 follows an exponential distribution with parameter 1 d i f H R k d i f + σ 2 . Based on Lemma 1, we can obtain the following lemma.
Lemma 2.
The derivation of the KL divergence is
f KL ( q | | q ˜ ) = log λ q λ q ˜ + λ q ˜ λ q 1 ,
where q ( x ) = λ q e λ q x and q ˜ ( x ) = λ q ˜ e λ q ˜ x are exponential with parameter λ q and λ q ˜ , respectively.
Proof. 
See Appendix B. □
Theorem 1.
The first layer of the proposed beam training method achieves an expected regret of O ( ln t ) .
Proof. 
The proof follows a similar procedure to that in Section 3 of [41], and is therefore omitted here for brevity. □
Theorem 1 guarantees that, under the condition that CCM remains constant, the regret of the first layer grows logarithmically. This enables the algorithm to asymptotically converge to the optimal codewords in the DFT codebook, thereby maximizing the expected beamforming gain.
Remark 2.
We restrict the number of codewords chosen per scheduled user k to A t , k f = { j : 1 K U j ( t ) ρ · max i 1 K U i ( t ) } following the computation of the KL-UCB values, such that the training overhead can be reduced. Consequently, this leads to A t f = k G t A t , k f and M f = A t f .
In the proposed scheme, the system progressively learns the average beam gain of each DFT codeword. Specifically, the average beam gain of user k at the i-th codeword d i f is denoted as ε k , i f = E h t k H d i f 2 = ( d i f ) H R k d i f . Note that codewords with higher average beam gains are more likely to provide high beamforming gains for the users across multiple coherence intervals. By prioritizing the selection of these high-gain codewords and reducing the exploration of low-gain ones, system performance is therefore guaranteed.

3.3. MAB-Based Polar-Domain Scanning

Building upon the angle information acquired in the previous Section 3.2, we filter candidate polar-domain codewords for polar-domain scanning, and then a C-MAB-based polar-domain scanning strategy is proposed.
In the second layer of the second phase, we leverage the angular scanning results from the first layer to extract angle information for each scheduled user and construct a refined set of angle indices. Based on this, we obtain the corresponding candidate set of polar-domain codeword angles Θ ¯ = { θ i | i A t f } , where θ i = 2 i N 1 N denotes the angle parameter of the i-th polar-domain codeword.
In the polar-domain codebook, each codeword is defined by the coupling of angle and distance parameters. For the purpose of adapting to near-field channel characteristics, multiple distance sampling points are associated with each angular direction [33]. For any given angle θ i Θ ¯ , the corresponding distance sampling values can be determined as follows r i , z = 1 z Z Δ ( 1 θ i ) , z = 1 , , Z , where Z Δ = N 2 λ 8 β Δ 2 serves as the coherence threshold for the response vectors of near-field, Z denotes the count of distance samples and the parameter β Δ represents the coherence threshold between the neighboring polar-domain codewords, used to partition codewords into polar domains. A larger β indicates lower coherence among codewords within the polar codebook, as shown in Figure 4, where | G ( β ) | denotes the codeword coherence. Denote the polar-domain codebook as D ¯ n , containing a total of M ¯ × Z codewords, where M ¯ = | Θ ¯ | . Each codeword b ( θ i , r i , z ) , i A t f , z = 1 , , Z is defined by a specific angle-distance pair ( θ i , r i , z ) .
This layer aims to identify an optimal subset of codewords from D ¯ n , such that the total beam gains of all scheduled users are maximized, and the problem is written as
(14) P 3 : max A t n t = 1 T k G t i A t n h t k H d i n (14a) s . t . d i n D ¯ n .
To reduce training overhead, we set M n = A t n , where A t n is solved in a manner analogous to that used in the first layer. The two-phase MAB-based beam training procedure is detailed in Algorithm 2.
Adopting an approach analogous to Theorem 1’s proof, this work investigates the exploration–exploitation dilemma inherent to the polar-domain codeword selection process at the second layer. It can be shown that the expected regret in this phase also exhibits an upper bound of O ( ln t ) .

3.4. Design of Multi-User Digital Precoding

In the second layer of the second phase, the BS transmits a pilot matrix X H C A t n × A t n , and the signal detected at user k ( k G t ) is formulated by
y k , t n H = h t k H D A t n X H + n k , t .
To estimate h ¯ t k H = h t k H D A t n , the pilot matrix must satisfy orthogonality condition X H X = I . Under the premise that the condition is satisfied, the Least Squares channel estimation is employed. Then, we have
y k , t n H X = h t k H D A t n + n k , t X ,
where the estimated value of h ¯ t k H is h ¯ ^ t k H = y k , t n H X , and this estimate is fed back to the BS.
Following a procedure analogous to Remark 2, we select a set of M n polar-domain codewords. Among these codewords, the one with the maximum KL-UCB value for user k ( k G t ) in the current coherence block must be included. We use these codewords to form A t 1 . The BS utilizes A t 1 and h ¯ ^ t k H to obtain an estimate of H ¯ G t H = H G t H D A t 1 , denoted as H ¯ ^ G t H . To mitigate interference, the multi-user digital precoder V t takes the form of H ¯ ^ G t H . While ideal zero-forcing requires perfect CSI, our approach follows practical implementations where estimated CSI is employed for precoder design [23,27,43]. This represents a trade-off between performance and practicality, effectively suppressing interference when channel estimation quality is sufficiently high.

4. Complexity Analysis

The computational complexity of the proposed two-stage C-MAB framework originates from three parts: energy-aware user scheduling, angular scanning, and polar-domain scanning.
In the user scheduling part, in each coherent block, Algorithm 1 calculates the EA-UCB metric for K users. The per-user computation, involving historical average rate, exploration term, and a penalty term, has constant complexity O ( 1 ) . Selecting the top-M users requires sorting, with complexity O ( K · M ) . Over T coherent blocks, the total complexity is O ( T · K · ( M + 1 ) ) .
In the angular scanning part, Algorithm 2 employs a DFT codebook of size N. KL-UCB indices for N codewords are computed via a bisection method. The KL divergence for the exponential reward distribution is solved with the complexity O 1 , and the bisection requires C a iterations. Selecting M f candidate beams has complexity O N · M f . The per-block complexity is, therefore, O N · ( C a + M f ) , leading to an overall complexity of O T · N · ( C a + M f ) .
The polar-domain scanning part searches over a candidate set with M f angles, each associated with Z distance samples. The candidate codebook size is O M f · Z . Using the same KL-UCB logic as polar-domain scanning to select M n candidate beams, the bisection requires C p iterations, so its per-block complexity is O M f · Z · C p and overall complexity is O T · M f · Z · ( 1 + C p + M n ) .
Integrating the three parts, the total time complexity of the proposed framework is O T · [ K · ( M + 1 ) + N · ( C a + M f ) + M f · Z · ( 1 + C p + M n ) ] .
The graph-based I-CSI algorithm (ICSIG) [13] requires full instantaneous CSI acquisition and solves an NP-hard maximum-weight clique problem, resulting in a complexity of O T · [ K 2 + 1 . 1996 K + M ( N · Z ) ] [44]. The Thompson-sampling-based beam training (TPBT) [45,46] has a complexity of O K + K · M in the user schedule part, whille has a complexity of O log ( N Z ) in the two-stage beam training part. So the total time complexity of the TPBT is O T · [ K + K · M + log ( N Z ) ] .

5. Numerical Analysis

This section presents a performance evaluation of the introduced energy-aware US-NFBT scheme. The simulation parameters and channel configurations are summarized in Table 1. Let r k , l be the distance between the l-th cluster of user k and the BS. The parameter r k is normalized to resolve magnitude discrepancies in the EA-UCB formulation, and r k = ave r k , l min ( r k , l ) max ( r k , l ) min ( r k , l ) , where ave r k , l , min ( r k , l ) and max ( r k , l ) are the average, minimum, and maximum distances between the clusters of user k and the BS, respectively. This normalization prevents any single component from dominating the scheduling decision and ensures balanced consideration of both distance and energy factors in the EA-UCB algorithm.
Figure 5 illustrates the performance of the proposed scheme under the scenario where user energy is enough. The system can consistently schedule users with the best channel conditions, allowing the EAR to remain at a high level after convergence. The proposed hierarchical MAB framework has a theoretical logarithmic regret upper bound O ( ln t ) (Theorem 1), meaning it converges to a near-optimal policy within 20–30 coherence blocks. This is significantly shorter than the 250-coherence-block S-CSI invariance interval, ensuring the algorithm completes learning before channel statistics change. Cumulative regret arises primarily from the initial exploration phase and grows slowly due to the KL-UCB and EA-UCB strategies.
The proposed budget-based MAB scheme for is compared with two baseline schemes:
  • ICSIG [13]: This scheme is adapted from [13] and relies on I-CSI. It first obtains I-CSI through channel estimation, and then employs a clique-based approach from graph theory for user scheduling.
  • TPBT [45,46]: This scheme uses classical Thompson sampling [45] for user scheduling, combined with a two-stage beam training strategy [46]. Finally, the estimated S-CSI is utilized to design the digital precoding.
Two scattering scenarios are considered: single cluster ( S = 1 ) and triple clusters ( S = 3 ).
Figure 6 compares the EAR of all schemes with different signal-to-noise ratios (SNRs). A positive correlation between SNR and EAR is evident for all schemes, indicating a consistent performance gain, which is consistent with theoretical expectations. The US-NFBT scheme consistently outperforms the baselines, owing to its low pilot overhead and efficient user scheduling mechanism, which adaptively selects users with good channel conditions and sufficient energy. In contrast, TPBT considers only channel quality and ignores energy constraints, while ICSIG may miss optimal users due to channel correlation and energy limitations. A more complex scattering environment ( S = 3 ) offers greater channel diversity. The rate reduction in the multi-cluster scenario can be attributed to the power normalization in the channel model, thereby lowering the EAR. At high SNR, performance is predominantly governed by channel gain, allowing the TPBT to be temporarily effective. The subsequent performance convergence indicates that the slight short-term penalty introduced by our energy-aware constraint is diminished in this regime, highlighting its advantage for long-term sustainability.
Figure 7 illustrates the EAR of all schemes with different M. As shown in Figure 7, the US-NFBT scheme maintains the highest EAR as the number of scheduled users grows. This advantage stems from its ability to dynamically balance channel quality and residual energy. When the best user depletes its energy, US-NFBT adaptively switches to energy-sufficient sub-optimal users, thereby sustaining system performance. Conversely, TPBT suffers from energy exhaustion, and ICSIG may form sub-optimal schedules due to its structural and energy constraints. This adaptability enables US-NFBT to perform well under different numbers of scheduled users. Moreover, these performance advantages are also evident in the single-cluster scenario with varying SNR levels, as shown in Table 2 where US-NFBT consistently outperforms other schemes across all SNR values.
As illustrated in Figure 8, the EAR performance of all schemes under different K is presented. US-NFBT maintains the highest EAR as K increases, demonstrating superior scalability and stability. This advantage stems from its MAB-based learning framework, which efficiently identifies and prioritizes users with favorable channel conditions and sufficient energy reserves. Consequently, US-NFBT avoids the performance degradation inherent to ICSIG, which suffers from high computational complexity and inflexible rules. It also overcomes the short-sightedness of TPBT in energy-limited scenarios, making it a more suitable solution for large-scale IoT applications.
To provide a more comprehensive performance evaluation, the impact of computational processing delay on the EAR is analyzed under a configuration with SNR = 20 dB, S = 1 scattering cluster, K = 9 users, and M = 5 maximum schedulable users. As shown in Figure 9, the system EAR exhibits a clear decreasing trend as the computational delay increases from 0 ms to 1 ms. This performance degradation is governed by a modified EAR formulation E k , t delay = G k , t T total T B , t T delay , t T total log 2 1 + P signal P interference + σ 2 , which, compared to the original expression in Equation (7), incorporates an additional overhead term T delay , t to account for the additional overhead introduced by algorithmic processing [47], thereby directly reducing the effective symbols available for data transmission within a coherence block. According to Figure 9, due to the inherent exploration–exploitation dynamic mechanism of the multi-arm slot machine framework adjusting for the delayed changes, some minor fluctuations were observed. The overall trend of the curve is that the performance decreases as the delay increases. Importantly, the proposed US-NFBT scheme maintains robust performance under moderate delays, validating its practical applicability in real-world scenarios where processing delays is present [47].

6. Conclusions

This paper presents a C-MAB-driven, two-phase joint optimization framework that jointly designs the beam training and user scheduling in XL-MIMO systems, thereby bridging a significant research gap in FDD near-field communication mechanisms. By incorporating an EA-UCB user scheduling algorithm, terminal energy constraints are integrated into the scheduling decision process, enabling joint optimization of the communication performance and energy consumption. Subsequently, a dual-layer beam training mechanism with coarse-grained angle estimation and advances to fine-grained polar-domain beam training scheme is introduced to substantially cut the training overhead. Simulation results confirm its superior performance. In the future, we will extend our proposed scheme to the non-stationary scenarios, such as scenarios involving high-speed mobile users or time-varying scatterers. By incorporating adaptive learning mechanisms and dynamic model prediction, we will enhance the system’s robustness and adaptability in these scenarios.

Author Contributions

Conceptualization, Y.X., Y.Y., H.L., and Y.J.; methodology, Y.X., Y.Y., and H.L.; software, J.G., X.Y., and J.W.; validation, Y.S. and H.L.; formal analysis, J.G., X.Y., and J.W.; investigation, H.L. and Y.J.; resources, Y.X., Y.Y., and Y.S.; writing—original draft preparation, H.L. and Y.J.; writing—review and editing, Y.X., Y.Y., J.G., X.Y., and J.W.; supervision, Y.S.; project administration, H.L.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62101282.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

The authors would like to express their gratitude to DeepSeek-V3.2 for the language refinement.

Conflicts of Interest

Authors Yunxing Xiang, Yi Yan, Jing Gao, Xiaohui You and Jun Wang, were employed by the State Grid Jiangxi Electric Power Co. Ltd. company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UCBupper confidence bound
XL-MIMOextremely large-scale multiple-input multiple-output
CSIchannel state information
S-CSIstatistical channel state information
I-CSIinstantaneous channel state information
UAVunmanned aerial vehicle
MABmulti-armed bandit
EA-UCBenergy-aware upper confidence bound
KL-UCBKullback-Leibler upper confidence bound
US-NFBTuser scheduling and near-field beam training
C-MABcombinatorial multi-armed bandit
BSbase station
ULAuniform linear array
EAReffective achievable rate
CCMchannel covariance matrix
ICSIGI-CSI-based graph theory algorithm
TPBTThompson-sampling-based beam training
SNRsignal-to-noise ratio

Appendix A. (Proof of Lemma 1)

Proof. 
From the channel model, we know that y k , i , t f CN 0 , d i f H R k d i f + σ 2 . Then, its real and imaginary parts, Re ( y k , i , t f ) and Im ( y k , i , t f ) , are i.i.d. real Gaussian variables, i.e.,
Re ( y k , i , t f ) CN 0 , d i f H R k d i f + σ 2 2 , Im ( y k , i , t f ) CN 0 , d i f H R k d i f + σ 2 2 .
The squared modulus is given by
y k , i , t f 2 = Re y k , i , t f 2 + Im y k , i , t f 2
Let X = Re y k , i , t f and Y = Im ( y k , i , t f ) . It follows that
X 2 κ / 2 χ 2 ( 1 ) , Y 2 κ / 2 χ 2 ( 1 ) ,
where χ 2 ( 1 ) is distributed according to a chi-squared law with one degree of freedom. Therefore,
y k , i , t f 2 = X 2 + Y 2 = κ 2 X 2 κ / 2 + Y 2 κ / 2 = κ 2 · Ω ,
where Ω χ 2 ( 2 ) is a chi-squared random variable with two degrees of freedom. An exponential distribution using rate parameter λ = 1 / 2 provides an alternative representation of the two-degree-of-freedom chi-square distribution χ 2 ( 2 ) , i.e.,
Ω Exp 1 / 2 .
Then, y k , i , t f 2 Exp 1 / κ . Hence, Lemma 1 is proved. □

Appendix B. (Proof of Lemma 2)

Proof. 
According to the definition, the KL divergence is
f K L ( p     q ) = 0 p ( x ) log p ( x ) q ( x ) d x .
There exist two exponential distributions: p ( x ) = λ p e λ p x and q ( x ) = λ q e λ q x . Then, we substitute them into the logarithmic function
log p ( x ) q ( x ) = log λ p e λ p x λ q e λ q x = log λ p λ q ( λ p λ q ) x .
Finally, it is incorporated into the KL divergence (Equation (A6)), i.e.,
f K L ( p     q ) = 0 λ p e λ p x log λ q λ p ( λ p λ q ) x d x = log λ p λ q 0 λ p e λ p x d x ( λ p λ q ) 0 x λ p e λ p x d x = log λ p λ q ( λ p λ q ) 1 λ p = log λ p λ q + λ q λ p 1 .
Therefore, Lemma 2 is proved. □

References

  1. Tariq, F.; Khandaker, M.R.A.; Wong, K.K.; Imran, M.A.; Bennis, M.; Debbah, M. A Speculative Study on 6G. IEEE Wirel. Commun. 2020, 27, 118–125. [Google Scholar] [CrossRef]
  2. Zeng, Y.; Chen, J.; Xu, J.; Wu, D.; Xu, X.; Jin, S.; Gao, X.; Gesbert, D.; Cui, S.; Zhang, R. A Tutorial on Environment-Aware Communications via Channel Knowledge Map for 6G. IEEE Commun. Surv. Tutor. 2024, 26, 1478–1519. [Google Scholar] [CrossRef]
  3. Parra-Ullauri, J.M.; Zhang, X.; Bravalheri, A.; Moazzeni, S.; Wu, Y.; Nejabati, R.; Simeonidou, D. Federated Analytics for 6G Networks: Applications, Challenges, and Opportunities. IEEE Netw. 2024, 38, 9–17. [Google Scholar] [CrossRef]
  4. Na, M.; Lee, J.; Choi, G.; Yu, T.; Choi, J.; Lee, J.; Bahk, S. Operator’s Perspective on 6G: 6G Services, Vision, and Spectrum. IEEE Commun. Mag. 2024, 62, 178–184. [Google Scholar] [CrossRef]
  5. González-Prelcic, N.; Furkan Keskin, M.; Kaltiokallio, O.; Valkama, M.; Dardari, D.; Shen, X.; Shen, Y.; Bayraktar, M.; Wymeersch, H. The Integrated Sensing and Communication Revolution for 6G: Vision, Techniques, and Applications. Proc. IEEE 2024, 112, 676–723. [Google Scholar] [CrossRef]
  6. Psomas, C.; Ntougias, K.; Shanin, N.; Xu, D.; Mayer, K.; Tran, N.M.; Cottatellucci, L.; Choi, K.W.; Kim, D.I.; Schober, R.; et al. Wireless Information and Energy Transfer in the Era of 6G Communications. Proc. IEEE 2024, 112, 764–804. [Google Scholar] [CrossRef]
  7. Liu, W.; Pan, C.; Ren, H.; Wang, J.; Schober, R. Near-Field Multiuser Beam-Training for Extremely Large-Scale MIMO Systems. IEEE Trans. Commun. 2025, 73, 2663–2679. [Google Scholar] [CrossRef]
  8. Wu, C.; You, C.; Liu, Y.; Chen, L.; Shi, S. Two-Stage Hierarchical Beam Training for Near-Field Communications. IEEE Trans. Veh. Technol. 2024, 73, 2032–2044. [Google Scholar] [CrossRef]
  9. Lu, Y.; Zhang, Z.; Dai, L. Hierarchical Beam Training for Extremely Large-Scale MIMO: From Far-Field to Near-Field. IEEE Trans. Commun. 2024, 72, 2247–2259. [Google Scholar] [CrossRef]
  10. Shi, X.; Wang, J.; Sun, Z.; Song, J. Spatial-Chirp Codebook-Based Hierarchical Beam Training for Extremely Large-Scale Massive MIMO. IEEE Trans. Wirel. Commun. 2024, 23, 2824–2838. [Google Scholar] [CrossRef]
  11. Zhou, C.; Wu, C.; You, C.; Zhou, J.; Shi, S. Near-Field Beam Training with Sparse DFT Codebook. IEEE Trans. Commun. 2025, 73, 4394–4408. [Google Scholar] [CrossRef]
  12. Li, X.; Jin, S.; Suraweera, H.A.; Hou, J.; Gao, X. Statistical 3-D Beamforming for Large-Scale MIMO Downlink Systems Over Rician Fading Channels. IEEE Trans. Commun. 2016, 64, 1529–1543. [Google Scholar] [CrossRef]
  13. de Souza, J.H.I.; Filho, J.C.M.; Amiri, A.; Abrão, T. QoS-Aware User Scheduling in Crowded XL-MIMO Systems Under Non-Stationary Multi-State LoS/NLoS Channels. IEEE Trans. Veh. Technol. 2023, 72, 7639–7652. [Google Scholar] [CrossRef]
  14. González-Coma, J.P.; López-Martínez, F.J.; Castedo, L. Joint User Scheduling and Precoding for XL-MIMO Systems with Imperfect CSI. IEEE Wirel. Commun. Lett. 2023, 12, 1657–1661. [Google Scholar] [CrossRef]
  15. Li, X.; Zhang, J.; Han, Y.; Jin, S.; Wu, Y. User Scheduling and Height Optimization in UAV-Assisted Maritime XL-MIMO Communications. In Proceedings of the 2023 International Conference on Wireless Communications and Signal Processing (WCSP), Hangzhou, China, 2–4 November 2023; pp. 966–971. [Google Scholar] [CrossRef]
  16. González-Coma, J.P.; López-Martínez, F.J.; Castedo, L. Low-Complexity Distance-Based Scheduling for Multi-User XL-MIMO Systems. IEEE Wirel. Commun. Lett. 2021, 10, 2407–2411. [Google Scholar] [CrossRef]
  17. Pérez-Adán, D.; González-Coma, J.P.; Javier López-Martínez, F.; Castedo, L. Interference-Aware Precoding and User Selection for Multi-Cell Near-Field XL-MIMO Systems. IEEE Wirel. Commun. Lett. 2025, 14, 1396–1400. [Google Scholar] [CrossRef]
  18. Hashima, S.; Hatano, K.; Takimoto, E.; Mahmoud Mohamed, E. Neighbor Discovery and Selection in Millimeter Wave D2D Networks Using Stochastic MAB. IEEE Commun. Lett. 2020, 24, 1840–1844. [Google Scholar] [CrossRef]
  19. Ni, Y.; Wang, T.; Tong, H.; Yin, C. A Fast Near-Field Beam Training Strategy based on Array Partition. In Proceedings of the 2025 IEEE/CIC International Conference on Communications in China (ICCC), Shanghai, China, 10–13 August 2025; pp. 1–6. [Google Scholar] [CrossRef]
  20. Huang, K.; Guan, J.; Wang, Y.; Luo, X. A Symmetric Beam Training Method for Near-Field XL-MIMO Systems. In Proceedings of the 2025 6th International Conference on Electrical, Electronic Information and Communication Engineering (EEICE), Shenzhen, China, 18–20 April 2025; pp. 1491–1496. [Google Scholar] [CrossRef]
  21. Grover, M.; R, S.; Ketha, R.; Chaudhary, P.; Juneja, B.; Sahoo, S.K. Optimal User Scheduling for Downlink Multi-User MIMO Systems Using Convolutional Neural Networks. In Proceedings of the 2025 International Conference on Networks and Cryptology (NETCRYPT), New Delhi, India, 29–31 May 2025; pp. 271–276. [Google Scholar] [CrossRef]
  22. Zhu, Y.; Li, S.; Guo, L.; Ge, W.; Wei, L. Deep reinforcement learning-based user scheduling methods with low-complexity beamforming for massive MU-MIMO systems. J. Commun. Netw. 2025, 27, 369–385. [Google Scholar] [CrossRef]
  23. Liang, H.; Liu, C.; Song, Y.; Yin, Z.; Wang, G. Joint Two Stage Beamforming and User Scheduling for LEO Satellite Communications via Hierarchical Multi-armed Bandit. IEEE Trans. Cogn. Commun. Netw. 2025; early access. [Google Scholar] [CrossRef]
  24. Dong, Z.; Li, X.; Zeng, Y. Characterizing and Utilizing Near-Field Spatial Correlation for XL-MIMO Communication. IEEE Trans. Commun. 2024, 72, 7922–7937. [Google Scholar] [CrossRef]
  25. Liang, H.; Liu, C.; Song, Y.; Gao, T.; Zou, Y. Neighbor-based joint spatial division and multiplexing in massive MIMO: User scheduling and dynamic beam allocation. EURASIP J. Adv. Signal Process. 2024, 2024, 1. [Google Scholar] [CrossRef]
  26. You, L.; Qiang, X.; Li, K.X.; Tsinos, C.G.; Wang, W.; Gao, X.; Ottersten, B. Hybrid Analog/Digital Precoding for Downlink Massive MIMO LEO Satellite Communications. IEEE Trans. Wirel. Commun. 2022, 21, 5962–5976. [Google Scholar] [CrossRef]
  27. Gao, T.; Liu, C.; Song, Y.; Yin, Z.; Liang, H.; Cheng, N. Spectral Efficient TSB Scheme with User Scheduling for FDD Massive MIMO Systems. IEEE Internet Things J. 2024, 11, 6084–6095. [Google Scholar] [CrossRef]
  28. Liu, L.; You, C.; Zhang, Y.; Liu, T. Side Angle Information Assisted Near-Field Beam Training for XL-Array Communications. IEEE Commun. Lett. 2024, 28, 2201–2205. [Google Scholar] [CrossRef]
  29. Hashima, S.; Fouda, M.M.; Sakib, S.; Fadlullah, Z.M.; Hatano, K.; Mohamed, E.M.; Shen, X. Energy-Aware Hybrid RF-VLC Multiband Selection in D2D Communication: A Stochastic Multiarmed Bandit Approach. IEEE Internet Things J. 2022, 9, 18002–18014. [Google Scholar] [CrossRef]
  30. Mohamed, E.M.; Hashima, S.; Hatano, K. Energy Aware Multiarmed Bandit for Millimeter Wave-Based UAV Mounted RIS Networks. IEEE Wirel. Commun. Lett. 2022, 11, 1293–1297. [Google Scholar] [CrossRef]
  31. Mohamed, E.M.; Ahmed Alnakhli, M.; Fouda, M.M. Joint UAV Trajectory Planning and LEO-Sat Selection in SAGIN. IEEE Open J. Commun. Soc. 2024, 5, 1624–1638. [Google Scholar] [CrossRef]
  32. Zhang, X.; Sun, S.; Tao, M.; Huang, Q.; Tang, X. Multi-Satellite Cooperative Networks: Joint Hybrid Beamforming and User Scheduling Design. IEEE Trans. Wirel. Commun. 2024, 23, 7938–7952. [Google Scholar] [CrossRef]
  33. Wei, X.; Dai, L. Channel Estimation for Extremely Large-Scale Massive MIMO: Far-Field, Near-Field, or Hybrid-Field? IEEE Commun. Lett. 2022, 26, 177–181. [Google Scholar] [CrossRef]
  34. Mauricio, W.V.F.; Maciel, T.F.; Klein, A.; Lima, F.R.M. Scheduling for Massive MIMO with Hybrid Precoding Using Contextual Multi-Armed Bandits. IEEE Trans. Veh. Technol. 2022, 71, 7397–7413. [Google Scholar] [CrossRef]
  35. Lee, H.S.; Kim, D.Y.; Min, K. Universal Dynamic Pilot Allocation for Beam Alignment Based on Multi-Armed Bandits. IEEE Wirel. Commun. Lett. 2024, 13, 756–760. [Google Scholar] [CrossRef]
  36. Mohsenivatani, M.; Ali, S.; Rajatheva, N.; Latva-Aho, M. Deep Contextual Bandits Learning-Based Beam Selection for mmWave MIMO Systems. In Proceedings of the 2023 IEEE International Mediterranean Conference on Communications and Networking (MeditCom), Dubrovnik, Croatia, 4–7 September 2023; pp. 211–216. [Google Scholar] [CrossRef]
  37. Manonmayee Bharatula, S.S.; Ramaiyan, V. Adapting UCB for Correlated Arms in Link Rate Selection for Wireless Channels. In Proceedings of the 2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Singapore, 24–27 August 2023; pp. 1–8. [Google Scholar] [CrossRef]
  38. Song, J.; de Veciana, G.; Shakkottai, S. Meta-Scheduling for the Wireless Downlink Through Learning with Bandit Feedback. IEEE/ACM Trans. Netw. 2022, 30, 487–500. [Google Scholar] [CrossRef]
  39. Wu, B.; Chen, T.; Ni, W.; Wang, X. Multi-Agent Multi-Armed Bandit Learning for Online Management of Edge-Assisted Computing. IEEE Trans. Commun. 2021, 69, 8188–8199. [Google Scholar] [CrossRef]
  40. Hashima, S.; Fouda, M.M.; Fadlullah, Z.M.; Mohamed, E.M.; Hatano, K. Improved UCB-based Energy-Efficient Channel Selection in Hybrid-Band Wireless Communication. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11December 2021; pp. 1–6. [Google Scholar] [CrossRef]
  41. Garivier, A.; Cappé, O. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. arXiv 2013, arXiv:1102.2490. [Google Scholar] [CrossRef]
  42. Cui, M.; Dai, L. Channel Estimation for Extremely Large-Scale MIMO: Far-Field or Near-Field? IEEE Trans. Commun. 2022, 70, 2663–2677. [Google Scholar] [CrossRef]
  43. Song, Y.; Liu, C.; Liu, Y.; Cheng, N.; Huang, Y.; Shen, X. Joint Spatial Division and Multiplexing in Massive MIMO: A Neighbor-Based Approach. IEEE Trans. Wirel. Commun. 2020, 19, 7392–7406. [Google Scholar] [CrossRef]
  44. Xiao, M.; Nagamochi, H. Exact algorithms for maximum independent set. Inf. Comput. 2017, 255, 126–146. [Google Scholar] [CrossRef]
  45. Wilhelmi, F.; Cano, C.; Neu, G.; Bellalta, B.; Jonsson, A.; Barrachina-Muñoz, S. Collaborative Spatial Reuse in wireless networks via selfish Multi-Armed Bandits. Ad Hoc Netw. 2019, 88, 129–141. [Google Scholar] [CrossRef]
  46. Zhang, Y.; Wu, X.; You, C. Fast Near-Field Beam Training for Extremely Large-Scale Array. arXiv 2022, arXiv:2209.14798. [Google Scholar] [CrossRef]
  47. Miuccio, L.; Panno, D.; Riolo, S. A Flexible Encoding/Decoding Procedure for 6G SCMA Wireless Networks via Adversarial Machine Learning Techniques. IEEE Trans. Veh. Technol. 2023, 72, 3288–3303. [Google Scholar] [CrossRef]
Figure 1. Illustration of the XL-MIMO multi-user communication system.
Figure 1. Illustration of the XL-MIMO multi-user communication system.
Sensors 26 00060 g001
Figure 2. The US-NFBT Scheme Architecture.
Figure 2. The US-NFBT Scheme Architecture.
Sensors 26 00060 g002
Figure 3. Comparison between the far-field and near-field channel.
Figure 3. Comparison between the far-field and near-field channel.
Sensors 26 00060 g003
Figure 4. The numerical results of | G ( β ) | against β [42].
Figure 4. The numerical results of | G ( β ) | against β [42].
Sensors 26 00060 g004
Figure 5. Comparison of EARs under different slots. SNR = 20 dB, S = 1 , K = 9 , M = 5 .
Figure 5. Comparison of EARs under different slots. SNR = 20 dB, S = 1 , K = 9 , M = 5 .
Sensors 26 00060 g005
Figure 6. The total EARs versus different SNRs for US-NFBT, TPBT [45,46], ICSIG [13] schemes. K = 12 , M = 5 .
Figure 6. The total EARs versus different SNRs for US-NFBT, TPBT [45,46], ICSIG [13] schemes. K = 12 , M = 5 .
Sensors 26 00060 g006
Figure 7. Total EARs versus M for US-NFBT, TPBT [45,46], ICSIG [13] schemes. SNR = 10 dB, K = 12 .
Figure 7. Total EARs versus M for US-NFBT, TPBT [45,46], ICSIG [13] schemes. SNR = 10 dB, K = 12 .
Sensors 26 00060 g007
Figure 8. The total EARs versus K for US-NFBT, TPBT [45,46], ICSIG [13] schemes. SNR = 20 dB, M = 5 .
Figure 8. The total EARs versus K for US-NFBT, TPBT [45,46], ICSIG [13] schemes. SNR = 20 dB, M = 5 .
Sensors 26 00060 g008
Figure 9. Comparison of EAR delay under different computational delay.
Figure 9. Comparison of EAR delay under different computational delay.
Sensors 26 00060 g009
Table 1. Simulation Parameters.
Table 1. Simulation Parameters.
ParameterValue
System
Antenna count N256
Frequency of carrier f c 100 GHz
Antennas spacing d1.5 mm
System circuit power ξ system 44 dBm
Efficiency of the power amplifier 1 / γ 1/3
Predefined energy threshold Ψ th 10 Joule
Bandwidth W bw 63 MHz
Initial user battery energy Ψ k , 1 30 Joule
Number of symbols per coherence block T total 1386
User number K 9 , 14
Required data per coherence block L data 1.5 Mb
Maximum number of scheduled users M 3 , 8
Channel
User angle range θ k ± 60 , ± 45 , ± 30 , ± 15 , 0
Number of clusters S 1 , 3
Scattering cluster distance range r k , l 6 , 8 , 10 , 12 , 14
S-CSI invariance interval250 coherence block
Concentration parameter of the von-Mises PDF 0 , 4 , 8
MAB
Exploration coefficient of EA-UCB B log 2 1 + max k d i n H R k d i n σ 2
Tuning coefficient of KL-UCB c3
Coefficient ρ (Remark 2)0.5
Table 2. EAR for different schemes at various SNR in single-cluster scenario.
Table 2. EAR for different schemes at various SNR in single-cluster scenario.
SNR (dB)EAR (bit/s/Hz)
US-NFBTTPBTISCIG
−513.178.934.14
020.1913.638.86
526.5419.9315.04
1033.5028.5322.42
1542.4537.5628.86
2048.5743.2132.47
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiang, Y.; Yan, Y.; Song, Y.; Gao, J.; You, X.; Wang, J.; Liang, H.; Jiang, Y. Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications. Sensors 2026, 26, 60. https://doi.org/10.3390/s26010060

AMA Style

Xiang Y, Yan Y, Song Y, Gao J, You X, Wang J, Liang H, Jiang Y. Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications. Sensors. 2026; 26(1):60. https://doi.org/10.3390/s26010060

Chicago/Turabian Style

Xiang, Yunxing, Yi Yan, Yunchao Song, Jing Gao, Xiaohui You, Jun Wang, Huibin Liang, and Yixin Jiang. 2026. "Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications" Sensors 26, no. 1: 60. https://doi.org/10.3390/s26010060

APA Style

Xiang, Y., Yan, Y., Song, Y., Gao, J., You, X., Wang, J., Liang, H., & Jiang, Y. (2026). Hierarchical MAB Framework for Energy-Aware Beam Training for Near-Field Communications. Sensors, 26(1), 60. https://doi.org/10.3390/s26010060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop