Two-Stage Multiarmed Bandit for Reconfigurable Intelligent Surface Aided Millimeter Wave Communications

A reconfigurable intelligent surface (RIS) is a promising technology that can extend short-range millimeter wave (mmWave) communications coverage. However, phase shifts (PSs) of both mmWave transmitter (TX) and RIS antenna elements need to be optimally adjusted to effectively cover a mmWave user. This paper proposes codebook-based phase shifters for mmWave TX and RIS to overcome the difficulty of estimating their mmWave channel state information (CSI). Moreover, to adjust the PSs of both, an online learning approach in the form of a multiarmed bandit (MAB) game is suggested, where a nested two-stage stochastic MAB strategy is proposed. In the proposed strategy, the PS vector of the mmWave TX is adjusted in the first MAB stage. Based on it, the PS vector of the RIS is calibrated in the second stage and vice versa over the time horizon. Hence, we leverage and implement two standard MAB algorithms, namely Thompson sampling (TS) and upper confidence bound (UCB). Simulation results confirm the superior performance of the proposed nested two-stage MAB strategy; in particular, the nested two-stage TS nearly matches the optimal performance.


Introduction
A reconfigurable intelligent surface (RIS) is a promising technology to extend the coverage of the communication systems by means of passive antenna arrays [1]. This can be done by configuring the phase shifts (PSs) of the antenna elements to reflect the incoming electromagnetic wave towards an intended destination. Compared with the conventional amplify and forward (AF) and decode and forward (DF) relays, RIS has the advantages of low cost and ease of installation as no RF chains are needed [2]. Millimeter wave (mmWave) communication, i.e., 30~300 GHz band, is another promising technology for fifth-generation (5G) wireless communications and beyond due to its vacant frequencies enabling multi-Gbps transmissions [3][4][5][6]. However, due to its high operating frequencies, mmWave is characterized by a short-range transmission with increased susceptibility to path blockage [7]. This necessitates the use of directional antennas in the form of antenna beamforming training (BT) [8][9][10][11].
A symbiotic relationship exists between both technologies. On one side, RIS is considered an efficient solution for mmWave challenges, where RIS can extend the mmWave coverage and route around blockages. On the other side, mmWave can directly tune its beam direction towards the RIS location, and the RIS reflects this beam towards the intended mmWave receiver (RX) via adjusting its PSs. However, jointly optimizing the PSs of both mmWave transmitter (TX) and RIS antenna elements is challenging due to the

•
The RIS-assisted mmWave communication system is considered, where an optimization problem is formulated to jointly adjust the PS vectors of both mmWave TX and RIS. • Discrete PSs in the form of codebook design are suggested to relax the complicated CSI estimation problem at both mmWave TX and RIS. In this design, the PSs are assumed to be generated with 90-degree phase resolution with constant amplitude like the codebook design used by WiGig standards [12,13]. • A stochastic single-player MAB game is constructed to jointly optimize the PS vectors of mmWave TX and RIS. This facilitates the adjustment of both PS vectors successively in a time-by-time fashion, which highly reduces the required BT overhead. Typically, the only available information for a MAB player is its reward observation, without any details about the environment. Thus, considering mmWave PSs optimization as a MAB game eliminates the need for CSI estimation as the observed achievable spectral efficiency, i.e., the reward of the game, is the only needed information. This information can be easily obtained via the feedback channel between the mmWave RX and TX. Moreover, the suggested codebook design facilitates the implementation of the MAB game, where the PS vectors are considered as its arms. To reduce the complexity of the arm optimization as we have two sets of arms, one belonging to the mmWave TX and the other to the RIS, a nested two-stage MAB game is proposed in this paper. In this approach, the PS vector of mmWave TX is adjusted in the first MAB stage, and based on it, the PS vector of the RIS is modified in the next stage and vice versa over the time horizon. Thus, at each trial, the player needs to only explore one set of the PS vectors, either that belonging to the mmWave TX or that belonging to the RIS, which reduces the computational complexity of the constructed MAB game. Two common MAB algorithms, namely Thompson sampling (TS) [16] and upper confidence bound (UCB) [17], are used to implement the proposed nested two-stage MAB and compare their performances under the mmWave-RIS environment.
• Numerical analyses are conducted to prove the effectiveness of the proposed mmWave-RIS communication system over benchmarks against the optimal performance under different simulation scenarios.
The rest of this paper is constructed as follows: Section 2 summarizes the related works, and Section 3 discusses the system model and the problem formulation. Section 4 proposes the antenna codebook design and the nested two-stage MAB approach. Section 5 delivers the numerical analysis, followed by the concluding remarks in Section 6.

Literature Review
One way to overcome the continuously increasing capacity in the wireless communication systems is to control the channel itself to develop an intelligent radio environment besides other existing solutions (diversity, high-frequency waves, etc.). RIS is a programmable arrangement that controls the propagation of electromagnetic waves by varying its surface's electric and magnetic characteristics. Furthermore, RIS can sense the radio environment by installing intelligent surfaces within the wireless system environment, which entirely or partially controls the features of the radio channels. Hence, RIS-assisted systems can improve the reliability and energy efficiency of wireless systems [1]. Lately, RIS has drawn much consideration as an up-and-coming technology that can suit future wireless systems demands [18], i.e., 6G and beyond. Hence, RIS has promoted wireless applications such as RIS-aided wireless power transfer [19], RIS-aided mobile edge computing [20], RIS-aided physical layer security [21,22], RIS-aided UAV communications [23,24], and mobility and handover management for RIS-aided wireless communications in high-speed trains [25].
There are limited related research works investigating the impact of RIS deployment in mmWave networks. A general tractable model for the coverage performance of the RIS-assisted mmWave networks focused on RIS and base station (BS) densities using stochastic geometry was proposed in [26]. A privacy-preserving design paradigm combining federated learning (FL) with RIS in the mmWave communication system was proposed in [27]. A deep learning algorithm was proposed in [28] to set up a relation between CSI and RIS configurations for better optimal communication rate performance. An efficient cascaded channel estimation model for an RIS-assisted mmWave MIMO system, with the wideband effect on the transmission model, was considered in [29]. A hybrid precoding (HP) design for the RIS-aided multiuser (MU) mmWave communication systems was investigated in [30]. Artificial intelligence (AI)-empowered mmWave communications, especially using RIS, were studied in [31]. To the best of our knowledge, all the current research works on RIS-assisted mmWave assume perfect CSI information. Based on it, the PS vectors of BS and RIS are adjusted to maximize the achievable spectral efficiency at the RX. Without this CSI information, these PS vectors cannot be optimized due to the assumption of continuous PS. However, perfect CSI is a strong assumption violating the RIS hypothesis of being utterly passive without any channel estimation functionality. Figure 1 shows the system model of the RIS-assisted mmWave communication, where RIS is used to connect the mmWave BS with a single-antenna mmWave user equipment (UE) by routing around the blocker. RIS is equipped with a uniform planner array (UPA) of M antenna elements, while mmWave BS is equipped with a uniform linear array (ULA) of N antenna elements. An RIS controller is used to control the PSs of RIS antenna elements based on the selected PS vector. In addition, mmWave BS and RIS are connected through a dedicated communication link for controlling and information exchange. As a result, the received signal at the UE can be expressed as follows: elements based on the selected PS vector. In addition, mmWave BS and RIS are connected through a dedicated communication link for controlling and information exchange. As a result, the received signal at the UE can be expressed as follows:  In (1), is the transmitted symbol, and is the received one where = , and is the TX power. (. ) means Hermitian transpose and ~(0, ) is the complex additive white Gaussian noise (AWGN) with zero mean and variance of . ∈ ℂ × is the analog precoder vector of size × 1 applied at the mmWave BS, and ∈ ℂ × is a diagonal matrix of size × containing the PSs of the RIS antenna elements in its diagonal. { , : 1 ≤ ≤ |ℛ|, 1 ≤ ≤ |ℱ|} are the indices of the used and , where ℛ and ℱ are their finite sets.

System Model
∈ ℂ × is the channel matrix of size × between BS and RIS, while ∈ ℂ × is the channel vector of size × 1 between the RIS and UE.
Following the geometric channel models with limited scatterers given in [30], and can be expressed as follows: where and are the number of channel paths between BS and RIS and between RIS and UE, respectively. ~ 0, and ~ 0, are the complex path gains of the -th path in and , respectively. ( , ) can be expressed as follows [30]: where is the antenna spacing and is the carrier wavelength and 0 ≤ { , } ≤ √ − 1 . By analogy, ( ) is defined as follows [30]: where 0 ≤ ≤ ( − 1). In (1), s is the transmitted symbol, and x is the received one where E ss H = P, and P is the TX power. (.) H means Hermitian transpose and n ∼ CN 0, σ 2 0 is the complex additive white Gaussian noise (AWGN) with zero mean and variance of σ 2 0 . f j ∈ C N×1 is the analog precoder vector of size N × 1 applied at the mmWave BS, and Φ i ∈ C M×M is a diagonal matrix of size M × M containing the PSs of the RIS antenna elements in its diagonal. {i,j: 1 ≤ i ≤ |R|, 1 ≤ j ≤ |F |} are the indices of the used Φ and f, where R and F are their finite sets. H BR ∈ C M×N is the channel matrix of size M × N between BS and RIS, while h RU ∈ C M×1 is the channel vector of size M × 1 between the RIS and UE. Following the geometric channel models with limited scatterers given in [30], H BR and h RU can be expressed as follows: where L BR and L RU are the number of channel paths between BS and RIS and between RIS and UE, respectively. ξ l ∼ CN 0, σ 2 ξ l and ν l ∼ CN 0, σ 2 ν l are the complex path gains of the l-th path in L BR and L RU , respectively. Λ R χ are the corresponding azimuth and elevation AoD. Generally, for any θ and φ, Λ R (θ, φ ) can be expressed as follows [30]: where d is the antenna spacing and λ is the carrier wavelength and 0 ≤ {p, q} ≤ √ M − 1 .
By analogy, Λ B χ (AoD) l is defined as follows [30]: where 0 ≤ n ≤ (N − 1). The mmWave-RIS optimization problem aims to jointly optimize Φ and f for maximizing the achievable spectral efficiency ψ in bps/Hz at the UE. Mathematically speaking this can be expressed as follows: where Herein, ψ Φ i f j is the spectral efficiency at the UE resulting from using Φ i and f j , and the indices of the optimal values of Φ and f are represented by {i * , j * }. Most of the existing literature assumes perfect CSI information; i.e., H BR and h RU are well known at both BS and RIS. Based on that, both Φ and f can be jointly adjusted using different iterative techniques [30][31][32]. However, this is a strong assumption as it is too difficult to estimate H BR and h RU due to the use of massive antenna elements at both BS and RIS. Furthermore, RIS should be utterly passive without any channel estimation functionality.

Proposed Antenna Codebook and MAB Approach
In this section, antenna codebook and MAB approach are suggested for the mmWave-RIS system to overcome mmWave CSI estimation and jointly adjust the PS vectors of BS and RIS.

Antenna Codebook Design
To eliminate the need for CSI estimation, discrete PSs are considered for both mmWave BS and RIS, where they constitute the antenna codebook of both. In this context, we will utilize the antenna codebook of WiGig standards for PS design at both BS and RIS [12,13]. This codebook-based beam switching involves fixed beam patterns and can be realized using a predefined pool of antenna weight vectors maintained at TX and RX. Columns of a codebook matrix specify the beamforming weight vector that corresponds to a unique beam pattern. The TX-RX beam pattern pair that optimizes a certain cost function is searched during beamforming according to an agreed criterion. Codebooks support a variety of antenna array geometries and offer flexibility in terms of the number, size, and spacing between antenna elements. For phased array antennas, the columns of the codebook matrix specify the discrete PSs applied to individual antenna elements. The patterns may be generated without amplitude adjustment to obtain processing power savings. As a guiding principle, the columns of the codebook are made orthogonal to each other so that multiple beams can be generated simultaneously without significant interference. These beams can also be synthesized to create a wider beam. Thus, in this codebook design, the PS vectors for K ≤ A, where A is the total number of antenna elements and K is the total number of PS vectors (i.e., beam directions), are given by column vectors of the following matrix: In the case that K = M/2, the PS vector at k = 0 becomes Thus, the columns in V are the available space for constructing f and the diagonal of Φ.

Proposed Nested Two-Stage MAB Approach
Jointly optimizing the values of Φ and f using the prementioned codebook design will consume a considerable BT overhead due to the search over |R||F | different candidate beams. Instead, an online learning approach is proposed to successively obtain the optimal solution over the time horizon. This results in considerably reducing the BT overhead as only one pair Φ i and f j will be tested at a time. In this context, an online single-player MAB game is constructed to address this problem efficiently. In this formulation, the BS is considered as the player of the bandit game; the available joint values of Φ i and f j are the arms of the bandit; and the achievable spectral efficiency at the UE, i.e., ψ Φ i f j , is the reward. This MAB-based optimization problem can be mathematically formulated as follows: s.t.
where T H indicates the time horizon and Z + is the set of all positive integers. ψ Φ i t f j t is the spectral efficiency resulting from using Φ i and f j combination at time t, i.e., Φ i t and f j t . I i t j t is a selection indicator, which is equal to 1 if the combination Φ i and f j is selected at time t and 0 otherwise. The constraint ∑ i,j I i t j t = 1 means that only one Φ and f combination is allowed to be selected at time t. To reduce the computational complexity of the constructed MAB game, a nested two-stage MAB strategy is proposed. In the proposed algorithm, the value of Φ is adjusted in the first MAB stage for a particular value of f. Then, based on the adjusted value of Φ, the value of f is adjusted in the second MAB stage, and so on over the time horizon. In this context, two common MAB algorithms are proposed to implement the suggested nested two-stage MAB approach, namely TS and UCB algorithms.

Proposed Nested Two-Stage TS Algorithm
TS is based on a pure Bayesian strategy [16], where prior/posterior distributions are considered for the arms' rewards. The parameters of the assumed probabilistic model are initialized for each arm at the beginning of the algorithm. Then, random samples are taken from the constructed distributions, and the arm related to the highest random sample is selected and played. After obtaining the rewards corresponding to the played arm, its parameters are updated for the posterior distribution of the next round of the MAB game. In the proposed TS algorithm, normal distributions are considered for the spectral efficiency corresponding to each value of Φ and f at time t, i.e., are the mean and variance of the assumed normal distribution, and ψ Φ i t f j t is the average value of This assumption comes from the AWGN term given in (1). Algorithm 1 gives the proposed nested two-stage TS algorithm, where the inputs to the algorithm are the spaces of codebooks R. and F and the outputs are the adjusted values of Φ i * and f j * . At the beginning of the algorithm, i.e., t = 1 the average spectral efficiencies ψ Φ i t f j t , the variances , and the number of selections Z Φ i t f j t corresponding to all values of Φ i and f j . are set to 0, 1, and 0, respectively. In addition, a PS vector f, i.e., f j * t , is initialized by picking it uniformly from its corresponding PS codebook. For 2 ≤ t ≤ T H , where T H is the total time horizon, nested two-stage TS algorithms are performed as follows:

END For
In the first stage and based on the value of * , a value of * is selected and its corresponding reward * * is obtained. This is done by sampling the prior distributions of * , i.e., * ~ * , * , 1 ≤ ≤ |ℛ|. Then, the index * corresponding to the maximum * is selected as follows: * = arg max * ,

END For
In the first stage and based on the value of is obtained. This is done by sampling the prior distributions is selected as follows: Next, the value of Φ matrix corresponding to this index, i.e., Φ i * t , is obtained. Afterward, its corresponding reward ψ Φ i * t f j * t is achieved, and its model parameters are updated for its posterior distribution as given in Algorithm 1, where the methodology presented in [33] is used for updating In the second stage TS and based on Φ i * t coming from the first stage, a value of f j * t is adjusted, and its corresponding reward , 1 ≤ j ≤ |F |, are taken, and the index j * t corresponding to the maximum sample value is chosen as follows: Again, the value of f matrix corresponding to this index, i.e., f j * t , is obtained. Then, its is achieved, and its model parameters are updated for the posterior distribution of the next round of the MAB game as given in Algorithm 1.

Proposed Nested Two-Stage UCB Algorithm
UCB is based on increasing the confidence of the chosen arm by decreasing its uncertainty. This is done by exploiting the arm with the maximum achievable average reward so far while exploring the less selected ones. Algorithm 2 summarizes the proposed nested two-stage UCB algorithm. Like the TS algorithm, the inputs to the algorithm are the spaces of codebooks R and F , and the outputs are the adjusted values of Φ i * and f j * . For initialization, at t = 1, the average spectral efficiencies ψ Φ i t f j t corresponding to all values of Φ i . and and their corresponding numbers of selections, i.e., Z Φ i t f j t , are set to 1. Moreover, a PS vector f, i.e., f j * t , is picked uniformly from its corresponding PS codebook, i.e., F . For 2 ≤ t ≤ T H , nested two-stage UCB algorithms are conducted as follows: In the first UCB stage, based on the value of f j * t−1 and index of the Φ matrix, i * t is selected based on the UCB policy as follows [17]: represents the exploitation term, while the term represents the exploration term of the UCB policy. After selecting Φ i * t , its corresponding reward are updated as given in Algorithm 2. Based on the selected value of Φ i * t , the value of f j * t is adjusted in a nested manner via the second stage UCB as given in Algorithm 2. Then, its corresponding reward

Numerical Analysis
In this section, Monto Carlo (MC) numerical simulations are conducted to prove the effectiveness of the proposed nested two-stage MAB algorithms compared to the random PS selection, where values of and are picked uniformly, against the optimal performance. The optimal performance is obtained by testing all available candidate pairs of and and selecting the best one maximizing . Table 1 summarizes the utilized simulation parameters unless otherwise stated. In addition, it is assumed that the line-of-sight (LoS) path is 10 dB higher than the other paths [34]. Figure 2 shows the spectral efficiency performances of the compared schemes i.e., nested two-stage TS, UCB, and random at no blockage against the used number of PS vectors (K), where K = M = N. Generally speaking, as the number of K increases, the spectral efficiencies of all schemes increase due to the increase in the received power affected by the increment in the beamforming gain. Although the proposed nested two-stage MAB algorithms do not need CSI estimation and only use the observed spectral efficiency, they END For

Numerical Analysis
In this section, Monto Carlo (MC) numerical simulations are conducted to prove the effectiveness of the proposed nested two-stage MAB algorithms compared to the random PS selection, where values of Φ and f are picked uniformly, against the optimal performance. The optimal performance is obtained by testing all available candidate pairs of Φ and f and selecting the best one maximizing ψ. Table 1 summarizes the utilized simulation parameters unless otherwise stated. In addition, it is assumed that the line-of-sight (LoS) path is 10 dB higher than the other paths [34].  Figure 2 shows the spectral efficiency performances of the compared schemes i.e., nested two-stage TS, UCB, and random at no blockage against the used number of PS vectors (K), where K = M = N. Generally speaking, as the number of K increases, the spectral efficiencies of all schemes increase due to the increase in the received power affected by the increment in the beamforming gain. Although the proposed nested two-stage MAB algorithms do not need CSI estimation and only use the observed spectral efficiency, they have good performances against the optimal performance compared to random PS selection. Moreover, the proposed nested two-stage TS algorithm outperforms all other compared schemes due to its Bayesian policy, which constructs prior/posterior distributions to the achievable spectral efficiency. As the assumed normal distribution highly matches the actual distribution of the attainable spectral efficiency, the proposed nested two-stage TS outperforms UCB-based one. Random PS selection shows the worst performance because it selects the PS vectors arbitrarily without any optimization objective. At K = 4, about 98.5%, 97%, and 85.3% of the optimal performance are obtained by the proposed nested two-stage TS, UCB, and random selection, respectively. These values become 94.3%, 86%, and 71.7% when K = 64, where the number of alternative beam pairs is highly increased. As TS is a Bayesian-based approach, its performance is still near the optimal one, while the performance of the other two schemes is highly degraded compared to the case of K = 4.
have good performances against the optimal performance compared to random PS selection. Moreover, the proposed nested two-stage TS algorithm outperforms all other compared schemes due to its Bayesian policy, which constructs prior/posterior distributions to the achievable spectral efficiency. As the assumed normal distribution highly matches the actual distribution of the attainable spectral efficiency, the proposed nested two-stage TS outperforms UCB-based one. Random PS selection shows the worst performance because it selects the PS vectors arbitrarily without any optimization objective. At K = 4, about 98.5%, 97%, and 85.3% of the optimal performance are obtained by the proposed nested two-stage TS, UCB, and random selection, respectively. These values become 94.3%, 86%, and 71.7% when K = 64, where the number of alternative beam pairs is highly increased. As TS is a Bayesian-based approach, its performance is still near the optimal one, while the performance of the other two schemes is highly degraded compared to the case of K = 4.   Figure 3 shows the spectral efficiency of the schemes involved in the comparison at 80% blockage, where it simulates a harsh blockage environment. In this context, four paths out of the five channel paths between BS and RIS and between RIS and UE undergo blockage, including the LoS path. Compared to Figure 2, more than 50% decrease in spectral efficiency occurs in this harsh blockage environment compared to the zero blockage case. This is due to the low power received from the only surviving path out of the five paths. Despite this harsh blockage environment, the proposed two-stage MAB algorithms still show good spectral efficiency performance against the optimal one. Yet, the proposed two-stage TS outperforms other schemes due to the aforementioned reason. At K = 4, about 98.6%, 95.7%, and 72% of the optimal performance are obtained by the proposed nested two-stage TS, UCB, and random selection, respectively. These values become 95.35%, 84.4%, and 58.3% at K = 64, respectively. By comparing these ratios with those   Figure 3 shows the spectral efficiency of the schemes involved in the comparison at 80% blockage, where it simulates a harsh blockage environment. In this context, four paths out of the five channel paths between BS and RIS and between RIS and UE undergo blockage, including the LoS path. Compared to Figure 2, more than 50% decrease in spectral efficiency occurs in this harsh blockage environment compared to the zero blockage case. This is due to the low power received from the only surviving path out of the five paths. Despite this harsh blockage environment, the proposed two-stage MAB algorithms still show good spectral efficiency performance against the optimal one. Yet, the proposed two-stage TS outperforms other schemes due to the aforementioned reason. At K = 4, about 98.6%, 95.7%, and 72% of the optimal performance are obtained by the proposed nested two-stage TS, UCB, and random selection, respectively. These values become 95.35%, 84.4%, and 58.3% at K = 64, respectively. By comparing these ratios with those given in the previous paragraph and shown in Figure 2, it is clearly shown that the performance of random PS selection is highly degraded compared to the optimal performance due to the blockage effect. However, the proposed two-stage MAB algorithms almost have the same ratios of the optimal performance even in this harsh blockage environment.
given in the previous paragraph and shown in Figure 2, it is clearly shown that the performance of random PS selection is highly degraded compared to the optimal performance due to the blockage effect. However, the proposed two-stage MAB algorithms almost have the same ratios of the optimal performance even in this harsh blockage environment.  Figures 2 and 3, respectively. However, the spectral efficiencies of K = 49 and 64 in Figures 4 and 5 are less than those given in Figures 2 and 3 due to the use of a lower number of antenna elements at both BS and RIS. Typically, the half-power beamwidth is inversely proportional to the used number of antenna elements. Thus, increasing the number of antenna elements for generating the same codebook pattern, i.e., the same number of beams, has two opposite effects. On one hand, it generates narrower beams with larger antenna gains [12,13], and on the other hand, it is more vulnerable to phase shift errors. This increases the gain loss at the maximum gain [12] due to the increase in the interbeam null angles [13]. This is the reason why spectral efficiency is not highly increased when increasing the used number of antenna elements for generating the same number of PS vectors. Interested readers can check the detailed analysis given in [12] and [13] in this regard. Still, the proposed nested two-stage TS has the best performance that nearly matches the optimal performance due to the prementioned reasoning. At K = 4 and zero (80%) blockage, about 99% (97.4%), 97.5% (92%), and 83.5% (63.6%) of the optimal performance are obtained by the proposed nested twostage TS, UCB, and random selection, respectively. These values become 97% (95%), 87% (87%), and 74.5% (61%) when K = 64, respectively. Again, the random selection is highly affected by the blockage effect more than the proposed nested two-stage MAB schemes. This means that the proposed algorithm can efficiently withstand the blockage effect due to its hypothesis of maximizing the achievable spectral efficiency irrespective of the environmental conditions.  Figures 2 and 3, respectively. However, the spectral efficiencies of K = 49 and 64 in Figures 4 and 5 are less than those given in Figures 2 and 3 due to the use of a lower number of antenna elements at both BS and RIS. Typically, the half-power beamwidth is inversely proportional to the used number of antenna elements. Thus, increasing the number of antenna elements for generating the same codebook pattern, i.e., the same number of beams, has two opposite effects. On one hand, it generates narrower beams with larger antenna gains [12,13], and on the other hand, it is more vulnerable to phase shift errors. This increases the gain loss at the maximum gain [12] due to the increase in the interbeam null angles [13]. This is the reason why spectral efficiency is not highly increased when increasing the used number of antenna elements for generating the same number of PS vectors. Interested readers can check the detailed analysis given in [12,13] in this regard. Still, the proposed nested two-stage TS has the best performance that nearly matches the optimal performance due to the prementioned reasoning. At K = 4 and zero (80%) blockage, about 99% (97.4%), 97.5% (92%), and 83.5% (63.6%) of the optimal performance are obtained by the proposed nested two-stage TS, UCB, and random selection, respectively. These values become 97% (95%), 87% (87%), and 74.5% (61%) when K = 64, respectively. Again, the random selection is highly affected by the blockage effect more than the proposed nested two-stage MAB schemes. Figures 6 and 7 show the spectral efficiency against the TX signal-to-noise ratio (SNR), i.e., 10 log 10             9 show the spectral efficiency convergence rate of the proposed nest two-stage MAB schemes against the time horizon using K = N = M = 16 at zero an moderate blockage effect with blockage probability of 50%, respectively. Due to the effe of blockage, the spectral efficiency performance given in Figure 9 is lower than that re resented in Figure 8. From these figures, it can be seen that the proposed nested two-sta TS has a faster convergence rate than the UCB-based one due to its Bayesian strateg Interestingly, the proposed nested two-stage UCB has better convergence than the T based one at low values of , where the TS algorithm starts to build up the prior/posteri distributions of the achievable reward. As these distributions are constructed, the TS co verges faster than UCB, as shown by these figures. At = 400, about 99% (99%) an 95.12% (94.8%) of the optimal performance are obtained using the proposed nested tw stage TS and UCB-based one at zero (50%) blockage, respectively.   9 show the spectral efficiency convergence rate of the proposed nested two-stage MAB schemes against the time horizon t using K = N = M = 16 at zero and moderate blockage effect with blockage probability of 50%, respectively. Due to the effect of blockage, the spectral efficiency performance given in Figure 9 is lower than that represented in Figure 8. From these figures, it can be seen that the proposed nested two-stage TS has a faster convergence rate than the UCB-based one due to its Bayesian strategy. Interestingly, the proposed nested two-stage UCB has better convergence than the TS-based one at low values of t, where the TS algorithm starts to build up the prior/posterior distributions of the achievable reward. As these distributions are constructed, the TS converges faster than UCB, as shown by these figures. At t = 400, about 99% (99%) and 95.12% (94.8%) of the optimal performance are obtained using the proposed nested two-stage TS and UCB-based one at zero (50%) blockage, respectively.   Figure 11, it is interesting to observe that the spectral efficiency and the convergence rate performances represented by Figures 10 and 11 are better than those represented by Figures 8 and 9, respectively. This comes from the increased number of antenna elements. At t = 400, about 99.4% (99.1%) and 95.5% (95%) of the optimal performance are obtained using the proposed nested two-stage TS and UCB-based one at zero (50%) blockage, respectively.   Figures 10 and 11 show the spectral efficiency convergence rate using N = 36, M = 64, and K = 16 at zero and 50% blockage, respectively. By comparing Figure 8 with Figure 10 and Figure 9 with Figure 11, it is interesting to observe that the spectral efficiency and the convergence rate performances represented by Figures 10 and 11 are better than those represented by Figures 8 and 9, respectively. This comes from the increased number of antenna elements. At = 400, about 99.4% (99.1%) and 95.5% (95%) of the optimal performance are obtained using the proposed nested two-stage TS and UCB-based one at zero (50%) blockage, respectively.     Figures 10 and 11 show the spectral efficiency convergence rate using N = 36, M = 64, and K = 16 at zero and 50% blockage, respectively. By comparing Figure 8 with Figure 10 and Figure 9 with Figure 11, it is interesting to observe that the spectral efficiency and the convergence rate performances represented by Figures 10 and 11 are better than those represented by Figures 8 and 9, respectively. This comes from the increased number of antenna elements. At = 400, about 99.4% (99.1%) and 95.5% (95%) of the optimal performance are obtained using the proposed nested two-stage TS and UCB-based one at zero (50%) blockage, respectively.  The suggested scheme of the perfect CSI-based approach presented in [30] reaches about 87~88% of the upper bound performance in the highest SNR scenario. This comes while assuming perfect mmWave CSI information, which is impractical in real scenarios. However, the proposed nested two-stage TS reaches about 94~99% of the optimal performance in the different simulation scenarios. Figure 12 shows the spectral efficiency ratio of the proposed nested two-stage TS, nested two-stage UCB, and the scheme proposed in [30] compared to the random performance against TX SNR. For fair comparisons, we used the same simulation parameters given in [30], i.e., N = 48, M = 100, and K = 6, and the same TX SNR values. As shown by this figure, the spectral efficiency ratio of the proposed The suggested scheme of the perfect CSI-based approach presented in [30] reaches about 87~88% of the upper bound performance in the highest SNR scenario. This comes while assuming perfect mmWave CSI information, which is impractical in real scenarios. However, the proposed nested two-stage TS reaches about 94~99% of the optimal performance in the different simulation scenarios. Figure 12 shows the spectral efficiency ratio of the proposed nested two-stage TS, nested two-stage UCB, and the scheme proposed in [30] compared to the random performance against TX SNR. For fair comparisons, we used the same simulation parameters given in [30], i.e., N = 48, M = 100, and K = 6, and the same TX SNR values. As shown by this figure, the spectral efficiency ratio of the proposed nested two-stage TS has the best performance. In addition, both MAB schemes outperform the scheme presented in [30]. At SNR = −25 dB, the spectral efficiency ratios of the proposed nested two-stage TS, nested two-stage UCB, and the scheme given in [30] become 5.5, 4.8, and 4, respectively. This means that about 37.5% and 20% improvements in spectral efficiency performance are obtained by the proposed MAB schemes over the scheme presented in [30]. This comes without any need for knowing the CSI of both mmWave BS and RIS.
The suggested scheme of the perfect CSI-based approach presented in [30] reaches about 87~88% of the upper bound performance in the highest SNR scenario. This comes while assuming perfect mmWave CSI information, which is impractical in real scenarios. However, the proposed nested two-stage TS reaches about 94~99% of the optimal performance in the different simulation scenarios. Figure 12 shows the spectral efficiency ratio of the proposed nested two-stage TS, nested two-stage UCB, and the scheme proposed in [30] compared to the random performance against TX SNR. For fair comparisons, we used the same simulation parameters given in [30], i.e., N = 48, M = 100, and K = 6, and the same TX SNR values. As shown by this figure, the spectral efficiency ratio of the proposed nested two-stage TS has the best performance. In addition, both MAB schemes outperform the scheme presented in [30]. At SNR = −25 dB, the spectral efficiency ratios of the proposed nested two-stage TS, nested two-stage UCB, and the scheme given in [30] become 5.5, 4.8, and 4, respectively. This means that about 37.5% and 20% improvements in spectral efficiency performance are obtained by the proposed MAB schemes over the scheme presented in [30]. This comes without any need for knowing the CSI of both mmWave BS and RIS. Figure 12. Spectral efficiency comparisons between the proposed nested two-stage MAB schemes and the scheme proposed in [30].
The complexity analysis clearly shows that the proposed nested two-stage MAB scheme has low BT and computational complexities compared to the optimal solution. The complexity analysis clearly shows that the proposed nested two-stage MAB scheme has low BT and computational complexities compared to the optimal solution. This is because the optimal strategy explores all available {R, F } pairs, which obtains its BT and computational complexities of order O(|R||F |). However, in the proposed MAB approach, the sets R and F are explored alternatively at every time t. Thus, the BT complexity of the proposed schemes is of order O(1). Regarding the computational complexities, for the proposed nested two-stage TS, the primary source of computational complexity comes from sampling a 1-dimensional Gaussian random variable and updating its related parameters with the complexity of O(|R| + |F | + 1). In addition, the computational complexity of the proposed nested two-stage UCB comes from selecting the optimal PS and updating its corresponding parameters with the same computational complexity order of O(|R| + |F | + 1). For example, when |F | = 36 and |R| = 64, the BT and computational complexities of the optimal solution are of order O(2304) while the BT and computational complexities of the proposed nested two-stage MAB approach will be O(1) and O(101), respectively. This means that about 99.96% and 96% reductions in BT and computational complexities are obtained, respectively. Consequently, the proposed nested two-stage MAB approach has a near-optimal performance with much lower complexity.

Conclusions
In this paper, we have explored RIS-assisted mmWave communications. To avoid estimating the massive mmWave CSI at both RIS and UE, we proposed using antenna codebooks. Moreover, the problem of jointly optimizing the PS vectors at both mmWave BS and RIS was formulated as a MAB game, which contributes to relaxing the required BT overhead. In this context, a nested two-stage MAB strategy was suggested, and nested two-stage TS and UCB algorithms were proposed to implement the proposed strategy. Simulation analyses confirm the superior performance of the proposed two-stage TS compared to the UCB-based one. Moreover, the proposed nested two-stage MAB schemes outperform random selection and other benchmarks with a high convergence rate and low BT and computational complexities.