Autonomous Power Decision for the Grant Free Access MUSA Scheme in the mMTC Scenario

Non-orthogonal multiple access schemes with grant free access have been recently highlighted as a prominent solution to meet the stringent requirements of massive machine-type communications (mMTCs). In particular, the multi-user shared access (MUSA) scheme has shown great potential to grant free access to the available resources. For the sake of simplicity, MUSA is generally conducted with the successive interference cancellation (SIC) receiver, which offers a low decoding complexity. However, this family of receivers requires sufficiently diversified received user powers in order to ensure the best performance and avoid the error propagation phenomenon. The power allocation has been considered as a complicated issue especially for a decentralized decision with a minimum signaling overhead. In this paper, we propose a novel algorithm for an autonomous power decision with a minimal overhead based on a tight approximation of the bit error probability (BEP) while considering the error propagation phenomenon. We investigate the efficiency of multi-armed bandit (MAB) approaches for this problem in two different reward scenarios: (i) in Scenario 1, each user reward only informs about whether its own packet was successfully transmitted or not; (ii) in Scenario 2, each user reward may carry information about the other interfering user packets. The performances of the proposed algorithm and the MAB techniques are compared in terms of the successful transmission rate. The simulation results prove that the MAB algorithms show a better performance in the second scenario compared to the first one. However, in both scenarios, the proposed algorithm outperforms the MAB techniques with a lower complexity at user equipment.


Introduction
The future radio access network of the fifth-generation is expected to support a variety of applications with different qualities of service (QoSs). These services are classified by the International Telecommunications Union and the Third-Generation Partnership Project into three main use cases with different stringent requirements, namely enhanced mobile broadband (eMBB), ultra reliable and low latency communications (uRLLC), and massive machine-type communications (mMTCs). The latter is also known as massive IoT as it is designed to mainly deal with a massive number of connected devices [1], i.e., one million connected devices per km 2 . The mMTC use case is characterized by short packet communications, i.e., on the order of a few bytes, low system complexity, and low energy consumption, which leads to a battery life on the order of ten years. The conventional orthogonal multiple access (OMA) schemes are limited by the restricted number of available orthogonal resources, and thereby, they may not be suitable to handle the huge number of devices to be connected in the mMTC scenario. However, the nonorthogonal multiple access (NOMA) schemes have been underlined as a prominent solution the cost of a complex decoding process. In addition, one problem of grant free access is the estimation of the number of active users. This issue was addressed in [12] by proposing a deep learning algorithm, which uses the recorded user activities at the base station to predict their future behavior. This prediction is given as an input to a modified orthogonal matching pursuit algorithm to improve the multi-user detection and reduce the error probability. In [13], a sinusoidal code was proposed for the signals' separation in the context of the mMTC scenario with grant free access. The proposed spreading sequences permit using non-iterative algorithms for multi-user detection without prior knowledge of the channel state information and the number of active users. The authors in [14] dealt with the problem of packet collisions in a grant free access context without a re-transmission opportunity. A novel grant free access framework was proposed where the non-decoded users considered the collisions occurring as interference. Moreover, the system performance was evaluated analytically, and the authors provided simplified expressions of the outage probability and the system throughput.
SCMA has particularly been studied with grant free access protocols. For instance, in [15], the authors studied the application of SCMA with a faster than Nyquist signaling, which improved the spectral efficiency, but at the expanse of a higher inter-symbol and inter-user interference. Therefore, a novel algorithm based on the expectation propagation was proposed for the channel estimation, the detection of user activities, and the signal decoding. The work in [16] investigated an iterative message passing algorithm for grant free access SCMA, based on the belief propagation. The proposed algorithm permits jointly estimating the channel coefficients, identifying the number of active users, and detecting the transmitted data while improving the bit error rate compared to the other techniques.
Regarding the system design, MUSA has the potential to enable grant free access with minimum signaling overhead in the context of mMTC applications. Unlike the SCMA scheme, which requires the assignment of the codebook beforehand, in MUSA, each user randomly and autonomously selects a spreading sequence within a predefined constellation. In other words, users can transmit their data at any moment without going through a resource allocation process with the base station, which minimizes the amount of signaling overhead. The MUSA scheme is typically used with a SIC receiver for multi-user detection, which provides a low decoding complexity. However, the SIC technique may suffer from the error propagation phenomenon when the received powers are similar [17]. The power allocation process is usually performed in a centralized manner [18,19] where the base station knows the channel state information of all users. For grant free access, each user performs a blind transmission with no information about its propagation environment and interfering users, which makes the power determination more complex.
The autonomous power decision for NOMA schemes with the grant free access strategy has recently been investigated in several works. An interesting solution is to use multi-armed bandit (MAB) algorithms, which belong to the global reinforcement learning paradigm [20,21]. MAB techniques can be applied to the problem of dynamic resource allocation by balancing between the exploration and exploitation phases. At each time, each agent selects an arm, i.e., representing the physical resource to be shared, among a set according to a predefined policy in order to maximize its cumulative reward and hence minimize its regret. The MAB algorithms have been used in several applications such as marketing, advertising, and cellular communications. For instance, the authors in [22] applied the MAB algorithms to the autonomous power decision problem in order to maximize the user rates for the PD-NOMA scheme. The user rewards are their rates. However, these may be carried on many bits, which increases the signaling overhead; hence, it may not really be adapted for mMTC scenarios. MAB was also merged with NOMA schemes in [23] where the authors proposed a distributed NOMA-based MAB approach to handle the channel access problem in cognitive radio networks. Moreover, the authors in [24] performed the MAB algorithms in the LTE cellular network for an autonomous subcarrier allocation in a dense network while taking into consideration the dynamic resource occupation in each surrounding cell.
To the best of our knowledge, no work has investigated the problem of autonomous power decision for grant free access with MUSA scheme. The characteristics of spreading sequences and the principle of the SIC receiver make the power decision more complex. Therefore, in this paper, we deal with this issue with minimum signaling overhead to address the mMTC requirements. The goal is to improve the system performance measured with the successful transmission rate in order to achieve the performance of an optimal centralized power allocation. The latter is quite difficult to obtain, especially for SIC receivers with the error propagation problem. To do so, we start by proposing an approximated expression for the bit error probability (BEP) while considering the inter-user interference and the effect of error propagation. The optimal power value of users is obtained as the solution of the minimization of the global average BEP. Based on the derived BEP expression, we propose a novel algorithm for power selection for the MUSA scheme with a reduced signaling overhead. The proposed algorithm is compared with known index-based MAB algorithms adapted to the power selection by each user. In this part, we propose to investigate two scenarios for selecting the best arm by each MAB algorithm: a scenario where the arm index computation by a user is only based on the decoding status of its own packet, i.e., success or failure, and another scenario where it depends on the decoding status of the other users' packets in addition to its own packet decoding status.
This paper is organized as follows. The system model and the fundamentals of MUSA are introduced in Section 2. The SIC receiver is revisited in Section 3, while a closed-form expression for users' bit error probability is derived in Section 4. Then, the proposed algorithm for autonomous power decision is described in Section 5. The multi-armed bandit algorithms and the studied scenarios are introduced in Section 6. A comparison of all power decision approaches is provided in Section 7. Numerical results and performance analysis are conducted in Section 8, and conclusions are drawn in Section 9.
Notations: Vectors and matrices are denoted in lowercase and uppercase, respectively, and in bold font, while scalars use a normal font weight. The complex and real number sets are denoted by C and R, respectively. Moreover, (.) T and (.) H stand for the transpose and Hermitian operations. diag(a) represents the diagonal matrix created with the elements of vector a in the main diagonal.

System Model
An uplink communication system of J users transmitting over K orthogonal subcarriers is considered. The active users share the available resources using the MUSA scheme with grant free access. Each user's bits are mapped to a series of symbols through an M-ary modulation block. Then, the modulated symbols are multiplied by the users spreading sequences and spread over the available subcarriers, as illustrated in Figure 1. User sequences s j , ∀j ∈ {1, · · · , J} are such that s j ∈ {a + jb} K , where (a, b) ∈ {−1, 0, 1} 2 . The received signal on subcarrier k of each OFDM symbol is: where h kj and s kj are the k-th component of the j-th user channel vector and spreading sequence, i.e., h j and s j , respectively. Moreover x j , p j are the transmitted symbol and the transmission power of the j-th user, respectively, and n k is the additive white Gaussian noise component on the k-th subcarrier with n ∼ CN (0, σ 2 I K ), where I K is the K-by-K identity matrix. The multiplexed received signals on all subcarriers can be written as: where P = diag(p 1 , p 2 , ..., p J ) ∈ R J×J + is the transmission power matrix, x = x 1 , x 2 , ..., x J T is the transmitted users' symbols with E[xx H ] = I J , and G is the equivalent channel matrix including the spreading sequences such that: where H = h 1 , · · · , h J , S = s 1 , · · · , s J , and is the Hadamard product, i.e., g kj = h kj s kj .

Multi-User Detection
The SIC receiver offers a low decoding complexity compared to other MUD algorithms, namely the message passing algorithm or maximum a posteriori algorithm [25]. However, SIC's performance depends on the user received powers, and the receiver performs better when the received powers are sufficiently different. MUSA is typically used with ordered-SIC jointly with a linear detection receiver such as the minimum mean squared error (MMSE). The MMSE matrix is calculated as in [26]: The main principle of the ordered-SIC technique is to successively estimate the user symbol, reconstruct the generated interference, and then, subtract it from the received signal. User symbols are decoded in a descending order of their SINRs. Assume that the received signal at the j-th iteration is: where g j is the j-th column of the matrix G. Then, the SINR of the user j picked to be decoded is: where w j is the j-th column of the MMSE matrix W. After that, the user symbol is estimated by multiplying the row vector w H j by the received column signal as follows: The interference generated by the j-th user is reconstructed and then subtracted from the received signal, which is updated as follows: After each iteration, the j-th column of the matrix G, corresponding to the decoded user j, is removed, and the MMSE matrix is recalculated as in (4). This process is repeated until all users are decoded.

BEP Analysis
The error propagation is one of the critical issue of SIC receivers, which significantly deteriorates the system performance and makes the derivation of the BEP expression more complicated. For a Gray mapping, two adjacent symbols are different in only one single bit. Hence, assuming the inter-user interference as noise, the erroneous detection often leads to the detection of an adjacent symbol with only one wrong bit compared to the correct symbol [27]. Therefore, the average system BEP is well approximated as: where P ej is the symbol error probability (SEP) of the j-th user. In the following, we investigate the BEP of the MMSE-SIC receiver with two different hypotheses: (i) perfect SIC with no error propagation (NEP); (ii) imperfect SIC with error propagation (EP).

Perfect SIC without Error Propagation
In this case, since there is no error propagation in the receiver, the BEP is calculated similarly as for the MMSE receiver while updating the MMSE matrix at each iteration, and the SINRs are calculated as in (6). For a QPSK modulation and assuming the inter-user interference as noise [28], the j-th user SEP is approximated as [27]:

Imperfect SIC with Error Propagation
In that case, the BEP of each user depends on the previously decoded users. In this paper, we were inspired by the proposed approach in [28], and thereby, the SEP of the j-th user is calculated as: where N j = 2 j−1 is the number of possible (j − 1)-dimensional binary sequences and the symbol of the k-th decoded user is correctly detected and one otherwise. Each sequence refers to the state, correctly decoded or not, of all the previously (j − 1) decoded users. The event ε j indicates an erroneous detection of the j-th user symbol. Hence, P ε j |b j i is the error probability of the j-th user symbol conditioned on the sequence b j i . Considering an eventual error propagation occurrence, the received signal at the j-th SIC iteration is represented as: wherex k is the faulty estimation of x k . The additional term compared to (5) is generated by the erroneous detection of the previous users. This may significantly affect the system performance. Therefore, the experienced noise and the new interference term can be combined in n eq = ∑ j−1 k=1 The resulting term is approximated as a centered Gaussian random variable, where E n eq = 0 and E n eq n H eq = (∑ j−1 We define d as the square of the euclidean distance between the neighboring symbols and δ k = 1 if x k =x k and zero otherwise. As a consequence, the SINR of the j-th user, corresponding to the detection combination b j i , is calculated as follows: Two main terms should be calculated to obtain the user SEP. Starting by the conditional probability, which is calculated according to (10) and (13), we have: However, the probability of the combination b j i is readily calculated as: where P b j i,n | ∩ n−1 m=1 b j i,m is the probability that the n-th symbol is correctly decoded or not, i.e., b j i,n = 0 or b j i,n = 1, conditioned on the estimation of the previously decoded (n − 1) symbols. It is calculated as: For an uplink transmission, devices are restricted by a maximum transmission power, p U , imposed by the regulation authorities and the equipment design restrictions. Therefore, an optimal centralized power allocation p opt that minimizes the global average error probability can be obtained by solving the following problem: where J = {1, 2, · · · , J} is the set of active users. The derived expression of the user SEP is quite complicated to analyze theoretically with the Karush-Kuhn-Tucker (KKT) conditions. Therefore, we use an advanced optimization algorithm, i.e., particle swarm optimization [29], to solve the power allocation problem above. This algorithm is known to be efficient for complex problems [30].

Proposed Autonomous Power Decision Algorithm
Each user has to decide its transmission power autonomously with no information about the propagation environment and the interference. In this section, we aim at proposing an autonomous power decision algorithm for uplink communication. This allows each user to select an adequate power value close to the optimal one, p opt , obtained by solving OP 1 .
The key idea is to perform an iterative algorithm that takes advantage of the natural base station acknowledgment (ACK). Each user gradually updates its transmitted power from the received ACK in order to converge toward the nearest power level from p opt . For example, the j-th user initially transmits its data with a randomly selected power p j within the interval [p j min , p j max ], where p j min and p j max are respectively the initial minimum and maximum power values memorized in the j-th user equipment (UE). Then, the base station detects the user signal and compares its transmission power with p j,opt , which the base station has computed on its own. An acknowledgment will be sent back to each user to adjust its power. In order to minimize the signaling overhead, the acknowledgment is carried on two bits and can hence encode four possible states: (1) ACK = 3 if the user should simply transmit with its maximum authorized power p U ; this case may be gainful for the cell edge users that experience bad propagation conditions; (2) ACK = 2 if p j > p j,opt ; (3) ACK = 1 if p j < p j,opt ; and (4) ACK = 0 if p j = p j,opt . Each user updates its interval by shifting the p j min and p j max values. After that, it picks up another random value in the new power interval for the next packet transmission until it arrives at the appropriate power value. However, the channel conditions may change along the way. Hence, the algorithm must take this into consideration in order to ensure its convergence and assure the best performance. For that reason, the base station may, sometimes, send another extra bit "Stat" to notify the user of this occurrence. In this case, the UE will try to initialize its power interval while taking advantage of the previously sent packets. This process is described in detail in Algorithm 1.
The channel should not change too fast in order to allow the convergence of the algorithm. However, as will be seen in the simulation results, the proposed algorithm converges to the near-optimal power value quite quickly. In addition, users' transmission powers must be known at the BS to perform the proposed algorithm. However, these power values are obviously needed in order to apply the SIC receiver properly. Therefore, a calibration phase between the BS and the UE should always be established.

Power Allocation with Multi-Armed Bandits
In this section, we revisit three known MAB algorithms, i.e., -greedy, upper confidence bound (UCB1), and Thompson sampling (THS), that we apply to our autonomous power selection problem. An MAB is a model with N resources, called arms, each of them being associated with a reward following a specific probability distribution. At each time slot t, each agent j plays an arm a j according to its policy. Then, it receives the corresponding reward r t j (a j ). Based on this and the number of times each arm has been played so far, n t j (a j ), each agent chooses the appropriate arm for the next time slot t + 1, according to the calculated index that depends on each algorithm policy. Over time, these techniques will prioritize the arms showing the best performance and exclude the worst ones.
All MAB algorithms search for the maximization of the cumulative rewards of each agent over the time horizon T, i.e., ∑ T t=1 r t j (a j ) and thereby the minimization of its regret R j defined as the difference between the rewards obtained using the chosen policy and the expected reward we would obtain if the best arm were always played, i.e., r * j . The j-th user regret during a maximum period of T slots is calculated as follows: In our case, we consider a multi-agent system where the agent refers to the UE and the arms represent the power levels. At the t-th iteration, the successful transmission rate of the j-th user is defined as the ratio between the cumulative number of its correctly received packets during t time slots and the total number of plays so far. The MAB algorithms are investigated in two different scenarios detailed hereafter.
(a) Scenario 1: The base station acknowledgment at the t-th iteration is carried on 1 bit representing the corresponding user reward, i.e., r t j ∈ {0, 1}. At each time slot t, r t j (a j ) = 1 if the packet of the j-th user is successfully decoded and r t j (a j ) = 0 otherwise. Therefore, the successful transmission rate of the j-th user at the t-th iteration is calculated as Q t j = ∑ t i=1 r i j (a j ) t . In this scenario, the reward of each user only depends on the decoding status of its own packet without any consideration of the other users. However, the successful decoding event of one packet depends on the successful decoding of the others, because of the SIC receiver. Hence, every user has interest in good power selection for the other users and not only for itself. Scenario 2, we propose hereafter, takes into account this fact.
(b) Scenario 2: The base station acknowledgment at the t-th iteration is now carried on two bits {b t 2,j b t 1,j }. The first bit informs whether all users are correctly decoded, b t 1,j = 1, or, at least, one packet is erroneously detected, b t 1,j = 0. The second bit notifies each user whether its own packet is correctly received, b t 2,j = 1, or not, b t 2,j = 0. For a picked power p j by user j, there are three possible states for the j-th user acknowledgment {b t 2,j b t 1,j } ∈ {11, 10, 00} = {3, 2, 0}. The case where b t 2,j b t 1,j = 01 is not possible because b 1,j = 1 means that all packets have been correctly decoded, including the j-th user packet, and hence, b 2,j is automatically equal to one. In order to meet the conditions of the convergence theorems derived in [31], the rewards should be supported in [0, 1]. Therefore, user rewards are defined as a normalization of the associated acknowledgments, i.e., r t j ∈ {1, 2 3 , 0}. The successful transmission rate, at the t-th iteration, of the j-th user is then calculated based only on the second bit b t 2,j , i.e., Q t j = In this scenario, the inter-user dependence is involved in the associated rewards.

UCB1
UCB1 was inspired by the Agrawal's index-based policy [31]. This algorithm has a uniformly logarithmic regret over time. Generally, the UCB family of algorithms relies to a confidence interval on the average reward of each arm [32]. the UCB1 index gathers two functions; the average reward and the exploration term. This index refers to an estimation of the upper bound of the true expectation of the arm reward. It is an upper bound because the square root term is an estimation of the variance of the expected return when playing the arm a j and is defined as follows, at time slot t: where θ > 0 is the exploration parameter. Originally, UCB1 was proposed with θ = 2; however, the authors in [32] mentioned that θ = 0.5 performs better empirically although θ > 0.5 is strongly recommended for the theoretical analysis. At the initialization phase, UCB1 explores each arm once in order to have an estimation of the reward of each arm. Then, at each iteration, each user selects the arm with the highest index, as illustrated in Algorithm 2. The calculated index (20) ensures the balance between the exploration of the most uncertain arms and the exploitation of the best arm so far. UCB1 prescribes the principle of "optimism in the face of uncertainty", which means that the less visited arm seems more uncertain, and thereby, it may optimistically be the best arm to play. (a) n t j (a j ) = n t−1 j (a j )

-Greedy
This algorithm deals with the exploration and the exploitation dilemma randomly. At each iteration, each user either explores arbitrarily a new arm with probability or it plays the best arm corresponding to the highest average reward so far with a probability of 1 − . However, for a constant exploration parameter , the system regret evolves linearly overtime instead of being logarithmic. On the one hand, for a high value, i.e., ≈ 1, the user will continue to only explore random arms even if it came out with the best arm, and on the other hand, for a low value, i.e., << 1, the algorithm will tend to exploit all the time even if it has not sufficiently explored the other arms. In both cases, an important performance loss will be experienced. Therefore, the value is a critical parameter. A revised version called -decreasing greedy has been proposed, where the exploration probability is decreasing toward zero over time with a rate of 1 t . This allows one to essentially explore at the beginning of the learning and mostly to exploit the best arm found so far after a certain amount of time. The new exploration probability is defined as [22,31]: where L > 0 is the exploration parameter. However, the main challenge of this policy is how to properly set the value of L. The -decreasing greedy algorithm is described in detail in Algorithm 3.
Algorithm 3: -decreasing greedy algorithm. Require: L and N for t = 1 : T do for j = 1 : J do Select a random arm with probability (t) = min 1, LN t Select with probability 1 − (t) the best arm: argmax Update the following variables:

Thompson Sampling Algorithm
This approach shows a robust performance for stochastic problems and sometimes outperforms other MAB algorithms. The THS algorithm belongs to the Bayesian MAB family. The j-th user starts by a uniform prior beta distribution β(α j,k , γ j,k ) for all arms with initial values α j,k = γ j,k = 2 ∀j ∈ {1, · · · , J} and ∀k ∈ {1, · · · , N}, where k refers to the arm index among N power levels. Then, inspired by the case where rewards follow a binomial distribution [33] and based on the observed reward, the parameters of the posterior beta distribution are updated such that α j,k = α j,k + 3r t j and γ j,k = γ j,k + 3(1 − r t j ). At the next time slot, each user draws a sampled index from the updated beta distribution for each arm, i.e., i j,k ∼ β(α j,k , γ j,k ) ∀k ∈ 1, · · · , N and ∀j ∈ 1, · · · , J. The arm with the highest index, i.e.,î j,k = max k∈N (i j,k ) ∀j ∈ 1, · · · , J, is hence elected for this transmission attempt. Through time, Thompson sampling prioritizes the arm with the highest probability of being the optimal one and avoids other arms that have demonstrated poor performance so far. This algorithm is described in detail in Algorithm 4.

Algorithm 4: Thompson sampling algorithm.
Require: N and α j,k = γ j,k = 2 ∀k = 1...N and ∀j = 1...J for t = 1 : T do for j = 1 : J do Select a sampled index from the beta distribution of each arm i j,k ∼ β(α j,k , γ j,k ) ∀k = 1, · · · , N Play the arm a j with the highest indexî j,q = max k∈N (i j,k ) Update the following variables:

Complexity and Overhead Analysis
A quantitative comparison of all the examined techniques in the context of the mMTC scenario is summarized in Table 1. The random power selection and the centralized allocation are taken as reference scenarios. The centralized allocation is the reference in terms of performance, and the random selection is the simplest one.
The centralized power allocation algorithm computes, at the base station, the power to allocate to the users at each transmission attempt, based on the users' received SINRs. All the complexity is located at the base station, and users have to set their transmitting power at the values sent back from the BS; hence, the algorithm complexity at the user side is O(1). The signaling overhead of this scheme cannot be assessed precisely since it strongly depends on the downlink control information (DCI) format. However, the power computed is quantized over k bits, which would likely be much larger than one or two bits, for each user. Hence, for a large number of users, the signaling would be at least in O(J · k). Thus, it may be very expensive in terms of energy consumption, leading to a significant reduction of the battery lifetime.
The random power selection does not manifest any algorithmic complexity since the power selection is realized randomly. Therefore, the generated signaling overhead is minimal, i.e., 1 bit, as it only relies on the acknowledgment sent by the BS for each user's packet, whether it is successfully received or not. The proposed autonomous power decision algorithm is based on four acknowledgment levels, used to update the power at the user side, which can be carried with two bits. Moreover, one may add one additional bit if the BS detects a channel variation in order to notify the corresponding user of this event. The generated complexity is on the order of O(1) as no computation is required at the UE during this process.
All the MAB techniques have the same signaling overhead and algorithmic complexity for each transmission attempt. UCB1, -decreasing greedy, and Thompson sampling can be seen as index-based policies. Hence, the algorithmic complexity consists of sorting N indexes, representing the rating of the arms w.r.t. the objective of the agent, and taking the arm that corresponds to the highest index. Therefore, their complexity is on the order of O(N). Furthermore, the generated signaling overhead depends particularly on the applied learning scenario. In Scenario 1, the index update by an agent is only based on the processing output of its own packet using a given power, i.e., either the packet is successfully received or not, and hence, it takes 1 bit. In Scenario 2, the update of an agent index is made by taking into account the decoding status of the other users' transmissions, in addition to that of its own packet, which is carried out with two bits. It is worth noting that the computational complexity is not considered here. Moreover, the complexity of calculating a sampled index from the beta function for each arm with the Thompson sampling algorithm is higher than that of the UCB1 and -decreasing greedy indexes.

Numerical Results and Analysis
We consider an uplink system with 150% of overload, where J = 12 and K = 8. Users are uniformly scattered in the cell while experiencing an AWGN channel with different path losses. Each user can pick its transmission power over a set of N = 10 possible power levels in the interest of selecting the appropriate value ensuring the best performance in both Scenarios 1 and 2. The user spreading sequences are normalized to unitary energy. The algorithms are investigated in term of the successful transmission rate, i.e., the total number of correctly decoded packets over the total number of sent packets. Simulations are averaged over 150 network realizations, i.e., the successful transmission rate is averaged over the path losses and the spreading sequences. Regarding the UCB1 algorithm, the exploration of new power values is conducted by the parameter θ. As mentioned above, this parameter was originally set to two, but in the literature, θ = 0.5 is admitted empirically as it provides better performance. In order to choose the optimal value of θ, the average transmission rate achieved by UCB1 was investigated w.r.t. θ, and the value θ = 0.5 is the one that allows achieving the best transmission rate. The figure is not reported here so as not to clutter the exposure. The other simulation parameters are given in Table 2.

Channel AWGN with Path Losses
Users J = 12 Maximum individual power 20 dBm N 10 levels Noise power σ 2 = −14 dBm T 1000 slots θ 0.5 Figure 2 compares the simulated average BER, i.e., averaged over the spreading sequences and positions, and the analytical average BEP obtained by the proposed expression in (9) for an AWGN channel and uniformly distributed users over the cell w.r.t. the global received SNR. We remark that the expression that takes into account the error propagation phenomenon almost matches the simulated BER. However, removing the error propagation effect induces a wide gap in the performance because it is too optimistic. In addition, we notice that, for high SNR values, the BEP with EPgets closer to the simulated BER. This can be explained by the fact that the QPSK approximation in (10) is more robust for high SNR. The performance of the -decreasing greedy algorithm depends on the value, which in turn depends on the coefficient L. It is important to choose the coefficient that allows the algorithm to achieve its best performance. Therefore, the main challenge of the -decreasing greedy approach is to handle the exploration and the exploitation dilemma by properly setting the value of L in (21). Figure 3 investigates the performance of this algorithm for different L in Scenario 1 after T = 1000 iterations. We note that L = 0.1 gives the best performance in terms of the average transmission rate and hence it is kept for the rest of the simulations. The same behavior is observed in Scenario 2, but not reported here to limit the redundancy.  Figures 4 and 5 compare the successful transmission rate of the algorithms under study, i.e., the centralized power allocation, the proposed algorithm, the MAB algorithms ( -decreasing greedy, UCB1, and THS), and the random power selection in Scenarios 1 and 2, respectively. The proposed algorithm outperforms all the MAB techniques with a faster convergence to the optimal power in both scenarios. We also remark in Figure 4 that the -decreasing greedy algorithm converges faster than the THS and UCB1 algorithms. This can be explained by the optimal selection of the L value, which ensures a trade-off between the exploration and the exploitation phases in order to achieve the best performance. The -decreasing greedy and THS algorithms converge to the same successful transmission rate after 400 iterations. However, the gap between -decreasing greedy and THS is less important in Scenario 2 in Figure 5. In fact, after T = 100 iterations, THS is slightly better than -decreasing greedy. THS seems to take advantage of the additional information carried by the feedback whether there is a decoding error among the users or not. However, both algorithms, i.e., -decreasing greedy and THS, are far better than UCB1 in both scenarios. UCB1 takes more time to explore suboptimal powers, which slows down its convergence to the optimal power values and thereby induces more packet losses. The random power allocation presents the lowest performance bound in both scenarios since no strategy is applied for an adequate power selection, which induces error propagation and hence packet losses.
For a given number of iterations T, the figures represent the average successful transmission rate achieved after averaging over the network realizations and the spreading sequences, i.e., 150 realizations, T being the number of packets sent, also known as the number of iterations in each algorithm. The performance achieved by the algorithms under fast variations of the propagation environment is directly obtained from  a fading channel could have been considered also; however, this would only affect the absolute performance, as the statistic of the rewards would have been changed, but not the relative behaviors of the algorithms. Therefore, in this paper and for the sake of simplicity, we consider only an AWGN channel with different path losses among users, and we show the behavior of the investigated techniques as the number of iterations increases averaged over several network realizations.  Figure 6 shows the performance comparison of all algorithms in Scenarios 1 and 2 for 30 ≤ T ≤ 300. One can remark that all MAB techniques achieve better performances in Scenario 2 compared to Scenario 1. For instance, after T = 50 iterations, the Thompson sampling algorithm achieves a successful transmission rate of ≈0.94 in Scenario 2, whereas, in Scenario 1, it attains the value of 0.91. This may be explained by the fact that Scenario 2 conveys more information compared to Scenario 1 to select the best set of powers. In other words, the reward a user gets in Scenario 2 is not only a function of the successful decoding of its own packet, but also whether all other users succeeded in their transmissions or not. This strategy allows each user to take into account a kind of global interest in the selection of its power. In addition, the successful transmission rate achieved with the proposed algorithm converges to the one obtained with the optimal centralized solution after a few iterations compared to the MAB techniques. For example, after T = 30 iterations, the proposed algorithm achieves a rate of 0.99 of correctly received packets, whereas the -decreasing greedy has a rate of 0.93. It should be noted that, after a large number of iterations, the performances of the MAB algorithms in Scenario 1 converge to those in Scenario 2.

Conclusions
The autonomous power decision for NOMA schemes with a grant free access strategy has been an issue to satisfy the mMTC requirements. To the best of our knowledge, no work has been done on this problem for the MUSA scheme in order to enhance user performance with a minimal signaling overhead. In this paper, we address this issue by proposing a novel algorithm for autonomous power decision based on the proposed BEP approximation and the base station acknowledgments. Moreover, we study the efficiency of some MAB algorithms for the power allocation with two different implementation scenarios, i.e., one where the rewards of a user are only dependent on the decoding output status of its own packet and another one where they depend also on whether all users have successfully transmitted their packets or not. The proposed algorithm converges very fast to the obtained solution with a centralized resource allocation that is considered as a baseline. Moreover, the MAB algorithms have an acceptable performance, but at the cost of a larger convergence time and a higher UE complexity compared to the proposed algorithm. The latter shows the best performance with a faster convergence rate, but also with a slightly higher signaling overhead compared to the MAB algorithms, particularly for a variant propagation environment.