D2D Mobile Relaying Meets NOMA—Part I: A Biform Game Analysis

Structureless communications such as Device-to-Device (D2D) relaying are undeniably of paramount importance to improving the performance of today’s mobile networks. Such a communication paradigm requires implementing a certain level of intelligence at device level, allowing to interact with the environment and select proper decisions. However, decentralizing decision making sometimes may induce some paradoxical outcomes resulting, therefore, in a performance drop, which sustains the design of self-organizing, yet efficient systems. Here, each device decides either to directly connect to the eNodeB or get access via another device through a D2D link. Given the set of active devices and the channel model, we derive the outage probability for both cellular link and D2D link, and compute the system throughput. We capture the device behavior using a biform game perspective. In the first part of this article, we analyze the pure and mixed Nash equilibria of the induced game where each device seeks to maximize its own throughput. Our framework allows us to analyse and predict the system’s performance. The second part of this article is devoted to implement two Reinforcement Learning (RL) algorithms enabling devices to self-organize themselves and learn their equilibrium pure/mixed strategies, in a fully distributed fashion. Simulation results show that offloading the network by means of D2D-relaying improves per device throughput. Moreover, detailed analysis on how the network parameters affect the global performance is provided.


Motivations & New Trends
The last twenty years have known a noteworthy growth in the demand for more network capacity. This was mainly caused by the unprecedented Internet built-out and the huge traffic generated by a massive number of devices. To cope with this neverseen demand, substantial research effort is being conducted to enhance the performance of next generation mobile networks. Current fifth Generation (5G) of wireless networks addresses a wider range of applications and many innovative use-cases [1]. It is expected that a single 5G tower will serve up to 1 Million device per km 2 , which generates massive data traffic in cellular networks. The sixth Generation (6G) of wireless networks is foreseen to support novel data-hungry applications, a plethora of autonomous services and new communication scenarios around 2030. These technologies encompass holographic videos, flying networks and vehicles, teleoperated driving, telemedicine, haptics, human bond communications, brain-computer interfaces, connected autonomous systems, high-definition video streaming and the tactile Internet, to name a few. Thus, the volume of wireless data traffic consuming. In this work, we analyze the network by means of game theory. In other words, a network designer (e.g., the network builder or the operator) will analyze the performance of the network and predict its operation points using game theory before rolling it out. The benefits of game theory rely on providing strong tools and theoretic framework to analyze the agents/devices interaction. This allows us to accurately predict the system performance. Game theory is useful in predicting the network performance while considering self-organized devices. Henceforth, the network designer can build efficient mechanisms granting the whole system to run properly under almost a zerotouch paradigm.

Our Contributions
To allow User Equipment (UE) to communicate and connect to the BS, a multiple access technology is utilized. Multiple access techniques can broadly be categorized into two different approaches, namely, Orthogonal Multiple Access (OMA) and Non-Orthogonal Multiple Access (NOMA). On one hand, OMA allows UEs to use orthogonal signals to eliminate interference, such as Orthogonal Frequency-Division Multiple Access (OFDMA) used in 4G mobile networks. On the other hand, NOMA is envisioned to be used as a candidate radio access technology for beyond 5G and 6G cellular systems. It allows allocating one frequency channel to multiple users at the same time within the same cell either in the power domain or the code domain. Moreover, NOMA offers a number of advantages, including improved spectral efficiency, enhanced resource allocation, higher cell-edge throughput, and lower latency (no scheduling request from users to base station is required) [9][10][11].
In a nutshell, we use game theory in the first part of this article to analyze and solve the conflict of interest raised between self-organized devices. The individual average throughput is considered as the payoff function. More precisely, we build a biform game, for which we analyze the pure/mixed Nash equilibria. The second part of our work [12] presents two distributed reinforcement learning algorithms to be implemented at the device level in order to reach equilibrium strategies. Our mechanism is robust as it is based on Nash equilibrium concept, and reduces the risk of bad decisions, allowing thereby to benefit from appreciated self-organizing and self-configuring features.
The main contributions of this work are fivefold: Part I's contributions are related to performance analysis of a self-organizing D2D relaying scheme: 1. We consider a hybrid two-tier scheme where cellular links use NOMA, whilst D2D links use OMA. This scheme is suitable for both inband and outband D2D schemes; 2. We fully characterize the Rayleigh channel model and derive closed forms for the outage probability of both OMA and NOMA links, and then compute the average throughput perceived by each device in the network; 3. To the best of our knowledge, this work is the first to implement a biform game to capture the devices' behaviors while deciding which Radio Access Network (RAN) to connect. In order to evaluate the outcome of the game, detailed analysis of pure and mixed Nash equilibria are provided for 3-person game and generalized to n-person game; Part II's contributions are related to implementing a self-organized mode selection using RL:

4.
We propose to empower devices with a self-organize capability allowing to reach pure Nash equilibria (Linear-Reward Inaction) and mixed Nash equilibria (Boltzmann-Gibbs dynamics), in a fully distributed manner;

5.
We perform extensive simulations to analyze the effect of different parameters on the learning schemes. Insights on accuracy and convergence are also provided.
The rest of this article is organized as follows: A comprehensive literature review is presented in Section 2. The problem is formulated in Section 3. We provide a full equilibrium analysis for the 3-person game in Section 4. The general case of n-player game is discussed in Section 5. Numerical investigations are presented in Section 6. Finally, we draw some concluding remarks and list future works in Section 7. Part II [12] of this research, presents the proposed decentralized reinforcement learning algorithms and access their dynamics and performance.

Related Work
D2D communications are widely used to relay information and improve local/overall performance by offloading traffic to other devices in the network, extending system coverage, mitigating wireless fading through improving the capture effect and exploiting spatial diversity. Also, reducing transmit power allows us to lower the impact of cross-interference, which helps to improve the network performance, enhance the QoS (improved throughput, reduced latency, and increased reliability) [13][14][15]. In latter researches, and in most of D2D published papers, a great attention is given to the performance enhancement of other technologies by introducing D2D communication (e.g., IoT [13] and Massive MIMO Systems [15]). In our article, we discuss the importance of strategically selecting the best RAN (i.e., either cellular or D2D) according to the network status, in a way to improve the devices experienced QoS. Game theory is a set of applied mathematical tools aiming to understand and solve decision-making problems, such as competing and independent actors during conflicts. It has been extensively used in wireless networks [16,17], and more specifically in solving cooperation and competition problems between devices over limited resources [18][19][20][21][22][23][24][25][26].
In the last few years, a tremendous research effort has been conducted in order to adapt and adopt self-organized networks. Self-organizing resource management approaches have attracted attention because of their low complexity, scalability and their important role in reducing information exchange [27]. It has been investigated for various networks from different perspectives including learning mechanisms, heuristic and game-theoretic approaches [28][29][30]. The authors in [28] propose a distributed utility-based SINR adaptation at small-cells that diminishes the cross-tier interference. The authors in [29] carry out a comparison among two decentralized heuristic algorithms, with no involvement of any centralized entity, for joint power assignment and resource allocation in small-cells. In [30], the authors present an energy-efficient self-organized cross-layer optimization scheme where each D2D transmitter strategically selects the resource blocks and the power levels for improving its energy efficiency while maintaining a certain QoS requirement of other tiers. However, the autonomy and self-organization of autonomous collaborative networks of devices make them especially vulnerable to attacks. Thus, such a network needs a dependable mechanism to detect and identify attackers and enable appropriate reactions. That is why the authors in [31] propose a scalable adversary detection for autonomous networks, a scheme to efficiently identify malicious devices within large networks of collaborating entities. It is designed to run in truly autonomous environments, i.e., without a central trusted entity. Unlike related works on D2D mode selection appended in Table 1, where authors focus on optimizing the network performance, we aim to model and understand the interplay between D2D, NOMA and OMA. Then, we use biform game theory to predict, and decentralized machine scheme to learn what options each device should pursue to earn the "best" long-term average profit.

System Model
Consider the uplink case of a single 4G/5G/6G cell, where a finite number of devices N = {1, 2, ...., n}, are randomly distributed around the serving BS. The devices communicate using NOMA in cellular links combined with conventional OMA for D2D links as shown in Figure 1. We use a separate band for D2D users (i.e., D2D overlay mode). We use OMA for D2D links to (1) study a hybrid access system; and (2) eliminate interference effect between cellular and D2D UEs. Here, we use stochastic geometry to estimate the performance of D2D users. Each device i ∈ N transmits its data to the BS using power P i from a distance d i while experiencing a channel gain h i . For better readability, the main notations and symbols used in this article are listed in Table 2.
Distance between device i and another D2D device f Orthogonality factor α c , α d Path-loss exponent in cellular and D2D, respectively Utility of device i that denotes its throughput when choosing the action a i For the sake of simplicity and without loss of generality, device numbered 1 is the closest device to the BS, with distance d 1 . It transmits with the lowest power P 1 and experiences the strongest channel h 1 . Whilst device n is the farthest with distance d n from the BS, uses the highest transmission power P n and experiences the poorest channel h n . Namely, we have |h 1 | 2 ≥ |h 2 | 2 ≥ · · · ≥ |h n−1 | 2 ≥ |h n | 2 . Let w(t) be the received noise at the BS and assume each device i transmits its individual signal s i (t). Then, the aggregate received signal at the BS writes: The BS decodes the signals by applying the Successive Interference Cancellation (SIC) technique [41,42]. The received signal power corresponding to the strongest channel user is likely the strongest at the BS and is therefore the first to be decoded at the BS and experiences interference from all the remaining weaker channels' users in the cluster. So, the transmission of device 1 experiences interference from users with weaker channels in the cluster, whereas the transmission of device n experiences zero interference. In contrast to downlink NOMA, each user in uplink NOMA can independently utilize its battery power up to the maximum since the channel gains of all the users are sufficiently distinct [43].

Channel Model
Within this article, the radio signal experiences attenuation due to the path-loss with exponent α and a Rayleigh fading. We denote by γ i the instantaneous Signal-to-Interferenceand-Noise-Ratio (SINR) of device i, which is given by: It is worth nothing that the SINR of the weakest device n experiences no interference according to NOMA operation, i.e., γ n = P n |h n | 2 d −α n σ 2 N . σ 2 N denotes the variance of the thermal additive white Gaussian noise. Through this article, each device aims at guaranteeing an instantaneous SINR above a certain threshold γ i,th to have successful communication. The outage probability denotes the probability that the SINR is less or equal than a given SINR threshold (γ i,th ). It is calculated as follows: Assuming that all channels undergo Rayleigh fading, the channel power gain |h| 2 is an exponential random variable with PDF f |h| 2 (x, λ) = λe −λx , where 1 λ ≥ 0 is the mean and scale parameter of the distribution, often taken equal to 1. Therefore, the outage probability can be expressed as:

Average Throughput
In general, device i transmits data with a rate R i in every channel use (i.e., in every packet or frame transmission), in a condition that R i must not exceed its channel capacity, i.e., R i ≤ log(1 + γ i ). We define the throughput of the transmission as the rate of successful data bits that are transmitted to the destination over a communication channel. As the channel is variable, random and unknown, the throughput of device i is a function of the outage probability P out i (γ i ) that depends on the average of the channel gain, expressed as follows: with ρ i = M L R i . M is the data length. L denotes the total number of bits in a frame with L = M + H data bits, and H is the length of the header.

Biform Game Analysis
The main goal of game theory is to study the strategic relations between rational players that strive to maximize their payoffs in the game and where the actions and choices of all the players affect the outcome of each player. In this work, the devices inside the cell decide either to communicate through the cellular link or to switch to D2D communication. Each device aims at making a decision that allows it to maximize its throughput. However, since each device decision influences the throughput of the other devices, we are concerned here about finding an equilibrium point and a prediction of what options players may take to earn the best profit. For this purpose, we use biform game theory. Biform game is a two-stage game that combines a competitive and cooperative game in one formal model. In the first stage, decisive players choose their strategies in a non-cooperative way to maximize their expected payoffs. Each profile of strategic choices at the first stage leads to the second stage, which is a cooperative game, where the actual payoff is realized. This gives the competitive environment created by the choices of the players in the first stage [44,45].
N is the set of players of G. A i is the set of actions of each player i, either to be a relay a i = 0 or to communicate through D2D a i = 1. U i is the payoff of each device i that represents its throughput. There are two cases of modeling the problem: - The first case is to consider the game from the perspective of one of the players, and define what is the action that each player needs to take to maximize its throughput depending on the network parameters and on the other players' probabilities of relaying.
-The second case is to consider the problem from an equilibrium perspective. In fact, we need to seek for the equilibrium probability vector where no player has incentive to deviate unilaterally. In this case also, each player could attain its maximum utility function at the equilibrium, depending on its own strategy and the strategy of other players.

Equilibrium Analysis for the Three-Player Game
Consider a three devices power-domain NOMA operation in a single cell network. Each device is communicating through uplink as shown in Figure 2. Each device i = {1, 2, 3}, is transmitting its data to the BS with a power P i , from a distance d i and with h i as the channel coefficient between device i and the BS.

Channel Model
Let us consider device 1 as the closest to the BS with the lowest transmit power P 1 , the smallest distance d 1 and best channel condition h 1 . Device 3 has the farthest distance d 3 from the BS with the highest transmit power P 3 , and the weakest channel gain h 3 .
Device 1 is considered as the strongest device experiencing the strongest channel, while device 3 is the weakest. According to the conventional uplink NOMA operation, the BS successively decodes and cancels the signal of device 1 that experiences interference from the two other devices, then device 2 which is affected only by interference of device 3 and finally decodes the signal of device 3 that experiences zero interference. Each device's SINR is then expressed as: The outage probability of each device i, is given by:

Throughput
At each time slot, each device can choose to communicate through cellular and serves as a relay or, communicate through D2D. D2D links use OMA as a multiplexing access method. Also, the D2D transmitters operate in an overlaying mode, where D2D and cellular devices are allocated distinct frequency resources which enables to suppress interference between cellular and D2D devices. Depending on the devices choices, each device i earns a throughput and experiences an outage probability as follows: -If all the devices communicate through cellular mode, then the throughput of each device is: We suppose that the BS allocates the same transmit rate R to all devices. For each device i, P out,c i (γ i ) is defined in Equation (7).
-If device i decides to be a relay while devices j and k transmit through D2D, i, j, k ∈ {1, 2, 3}, then: If there is at least one device in the D2D group, then the relay device allocates a fraction of its throughput x i to that group. x i allows also to define the mode selection of device i. For instance, x i = 1 means device i fully opts for cellular mode. Meanwhile x i = 0 means device i chooses to communicate through D2D link. When x i ∈]0, 1[ the device i plays the role of a relay. Here, we assume that the fraction given from the relay will be equally divided between the devices in D2D mode. P j,d and d j,d are the transmit power and the distance of the D2D device j, respectively. The power transmission in cellular communication is much higher than the D2D transmit power because of the short distances between D2D devices in comparison with the distances between a device and its serving BS.
Theoretically, if there is a perfect synchronization of time and frequency, there will be no interference and the sub-carriers will be considered orthogonal. However, in real networks, although frequency synchronization can be performed with certain accuracy, small frequency synchronization errors can still cause significant interference among different users. f j,k is the orthogonality factor between device j and device k.
-If device i and j decide to act as relays while device k transmits through D2D link, and by considering device i the strongest (d i ≤ d j ), then: -If all devices decide to switch to D2D communication, each device earns a regret of being disconnected from the network and the throughput is given by:

Biform Game Analysis
Consider a two-stage decision problem of three devices. Each player i 's profit (with i = {1, 2, 3}) is its throughput as presented in Figure 3. Recall that at each transmission, each device has the choice of staying connected to the BS or instead switch to a D2D communication. A device has the right to switch to the D2D side and go back to the cellular side whenever it wants, it is a random and reversible process.  The players decide to cooperate and choose whether to be connected to the cellular or D2D link to improve their throughput. If a device stays connected to the cellular link and there is at least one device in the D2D side, the cellular device must serve as a relay to D2D devices.
There are 2 3 different cooperation combinations between the three devices as shown in Figure 4. Depending on the devices combinations, they earn different throughput as follows: -If all the devices decide to stay connected to cellular link, each of them earns Thp c i as throughput.
-If at least one player switches to D2D mode, it earns Thp d j , while those who stay connected to the BS earn Thp c,d If all the devices decide to switch to D2D communication, each of them will have −r i that represents regret of being disconnected from the network.
As mentioned before, biform game consists of two stages: First Stage: This stage is considered as a non-cooperative game. The decision of player i ∈ {1, 2, 3}, is either to communicate through the cellular link and serve as relay or to communicate through D2D. This could be represented by a binary decision variable a i ∈ {0, 1} with: -a i = 0 refers to the choice of the action of being a relay.
a i = 1 refers to the action of communicating through D2D.
Second Stage: This stage is considered as a cooperative game, where the value created U(a) (i.e., the characteristic function) is investigated, with a = (a 1 , a 2 , a 3 ) refers to the decisions taken by the devices in the first stage. In other words, U(a) is the value (i.e., throughput profit) that the players gain as a result of cooperating in the second-stage game given that strategies (a 1 , a 2 , a 3 ) were played in the first stage. To analyze the game, we start by analyzing the cooperative part and then work back to find the optimal strategy for the devices. Each case of the second-stage cooperative games has a single point core: The core of the game a = (1, 1, 1) is an allocation in which each player i earns a regret because all the devices are disconnected totally from the BS.
Hence the second-stage in each game is deterministic as a result of first-stage devices' decisions, as shown in Figure 4.  Note that these choices are made simultaneously. The profit that represents each device's throughput depending on their choices is expressed as follows: As explained before, there are two cases of analyzing the problem:

First Case
The first-stage decision of player i is represented by a binary decision variable a i ∈ {0, 1}. In the second stage, after the first stage switching choice a has taken place, the corresponding cooperative game is then played. Let U i (a) denotes the second stage profits for a player i given first stage choice a. The programming problem of player i can be written as: max Here the player i chooses the action a i that maximizes its second stage profit, with: Let us take for example the case of the player 1. Let a 1 be the binary decision variable of player 1, with a 1 ∈ {0, 1}. Let 2 , 3 be random variables representing player 2 and player 3 decision values, where 2 and 3 ∈ {0, 1}. Let U 1 (a 1 , 2 , 3 ) represents the second stage gain achievable by player 1 given its first stage choice a 1 , player 2's and player 3's decisions 2 , 3 , respectively.
The problem of player 1 can be written as: Note that E 2 , 3 [U 1 (a 1 , 2 , 3 )] is the expected utility of player 1 depending on player 2 and player 3 decisions.
In Equation (16), player 1 is selecting a 1 , which maximizes its second-stage expected profit. Suppose that player 1 believes that players 2 and 3 will choose to communicate through the cellular link with a probability of belief y 2 ≥ 0 and y 3 ≥ 0, respectively. So we can rewrite the above problem as: max a 1 ∈{0,1} y 2 y 3 (U 1 (a 1 , 1, 1 a 1 , 1, 0)) + (1 − y 2 )y 3 (U 1 (a 1 , 0, 1) a 1 , 0, 0)). (17) In Equation (17), device 1 chooses the action that allows it to attain its maximum throughput depending on some probability beliefs it has on which action other devices can choose. The second stage throughput profit of player 1 can be written as: Thp c 1 i f a 1 = 0 and 2 = 3 = 0 Thp c,d 1 i f a 1 = 0 and 2 + 3 ≥ 1 Thp d 1 i f a 1 = 1 and 2 + 3 ≤ 1 −r 1 i f a 1 = 1 and 2 = 3 = 1 (18) The result is that player 1 should switch to D2D if he believes that his profit in D2D is higher than his profit in cellular and vice-versa, while he is indifferent between the two options when the benefits are equal.

Second Case
In this case, we aim to find both the pure and mixed strategy Nash equilibria that allow the devices to attain their equilibrium in terms of the highest throughput. In game theory, if each player has chosen an action strategy, and no player can benefit by modifying its strategy while the other players keep theirs unchanged, then the current set of strategy choices and their corresponding payoffs form a Nash equilibrium. Likewise, there exists a Nash equilibrium for every finite game. The Nash equilibrium could be either a pure strategy or a mixed strategy.

Pure strategy Nash Equilibrium (PNE):
A pure strategy determines the action a device will choose with probability 1 and every other action with probability 0 to attain its best profit.
One can clearly see that the action strategy (1,1,1) could never be a PNE. This is because the throughput could not be a negative value.

Proof. See Appendix A.1
Different from the pure equilibria analysis, where we consider unknown, slow fading and stationary channels, in the mixed analysis we consider random and fast fading channels. In a fast fading channel, a device can find itself unable to reach a pure equilibrium strategy in some situations, but it can attain the equilibrium by adopting each strategy with a certain probability.

Mixed strategy Nash Equilibrium (MNE):
A mixed strategy is an attribution of a probability to each pure strategy, i.e., a device chooses an action with a certain probability. A pure strategy can be considered as a degenerate case of a mixed strategy. Let p i denotes the probability of relaying of each device i, so (1 − p i ) is its probability of choosing to communicate through D2D.
-If player 1 is indifferent between choosing to be a relay or to switch to D2D, then: -If player 2 is indifferent between choosing to be a relay or to switch to D2D, then: -If player 3 is indifferent between choosing to be a relay or to switch to D2D, then: Then, the equilibrium probability vector p * = (p * 1 , p * 2 , p * 3 ) could be obtained by solving the following system of equations:

Equilibrium Analysis for n-Person Game
Consider a two-stage decision problem of a fixed number n of devices inside a single cell. At each step of the game, each of the players chooses an action. The result of each play is a random payoff defined as the throughput of each player i ∈ N . Depending on the devices choices of belonging to cellular or D2D group, each device earns a throughput as follows: -If all the devices are in cellular, then the throughput of each device is: -If there are N c = {1, 2, ..., n c } devices in cellular and N d = {1, 2, ..., n d } devices in D2D, then each device i in cellular has: On the other hand, each device k in D2D group communicates with the following throughput: We assume that the fraction of throughput given from the cellular devices is equally divided between devices in D2D.
-If all devices decide to switch to D2D communication, each device earns a regret, because there is no link left with the BS so all transmissions fail: At each transmission, each device has the choice of staying connected to the BS and serves as a relay or instead switch to D2D communication. A device has the right to join either the cellular or the D2D group whenever it wants to maximize its profit. Once in the cellular group, all the devices serve as relays to the D2D-transmitters in the other group. There are 2 n different cooperation combinations between the n devices inside the cell. Either all of them are communicating through cellular links, or all the devices choose to join the D2D group, or some devices communicate through cellular and serve as relays to others in the D2D group.
In the first stage, the decision of player i ∈ N , is to choose the mode of communication. In the second stage, we investigate the throughput U(a) that players generate as a result of cooperating in the second-stage game given that strategies a = (a 1 , a 2 , ..., a n ) were played in the first stage. Then, let us denote U(a) as a second stage cooperative game. For example, U(0, 0, ..., 0) is the case where all devices are in the cellular group while U (1, 1, ..., 1) is the case where all devices choose to join the D2D group.

First Case
For each device i, a i is its binary decision variable, with a i ∈ {0, 1}. Let k ∈ {0, 1} be the decision of device k ∈ {1, ..., n} \ {i}. U i (a i , k ) denotes the second stage profit achievable by device i given its first stage action and other devices decisions.
The problem of device i can be written as follows: where E k (k∈{1,...,n}\{i}) [U i (a i , k )] is the expected value of device i when choosing action a i depending on the other devices decisions.
Here the player i chooses the action a i that maximizes its second stage earning, with:

Second Case
In this case, we aim to find the PNE and the MNE of the n-device game. The concept of NE is used to describe a strategy as the most rational behavior by players acting to maximize their gains.
Nonetheless, a finite game might not always have a PNE, but it always has a MNE.

Definition 2.
A mixed action profile p * ∈]0, 1[ is a mixed Nash equilibrium if for each player i ∈ {1, 2, ..., n}, where p i is a mixed action for player i and p −i is the profile of mixed actions for all players other than i. ∆(A i ) is the set of all probability distributions over A i , which is the set of player i pure strategies.
From the network designer, the solutions produced by the biform game framework require complete network information, which may not scale well with the network size, and might cause high overload. Thus, for networks with incomplete information, the devices need to be self-organized and use decentralized learning algorithms to reach their equilibrium strategies. This only requires a minimal signaling to the users, and no recommendation from the BS. Part II [12] of this work covers the distributed schemes enabling the devices to reach Nash equilibrium, only based on their local information and observations.

Performance Analysis
In this section, we evaluate the performance of the biform game using Mathworks Matlab R2020a. For illustrative purpose, we perform simulations for the three-device case. Figures are produced using the following setup: P c 1 = 10 mW, P c 2 = 30 mW, P c 3 = 50 mW, P d = 5 mW, R = 1 Mbit/s, L = M = 1024 bits, γ th = 40 dB, α c = α d = 3 and σ 2 N = −116 dBm, d 1 = 100 m, d 2 = 300 m, d 3 = 500 m, f = 10 −5 , x 1 = x 2 = x 3 = 0.5, |h 1 | 2 = 0.6, |h 2 | 2 = 0.5, and |h 3 | 2 = 0.2. Figures 5-7 report the action that a device might choose to maximize its expected utility depending on its belief on its competitors.  Figure 5. Throughput of device 1 as function of its beliefs on the relaying probabilities of device 2 (y 2 ) and device 3 (y 3 ), both when relaying (a 1 = 0) and not relaying (a 1 = 1).  Figure 6. Throughput of device 2 as function of its beliefs on the relaying probabilities of device 1 (y 1 ) and device 3 (y 3 ), both when relaying (a 2 = 0) and not relaying (a 2 = 1).  Figure 7. Throughput of device 3 as function of its beliefs on the relaying probabilities of device 1 (y 1 ) and device 3 (y 3 ), both when relaying (a 3 = 0) and not relaying (a 3 = 1). Figure 5 shows that when the strongest device 1 believes devices 2 and 3 have a low chance to relay data, it has incentives to act as a relay to maximize its expected utility. Meanwhile, it is more likely to communicate over a D2D link when it believes its competitors are likely to serve as relays. We notice that the maximum throughput for device 1 is attained when it acts as relay while the other devices have a high chance to communicate through D2D. Here, it prevents earning regrets by being disconnected from the mobile service and it gets rid of all interference from the other two devices. Moreover, device 1 chooses to communicate through D2D to maximize its utility if it believes one of its competitors might be a relay. This way, it gets rid of cellular interference and transmits at lower power. Figure 6 depicts the average throughput of device 2 while changing its beliefs on the other devices willingness to relay. We notice that the relaying probability of device 2 increases when the relaying probability of the weakest device 3 decreases. This can be explained as follows: device 3 may harm the second strongest device while transmitting over cellular link, while it is indifferent about device 1 strategy. It switches to D2D when the relaying probability of device 1 increases and that of device 3 decreases. Following this behavior, device 2 is able to get rid of high interference in cellular, transmit at lower power, use better RAN and experience satisfactory QoS brought by the strongest device.
Similarly, Figure 7 depicts the average throughput of the weakest device. When this latter decides to serve as a relay, it will experience low QoS due to the long distance and the bad channel gain leading to the BS. It also has to transmit with high power and share its throughput with other devices via D2D. However, switching to D2D allows it to benefit from a better channel quality, to transmit at lower power and to experience improved QoS offered from the stronger relays. We notice that device 3 might experience high throughput when the strongest device serves as a relay and device 2 uses D2D. In this case, device 1 gets rid of interference and perceives high throughput, meanwhile device 3 gets a fraction of that throughput, resulting in a win-win scenario.

Conclusions and Perspectives
In this article, we considered the uplink case of n devices, where each device chooses whether to communicate through cellular (e.g., 5G/6G) or via D2D link to maximize its throughput. Cellular devices use NOMA, whilst they may serve neighboring devices using an orthogonal multiple access method (e.g., OFDMA/SC-FDMA). We formulated the problem as a biform game: Step 1) the devices competed over two available radio access technologies (cellular and D2D); Step 2) Devices connected to cellular cooperate with other devices in order to provide access to available services. Next, we analyzed the game pure/mixed equilibria. Simulation results show that D2D-relaying improves the devices' average throughput. The second part of this article [12] deals with implementing distributed reinforcement learning to self-explore optimal strategies in a fully distributed manner.