Joint Resource Allocation and Drones Relay Selection for Large-Scale D2D Communication Underlaying Hybrid VLC/RF IoT Systems

: Relay-aided Device-to-Device (D2D) communication combining visible light communication (VLC) with radio frequency (RF) is a promising paradigm in the internet of things (IoT). Static relay limits the ﬂexibility and maintaining connectivity of relays in Hybrid VLC/RF IoT systems. By using a drone as a relay station, it is possible to avoid obstacles such as buildings and to communicate in a line-of-sight (LoS) environment, which naturally aligns with the requirement of VLC Systems. To further support the application of VLC in the IoT, subject to the challenges imposed by the constrained coverage, the lack of ﬂexibility, poor reliability, and connectivity, drone relay-aided D2D communication appears on the horizon and can be cost-effectively deployed for the large-scale IoT. This paper proposes a joint resource allocation and drones relay selection scheme, aiming to maximize the D2D system sum rate while ensuring the quality of service (QoS) requirements for cellular users (CUs) and D2D users (DUs). First, we construct a two-phase coalitional game to tackle the resource allocation problem, which exploits the combination of VLC and RF, as well as incorporates a greedy strategy. After that, a distributed cooperative multi-agent reinforcement learning (MARL) algorithm, called WoLF policy hill-climbing (WoLF-PHC), is proposed to address the drones relay selection problem. Moreover, to further reduce the computational complexity, we propose a lightweight neighbor–agent-based WoLF-PHC algorithm, which only utilizes historical information of neighboring DUs. Finally, we provide an in-depth theoretical analysis of the proposed schemes in terms of complexity and signaling overhead. Simulation results illustrate that the proposed schemes can effectively improve the system performance in terms of the sum rate and outage probability with respect to other outstanding algorithms.


Introduction
With the wide application of the Internet of Things (IoT) in various fields such as city, industry, and transportation, a constant emergence of IoT devices (IoDs) are connected via the Internet to exchange information about themselves and their surroundings.It is expected that the number of IoDs will increase to 75.4 billion by 2025, more than 9-fold the number in 2017 [1].The proliferation of IoDs puts higher demands on the spectrum, data rate, and latency for IoT communications.In response, device to device (D2D) communication, where two nearby devices can exchange information directly, has been widely employed in IoT networks to improve spectrum efficiency and data rates, along with reducing transmission delays [2][3][4].Depending on whether radio frequency (RF) resources are shared between D2D users (DUs) and traditional cellular users (CUs), D2D communication can be classified into two categories: underlay and overlay communication.In particular, underlay D2D communication has been proven to provide a higher spectrum efficiency and match spectrum sharing nature in IoT networks [5].However, it will inevitably lead to mutual interference between DUs and CUs.In addition to the absence of electro-magnetic interference with existing RF systems, the emerging visible light communication (VLC) offers many advantages, such as a broad spectrum, innate security, and license-free deployment [6].Yet, VLC is also susceptible to blockage and has severe path attenuation.Therefore, combining RF and VLC bands for D2D communication has been regarded as an enticing solution to mitigate the interference and overcrowding of the RF spectrum, thus boosting the system capacity [7][8][9].
However, the envisioned benefits of D2D communication may be limited by long distances, obstacles, and inferior channel conditions, especially for VLC-D2D communication.As a result, D2D communication is not well-suited for IoT applications that require wide coverage and high reliability [10].A promising response to this dilemma is to implement relay-aided D2D communication, which is able to extend the communication range as well as improve both reliability and flexibility [11].That is, D2D communication can be extended to a relay-aided manner when IoDs that need to communicate are far away from each other or are blocked by obstacles.Such relay-aided systems are feasible for the largescale IoT without extra construction costs, like massive machine-type communication, the cognitive IoT, and wireless sensing, as there are a large number of available IoDs (e.g., sensors, actuators, drones) that can act as relays [12][13][14].Unmanned aerial vehicles (UAVs) have been widely used in both military and civilian applications [15,16].Renhai Feng [17] considers unmanned aerial vehicles (UAVs) to relay the maintenance data by visible light communication (VLC) under the requirements of ultra-reliability and low-latency.Zhiyu Zhu [18] and Yining Wang [19] enable UAVs to determine their deployment and user association to minimize the total transmit power with VLC.In [20], the authors optimized the UAV-assisted VLC framework that aims at minimizing the required number of UAVs first and minimizing the total transmitted power second.In [21], the authors consider UAVs equipped with a VLC access point and the coordinated multipoint (CoMP) capability to maximize the total data rate and minimize the total communication power consumption simultaneously.In [22], the authors describe a UAV-assisted outdoor VLC system to provide high-speed and high-capacity communication for some users who are blocked by natural disasters or mountains, in where the UAV is set as a communication relay.However, to my knowledge, there is no research on using drones as relays for joint resource allocation and relay selection in a hybrid VLC/RF IoT system for D2D communication.
Accordingly, we concentrate on a large-scale drone relay-aided D2D communication underlaying hybrid VLC/RF IoT system, where multiple CU, DU, and drone relays coexist.Different from existing research, in this paper, we innovatively introduced drones as relay stations to address the challenges posed by the constrained coverage, poor reliability, and the lake of flexibility.More concretely, each DU corresponds to a pair of IoDs with inferior channel condition, so a drone relay is needed to aid the communication.And this relay can be selected from other idle IoDs.Besides, each DU and its relay are allowed to utilize either VLC resource or orthogonal RF resource of a certain CU.Obviously, there are two main variables that determine the system performance.One is the resource allocation for each DU, which also actually decides the resource used by its corresponding relay.For large-scale IoT, the number of DUs is typically higher than that of CUs.This means that although the VLC resource is included, some DUs still share the same resource, causing mutual interference among the DUs, the corresponding relays, and perhaps a certain CU.Hence, how to fully leverage the combination of VLC and RF to alleviate the mutual interference is a crucial issue.Another variable is the drone relay selection for each DU.For large-scale IoT, a DU has multiple available relays.However, different relays bring not only different communication gains to the DU, but also different interferences to the users sharing the same resource.Thus, how to select the relays to improve the overall system performance is another important issue.
So far, there have been some works on resource allocation [23][24][25][26][27], relay selection [28][29][30], and their joint optimization [31][32][33][34][35][36][37] for relay-aided D2D communication.However, there is no research on using drones as relays for joint resource allocation and relay selection for large-scale D2D communication in hybrid VLC/RF IoT system.Static relay limits the flexibility and maintaining connectivity of relays in Hybrid VLC/RF IoT Systems.By using a drone as a relay station, it is possible to avoid obstacles such as buildings and to communicate in a line-of-sight (LoS) environment, which naturally aligns with the requirement of VLC Systems.In addition, most works mainly consider small-scale scenarios in which sharing resources among DUs is not required.Most of the optimization methods proposed in these works are not suitable for large-scale scenarios, where resource allocation and relay selection become more difficult for the following reasons.

1.
Large-scale relay-aided D2D communication causes resource shortages, leaving each resource shared by multiple DUs.The arising complex interference relationships make the resource allocation for one DU also impact the performance of other DUs who share the same resource.This motivates us to view the resource allocation process for each DU as finding the optimal set of DUs for each resource, in which the mutual interference is minimal.

2.
Similarly, the interference relationships make the relay selection for all DUs within the same set co-dependent.This prompts us to further consider allowing these DUs to cooperate with each other for a higher collective gain.

3.
Large-scale IoDs deployment inherently exacerbates the time complexity and signaling overhead required for the optimization process, especially when it comes to relay selection.The optimization schemes are desired to have low complexity due to practical applications.
Against this background, we present a joint optimization of resource allocation and drones relay selection for large-scale relay-aided D2D underlaying hybrid VLC/RF IoT system, aiming to maximize the D2D system sum rate while ensuring the quality of service (QoS) requirements for CUs and DUs.First, inspired by the aforementioned perspective of finding the optimal DU set for each resource, the resource allocation problem can be modeled as a coalitional game.In particular, we construct a two-phase coalitional game that allows each DU to explore and finally join a coalition while guaranteeing QoS.The different coalitions that eventually form are exactly the optimal sets of DUs for different resources.Afterwards, with a large number of DUs and available relays, we regard each DU as an agent that can autonomously select a proper relay through learning.In this way, the relay selection problem is modeled as a multi-agent problem and thus can be solved in a distributed manner.Furthermore, given the aforementioned co-dependency in relay selection among the DUs in the same coalition, we propose two cooperative relay selection schemes based on multi-agent reinforcement learning (MARL) with low complexity, termed WoLF policy hill-climbing (WoLF-PHC).These two proposed schemes can not only overcome the inherent non-stationary of the multi-agent environment, but also encourage the DUs to cooperate for a higher system sum rate.The main contributions of this paper are summarized as follows: 1.
The model of the drone relay-aided D2D communication underlaying hybrid VLC/RF system for the large-scale IoT is given.Aiming to maximize the sum rate of the D2D system while ensuring QoS, the joint optimization problem of resource allocation and drones relay selection is formulated.The problem has a nonconvex and combinatorial structure that makes it difficult to be solved in a straightforward way.Thus, we divide it into two subproblems and solve them sequentially.

2.
From the perspective of finding the optimal DU set for each resource, we construct a two-phase coalitional game to tackle the resource allocation problem.Specifically, we leverage the combination of VLC and RF to ensure QoS in the coalition initialization phase.We also incorporate a greedy strategy into the coalition formation phase to obtain the global optimal sets of DUs.

3.
In order to eliminate co-dependency, we first propose a cooperative WoLF-PHC-based relay selection scheme, where the agents in the same coalition share a common reward.Meanwhile, in any coalition, each agent's policy can use the historical action information of other agents to overcome the non-stationary of the environment.Interestingly, combining the results of the resource allocation, we find that only the historical information of neighboring agents is sufficient to alleviate the instability.Hence, a lightweight neighbor-agent-based WoLF-PHC algorithm with curtailed complexity is further proposed.4.
We provide a theoretical analysis of the proposed schemes in terms of complexity and signaling overhead.Also, we provide numerical results to indicate that the proposed schemes outperform the considered benchmarks in terms of the sum rate and outage probability.Moreover, we investigate the trade-off between the sum rate performance and computational complexity.
The rest of this paper is organized as follows.Section 2 is the related works.In Section 3, the system model is given and the problem is formulated.In Sections 4 and 5, we present the proposed resource allocation and relay selection schemes, respectively.The complexity and signaling overhead, and the simulation results are shown and analyzed in Sections 6 and 7, respectively.Finally, Section 8 concludes the paper.

Related Works
With the potential to substantially increase system capacity, the novel D2D concept combining VLC and RF communication was first proposed in [7].In [38], a survey on D2D Communication for 5 GB/6G Networks about concept, applications, challenges, and future directions have been discussed.In [39], the authors provide a V2I and V2V collaboration framework to support emergency communications in the ABS-aided internet of vehicles.Up to now, several works have been proposed to study the resource allocation for D2D communication in hybrid VLC/RF systems.In [8], an iterative two-stage resource allocation algorithm was proposed based on the analysis of the interference generated by D2D transmitters and those received by D2D receivers.With only limited channel state information (CSI), the authors in [9] attempted to implement a quick band selection between VLC and RF using deep neural networks.On this basis, refs.[25,26] included a millimeter wave into the hybrid VLC/RF bands and formulated the multi-band selection problem as a multi-armed bandit problem.However, the above works only considered the overlay mode instead of the multiple DUs coexisting in the underlay mode, which is an essential use-case in future networks.Only our previous work [40] considered this use-case for D2D underlay communication and solved the resource allocation problem using the coalitional game.The main difference between our work and previous work is that D2D communication is extended to a relay-aided manner, which gives rise to new problems.
The relay-aided D2D communication appeared due to the demand to extend the communication range as well as enhance both reliability and flexibility.As a matter of fact, jointly optimizing resource allocation and relay selection for relay-aided D2D communication in traditional RF systems has been widely studied.In [31], the joint optimization problem of mode selection, power control, channel allocation, and relay selection was decomposed into four subproblems and solved individually, aiming to maximize the total throughput.However, the authors in [32] first addressed the power control problem separately, and then solved the remaining joint problem using an improved greedy algorithm.Similarly, ref. [33] addressed the power control problem first so that the remaining joint problem could be converted into the tractable integer-linear programming problem.In [34], taking into account both willingness and social attributes, a social-aware relay selection algorithm was proposed, and then a greedy-based resource allocation scheme was presented.Furthermore, in order to motivate users acting as relays, ref. [35] assumed that the relays involved in assisting D2D communication could harvest energy from RF signals and formulated the optimization problem as a three-dimensional resourcepower-relay problem.The authors in [36] focused on an energy efficiency optimization problem of relay-aided D2D communication under simultaneous wireless information and the power transfer paradigm.Besides, ref. [37] derived an energy efficient oriented joint relay selection and resource allocation solution for mobile edge computing systems by using convex optimization techniques.Despite all this, these research works took into consideration neither large-scale nor hybrid VLC/RF scenarios.Moreover, most of the above works on relay selection adopted either the brute-force algorithm based on designated regions or the distance-based algorithm, which have high computational complexities and are not suitable for large-scale applications.
Given the dynamics of practical networks, reinforcement learning (RL) techniques have been introduced to provide a solution to the relay selection problem.Ref. [41] developed a centralized hierarchical deep RL-based relay selection algorithm to minimize the total transmission delay in mmWave vehicular networks.Ref. [42] presented a multifeatured actor-critic RL algorithm to maximize the data delivery ratio in energy-harvesting wireless sensor networks.Also, ref. [43] incorporated the prioritized experience replay into a deep deterministic policy algorithm and minimized outage probability without any prior knowledge of CSI.The above works modeled the policy search process as a Markov decision process, which is true if different agents update their policies independently at different times.Nevertheless, if two or more agents update their policies at the same time, a non-stationary multi-agent environment may occur [44].How to reduce the action space and computational complexity of multi-agent systems to improve the training speed while ensuring a stationary multi-agent environment is a key issue.
In summary, there are four drawbacks in the above studies: (1) The above works only considered the overlay mode instead of the multiple DUs coexisting in the underlay mode, which is an essential use-case in future networks.(2) Although some works focus on jointly optimizing resource allocation and relay selection for relay-aided D2D communication, these works did not take the large-scale IoT or hybrid VLC/RF scenarios into consideration.(3) Static relay is adopted in existing research, which limits the flexibility and maintaining connectivity of relays in Hybrid VLC/RF IoT Systems.The dynamic relay-assisted D2D communication system with wide coverage, high flexibility, good reliability, and strong connectivity needs to be constructed.(4) Most of the joint optimization methods proposed in these works are not suitable for large-scale scenarios, and new methods with low complexities and signaling overhead are forced to be developed.

System Description
We consider a drone relay-aided D2D communication underlaying hybrid VLC/RF system for the large-scale IoT, as shown in Figure 1, which consists of M CU, N DU, and R drone relays uniformly distributed in a square room.Note that a DU represents a D2D pair, consisting of a transmitter (DU-TX) and a receiver (DU-RX).Let N = {1, . . ., n, . . ., N}, S = {1, . . ., s, . . ., N}, and D = {1, . . ., d, . . ., N} denote the set of DUs, DU-TXs, and DU-RXs, respectively.Similarly, M = {1, . . ., m, . . ., M} and R = {1, . . ., r, . . ., R} represent the set of CU and drone relays, respectively.In this paper, we assume that each CU has been pre-allocated an orthogonal uplink RF resource, i.e., CU m has occupied the RF resource c m .Combined with the VLC resource c M+1 , there are M + 1 available resources and their set is denoted as C = {c 1 , . . ., c m , . . ., c M , c M+1 }.Meanwhile, each DU is allowed to reuse a resource from the set C. To describe whether DU n ∈ N reuses resource c m ∈ C, we introduce a decision matrix for resource allocation: where β c m n is a binary variable.Specifically, β c m n = 1 denotes that DU n reuses resource c m ; otherwise, β c m n = 0.For the sake of practicality, it is supposed that each DU can only select one relay for assistance and each relay is allowed to be attached to, at most, one DU at a time.To describe whether DU n ∈ N transmits data with the help of relay r ∈ R, we introduce another decision matrix for relay selection:

BS
where α n,r is a binary variable.Specifically, α n,r = 1 denotes that relay r aids DU n; otherwise, α n,r = 0. Furthermore, the drone relays involved in the aid utilize the mixed VLC/RF decode-and-forward protocol with a half duplex mode to transfer data, thus dividing the data transmission into two-hops: (1) DU-TX s transfers data to the corresponding relay r by reusing resource c m ∈ C; (2) the drone relay r forwards the data to the corresponding DU-RX d by reusing c m .In such a system, we focus on the analysis of the interference caused by resource sharing.On the one hand, the DUs using the VLC resource are exposed to the interference from the other DUs and their corresponding relays operating in VLC.On the other hand, users who share the same RF resource, including one CU, several DUs, and their corresponding relays, interfere with each other.Note that users exploiting different resources are not mutually interfered.All types of interference are sketched in Figure 1.A detailed analysis of the interference in two typical communication modes: VLC-D2D and RF-D2D, will be presented in the following.

VLC-D2D Communication Mode
In the VLC-D2D communication mode, it is supposed that DU n = (s, d), n ∈ N , s ∈ S, d ∈ D utilizing resource c M+1 communicates in VLC with the assistance of relay r ∈ R.
According to [45], the VLC channel gain is given as: where A is the detector area; d i,j denotes the distance between the source i and destination j; φ and ψ represent the angle of irradiance and incidence, respectively is the Lambertian order and Φ 1/2 is the half-intensity radiation angle; g f is the gain of the optical filter; and g c (•) is the optical concentrator gain, which is a function of ψ and is denoted as: where l is the refractive index and Ψ is the semi-angle of the field-of-view of the photodiode.
In the 1st-hop VLC-D2D link, the signal to the interference plus noise ratio (SINR) of relay r from DU-TX s is expressed as: where P V s is the transmitted optical power of s; κ is the efficiency of converting the optical signal to an electrical signal; N V is the noise power spectral density in VLC and B V is the bandwidth of VLC; and is the interference at r when receiving the signal from s, which comes from other DU-TXs sharing VLC resource c M+1 .However, I c M+1 s,r is difficult to be expressed exactly because the set of DUs sharing the same resource may be different at different periods.Inspired by [46], we replace the exact form γ c M+1 s,r with the expected form γ c M+1 s,r , which can be approximately shown as: where the symbol E[•] indicates the expectation of [•], which can be formulated as: where A M+1 represents the set of DUs utilizing c M+1 , and the operator |A| denotes the cardinality of set A. Since DU n corresponds its DU-TX s one by one, n ∈ A and s ∈ A are regarded as the equivalent hereinafter.
In the 2nd-hop VLC-D2D link, the expected SINR of the corresponding DU-RX d from relay r is expressed as: where P V r is the transmitted optical power of r; r,d ] is the expected interference generated by the relays assisting other DUs who reuse c M+1 and is measured as: Due to the average, peak, and non-negative constraints on the modulated optical signals, the classical Shannon equation can not be applied to the VLC.Although the exact capacity of the VLC channel remains unknown, the dual-hop achievable data rate of DU n can be approximated as [47]:

RF-D2D Communication Mode
In the RF-D2D communication mode, we assume that DU n = (s, d), n ∈ N , s ∈ S, d ∈ D is assisted by r ∈ R with reusing the RF resource c m , m ∈ M.Moreover, we follow the 3GPP recommendation for indoor D2D communication as defined in [48], i.e., the D2D indoor path loss model is formulated as: In the 1st-hop RF-D2D link, the SINR of relay r from DU-TX s is shown as: where P R s is the transmitted power of s; N R is the noise power spectral density in RF and B R is the bandwidth of RF; H is the RF channel gain; and I c m s,r is the sum interference received by r, which contains two parts.The first part is the interference from other DU-TXs sharing c m , which is denoted as I c m s,r (D).Similar to the interference I c M+1 s,r , the interference in this part cannot be accurately described due to the dynamics of resource allocation.The second part is the interference from CU m, denoted as I c m s,r (C), which is also difficult to calculate exactly.This is because CUs do not always transmit data to the base station (BS) but probabilistically.Consequently, we use the expected form γ c m s,r instead of γ c m s,r , which is given by: where E[I c m s,r ] denotes the expected sum interference and can be written as: where P R m is the transmitted power of CU m; µ m represents the communication activity probability of m; and A m is the set of DUs exploiting c m .
In the 2nd-hop RF-D2D link, the expected SINR of the corresponding DU-RX d from relay r is indicated as: where P R r is the transmitted power of relay r; E[I c m r,d ] is the expected sum interference and is calculated as: Here, the data rate of DU n can be measured by Shannon's capacity formula: Similarly, the expected SINR at BS b from CU m in the 1st-hop is shown as: And the expected SINR at BS b from CU m in the 2nd-hop can be given by: Therefore, the data rate of CU m can be calculated as:

Problem Formulation
Our goal is to maximize the D2D system sum rate while ensuring the QoS requirements for the CUs and DUs.Thus, the joint optimization problem of resource allocation and relay selection is formulated as: Constraint (21b) guarantees the QoS of the CUs while R C th denotes the rate threshold of the CU link.By analogy, constraint (21c) guarantees the QoS of the DUs while R D th denotes the rate threshold of the D2D link.Constraint (21d) shows that the resource allocation decision β c m n is a binary variable.Constraint (21e) ensures that each DU only reuses one resource.Constraint (21f) indicates that the relay selection decision α n,r is a binary variable.Constraint (21g) further ensures that each DU only employs one relay and each relay aids at most one DU.
It is clear that the formulated problem possesses a non-convex and combinational structure that makes it intractable to solve in polynomial time.Since both α n,r and β c m n are 0-1 integer variables, an intuitive method is to enumerate all possible combination policies and find out the optimal resource allocation and relay selection strategy.Nevertheless, the time complexity of the exhaustive method is O A N R (M + 1) N , which cannot work out for large-scale scenarios.To address the problem efficiently, we decompose the optimization problem into two subproblems, i.e., resource allocation and relay selection, and tackle them sequentially.

Coalitional Game Based Resource Allocation
Since an appropriate resource allocation solution has a large positive impact on the system throughput improvement, we first address the resource allocation problem under random relay selection to approach the maximum throughput quickly.It is worth noting that the relays are randomly selected from the candidate relays, which are described in Section 5.1.In this section, with random relays, a two-phase coalitional game is introduced to solve the resource allocation problem.

Coalitional Game Formulation
We model the resource allocation problem as a coalitional game.Specifically, each CU forms a coalition representing one RF resource, and an empty coalition is used to represent the VLC resource.Next, each DU independently and randomly chooses to join a coalition, which means that the DU shares the same resource with other users in the chosen coalition.
In the game, G = (I, V, F ) with a non-transferable utility is defined.The set of players I = M ∪ N consists of both the CUs and DUs.The coalition structure is denoted by the set F = {F 1 , F 2 , . . ., F m , . . ., F M+1 }, where F m is the m-th coalition, and all coalitions are disjoint.That is, we have F i ∩ F j = ∅ for any i = j, and ∪ M+1 m=1 F m = I.The characteristic function V denotes the coalition utility, which is expressed as: Given the two coalitions F i and F j , if the switch operation of DU n can increase the total throughput of the system, DU n will leave its current coalition F i and participate in the new coalition F j .We say that DU n prefers being a member of F j to F i , which is denoted by F j F i .Thereby, the transfer rule is as: where CS(F ) = ∑ M+1 m=1 V (F m ) denotes the sum rate of the current coalition structure F = {F 1 , . . ., F i , . . ., F j , . . ., F M+1 }, and F = {F 1 , . . ., F i \ {n}, . . ., F j ∪ {n}, . . ., F M+1 } is the new coalition structure.
According to Equation ( 23), the D2D system reaches the maximum throughput when all DUs no longer perform switch operations.At this time, the final evolutionary coalition structure F f in is the solution of the resource allocation problem.More concretely, the different coalitions in F f in are exactly the optimal sets of DUs for different resources.

Coalition Formation Algorithm
Based on the coalition structure and transfer rule described above, we need to try our best to satisfy the QoS requirements for all players so that more switch operations can be performed to search for the global optimal solution.Therefore, we construct a two-phase coalitional game composed of the coalition initialization phase and coalition formation phase.
To ensure the QoS of the CUs and DUs, we design the following process for the coalition initialization phase by leveraging the combination of VLC and RF.
(1) Initialize coalitions.In the relay-aided D2D network, the advantage of using the VLC band is more prominent, as it can provide a high data rate.To be specific, VLC signals are strongly attenuated with distance, so the interferences from other DU-TXs and relays operating in the VLC are naturally suppressed.Moreover, VLC signals are closely related to the D2D peer's orientation in terms of irradiance and incidence angles, thus reducing the amount of interferences received.Accordingly, all DUs choose to be members of the coalition F M+1 .(2) Environment sensing.It is obvious that the DUs with a long distance or misaligned orientation are not good candidates for utilizing the VLC resource.In addition, the DUs in close proximity are also unsuitable for reusing the VLC resource due to the heavy interference generated.In this regard, we can filter these DUs that require reallocated resources by observing the data rate received per DU, which intuitively reflects the above environmental factors.(3) Guarantee the QoS.More concretely, we sort the data rate achieved by each DU in descending order and filter out those with data rates below the threshold R D th .In other words, these DUs are more appropriate to exploit the RF resources.To this end, a priority sequence S n = (n 1 , n 2 , . . ., n k , . . ., n M ), n k ∈ M is designed to guide the switch operation of DU n, where n k with the smaller subscript k indicates that DU n suffers less interference from CU n k .For simplicity, the priority order is determined by the distance d nm between the n-th DU and m-th CU.The farther the d nm is, the higher priority of DU n sharing the resource of CU m is.Note that if CU n k no longer meets the threshold R C th due to DU n joining the coalition F n k , then DU n should switch to the next coalition F n k+1 .
In the traditional coalition formation algorithm [49], a randomly selected DU n performs switch operations in a random order based on a randomly initialized coalition partition.According to the transfer rule in Equation ( 23), DU n leaves the current coalition F i and joins the new coalition F j only when the system profit increases.However, it only refers to the local information and may deviate from the global optimal solution.Furthermore, the existence of users who do not satisfy the QoS demands compromises the coalition utility, which may adversely affect the decision to switch operations.To this end, by allowing DUs to carry out some exploratory operations with a chance probability, we introduce a greedy strategy to search for the global maximum throughput of the D2D system in our coalition formation phase.Considering the convergence rate of the algorithm, the chance probability should decrease gradually with the increase of the number of switch operations.Moreover, it should also depend on the system profit generated by the switch operation.More concretely, it is recommended to reduce the probability of performing the exploratory operation when the system penalty is high, i.e., the system profit is highly negative.In this regard, the chance probability P c is designed as [50]: where CS(F ) − CS(F ) denotes the system profit, is the function of the current number of switch operations t, and L 0 is the constant value.
The detailed process of the coalition formation algorithm for resource allocation is shown in Algorithm 1.

Theoretical Analysis
In this subsection, we provide the theoretical analysis in terms of convergence, stability, and optimability.
Convergence: Starting from any initial coalition structure F in , the proposed coalition formation algorithm will always converge to a final coalition structure F f in .
Proof: For a given number of the CUs and DUs, the total number of the possible coalition structure is finite.As stated before, to improve the D2D system sum rate, each switch operation is allowed to sacrifice the immediate profit with a chance probability.Nevertheless, the probability will approach zero as the number of switch operations increases, denoted by lim t→+∞ P c (L t ) = 0, if CS(F ) < CS(F ).That is, every switch operation will eventually contribute to a higher profit, thus ensuring the convergence to a final coalition structure.
Stability: The final coalition structure of our algorithm Proof: Supposing that F f in is not Nash-stable, there is at least a n ∈ N , n ∈ F m and a new coalition F m , F m ⊂ F f in , F m = F m that conform to the transfer rule F m F m , and then a new coalition structure F f in , F f in = F f in is formed.This is in contradiction with the premise that F f in is the final coalition structure.Therefore, the final coalition structure F f in is Nash-stable.
Optimality: The Nash-stable coalition structure obtained by this algorithm corresponds to the optimal system solution.
Proof: Regarding the renewal of the coalition structure as the evolution of the Markov chain, we can prove that the Markov chain will enter a stationary state with the increase of the number of iterations, so as to guarantee the optimability.The detailed proof can be referred to [50].1: Set the current structure to F cur ← F in , t ← 0; 2: repeat 3: Uniformly randomly choose DU n ∈ N and denote its coalition as F i ; 4: Uniformly randomly choose another coalition F j ⊂ F cur , F j = F i ; 5: if The switch operation from F i to F j satisfying F j F i then 6: The chosen DU leaves coalition F i , and joins coalition F j ; 7: Update t ← t + 1 and current structure as follows: Draw a random number P uniformly distributed in (0, 1], and calculate the chance probability P c ; 10: if P < P c then 11: Allow D n to join F j , update t ← t + 1 and current structure as: F cur ← F cur \ (F i ∪ F j ) (F i \ {n}) (F j ∪ {n}); 12: end if 13: end if 14: until The coalition structure converges to the final Nash-stable F f in .

MARL-Based Relay Selection
After obtaining the resource allocation solution, we discuss how to select the optimal relay for each DU to further improve the D2D system sum rate.Considering the large number of DUs and relays, it may not be practical to accomplish relay selection with a centralized method due to its high signaling overhead.In this regard, each DU can be considered as an agent and independently selects a relay for assistance, which constitutes a multi-agent system.However, the interferences among some DUs for the large-scale IoT make the relay selection for these DUs co-dependent.That is, a DU needs to consider the relay selection behaviors of other DUs within the same coalition when selecting a relay.In this section, in order to eliminate the co-dependency, we introduce a distributed cooperative MARL-based algorithm, named WoLF-PHC, which encourages the DUs to cooperate for a higher system sum rate.

Modeling of Multi-Agent Environments
By solving the resource allocation problem in Section 4, N DUs are grouped into M + 1 coalitions.Note that DUs in different coalitions are not mutually interfered, which implies that the DUs in coalition F m ⊂ F f in do not need to consider the relay selection behaviors of the DUs in other coalitions F m , ∀F m ⊂ F f in , F m = F m .Hence, without a loss of generality, we focus on the relay selection problem for the DUs in coalition F m and conduct the modeling analysis hereafter.
In the formulation of the MARL problem, all DUs as agents are independently refining their relay selection policies according to their own observations of the global environment state.Nevertheless, if two or more agents update their policies at the same time, the multiagent environment appears non-stationary, which violates the Markov hypothesis required for the convergence of RL [51].Here, we consider modeling the problem as a partially observable Markov game.Formally, the multi-agent Markov game in F m is formalized by the 5-tuple < N m , S m , Z n m , A n m , R m >, where N m ⊂ N is the set of DUs in F m , and |N m | is the total number of DUs in F m ; S m is the global environment state space; Z n m is the local observation space for DU n m ∈ N m , determined by the observation function O as Z n m = O(S m , n m ); A n m is the action space for DU n m ; R m is the immediate reward that is shared by all DUs in F m to promote cooperative behavior among them.As depicted in Figure 2, at each step t, given the current environment state S m t , each DU n m takes an action a n m t from its action space A n m t according to the observation Z n m t and its current policy π(a n m t |Z n m t ), forming a joint action a m t .Thereafter, the environment generates an immediate reward R m t+1 and evolves to the next state S m t+1 .Then, each DU receives a new observation Z n m t+1 .To be specific, at each step t, for DU n m , the observation space Z n m t , action space A n m , and reward R m t+1 are defined as follows: Observation space: The state space observed by n m can be described as Z n m t = a m t−1 , which includes the historical actions of all DUs in F m at the previous step.One of the motivations behind this is that if we know the actions taken by all agents, the multi-agent environment becomes stationary [52].Furthermore, each DU can fully learn to cooperate with other DUs to achieve the global optimal reward in this way.
Action space: The action space of n m can be described as A n m = [r : ∀r ∈ R], which represents that the DU can select a relay (The terms select a relay and take an action will be used interchangeably throughout the paper.)from the set of relays R for assistance.Accordingly, the dimension of the action space is the total number of relays R. In order to reduce computational complexity, we limit the number of available relays by delineating the area.For DU n m , let the distance between DU-TX s m and DU-RX d m be D m sd .As shown in Figure 3, we create two circles of radius D m sd and place s m and d m at the center of each circle, thus forming an overlapping area.The relays that are located inside the overlapping area are considered as the candidate relays.Subsequently, A n m can be reduced to: where D m sr denotes the distance between s m and r; D m rd is the distance between r and d m .Besides, we assume that the candidate relays for each DU do not overlap.Reward: To encourage each DU to learn to collaborate with other DUs and thus maximize the D2D system throughput, the DUs in the coalition F m share a common reward R m t+1 , which is defined as:

DU-TX
where α m t ∈ (α n m ,r ) N m * R is the decision matrix for relay selection at the current step t in coalition F m , which depends on the joint action a m t .That is, if a n m t = r * , then we have α n m ,r * = 1 and α n m ,r = 0, ∀r ∈ R, r = r * .

WoLF-PHC
In a multi-agent environment, each agent is part of the other agent's environment, leading to a non-stationary environment.Directly applying a classical single-agent RL (e.g., Q-learning and policy gradient) in the multi-agent case may cause severe oscillations and eventually make the results hard to converge [53].In contrast, WoLF-PHC, as an extension of Q-learning, adopts the principle of fast learning when losing and slow learning when winning, which allows agents to learn moving targets with both guaranteed rationality and convergence [54].Hence, we apply the WoLF-PHC to enable the DUs to learn their own relay selection decisions in a multi-agent system.
In the WoLF-PHC, each DU continuously interacts with the environment and other DUs in the same coalition to update the Q-value.To simplify the representation, for DU n m ∈ F m , the local observation Z n m t , action a n m t , and action space A n m t at the current step t are simply denoted as Z, a and A, respectively; the reward received R m t+1 , new observation Z n m t+1 , and action a n m t+1 at the next step are denoted as R , Z and a , respectively.Let Q(Z, a) be the estimated Q-value with action a in state Z during the learning process.As with the Q-learning algorithm, the update rule of the Q-value can be expressed as: where δ ∈ (0, 1] represents the learning rate, and β ∈ (0, 1] is the discount factor. To learn the optimal Q-value, the DU updates its own relay selection policy π(Z, a) that describes the probability of taking action a in state Z.As a generalization of the widely used greedy algorithm, the policy hill-climbing (PHC) algorithm increases the probability of taking the highest valued action while it decreases the probability of other actions according to the learning parameter θ [55].Moreover, the policy should be restricted to a legal probability distribution.Thus, the updated rule of policy π(Z, a) can be calculated as: where where M is a constant coefficient.
In essence, the key contribution of the WoLF-PHC is the variable learning parameter θ consisting of two parameters: θ w and θ l , with θ w < θ l .They are employed to update the policy, which depends upon whether the current policy π(Z, a) is winning or losing.To this end, the average policy denoted as π(Z, a) is introduced to judge the win-lose of the current policy and can be formulated as: where C(Z ) represents the number of occurrences of the state Z observed by the DU, which is updated by: By comparing the expected payoff of the current policy with that of the average policy over time, the DU can choose its appropriate learning parameter θ from θ w and θ l .If the expected value of the current policy is larger, θ w is applied to update the policy cautiously; otherwise, θ l is utilized to learn quickly.Accordingly, the learning parameter θ can be described as: The detailed process of the WoLF-PHC algorithm for relay selection is given in Algorithm 2.

Neighbor-Agent-Based WoLF-PHC
In the WoLF-PHC, we define the observation space of agent n m as the past joint action of all agents within coalition F m , so as to guarantee the stability of the multi-agent environment.Before reselecting relays, when the number of the DUs and resources are 10 and four, we visualize the geographic location of all the DUs and the result of the resource allocation, as shown in Figure 4, where different colors represent different resources.It can be seen that the closer DUs use different resources, while the more distant DUs share the same resource.In other words, the DUs in a coalition are far apart from each other.In the case of limited range D2D communication, the interference between any candidate relay of DU n m and a remote DU n r m can be considered the same and negligible.Similarly, the interference between any candidate relay of n r m and n m can be considered the same.Thus, the relay selection decisions of n m and n r m are independent of each other.That is, it is not necessary to have all agents' historical actions to ensure stability; only the actions of neighboring agents is enough.Accordingly, we propose a lightweight algorithm that allows the target agent to observe the actions of a fixed number of neighboring agents, named neighbor-agent-based WoLF-PHC.In the neighbor-agent-based WoLF-PHC, for the target agent n m , we define the nearest λ agents to target n m within a coalition as the neighboring agents of n m .Moreover, the observation space Z n m is changed from the joint action a m t−1 to the joint action of neighbors a m,nb t−1 , where a m,nb t−1 = {a i t−1 , i ∈ N nb n m } comprises the actions of λ + 1 agents in the neighboring set N nb n m ⊂ N m , which incorporates n m itself and its neighboring agents.Note that if |N nb n m | = |N m |, the neighbor-agent-based WoLF-PHC is the same as the WoLF-PHC.

Complexity Analysis
The complexity of the proposed joint resource allocation and relay selection algorithm can be analyzed from the following two parts.
One part of the complexity comes from the resource allocation scheme based on the coalitional game.The computational complexity of the resource allocation scheme is O(I in ), where I in is the number of inner iterations required to converge to the final coalition structure.
Another part of the complexity arises from the relay selection scheme based on the WoLF-PHC or neighbor-agent-based WoLF-PHC.For each agent n ∈ N , the computational complexity is calculated as O(|Z n | 2 |A n |), where |Z n | < N is the observation space size of n, and |A n | < R is the action space size of n.As for the WoLF-PHC, the overall complexity is, at most, O(N(Z * ) 2 A * ), where Z * = max n∈N |Z n | denotes the largest size of the observation space, and A * = max n∈N |A n | is the largest size of the action space.As for the neighbor-agent-based WoLF-PHC, the overall complexity is at most O(N(λ + 1) 2 A * ), where the setting parameter λ is much less than Z * in general.
Therefore, the total complexity of our proposed algorithms is O(I in N(Z * ) 2 A * ) or O(I in N(λ + 1) 2 A * ).To obtain the global optimal solution, apart from solving subproblems sequentially, an ideal algorithm usually requires multiple outer iterations until the D2D sum rate no longer rises.As a result, the complexity of the ideally proposed algorithms , where I ou is the number of outer iterations.
However, the relays reselected by any agent come from its corresponding delineated area, i.e., the candidate relays are close to each other, which leads to less impact of reselecting relays on the resource allocation solution.In this way, the performance of our proposed algorithms with lower complexity is considered to be approximate that of IPA.Hence, it is more suitable to apply our proposed algorithms rather than IPA to large-scale scenarios.

Signaling Overhead Analysis
The signaling overhead of our proposed algorithm should also be analyzed in two parts.On the one hand, since the resource allocation mechanism is implemented in a centralized manner, the signaling overhead mainly comes from the process of acquiring CSI, which can be classified into transmission and interference CSI.Concretely, in the relay-aided D2D network, the transmission CSI includes the links from CUs to the BS, from DU-TXs to the corresponding relays, and from these relays to DU-RXs; the interference CSI includes the links from CUs to the relays and DU-RXs, and from DU-TXs to the BS.When the number of CUs, DUs and relays are M, N and R, respectively, we can conclude that the signaling overhead for the CSI measurement in a centralized manner is O(2NR + MR + 2N + 2M) by using the evaluation method in [56].In contrast, the signaling overhead for CSI measurements can be reduced to O(2NR) in a distributed manner, which usually comes at the expense of the global system performance.Note that the number of R is generally assumed to be larger than that of N and M, so as to ensure the reliability of relay-aided D2D communication.Thus, the difference in signaling overhead between these two manners is not significant.
On the other hand, the distributed relay selection mechanism is performed independently in each coalition without exchanging information among coalitions, which greatly reduces the signaling overhead.However, for the DUs within any coalition, in order to encourage the DUs to achieve the global optimal reward in a collaborative way, each DU needs to upload its own historical information to the BS, including the actions taken and data rate obtained.Then, the BS broadcasts the actions of all DUs within a coalition along with a common reward.All the above information exchanged between the DUs and BS are numerical data with a size of only a few kilobytes, which leads to a small signaling overhead.Consequently, this part of the overhead is negligible compared to that incurred by the former centralized resource allocation mechanism.

Numerical Results
In this section, we present numerical results to evaluate our proposed algorithm.In our simulation, we consider a 30 m × 30 m room in which CUs utilize RF resources for uplink communication, and relay-assisted DUs want to implement the applications that require high rate communication; the DUs can choose either the VLC-D2D or RF-D2D communication mode.Furthermore, the distance between the transmitter and receiver of each DU is uniformly distributed and the upper bound is 10 m, which makes cooperation gain obtained by the combination of VLC and RF the most notable [7].Moreover, the idle relays available are evenly distributed and the number is fixed at 50 [57].To model the realistic VLC-D2D communication channel, we assume that the irradiance and incidence angle follow a Gaussian distribution with a mean value of 0 • and a standard deviation of 30 • [7].We repeat the simulations 200 times independently and average the results, thus mitigating the randomness of the above parameters.Considering the QoS requirements, the minimum rate thresholds of the CUs and DUs are set to 10 Mbps and 20 Mbps, respectively.Additional detailed simulation parameters can be seen in Table 1.At first, by comparing with the exhaustive algorithm (EA), we further demonstrate the optimability of the proposed coalitional game (PCG)-based resource allocation in practice.Meanwhile, we give the performance comparison between the proposed joint resource allocation and relay selection algorithm, namely PCG-WP, and the corresponding IPA.In this case, we present the D2D system sum rate comparison under the above algorithms by varying the number of CUs and DUs.Given the high complexity of EA and IPA, we fix the number of DUs at eight in Figure 5, and fix the number of CUs at two in Figure 6.From these two figures, on the one hand, we can observe that the sum rate of the D2D system achieved by PCG is almost close to that implemented by EA, which demonstrate that our proposed PCG can achieve a sum rate close to EA, but with a lower complexity.On the other hand, the sum rate gap between PCG-WP and IPA is insignificant.Concretely, the sum rate of IPA is at most 10% larger than that of PCG-WP.

Performance Analysis of WoLF-PHC for Relays Selection
Then, based on the final coalition structure obtained by PCG, we employ RL algorithms to reselect relays for the DUs in each coalition.In this simulation, we use Q-learning (QL) as a comparative algorithm to evaluate the convergence performance of our proposed WoLF-PHC (WP).In addition, we also show the convergence performance of the algorithms only exploiting local information, including the neighbor-agent-based WoLF-PHC (NWP) and the neighbor-agent-based Q-learning (NQL).For the sake of simplicity, we define the NWP working with λ neighboring users as NλWP, and the same goes for NλQL.
Figure 7 compares the convergence of the above four approaches in terms of the total reward performance when the number of DUs is 10 and the number of CUs is one.The total reward is the sum of the rewards received by all coalitions.From Figure 7, the proposed WP converges to the maximum total reward of about 1150 at nearly 2700 steps, while the N3WP converges to the close-to-maximum reward of about 1070 at a faster convergence rate of around 1500 steps.Therefore, the use of N3WP increases the convergence speed by approximately 44.4% in the case of a total reward loss of 6.9%.By contrast, both the QL and N3QL fail to converge and exhibit poor performance, despite the N3QL seeming to be more stable (less fluctuations) than the QL.On the one hand, capitalizing on the "wining or learning fast" mechanism, the WP-based approaches present a much better convergence performance than the QL-based approaches.On the other hand, the approaches that utilize local information (N3WP and N3QL) can greatly reduce the state space, thereby accelerating the convergence speed but sacrificing the tolerable performance, while the complexity of IPA grows exponentially.This result further confirms the feasibility of replacing IPA with PCG-WP.

Performance Comparison
Next, we compare the two proposed schemes, PCG-WP and PCG-N3WP, with the following five baseline schemes: (1) Random algorithm (RA).In this scheme, each D2D pair assisted by a randomly selected relay can randomly use either the VLC resource or RF resource of any CU.(2) PCG-based resource allocation and random relay selection (PCG-RD).For investigating the potential gain of the joint optimizing of resource allocation and relay selection, the PCG-RD that optimizes only the resource allocation is considered as a comparative algorithm.(3) Random resource allocation and WP-based relay selection (RD-WP).Similar to the PCG-RD above, the RD-WP that optimizes only the relay selection is regarded as a comparative algorithm to analyze the joint optimization gain.(4) Traditional coalitional game [49] and WP-based relay selection (TCG-WP).In this scheme, the resource allocation problem is addressed by the traditional coalitional game with random initialization and formation, and the WP method is used for relay selection.(5) Best response dynamics (BRD) in [58].Compared with our proposed cooperative scheme, each DU in this scheme is selfish and aims at maximizing its own throughput performance.In both the resource allocation and relay selection stage, every DU simultaneously optimizes its actions with respect to the action profile, which is composed of the actions played by the other DUs in the same coalition at the previous time.
In Figure 8, we evaluate the impact of the number of CUs on the D2D system sum rate under different schemes.Here, the number of DUs is enlarged to 14 and the number of CUs varies from one to eight.As the number of CUs increases, the performance of both the RA and RD-WP declines slightly and then levels off, although that of the RA exhibits slight fluctuations on the curve due to randomness.The performance degradation is due to the short distance (up to 10 m) between the transceivers of each D2D pair, which makes the VLC superior to the RF.When the number of CUs equals one, the probability of randomly selecting VLC resources for every DU is up to 50%, so the sum rate of both RA and RD-WP reaches the maximum.However, the performance of the remaining schemes improves with the increase in the number of CUs, thanks to the rational resource allocation.Among them, the BRD with the selfish nature exhibits the worst performance, while the cooperative PCG-WP obtains the best one.This is because each DU in BRD optimizes its own profit, regardless of the interference introduced to other DUs.When three CUs are involved, the sum rate of PCG-WP is larger than that of PCG-N3WP, TCG-WP, and BRD of about 5.2%, 13.3%, and 27.2%, respectively.As the number of CUs increases further, which implies that the number of DUs within a coalition decreases, PCG-N3WP becomes enough to characterize global information and thus achieve almost the same throughput as PCG-WP.Meanwhile, the sum rate gap between PCG-WP and TCG-WP is gradually narrowing.This is due to the fact that the switch operations in TCG-WP are no longer limited by QoS constraints in the case of adequate resources.In addition, when the number of CUs is five, PCG-WP outperforms PCG-RD and RD-WP by about 19.0% and 44.5%, which highlights the gain of joint optimization.
Moreover, in Figure 9, we focus on the system performance in terms of the outage probability, which is calculated as the ratio of users who do not meet the QoS demands to the total system users.As can be seen, the outage probability declines sharply as the number of CUs increases.The underlying reason is that more CUs will naturally contribute to a lower interference.When the number of CUs equals one, BRD shows the worst-case due to the ping-pong effect between the VLC resource and RF resource of the CU.As the number of CUs grows, however, its performance surpasses that of RD because the probability of the ping-pong effect decreases.In combination with Figure 8, it can be noticed that BRD outperforms RD-WP in terms of the sum rate performance, while its outage performance is slightly worse than that of RD-WP.This is due to the nature of BRD, i.e., improving the rate of some DUs at the expense of others.More importantly, PCG-N3WP achieves almost the same and lowest outage probability as PCG-WP.Note that TCG-WP initially exhibits a poor performance, and its performance exceeds that of our proposed PCG-WP and PCG-N3WP when the number of CUs is larger than seven.It makes sense that when resources are sufficient, an affordable individual DU performance can be sacrificed for the sake of the overall system performance in our schemes.Figure 10 depicts the comparison of the D2D system sum rate for different mechanisms in the resource-lacking system by fixing the number of CUs at two and varying the number of DUs from four to 18.It is shown that the increase in the number of DUs can boost the total throughput, and PCG-WP always achieves the highest total throughput.Moreover, with the aggravation of traffic congestion, the gap between PCG-WP and other competitive schemes is growing.When 18 DUs are involved, PCG-WP results in a 129.2% higher total throughput than the baseline RA.Another observation is that when the number of DUs is larger than 12, the performance of all schemes except the proposed PCG-WP and PCG-N3WP shows little improvement.This can be inferred that without the effective joint gain of the resource allocation and relay selection, the gain from increasing the number of DUs alone no longer compensates for the loss from the resulting severe interference.Concretely, in the context of insufficient resources, PCG has more prominent advantages over TCG in finding the optimal solutions.The reason is that the QoS requirements of users restrict TCG to perform switch operations, which leads to deviation from the optimal solution.While PCG satisfies the QoS demands as much as possible in the initialization stage, the greedy policy further allows the system to explore more operations in the formation stage, so as to bring the sum rate performance enhancement.The last observation is that when the number of DU increases, the advantage of exploiting global information for relay selection becomes obvious.
In Figure 11, we can observe that the outage probability goes up as the number of DUs increases because of the fierce competition for resources and relays.In contrast to Figure 9, the performance of BRD is slightly better than that of RD-WP, which suggests that an efficient resource allocation scheme may be more important than an appropriate relay selection scheme in the resource-scarce environment, and vice versa.In addition, increasing the number of DUs makes the gap between PCG-WP and other algorithms become notable.Moreover, we study the impact of the number of neighboring users λ on the system performance in terms of the sum rate of the D2D system and convergence rate.The convergence rate is indicated by the reciprocal of the number of iterations to converge.In Figure 12, the number of CUs remains as two and the number of DUs equals 18.As expected, if λ decreases, the sum rate decreases as well, while the convergence rate increases greatly.More specifically, the decrease in λ from seven to three decreases the sum rate by 10.3%, and also decreases the number of iterations to converge by 82.9%.Obviously, PCG-NWP trades a smaller throughput loss for a significantly faster convergence rate.In this regard, users can make a trade-off between throughput and convergence performance according to preferences and practical constraints.

Summary of Main Results
In order to present the results of this article more clearly, this section summarizes the main conclusions as follows: (1) In the stage of Resource Allocation, our proposed PCG can achieve a sum rate close to EA, but with a lower complexity.(2) Compared with WoLF-PHC (WP), the neighbor-agent-based WoLF-PHC (N3WP) increases the convergence speed by approximately 44.4% in the case of a total reward loss of 6.9%.(3) Our proposed WP presents a much better convergence performance than the QL-based approaches.(4) The approaches that utilize local information (N3WP and N3QL) can greatly reduce the state space, thereby accelerating the convergence speed.(5) Just randomly optimizing the Resource Allocation or Relays Selection policy cannot make the overall performance maximization.Appropriate methods applied to joint optimization are indispensable.(6) In the resource-lacking system, our proposed WP or NWP shows greater advantages.

Conclusions
In this paper, we proposed an efficient joint resource allocation and drone relay selection algorithm with a low complexity and signaling overhead for large-scale IoT.With randomly selected relays from a delineated area, the two-phase coalitional game-based algorithm was proposed to solve the resource allocation problem.Then, the WoLF-PHC based algorithm was proposed to solve the relay selection problem.Meanwhile, the lightweight neighbor-agent-based WoLF-PHC was introduced to further reduce the complexity.Simulation results demonstrated that our algorithms outperformed the considered benchmarks, especially in traffic congestion scenarios.Moreover, the appropriate number of neighboring users can be chosen based on preferences and practical constraints when applying our relay selection algorithm.

Figure 2 .
Figure 2. The agent-environment interaction in the MARL formulation of the relay selection in relay-aided networks.

Figure 3 .
Figure 3.The delineated area of candidate relays.

Figure 4 .
Figure 4. DUs geographic location and resource allocation results visualization.

Figure 5 .
Figure 5. Sum rate under EA/PCG and IPA/PCG-WP vs. number of CUs.

Figure 6 .
Figure 6.Sum rate under EA/PCG and IPA/PCG-WP vs. number of DUs.

Figure 7 .
Figure 7. Convergence performance of QL and WP based algorithms.

Figure 8 .
Figure 8. Sum rate of different methods vs. number of CUs.

Figure 9 .
Figure 9. Outage probability of different methods vs. number of CUs.

Figure 10 .
Figure 10.Sum rate of different methods vs. number of DUs.

Figure 11 .
Figure 11.Outage probability of different methods vs. number of DUs.

Figure 12 .
Figure 12.System performance for different number of neighboring users λ.

Algorithm 1 :
The Coalition Formation Algorithm for Resource Allocation Initialize coalition structure F in : F m = {m}, m ∈ M, F M+1 = N ; 2: Collect data rate R