Multi-Cell LTE-U/Wi-Fi Coexistence Evaluation Using a Reinforcement Learning Framework.

Cellular broadband Internet of Things (IoT) applications are expected to keep growing year-by-year, generating demands from high throughput services. Since some of these applications are deployed over licensed mobile networks, as long term evolution (LTE), one already common problem is faced: the scarcity of licensed spectrum to cope with the increasing demand for data rate. The LTE-Unlicensed (LTE-U) forum, aiming to tackle this problem, proposed LTE-U to operate in the 5 GHz unlicensed spectrum. However, Wi-Fi is already the consolidated technology operating in this portion of the spectrum, besides the fact that new technologies for unlicensed band need mechanisms to promote fair coexistence with the legacy ones. In this work, we extend the literature by analyzing a multi-cell LTE-U/Wi-Fi coexistence scenario, with a high interference profile and data rates targeting a cellular broadband IoT deployment. Then, we propose a centralized, coordinated reinforcement learning framework to improve LTE-U/Wi-Fi aggregate data rates. The added value of the proposed solution is assessed by a ns-3 simulator, showing improvements not only in the overall system data rate but also in average user data rate, even with the high interference of a multi-cell environment.


Introduction
The increasing use of the wireless licensed spectrum, as well as its scarcity, comes with the massive demand from mobile data, as the number of wireless devices connected to the internet is expected to keep growing every day. According to [1], the number of mobile broadband subscriptions in the first quarter of 2019 reached 6 billion, corresponding to a growth rate of 15 percent year-on-year. In the cellular Internet of Things (IoT) applications (services using a cellular infrastructure with no need for a new private network), the same report also estimates that 4.4 billion devices will be anchored in broadband by the end of 2024. These services that correspond to IoT deployments require a large amount of data, e.g., multimedia data, augmented reality (AR), and virtual reality (VR).
Historically, previous cellular broadband IoT applications have been enabled by the third-generation (3G) of mobile communication. Still, with the increasing demand, the fourth generation (4G) rises as the primary enabler, since it can better cope with such demand. However, the spectrum of long term evolution (LTE) 4G deployments, or the LTE-M (modified LTE for IoT applications introduced in 3rd generation partnership project (3GPP) Release 13 [2]) needs licenses earned in an auction by the operators, which is expensive. Additionally, most of the licensed bands already have an overused profile, which is impracticable for IoT deployment, especially for broadband services and a significant number of devices. Then, one alternative to overcome those drawbacks is to leverage the unlicensed spectrum once it is free to use, and to offer cheap additional bandwidth. definition and preliminaries results are presented in Section 4. The proposed framework is presented in Section 5. In Section 6, we present and discuss the results of our solution. Section 7 presents the conclusions and the future perspectives of this work.

Related Works
We can find several works dealing with the coexistence of the new unlicensed LTE technologies and the consolidated Wi-Fi [11][12][13][14]. One common finding of these works is that the LTE solutions are, for the evaluated scenarios, a better neighbor for Wi-Fi than Wi-Fi itself in terms of interference and capacity. Another important finding states that this coexistence problem has desired characteristics for applying machine learning techniques to reduce the coexistence impact. In [15], the specific coexistence of LTE-U and Wi-Fi is analyzed. The main analysis of this work is how the adjustment of the duty cycle parameter influences the system throughput.
Medium access mechanisms are key blocks to allow coexistence between different technologies in the unlicensed spectrum. Thus, the mismatch between parameters can lead one technology to overlap others concerning access to the communication channel [16]. Even with this concern, LTE-LAA, with the carrier-sense multiple access with collision avoidance (CSMA-CA) mechanism, can achieve a performance equal to or better than Wi-Fi coexisting with Wi-Fi [12]. With LTE-U, channel access alternatives based on channel sensing and dynamic adaptation of the duty cycle fractional work cycle have been proposed [17,18]. The authors in [18] proposed an approach to disable standard LTE transmissions on specific subframes using the almost blank subframe (ABS) LTE functionality. Their results demonstrate that LTE-DC can operate harmoniously with Wi-Fi and even LTE-LAA by adequately adjusting the duty cycle [19][20][21].
Some classic optimization techniques are used to enhance coexistence metrics. Attempting a better Quality of Service (QoS) level, a solution based on joint optimization of transmission time, carrier allocation and power is proposed [22]. The authors also used throughput and delay-based user association as the metric for LAA/Wi-Fi coexistence. They achieved the theoretical performance with the solution, although there is trade-off between the desired QoS level and solution complexity. In addition to QoS, fairness is another metric exploited by classical optimization techniques. The first algorithm proposed in [23], based on connection admission control (CAC), is responsible for fairness in the sharing of resources. The second algorithm uses a duty cycle approach, similar to the one proposed for LTE-U, but applied to LTE-LAA to improve delay related metrics. Still in the LAA/Wi-Fi coexistence, the authors of [24] provide an analytical study of the influence of energy detection (ED) on the throughput of each technology. Conclusions indicate that maximum system throughput can be achieved by adjusting the ED threshold. A branch of classical approaches called game theory is also used to solve the problem of LTE/Wi-Fi coexistence. A coalition game-based framework, in a real deployment of commercial routers are used as a testbed of an LTE-U/Wi-Fi scenario. In this scenario, and with the proposed solution, the goal of users in the game is to maximize system throughput (sum of each user) [25]. Another game, which is cooperative and based on the Nash bargain, is presented in [26]. LTE-U users QoS is enhanced by sharing resources between LTE-U and Wi-Fi while protecting Wi-Fi.
Coexistence solutions can be classified regarding the radio resource management (RRM) strategy to be modified to achieve the coexistence. We summarize the following list of solutions: • Carrier detection and sensing: The authors of [27] implement adjustments on the LTE MAC layer to support this technique. After evaluations, they concluded that LTE presents throughput gains without decreasing the performance of Wi-Fi systems. Aiming channel access and QoS fairness, Li et al. [28] devised an enhanced LBT scheme to adaptively tune the clear channel assessment (CCA) threshold of LTE-LAA and interference avoidance to Wi-Fi. For LTE-U coexistence, two-channel sensing schemes are proposed in [29], including periodic sensing and persistent sensing. In the periodic scheme, the LTE-U APs sense the medium for a fraction of time within each subframe and then determine if it is available to transmit data during the remaining subframe. In the persistent sensing scheme, the LTE-U APs also sense the channel during one entire subframe until the channel becomes free, and then transmits the data during the following several subframes. • Link-adaptation: Using the already defined Wi-Fi features to support coexistence, authors in [30] proposed a link-adaptation algorithm based on Wi-Fi's MAC distributed coordination function (DCF) protocol to enhance LTE-LAA's throughput performance. The authors in [31] use a scheme similar to Wi-Fi's request to send (RTS)/clear to send (CST) enabling protections for the LTE cells and helping Wi-Fi stations to save energy.

•
Power control: In [32], the authors affirmed that the static control power used for Wi-Fi might be used in a similar form for LTE in unlicensed bands, controlling coexistence interference. The authors of [33] compared the duty cycle mechanism with the power control mechanism for the LTE-U uplink, which resulted in a higher average user throughput for both LTE-U and Wi-Fi compared to LTE-U with the specific duty cycle of 80%. Regarding the ABS mechanism, they also affirm that this is a conservative solution, and the coexistence may be better addressed using some power control approach. • Spectrum slicing: The authors of [34] adopt the spectrum slicing, a dynamic and statics spectrum sharing through spectrum division into several partitions. In their solution, each partition can be exclusively accessed by one mobile network operator with few concerns about power control and therefore enhancing the coexistence scenario. Furthermore, works like [35] propose the use of game theory frameworks to mitigate interference in LTE-U/Wi-Fi coexistence and show improvements on the overall throughput. • Channel selection: In scenarios where spectrum slices are not possible, a solution can be using the channel selection algorithms to search for clean channels [15]. In [36], the authors propose an adaptive LBT mechanism that incessantly switches between the channels, impeding the channels for being occupied for a long time. Adopting a priority access approach, authors of [37] set the LTE-U with a higher priority than Wi-Fi to access the channel and demonstrated that with this approach, the coexistence is enhanced since LTE-U is more robust to the interference. Another way is to limit the LTE presence to increase the chance of Wi-Fi transmitting, as presented in [38].
Most of the techniques mentioned so far were all non-cooperative since they do not consider the cooperation of coexisting technologies to mitigate interference. However, the coexistence problem presents attributes that can enable the application of cooperative techniques. The authors of [36] proposed a centralized cooperative control management to employ the virtualization of some network functions, allowing the continuous transfer of resources between unlicensed technologies using cloud control of the distributed access points. In [18,39], the proposed solution has networks with modes of operation in which messages are exchanged between them to negotiate which mode the networks should operate and which parameters to use to enable a better coexistence. If a coexisting system is detected, one synchronization phase is started, and consequently, another negotiation phase starts to choose the coexistence parameters.
As previously mentioned, for regions where LBT is mandatory and transmissions are limited to burst, coexistence can be enabled by disabling the standard LTE transmissions on specific subframes using the almost blank subframe (ABS) LTE functionality [18]. In [40], the authors solve the problem of coexistence by optimizing the probability of access to the channel. They proposed a proportional fair allocation scheme based on equal distribution of channel access among competing entities, also considering idle periods, successful transmissions and collisions for Wi-Fi networks. In turn, authors of [13] leveraged the time division duplex (TDD) functionality of LTE to support various numbers of uplink and downlink frames to grant the fair coexistence between LTE-LAA and Wi-Fi. The authors of [41] proposed another technique to adapt the number of subframes according to the traffic load from WLAN systems. Yet considering LAA/Wi-Fi coexistence, the work [42] presents an investigation of offered transmission time of LTE-LAA to evaluate its performance compared to Wi-Fi.
Regarding the use of machine learning techniques to improve the coexistence of access technologies in the unlicensed band, the authors of [43,44] survey use cases and appropriate techniques that can be leveraged for wireless communication problems in general. In [45], a reinforcement learning technique called double Q-Learning for LAA and Wi-Fi coexistence is proposed. This solution is based on discontinuous transmissions (DTX) and transmit power control (TPC) to accurately select the channel, which leads to a better coexistence of LAA and Wi-Fi, as stated by the authors. Also, the authors of [46] proposed two solutions, one with Q-Learning and other with game theory, to tackle the problem of channel selection in an indoor scenario. Both solutions indeed enable a better coexistence, providing improvements in data rate. The work [47] proposes an algorithm to dynamically select the LTE-U duty cycle ratio in coexistence with Wi-Fi. The results demonstrate that system capacity increased for simulations with file transfer protocol (FTP) data traffic. Still using Q-Learning, an algorithm applied in a multichannel scenario has the mission of dynamically choosing the least populated channel and also the duty cycle value of that channel to improve system performance [48]. Two other algorithms also use the dynamic choice of duty cycle to achieve fairness in coexistence, differing only in the metric used [49,50].
In our previous contribution [10], we proposed a reinforcement learning framework based on Q-Learning applied to LTE-U to automatically adapt the duty cycle parameter to deal with the interference profile between the coexisting systems. However, the evaluated scenario presents only one device and one cell for each operator. So, to better evaluate the coexistence scenario between these two technologies with the proposed framework, as well as to extend our previous conclusions, we proposed herein a multi-cell evaluation, with even more devices (and cells) so that the interference profile can be more realistic and more challenging to track. Thus, in this paper, we highlight the following contributions: • A study of the coexistence problem between LTE-U and Wi-Fi in a multi-cell scenario with co-channel and inter-radio access network (RAT) interference under changing offered data rate that better approximates a possible cellular broadband IoT deployment. This evaluation differentiates from the previous one [10] due to the inclusion of multiples users and multiple interfering RATs.

•
A new centralized Q-Learning mechanism to dynamically adjust the duty cycle pattern so that it can achieve higher aggregated data rates per user and operator. This centralized mechanism differentiates from the previous one proposed in [10] as it operates under co-channel interference deployment, while the previous one, which is embedded only in a single operator, only considers a single source of interference. Furthermore, the modeling of the proposed solution in [10] takes into account the operator's data rates being the same as the user's data rate, since there is only one user per operator, whilst the modeling of the proposed solution in this current work discriminates between these data rates. Also, there is no coordination in the previous mechanism, consequently there is no communication among cells (base stations).

•
Extend evaluations and conclusions regarding LTE-U/Wi-Fi coexistence, without and with the proposed framework, to the multi-cell scenario with changing offered data rate and more challenging interference profile.

Wi-Fi and LTE-U MAC Layer
Medium access control is the enabler for the communication of multiple stations in a network since it provides channel access control, besides addressing and power control. In a coexistence scenario, each operator can use a different mechanism to control the medium access. So, it is imperative to review the basics of Wi-Fi and LTE-U and show their differences. In this section, we review the 802.11 MAC layer, followed by some specifications of LTE-U.

The 802.11 MAC Layer
Since the first release of 802.11 standards in 1997, several enhancements have been developed in the physical layer targeting higher data rates, besides amendments to support additional features. But even with these enhancements, the MAC layer of the following standards still share the same primary MAC mechanism, which is primarily based on the distributed coordination function (DCF). This function defines how the 802.11 devices will access the wireless medium, described as follows.
The DCF is classified as an LBT contention-based mechanism and operates in a distributed manner without any coordination. Whenever a station (STA) wishes to transmit, it has to first perform a clear channel assessment (CCA) for a fixed duration of time called distributed inter-frame spacing (DIFS) [51]. If the medium stays idle for DIFS, then the station can take control of the medium and starts to transmit for a period called channel occupation time (CoT). As the station transmits, it controls the channel by keeping a gap between each sent frame of one sequence, the short inter-frame spacing (SIFS), that is the necessary time for the transmitter to receive an acknowledgment (ACK) frame indicating that the receiver successfully received the frame. To avoid other stations accessing the channel duration of the transmission, SIFS is always smaller than DIFS, so that stations do not mislead a SIFS gap with the channel being completely idle. But if the channel goes busy before DIFS is completed, the station defers for other DIFS plus a random backoff time. If the medium remains idle for a DIFS duration plus the backoff time, the STA determines that it can take control of the channel and begins to transmit [9].
A random backoff time in the DCF approach provides the collision avoidance feature, since whenever the channel transitions from busy to idle, it is possible that multiple STAs were waiting and may be ready to transmit simultaneously. Thus, allowing each station to select a random time decreases the probability of collisions when the channel goes idle. To initiate a random backoff procedure, a station waits for the channel to be idle for a DIFS, then it picks an integer random backoff count in the [0, CW] range, where CW-contention window.
The CW value, used in the backoff procedure, always starts with a lower bound CW min and every time that happens an unsuccessful transmission, the value of CW is updated according to CW = [2 · (CW + 1) − 1], with CW max being the upper bound. Once a successful transmission occurs, CW is reset to the lower bound CW min [9].
DCF uses two types of carrier sense functions to check the state of the medium: • Physical (PHY) carrier sense: Two PHY functions based on preamble detection (from Wi-Fi transmissions) and energy (from all transmissions in the bandwidth) to check the channel's state. By measuring the energy level of the channel and comparing if the level is above a certain threshold, these functions can assure the medium is not busy [52]. • Virtual carrier sense: A MAC function based on one field of the MAC header with the expected transmission duration. All stations that receive the information set their network allocation vector (NAV) and wait still this time had passed to start sensing the channel again [52].
Important changes are provided in other 802.11 related products, such as: a 60 GHz carrier PHY and a scheduled-fashion MAC access of the IEEE 802.11 ad [53,54] (for broadband access with limited coverage); a modified MAC to provide a higher number of connected devices of the 802.11 ad [55] (achieving a trade-off between the goals of the wireless personal area network (WPAN) and the low-power wide-area network (LPWAN)); and the improvements of dual-band (2.4 and 5 GHz) operation of the IEEE 802.11 ax (Wi-Fi 6) [56,57], enabling bandwidth-hungry applications in a dense deployment scenario (4K/8K video, virtual and augmented reality (VR/AR) applications).

The LTE-U Coexistence Mechanisms
LTE-U is underpinned by three mechanisms to enable coexistence with Wi-Fi and other LBT-like technologies in the unlicensed band: supplemental downlink (SDL), channel selection and carrier-sense adaptive transmission (CSAT) [15]. These mechanisms are carefully designed by software and allow the LTE deployment compatible with 3GPP Releases 10 and 11 [15]. Figure 1 presents the relational flowchart of the three mechanisms that enable LTE-U for coexistence. Supplemental downlink (SDL) was proposed in 3GGP Release 9 to enable the bonding of paired and unpaired spectrum bands for enhancing downlink transmissions over single-cell deployments [58]. This feature enables the use of one unpaired band to create a downlink-only carrier besides the traditional downlink/uplink frequency division duplex (FDD) scheme in the paired spectrum. Release 10 introduced the possibility of bonding up to three carriers utilizing carrier aggregation that is another feature introduced in this release. SDL with carrier aggregation allows the transmission of data to a station using more than one downlink carrier and presents a division into the primary cell (PCell) carrier and secondary cell (SCell) carrier. For LTE-U, PCell is anchored in a licensed band and carries control and signaling information, while SCell is anchored in the unlicensed band and is used to boost downlink capacity opportunistically. Depending on the traffic load, the LTE base station (eNodeB) can turn the SCell on or off. If the load is light, the eNodeB can use only the PCell and turns off the secondary one, thus mitigating the interference in the unlicensed band. But if the load is heavy, the SCell is enabled and operates following the channel selection coexistence mechanism.
Channel selection enables the LTE cell to choose the cleanest channel for the secondary cell based on measurements of LTE and Wi-Fi coexisting technologies. It provides a behavior where LTE-U always tries to avoid interference, choosing the channel with the lowest interference profile, releasing the remaining channels to Wi-Fi or other LTE-U base stations. Channel selection operates in an on-going base, then every time the transmitting station finds a cleaner channel, it begins an operation of switching to the cleaner channel. The interference measurements are based on energy detection since it does not discriminate the types of interfering sources. The interference measurement can also be technology-specific based on the detection of Wi-Fi preambles or LTE-U synchronization signals. Furthermore, for some LTE-U deployments, the channel selection mechanism already satisfies coexistence requirements [15].
If channel selection does not find a clear channel, the carrier-sense adaptive transmission (CSAT) algorithm is used to enable the coexistence. CSAT is a time division multiplexing (TDM) algorithm based on long-term medium sensing. Before any attempt of transmission, the CSAT algorithm senses the medium for a long duration than the other coexisting technologies, and according to the observed medium activity, it proportionally adopts a duty cycle behavior activating and deactivating the SCell, as illustrated in Figure 2. During the activated periods, LTE-U transmits as a standard LTE, and during the deactivated period, the channel is released so that the coexisting technologies can transmit. LTE-U operating with CSAT algorithm is also called, in literature, as LTE-DC [10,12,21]. As Qualcomm mainly proposed CSAT, the details of its algorithm implementation are hidden, opening up the possibility of alternative implementations by the industrial and academic community.
After a successful CSAT procedure, the duty cycle behavior in LTE-U can be performed by the almost blank subframe (ABS) functionality. This solution presented in [18], following the already defined feature with the same name presented in Release 10, but in the context of heterogeneous networks (HetNets). In LTE HetNets, cells present different sizes (macro, micro, pico and femto). Taking into account the essential ABS operation in this context, during the transmission time of the macro cell, the data transmission only occurs in individual subframes. Thus the remaining subframes can be used by the other cells. In the context of LTE-U, the data can be transmitted in the defined subframes while Wi-Fi, for example, could exploit the blanked LTE-U subframes to perform its transmission. The proportion of subframes to blank out is defined by the duty cycle ratio informed by the CSAT algorithm after its medium sensing procedure.

System Model, Evaluation Scenario and Preliminary Results
Before any attempt to incorporate the proposed reinforcement learning framework into the coexistence problem, the evaluation scenario must be defined, and the preliminary results must show why it is possible and necessary to use the proposed framework. This section presents the scenario as well as the preliminary results without the proposed reinforcement learning algorithm.

System Model and Evaluation Scenario
To better evaluate the coexistence problem between Wi-Fi and LTE-U, 3GPP proposed evaluation scenarios in the technical report TR36889 [7]. Among the proposed scenarios, the indoor (Hotspot) scenario is the most challenging and complex since it comprises not only multiple users but also interference between the same and different access technologies. Figure 3 presents the 3GPP indoor scenario considered in this work. This scenario is composed of two operators, called operator A and operator B, in an unlicensed band deploying four small cells each. The deployment is on a building room with dimensions of 50 × 120 m, no walls, and with the four base stations of each operator equally spaced in the X-axis defined by the d and bs_space distances. Besides the four cells, forty stations are randomly distributed in this rectangular region with no mobility [7]. We use the ns-3 simulator [59] to model the Wi-Fi, the LTE-U and their coexistence. The ns-3 is an open-source C++ network simulator based on discrete event simulation for research and educational purposes, also complying with technical norms from standard organizations like IEEE, 3GPP and Wi-Fi Alliance. As part of a project funded by Wi-Fi Alliance, the Centre Tecnològic de Telecomunicaciones de Catalunya (CTTC) and the University of Washington cooperated and created a branch of this simulator with the system modeling and coexistence evaluation of LTE-U and Wi-Fi, named as ns3-lbt [60]. In this version, the implemented duty cycled transmission of LTE-U follows the approach presented in [18], which is based on LTE ABS functionality, where the ABS pattern is defined with the duration of 40 ms. Since each subframe of an LTE transmission lasts for 1 ms, one bitmask of 40 values is defined (for each subframe) to control the ON-OFF pattern. Both the core and discussed modifications of this simulator are found on http://code.nsnam.org/laa/ns-3-lbt/file/9529febb7ebc.
Since the evaluation scenario presents a challenging interference profile, simulations were performed for different offered data rates for each duty cycle to map possible patterns. Following this approach, the DC value used is in the range of 0.2 to 0.9 with a granularity 0.1. The offered data rate follows the full buffer scheme, i.e., user datagram protocol (UDP) transmission with a constant bit rate, since we want to analyze the situation where there is always data to transmit, thus allowing an intense dispute to access the medium. This UDP offered data rate represents the available data rate capacity for each station, and a variable called UDPRate controls it. The parameters used on the simulation, for both LTE-U and Wi-Fi, are presented in Table 1. The simulation is set to run for 20 s since it is a sufficient duration that is possible to see the relation between the offered data rate and duty-cycle values in this scenario.  Figure 4 presents the relation between the transmitted data rate (per operator) versus duty cycle value for UDPRate = 2 Mbps. In theory, the expected transmitted data rate for each operator is the sum of its user's rates (without interference), then it would be expected for each operator 20 · 2 = 40 Mbps. And as we can see, this theoretical data is nearly reached for each operator for some DC values, but never at the same time. For this behavior, the aggregated transmitted data rate (LTE-U + Wi-Fi) presents its maximum only for DC 0.6, even though this DC value does not present the best possible data rate for Wi-Fi. This duty cycle value indicates that for this configuration and the offered data rate, the system reaches its maximum rate with LTE-U having more channel access time than Wi-Fi. Furthermore, the LTE-U's data rates for the DC values from 0.6 and on do not present meaningful variations, indicating that for this UDPRate as the DC value keeps increasing, the most affect access technology is the Wi-Fi. The same pattern also occurs for Wi-Fi for DC values less than 0.5. Figure 5 is the similar to Figure 4, but now with UDPRate = 4 Mbps. For this offered data rate, the maximum aggregated transmitted data rate is reached for a DC value of 0.4., indicating that the system, for this situation, reaches its maximum with Wi-Fi having more channel access time than LTE-U. Furthermore, for this UDPRate, the transmitted data rate for each operator is more affected by the DC value. While for UDPRate = 2 Mbps, and duty cycle values bigger than 0.6, the LTE-U's transmitted rate slightly changes, now for UDPRate = 4 Mbps, the data rates keep increasing with the DC value, and the Wi-Fi ones keep decreasing. As a general result for the multi-cell environment, the duty cycle value has a significant influence on the transmitted data rates combined with the UDPRate used. This pattern also occurs in other scenarios, and different data rates, as presented in [10] for a single-cell scenario, showing a general pattern in this LTE-U/Wi-Fi coexistence problem. This problem also gets more complicated if the UDPRate changes over time, as the best DC value also changes per each offered data rate. Then, for setting the goal for maximizing the system throughput in a dynamic offered data rate scenario, it is necessary to use a tool for automatically choosing the optimal DC value on the fly. The next section presents our proposed reinforcement learning framework based on Q-Learning to address this problem.

Proposed Reinforcement Learning Framework
If we set the goal for maximizing the aggregated transmitted data rate as the time passes, and the offered data rate changes, a new DC value must be chosen to keep the performance on the maximum point, then bringing the problem to a sort of sequential decision-making problem. In this decision-making problem, each new chosen duty cycle value leads the system to a new state (a new value of aggregated transmitted data rate) that can change if the offered data rate also changes. So, for choosing the best duty cycle value for each case, the correlated decision-making problem must be solved. We address the Q-Learning as a solution to this problem.

Q-Learning
Q-Learning is a reinforcement learning algorithm proposed in [61] that combines dynamic programming and Monte Carlo concepts to solve sequential decision-making problems in the form of Markov decision processes (MDP) [62]. In MDP, we have a decision-maker that is called an agent, and all other things are called the environment. As the agent interacts with the environment, it chooses actions, and in turn, the environment responds to these actions in the form of observations or new situations. This observation returned from the environment are called rewards and are numerical values that the agent uses to achieve a goal, i.e., maximizing the sum of all received rewards. Figure 6 presents the relationship of these concepts in a diagram. Practically all reinforcement learning algorithms involve the estimation of a value function [62]. The value function is a map from state-action pairs to a value that estimates how good or bad is a given state for the agent to be in. This notion of "how good" is defined in terms of the immediate and all expected future rewards that an agent receives as the learning process occurs, and since the future rewards depend on actions taken, a track of actions defines a policy π. Formally, a policy is a mapping from states to actions, and it is changed as a result of the agent's learning process.
If at time t the action selected is a, the system is in state s under policy π, then it is defined the action-value function for policy π as q π (a, s) where E is the expected value, and γ is the discount factor controlling the relation between immediate and future rewards. Solving this reinforcement learning task means finding a policy that returns the maximum possible reward over the long run. There is always an optimal policy that is better or equal to the other policies [63] denoted π * . Then, the optimal value for this optimal policy is defined as However, solving the reinforcement learning task for q * is not trivial, since it requires the transition probabilities associated with the states and actions taken. To solve this problem, Q-Learning defines a Q-value that approximates q * and solves it independent of the policy being followed. This Q-value is defined by where S t is the set of states; A t is the set of actions; α is the learning rate, indicates how much of the learning is taken into account to update the Q-value; γ is a parameter that controls how much the future rewards influences to update Q. If γ is set to zero, the agents learn to take into account only the immediate rewards, while if γ is to set to 1, the agent learns using only the possible future rewards. Both α and γ are in the range 0 to 1. The learning task of the agent is represented by updating the relation of Equation (1) for every state it reaches paired with the corresponding action. As presented in [61,62], the only requisite for convergence is all pairs (s, a) continue to be updated. Under this assumption, as each pair is visited several times and the corresponding Q value is updated, the algorithm converges with probability 1 to the optimal q * [61,62].

Proposed Framework
In our work, we propose a centralized, coordinated algorithm to maximize the system aggregated transmitted data rate, and in the basis of Q-Learning we formulate the sets of states S and actions A as follows: • The where T tx wi f i is the sum of the transmitted data rate of all four Wi-Fi cells, and T tx lteu is the sum of all four LTE-U cells. Then, the proposed algorithm works on a centralized way operating only on the operator data rate, LTE-U or Wi-Fi, and not on each cell. In this way, the state formulation can be simplified, since the algorithm operates on two operators and not on four cells. Thus, the number of states can be decreased to the only four states presented. Furthermore, defining the algorithm to work in a centralized operation also softens issues related to energy consumption since the algorithm will be operating in one single node in a coordinated way (or in a cloud server); • The reward for each DC value chosen is set to be T tx wi f i + T tx lteu that will be the aggregated transmitted data rate the system will reach for that DC. The general goal is to minimize the cost function C = T target − (T tx wi f i + T tx lteu ) that can be also viewed as the maximization of the reward T tx wi f i + T tx lteu over time to fulfill the goal. T target is the desired aggregated transmitted data rate and it is set as a parameter.

•
The goal is set to maximize the system aggregated transmitted data rate.
The choice of these DC values is based on, firstly, the trade-off between computational complexity and the number of actions. This is intrinsic to the Q-Learning algorithm since the higher the number of actions and states, the higher the time is taken by the algorithm to converge. The second point is that these values represent four situations, starting with Wi-Fi having more air time than LTE-U, for DC = 0.2, to the opposite situation of LTE-U having more air time than Wi-Fi, DC = 0.8. Thus, after convergence, the best action (DC value) for the current state will be determined by the optimal Q-value, indicating what percentage of air time for each technology is more suitable in that moment for achieving the optimum goal.
The intervals between states were chosen as equally spaced intervals from 0 to M, where M is the maximum aggregated transmitted data rate that the system can achieve. It depends on the parameters of each system and the coexisting environment. The value of M differentiates from T target , as M is the real maximum that the system can achieve by the parameters used without the proposed solution, while T target is the desired maximum we want to achieve using the proposed algorithm.
Since the proposed solution operates in a centralized and coordinated manner, the new duty cycle value chosen by the algorithm, in each step, must be sent to all LTE-U cells by the central node. Regarding the simulation, as we use a discrete event simulator, the parameter parsing is by functions and occurs in the updating duty cycle phase, incurring in no latency (and overhead) for sending the new duty cycle value from the central node. However, targeting the real deployments, this communication between nodes can be achieved by jointly utilizing the X2 interface and the X2AP services. Defined by 3GPP in [64], the X2 is the denomination of the interface that connects one eNodeB to another for exchanging signaling messages, and the X2AP is the related protocol. The X2AP provides several functions as mobility management, cell configuration update, error reporting and others, which can be used by our proposed Q-Learning mechanism for sending the updated DC value with controlled overhead. However, when it comes to the latency issue, additional studies must be performed considering that the latency using X2 interface is highly influenced by the used network topology, transport technology (e.g., optical fiber and copper-base) and capacity of the backhaul link, which can vary the latency from a hundred of microseconds to 20 ms [65,66]. Figure 7 shows a flowchart of the proposed solution. From a macro-perspective, the proposed algorithm has two main steps. The first one, at the LTE-U cells, updates the actual DC value and reports the reward (data rates) to the central Q-Learning node. The second step, at the central Q-Learning node, performs a Q-Learning cycle, yielding in a new DC. This new DC value is reported back to the LTE-U cells in which the step one is started again.
To complement, the pseudo-algorithm of the proposed centralized, the coordinated solution based on Q-Learning is also presented in the Algorithm 1.
Algorithm 1: Q-Learning application for dynamic duty-cycle selection in a multi-cell environment.
1 Initialize 2 for s ∈ S, a ∈ A do 3 Start Q-Table with random Q-Values Q(s, a). 5 Estimate initial states s. 6 Learning: 7 Loop 8 Generate a random value r ∈ U(0, 1) 9 if r < then 10 Select an action a ∈ A randomly; 11 else 12 Select an action a ∈ A according to max a Q(s, a); 13 end 14 Run a; 15 Take the immediate reward r, which is T tx wi f i + T tx wi f i ; 16 Observe the next state s ∈ S; 17 Update Q-table according to the expression:

Proposed Solution Evaluation
A new simulation to evaluate the proposed Q-Learning algorithm was performed following the same basic parameters for Wi-Fi and LTE-U presented in Table 1, but with modifications regarding the offered data rate for each station and the total duration of the simulation. The proof-of-conception simulation was performed according to the following scheme: • The simulation has a duration of 250 s where the offered data rate changes at specific points as the simulation runs.

•
The specific points in time where the offered data rate changes that occur are uniformly picked from an interval between 10 to 15 s that are then summed to the current time until it reaches the 250 s limit. In this way, during the 250 s simulation, there will be from 16 to 25 changes in data rate (total duration divided by minimum and maximum intervals).

•
The Q-Learning algorithm always operates at the end of the current ABS interval, which is 40 ms, choosing the DC value for the next ABS interval until it converges.

•
To gain statistical confidence, 100 snapshots/repetitions of the 250 s simulation were realized. In each of these snapshots, the stations are uniformly distributed in the scenario. Then, in each snapshot, the stations are in different positions generating different interference profile.

•
Regarding the states, M is set to 160 Mbps and T target = M; • Regarding the Q-Learning parameters, the values of α and γ were chosen as 0.3 and 0.5, respectively. These values were the best ones when the algorithm converged in the test phase using the proposed solution. Figure 8 presents a bar graph of the simulation results using some fixed DC values and with our proposed framework. With the proposed framework, the maximum aggregated transmitted data rate is increased when compared with all the duty cycle values without the solution. When compared with the best static results, DC = 0.6, and DC = 0.7, the Q-Learning increases the maximum aggregated with around 3 Mbps more than the static ones, corresponding to a 5% increase percentage. If we compare the lower UDPRate versus the Q-Learning gain, it can correspond to six times increasing (UDRate with 500 kbps). If we focus on each operator, the LTE-U and Wi-Fi transmitted data rates present gains over the ones for DC = 0.6. However, if we take the other best static DC value, 0.7, only the Wi-Fi presents gain over the static case, even though the system still reaches its maximum. Furthermore, neither the LTE-U data rate with Q-Learning does not reach the maximum it can achieve DC = 0.9, nor the Wi-Fi at DC = 0.2. Still, the goal was set to maximize the aggregated T tx lteu + T tx wi f i . The algorithm chose the best combination of duty cycle values for each state the system had passed to reach the goal, which led to a combination of data rates for each operator that was not the maximum possible, but the necessary to the goal as it would be expected. The results so far have only shown the aggregated transmitted data rate perspective, but the Q-Learning also indirectly operates on each user transmitted data rate. Figure 9 presents a comparison between the cumulative density function (CDF) of the user transmitted data rate for the Q-Learning and the best static duty cycle value 0.6. Observing the Q-Learning curves, the transmitted data rate of practically all the Wi-Fi users improved, while all LTE-U users maintained their performance. The successive duty cycle values chose by the Q-Learning algorithm led the system to a behavior where Wi-Fi users have more air time. Consequently, more data rate, than the static DC = 0.6, but maintaining the mean data rate of LTE-U users, improving the aggregated one. Furthermore, the gain of around 3 Mbps in the aggregated performance presented in Figure 8 is higher than the maximum user transmitted data rate for both Wi-Fi and LTE-U that is 2.7 Mbps. Still, since the maximum data rate is the same with or without the proposed solution, we conclude that the gain in the aggregated data rate is distributed all over the users with less data rate. For the duty cycle value of 0.7, the CDF presents a different behavior, as shown in Figure 10. The Wi-Fi stations present even more gains with Q-Learning, despite the gain is comes with the cost of worse performance for the LTE-U stations. With Q-Learning, the 60th-percentile shows that when compared with the case of DC = 0.7, the Wi-Fi gains affect the LTE-U stations with fewer data rates. Furthermore, the gain of 3 Mbps presented at the system level, as opposed to the DC = 0.6 case, is still distributed for practically all Wi-Fi stations. Still, the LTE-U stations with fewer data rates are severely affected.   For the 10th-percentile, the Wi-Fi Q-Learning results are better than all duty cycles values higher than 0.5, while the LTE-U ones are better for the duty cycles values lower than 0.5. For the 90th percentile, the Wi-Fi results with the proposed solution are better than all DC values higher than 0.4, while the LTE-U results are better for the duty cycle values lower than 0.5. Still, in the 90th-percentile analysis, it can be seen that LTE-U values do not show much variation for high values of the duty cycle. These duty cycles does not significantly affect this percentile for LTE-U, but only the Wi-Fi.

Conclusions and Future Works
In this work, we propose and evaluate a centralized, coordinated reinforcement learning framework, based on the Q-Learning algorithm, to deal with a multi-cell coexistence scenario between LTE-U and Wi-Fi. This scenario, differently from other previous works, is more challenging since it comprises not only more cells, but also interference between the same and different access technologies. Investigative results showed that each duty cycle value leads the system to a different maximum data rate, which also depends on the offered data rate we provided for each user in the coexistence scenario.
The automatic duty cycle selection of our proposed solution is accomplished at a time window of 40 ms, reaching to a improved aggregated transmitted data rate. This gain is distributed all over the Wi-Fi users, with Q-Learning providing gains in both system and user transmitted data rate.
Regarding the multimedia IoT deployments, since our solution adopts a centralized and coordinated approach, the issues related to battery consumption are diminished because the additional energy consumption due to the Q-Learning algorithm is only applied to the central node that runs it. Furthermore, the benefits of our framework matches one of the goals of cellular broadband IoT deployments that is the use of data rates sufficient to transmit multimedia data while keeping a reasonable energy consumption. Then, in the trade-off between energy consumption and increasing data rate, our solution achieves improvements in data rate without cost of energy consumption for each terminal.
One drawback of our solution is that the way it was modeled, the interference between the same access technologies is not explicitly taken in account, even it is included in the T tx wi f i and T tx lteu collected values due to the lost packets. If we also explicitly consider the interference measurement, the results could present even better results, since the algorithm would use this information to better perform in the choice of the duty cycle value. Then, we address this as an extension of our solution and a future work, when we also could explore modifications regarding QoS or a decentralized approach with minimum additional energy consumption. Funding: This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001.