Multi-objective UAV Positioning Mechanism for Sustainable Wireless Connectivity in Environments with Forbidden Flying Zones

—Unmanned aerial vehicles (UAVs)-based communication system is a promising solution to meet coverage and capacity requirements of future wireless networks. However, UAV-enabled communications is constrained with its coverage, energy consumption, and ﬂying regulations, and the number of works focusing on the sustainability aspect of UAV-assisted networking has been limited in the literature so far. In this paper, we propose a solution to this limitation; particularly, we design a Q -learning-based UAV positioning scheme for sustainable wireless connectivity considering key constraints, that are, altitude regulations, non-ﬂight zones, and transmit power. The objective is to ﬁnd the optimal position of the UAV base station (BS) and minimize the energy consumption while maximizing the number of users covered. Moreover, a weighting mechanism is developed, where the energy consumption and number of users covered can be prioritized according to network/battery conditions. The proposed Q -learning-based solution is compared to the baseline k -means clustering method, where the UAV BS is positioned at the centroid location that minimizes the cumulative distance between the UAV BS and the users. The results demonstrate that the proposed solution outperforms the baseline k -means clustering-based method in terms of the number of users covered while achieving the desired minimization of the energy consumption.


I. INTRODUCTION
It has been a truism that the number of subscriptions to mobile communication networks has been increasing over the years, and that the newer generations have been becoming more dominant than the legacy after a few years of their first deployment.As seen in the report by Ericsson that the number of subscriptions for the fifth generation of mobile communications (5G) is projected to be around 3.5 billion around the world [1].On the other hand, the mobile data traffic has also globally been on the rise; such that the same report reveals that the global mobile data traffic increased by 46% during 2020 and the monthly total global data traffic was more than 66 exabyte (EB) in the first quarter of 2021 [1].
All these statistics almost reach the same conclusion: the number of users and the amount of data consumed per user have been dramatically rising over the years.In 5G, on the other hand, such increase is more highlighted due to the fact that there are more demanding emerging applications, including tactile Internet, 4K video streaming, online gaming, etc., and the concept of Internet of things (IoT) has been seriously proliferating, and pervading our daily lives with a large inclusion in various domains, such as healthcare, manufacturing, etc. [2].This means that the two components of the increase in the data traffic (i.e., number of connected devices and the amount of data consumed per device) are more challenging issues in 5G, as the aforementioned datahungry applications increases the amount of data consumed while IoT greatly contributes to the number of connected devices.Therefore, in 5G, the scale of the challenge is multiplied compared to that of the legacy networks, and thus there is more sophisticated solutions needed to tackle such level of unprecedented challenge.
There have been various concepts and technologies proposed in the literature in order to address the aforementioned capacity issues.The use of millimeter-wave (mmWave) frequencies, massive multi-input multi-output (mMIMO), and network densification are some of the most popular and practical ones among others [3]- [6].Each of these technologies has different set of advantages and disadvantages, however they mainly target capacity enhancement in mobile communication networks.With mmWave communications, for example, an additional spectrum added to 5G networksit has already been included in 5G New Radio (NR) as frequency range-2 [7]-, and thus the capacity is increased with this additional bandwidth.The use of higher carrier frequency also enable smaller antenna sizes, which subsequently enables mMIMO antenna array, enhancing the capacity further [3], [5].Network densification, on the other hand, offers deployments of smaller base stations (BSs) with comparatively less antenna transmit power in order to reuse the frequency band, leading to a great deal of capacity enhancement [3].
Even though all these solutions are quite beneficial in enhancing the capacity of mobile communication networks, there is still room for improvements, since the spatio-temporal changes in wireless networks pose another type of challenges.More specifically, unusual circumstances, including expositions, sport competitions, and musical concerts, where much more people than normal gather together and significantly in-crease the demand for wireless communications 1 are required to be tackled in a more sophisticated and intelligent manner.This is mainly because such kind of events do not happen pretty often (only few times a year); hence, it is not a good idea to design the network by taking them into consideration.In this regard, BS mounted on UAVs (which will be referred as to UAV BS hereafter) has been a promising solution to meet the strict user requirements of coverage, capacity, and quality of service (QoS).With the dawn of 5G network and related technologies, the user requirements are getting diverse as the users are from a diverse groups including conventional user equipments (UEs), IoT devices, machines, vehicles, etc.The UAV-assisted communication system is a solid use-case for the next generation of mobile communications given that UAVs are flexible, easy-to-deploy, and cost-efficient [8].This provides a boost to the terrestrial cellular infrastructure as UAVs can be deployed to provide the extra coverage, or increase the capacity in the given area.
Indeed, terrestrial mobile BS mounted on terrestrial vehicles, including trucks, vans, and cars, would also offer flexibility to some extent, as they are also capable of moving according to the circumstances faced.However, UAV BSs are deemed as more advantageous than their terrestrial counterparts due to multiple additional benefits [3], [9].First, while the terrestrial mobile BSs can also move in two dimensions, due to the fact that UAV BSs can also alter their altitude, giving them an extra degree-of-freedom [10].Therefore, they can provide better connectivity since line-of-sight (LoS) becomes more likely with the required adjustments in terms of the altitude.Second, terrestrial mobile BSs are restricted with traffic regulations and with the planning of the environment (e.g., city); however, UAV BSs do not have such strict regulations apart from some altitude restrictions.To this end, UAV BSs are more suitable and feasible solutions in many scenarios if they are managed properly.
The primary research challenges regarding the UAVassisted communication systems are as follows: i) determining optimal positions of the UAVs; ii) finding optimal UAV trajectory; and iii) meeting the regional regulations for UAVs.Energy efficient networking with UAV BSs is at the core of the discussion, since the UAVs are battery operated and have limited energy capacity.This is one of the most important limitations with UAV-assisted communications, requiring a proper management; otherwise, the concept can get infeasible if the the flight time of UAVs cannot be sufficiently prolonged 2 .Furthermore, UAV-based communication systems represent even a more complex case, because the total energy consumption depends on both the communication to the 1 It has been quite a normal habit that people go online during these events and broadcast live videos over their social media applications.Therefore, this critically affects the data traffic in those regions.In other words, the problem is not only that people accumulate around a specific region increasing the population intensity, they also tend to consume more bandwidth with these online video streaming applications. 2The sufficiency mentioned here should be discussed according to the scenario; i.e., the conditions of the networks and the requirements of mobile network operators as well as users, and hence it is quite hard to put a formal and strict definition and/or a numerical value for it.However, the main idea is to maximize the flight time of UAVs as much as possible.
ground users and flying on a predefined trajectory or simply hovering over a fixed point.Besides, to cover as much as users as possible while maintaining the energy efficiency is an important aspect of UAV-assisted networking, since the the main objective is to enhance the capacity of the network in order to serve more users in temporally-dense networks.Put it another way, energy efficiency is required to realize the main objective, which is capacity enhancement in this case, an thus the idea is to keep the UAV BSs more in the air in order to serve more users in total.Therefore, in order to capture this phenomena in our work, we consider both the energy consumption and the total number of users covered as objective functions of our novel problem formulation.
No-fly-zones (NFZs), which are restricted or prohibited areas where UAV is not allowed to fly-such as military bases-, are considered as a practical constraint in the deployment phase of UAV BSs.This is because the optimal UAV trajectory and positioning are affected from NFZs given that UAVs will need to avoid those places even if they are optimal positions.In other words, UAV BSs are supposed to be positioned considering the NFZ constraints, bringing additional challenge to the optimization process.In addition, as aforementioned, there are also regulations on the minimum and maximum altitude of the UAVs; such that the UAVs are supposed to be within the allowed range in terms of their altitude 3 .In this regard, in order to seize this idea of NFZs and altitude regulations in this work, we consider them as constraints in the problem formulation.
Machine learning is a promising solution for various domains, such as agriculture [11], finance [12], healthcare [13], due to its strong capabilities in terms of convergence, dynamism, and agility.It also has a considerable place in optimizing wireless communications networks [14]- [18].Moreover, it is envisioned that it will play rather a more critical role in the upcoming generations of mobile communication networks, such as sixth generation of mobile communication (6G) [19]- [21].Reinforcement learning (RL) has a special place in machine learning, as it is structurally quite different than supervised and unsupervised learning methods.RL consists of a set of policy-based goal-seeking algorithms, where an agent takes actions in a given environment in order to maximize its reward or minimize the penalty, and therefore RL is predominantly used in optimization problems rather than time-series analysis, classification, or clustering as supervised and unsupervised learning algorithms do.
RL has unique advantageous characteristics making it more preferable than other types of optimization methodologies, including heuristics.First, RL algorithms, such as Q-learning and SARSA, are predominantly model-free, meaning that they do not require the model of the environment-of-interest in advance, they instead interact with the environment in order to capture its dynamics [22].Moreover, since RL algorithms include learning in their body, they do not have to start from scratch every time there is a change in the environment, they rather adapt themselves to the changes, giving them a strength of optimization with reasonable time complexity.This is an essential feature for an optimization algorithm especially for dynamic scenarios, where network conditions change rapidly and frequently.To this end, we employ Q-learning, one of the most common RL algorithms, in this work in order to take the benefit of above-mentioned features.

A. Related Work
The literature on UAV-assisted communication systems will be thoroughly evaluated in this subsection.In recent years, numerous number of studies have been done in the field of UAV-assisted wireless networking [23]- [32], but only a few studies have looked at the flying regulations [33], [34].In [8], the authors presented a survey on the most recent research possibilities and problems in the field of UAV aided wireless networks.The key difficulties in UAV-assisted networking are investigated, including 3D deployment, performance analysis, channel modeling, and energy efficiency.
For the coexistence of UAVs and under-laid device-todevice (D2D) communication networks, a tractable analytical framework is proposed in [29].The authors showed that flying a UAV at the ideal altitude can result in the highest system sum-rate and coverage probability.Furthermore, an optimal trajectory design can reduce transmit power; however, networking under the UAV altitude regulations has not received enough attention in the literature.The lowest and maximum authorized altitudes of flying UAVs vary by country; for example, European laws for flying UAVs establish the limits of minimum and maximum allowed altitudes, which may fluctuate in different regions of the world, and in [33], the authors looked at the status of UAV-related regulations.
In the previous few years, numerous surveys and tutorials have been released.The findings reveal that creating air route networks is a scientifically sound and efficient way to standardize and improve the efficiency of low-altitude UAV operations .The most significant approach for UAV regulation in urban regions, in terms of safety and efficiency, is to enhance research that heavily relies on urban remote sensing and Geographic Information System (GIS) technology, as well as application demonstrations of low-altitude public air route networks [33].In [34], the authors discussed the standardization initiatives for UAV-assisted UEs, UAVassisted BSs, UAV communication prototypes, and UAVassisted cellular communications cyber-physical security.The usage of UAV-assisted communication has been suggested as a possible approach for Internet of things (IoT) networks in the literature [8], [35]- [37].In [35], it was demonstrated how to collect data in an energy-efficient manner for IoT networks, and the best way to deploy and move several UAVs was examined.The authors developed a framework for concurrently optimizing UAV three-dimensional (3D) positioning and mobility, device-UAV association, and uplink power regulation in their paper.First, the ideal UAV location is identified based on the locations of active IoT devices at each time instant.Next, the optimal UAV mobility patterns were studied to dynamically serve the IoT devices in a timevarying network.The goal is to utilize as little energy as possible for the UAVs' mobility while serving IoT devices.For the coverage and rate analyses, a tractable analytical framework is developed [28], wherein the UAV's coexistence with a D2D communication network is taken into account.
The interfering UAVs are considered in [38], while in [39], the authors investigated the optimal 3D placement of multiple UAVs, that use directional antennas, to maximize total coverage area.The authors in [40] analyzed the impact of a UAV's altitude on the sum-rate maximization of a UAV-assisted terrestrial wireless network, and the 3D placement of drones with the goal of maximizing the number of ground users which are covered by the drone was investigated in [41].The minimum number of drones needed for serving all the ground users within a given area was determined in [42].In [43], evolutionary algorithms were employed to find the optimal placement of low-altitude platforms (LAPs) and portable BSs for disaster relief scenarios, by deploying the UAVs at the optimal locations, the number of BSs required to completely cover the desired area was minimized.The authors in [44] determined the optimal location of the UAV by maximizing the average rate while ensuring that the bit error rate will not exceed a specified.
Different considerations, such as flight time, energy limits, ground user demands, flying regulations, and avoiding NFZs, have a substantial impact on a UAV's trajectory.For maximizing the minimal average rate among ground users, the authors in [45] proposed a simultaneous optimization of user scheduling and UAV trajectory.While a number of jammers with unknown locations sent jamming signals, the authors in [45] presented a combination UAV and ground users' scheduling and transmit power allocation optimization technique.The optimal trajectory of UAVs with multiple antenna for maximum sum-rate in uplink communication was researched [46].The throughput maximization problem in mobile relaying systems was investigated in [47] by optimizing the source/relay transmit power along with the relay trajectory, subject to practical mobility constraints such as UAV's speed and relay locations.The authors in [48] proposed an E-Spiral algorithm for accurate photogrammetry that considers the camera sensor and the flight altitude to apply the overlapping necessary to guarantee the mission success.This technique used an energy model to determine different optimal speeds for straight parts of the road, thereby lowering energy consumption and improving the energy model's ability to estimate overall path energy.To characterize the practical path planning requirements of UAVs in difficult situations, the authors in [49] developed an energyaware multi-UAV multi-area coverage path planning model.A bipartite cooperative coevolution (BiCC) algorithm was suggested in this regard, which coevolves inter-area and intraarea path planning components to generate good solutions.In [50], the authors proposed a geometric planning-based iterative trajectory optimization technique.To begin, graph theory was used to generate all potential UAV-ground BS association sequences, and candidate association sequences were chosen based on the topological link between UAV and ground BSs.Following that, an iterative handover location design based on the triangle inequality property is given to calculate the shortest flying route with quick convergence and minimal computation complexity.After that, by comparing all of the possible trajectories, the optimal flight trajectory can be determined.The authors presented a tradeoff between mission completion time and flight energy usage [50].
In addition, recent research looks on multi-objective optimization of UAV assisted communication [51], [52].Over the course of a flight, a multi-objective optimization problem is constructed to jointly optimize three objectives [52]: 1) maximization of cumulative data rate, 2) maximization of total gathered energy, and 3) reduction of UAV energy consumption.Because these goals are incompatible, the authors suggested an enhanced deep deterministic policy gradient (DDPG) technique for learning UAV control policies with multiple goals.In [53], the authors developed a mathematical propulsion energy model for rotary-wing UAVs with the goal of minimizing the total energy consumption of the UAV while keeping all ground node data rates in consideration.The authors suggested a new path discretization method for converting the original optimization problem into a discretized equivalent with a finite number of optimization variables, for which the successive convex approximation technique yielded a high-quality suboptimal solution.

B. Contributions
In this paper, a smart UAV positioning mechanism is proposed by taking such regulation constraints into account to provide sustainable wireless coverage and services to the ground users under more realistic conditions.In particular, we propose a Q-learning-based approach for UAV-assisted communication systems.The optimal position of UAVs are determined under the constraints of altitude regulations, NFZs, and transmit power.The main contributions of the paper are as follows: • A smart UAV positioning mechanism for a sustainable UAV communication system is proposed, under certain constraints.• A multi-objective optimization model is formulated, that is, minimizing the energy consumption of UAV, while maximizing the number of users covered.• A weighting mechanism is developed in order to prioritize the two objectives given in the previous item over each other for different scenarios.• Q-learning based algorithm is used to find the optimal position of UAV.The convergence of the developed algorithm is first tested, followed by comparing its performance with the baseline k-means method in terms of number of users covered and energy consumption.

C. Organization of the Paper
The remainder of this paper is organized as follows.Section II describes the system model including propagation and energy consumption models, while Section III presents the problem formulation.Section IV presents the proposed Qlearning based UAV positioning mechanism, followed by discussing the simulation scenario and the results in Section V. Section VI concludes the the paper.

II. SYSTEM MODEL
In this section, we will elaborate on the system modeling of the work, including the scenario used, propagation and energy consumption modeling.

A. Scenario
We consider a UAV mounted BS to provide coverage to n u ground users distributed over a rectangular geographical area of size a × b square-meter.Let U = {1, 2, 3, ..., n u } be a set of n u users, and the UAV can move in any direction (x, y, or z) to provide coverage to ground users based on the user density.The total time of service T t (in mins) is divided into consecutive time-slots with equal duration of T d (in mins), such that n ts = T t /T d is the number of time slots, and T becomes a vector containing the consecutive time slots as T = [t 0 , t 1 , ..., t nts ].The location of a user is represented by (x u , y u , z u ), where x u ∈ R + is in the range of x u = [0, a], and similarly y u ∈ R + is in the range of x u = [0, b].z u ∈ R + is assumed to be a constant number as z = h u , since we consider the conventional mobile handsets, which are carried around a similar height.In this work, we assume the height of UEs to be h u = 1.5 meters.
The altitude of the UAV h d ∈ R + is in the range of [h min , h max ] where h min and h max are the minimum and maximum allowed altitude 4 of the UAV, respectively.For instance, according to the European regulations for flying UAV, h min and h max are 30 and 120 meters, respectively.

B. Propagation Model
The propagation model is inspired from [26], [31], wherein the average path loss model for air-to-ground communication can be characterized in terms of LoS links and non-LoS (NLoS) links, given as 4 According to the regulations of concerning country/region.
where f c is the carrier frequency and d k is the Euclidian distance between the UAV and user k, c is the speed of light, η LoS and η NLoS are the mean value of the excessive path loss (in addition to the free-space path loss) for LoS and NLoS, respectively.The LoS link probability is given as where ψ and ς are constant values depend on the environment, ϑ k = 180 π arcsin( hd d k ) is the elevation angle.Besides, the NLOS link probability can be calculated as Therefore, the average path loss can be expressed as

C. Energy Consumption Model
The energy consumption model is inspired from [31] where it is modeled as a combination of the energy consumption resulting from communication, UAV hovering, and UAV mobility.

1) Communication Energy Consumption:
The communication energy is needed to communicate with the ground users; i.e., transmit/receive the signals to/from the users.As such, the communication energy consumption of UAV E C can be calculated as follows: where P t is transmission power, P cu is the on-board circuit power, t cm is the duration to communication of UAV to user j, and n u,tj is the number of users served by UAV during time slot t j .

D. Hovering Energy Consumption
The hovering energy is required to keep the UAV up in the air and stay at the right altitude, and the hovering energy consumption of the UAV during time slot t j can be given as where t H is the duration of hovering of UAV.P H (in Watts) is the instantaneous hovering power consumption that can be determined by where M is the number of rotors of the helicopter, G is the thrust (in Newton), ρ is the fluid density of the air, and β is the rotor disk radius.

E. Mobility Energy Consumption
The mobility energy is needed to move the UAV to the optimal position in order to serve the ground users.From [31], the mobility energy consumption of the UAV can be given as where P h is instantaneous power consumption for mobility in the horizontal direction, P a is the ascending power, P d is descending power.d(t j ) the horizontal moving distance at t j , while ∆h(t j ) is the changes in the altitude of the UAV at t j .v h , v a , and v d are the horizontal, vertical (ascending), and vertical (descending) velocities of the UAV, respectively 5 .I(∆h(t j )) is the indicator function, such that [31] I(∆h(t j )) = 1 ∆h(t j ) 0, 0 ∆h(t j ) < 0.
Lastly, the power consumption of the horizontal direction is as follows: where P P is the parasitic power for overcoming the parasitic drag due to the aircraft's skin friction [31].

III. PROBLEM FORMULATION
The primary objective of this work is to maximize the number of connected users while minimizing the total energy consumption of the UAV BS in order to prolong its flight time.In this regard, we aim at finding the optimal position of the UAV BS and associate the ground users, which are normally out-of-service due to the congestion in the terrestrial network, to it so that the number of unconnected users are reduced; however, it is important to consider the total energy consumption of the UAV BS in order to maximize the service duration given they they are battery operated and have limited flight time.We also consider certain constraints, including the NFZs (e.g., the UAV BS cannot fly over those forbidden regions), the altitude regulations for UAVS, etc., thereby determining the optimal positioning of the UAV BS by taken into account both the requirements and constraints becomes a non-trivial objective.
Theorem 1.The number of connected users, n c , can be controlled by the altitude of the UAV BS, h d .
Proof.Let K be a rectangular prism with the base area of A K = x K y K , x K and y K are the x and y dimensions of the base of K that is placed on z = 0 plane.If we place the UAV BS-with a directivity angle of θ-at any point inside K, the radius of the footprint of the UAV BS can be calculated as follows [23]: 5 For the details on the calculations of these velocities, please refer to [31].where h d is determined by

Preprints
where | N| is the length of the normal vector, N, from UAV BS to the z = 0 plane.Then, the footprint of the UAV BS can be found as If a random point is selected on z = 0 plane, then the probability of falling inside the footprint of the UAV BS can be given as where A K is the base area of the rectangular prism K. Let p q is the probability of receiving sufficient signal-to-noise ratio (SNR 6 ) for a UE, such that where S r is the received SNR, while S min is the minimum required SNR value to establish a connection between the UAV BS and the UE.We assume the ground BS uses a different frequency band than the UAV BS, thereby it does not create any interference to the UAV BS.Therefore, SNR is a better choice here.Moreover, note that T s captures the receiver sensitivity of the user equipment (UE), and p q encompasses small-scale and large-scale fading effects.Therefore, for a UE, the probability of being served by the UAV BS can be determined as where p r is the probability of having enough resource for the UE at the UAV BS, such that p r = P (B L ≥ B R ), where B L is the remaining radio resources at the UAV BS and B R is the required radio resources for the UE.By substituting (11), (13), and ( 14) into ( 16), we get: Hence, it is obvious from (17) that the probability of being served by the UAV is a function of the height of the UAV BS and they have a direct proportionality.

Theorem 2. The total energy consumption of the UAV BS, E T , can be controlled by the altitude of the UAV BS, h d .
Proof.Let E T be the total energy consumption of the UAV BS, E C be the communication energy, E M be the energy consumption during the UAV mobility.Suppose UAV moves to the optimal position and attain optimal altitude to serve the ground users.The total energy consumption of the UAV BS can be calculated as follows: By substituting the E M from ( 8) into (18), we get It is obvious from (19) that the total energy consumption of the UAV directly depends on the changes in the height of the UAV as well as the movement in the horizontal direction.

A. Optimization Problem Formulation
There are two primary objective functions considered in this work; namely, i) maximization of the number of served users by the UAV BS (n c ) and ii) minimization of the total energy consumption of the UAV BS (E T ).Therefore, these two objective functions can be formulated as follows: 1) Maximization of Number of Served Users: The number of served users by the UAV, n c is supposed to be maximized at each time slot.Let F be the NFZ and a non-self-intersecting convex quadrilateral that is defined by its vertices as V i = (x i , y i , z i ), where i = 1, 2, 3, 4.Moreover, let C 3d ∈ R 3 be a vector containing the 3-dimensional (3-D) coordinates of the UAV, and C 2d be a point in xy-plane, representing the projection of the UAV on the xy-plane, and imagine we draw straight lines from each vertex of F to the point C 2d .Then, the optimization problem can be modeled as follows: where • C 1 : The altitude of the UAV (h d ) is regulated in many countries and regions, such that the maximum (h max ) and minimum (h min ) altitudes that UAVs can flight are determined.Therefore, in this work, the UAV is supposed to obey these limitations in terms of the altitude.• C 2 : Since F is defined as the NFZ, it means that the UAV BS cannot fly over it.As such, this constraint confirms that the UAV BS is flying out of F , such that the projection of the UAV BS on the xy-plane, C 2d , is not within F .• C 3 : The directivity angle of the antenna of the UAV BS can be π at maximum 7 , but practically it should be less than that in order to have a better antenna gain.Though this could be normally not a hard constraint, in this work we deal with the case where the antenna angle is less than π, thereby this becomes a constraint for the optimization problem.• C 4 : Given that the maximum transmit power of the BSs are regulated, this constraint captures such regulations, meaning that the transmit power of the UAV BS has an upper bound.
2) Minimization of Energy Consumption: It is crucial to minimize the energy consumption of the UAV BS in order for it to stay in the air for a longer time so that the service that the ground users get is prolonged.Put it another way, the optimization objective elaborated in Section III-A1 focuses on maximizing the number f connected users, n c , however, such objective is instantaneous (i.e., for a duration of a single time slot, T d ) and does not aim to maximize n c for a period of time.The total number of connected users over a period of time considered can be calculated by where n c,i indicated the number of served users by the UAV BS during time slot i from T .In (21), n ts is a function of T s , such that n ts = f (T s ), thereby although T s is assumed to be fixed here, normally it is dependent on the energy stored in the UAV battery (i.e., battery capacity) as well as the energy consumption of the UAV BS.Since the battery capacity is fixed 8 , the only way left to prolong the UAV flight time is reducing the energy consumption.Therefore, the second objective of our problem formulation becomes the minimization of the total energy consumption of the UAV BS (E T ), and that can be modeled as follows: where g : R 3 → R is the objective function, and g( C) = E T in this case.

3) Multi-objective Problem Formulation:
As detailed in Sections III-A1 and III-A2, there are two distinctive objectives included in our problem; i.e., maximization of connected users-as given in (20)-and minimization of the energy consumption of the UAV BSs-as given in (22).In this work, we aim at optimizing the both objectives-( 20) and ( 22), simultaneously.In this regard, we developed the following optimization model: where h : R 3 → R is the objective function, and h( C) = w 1 f ( C) − w 2 g( C) = w 1 n c − w 2 E T in this case.Here, w 1 , w 2 ∈ R are coefficients used for two purposes: • To prioritize one objective over the other.For example, a mobile network operator may not be interested in the energy consumption much, and focuses only on covering as much as users as possible for a short duration, and it would choose w 1 ≫ w 2 .On the other hand, if the operator ranks both objectives equally, then it would choose w 1 = w 2 .Therefore, w 1 and w 2 allow the operators to rank the objectives according to their requirements.• To make the units of both f ( C) (unitless) and g( C) (in Joules) the same, since h( C) includes the summation of f ( C) and g( C).To this end, while w 1 is chosen to be unitless, w 2 is in (1/Joules).

MECHANISM
In RL, there is an agent taking actions to find the optimum policy for a given problem.Based on the action of the agent, first, corresponding state is observed, followed by evaluating the subsequent penalty/reward function.Then, the actionvalue function, storing calculated penalty/reward values for all the states and actions, is updated [22].The agent takes action in two different ways: explore and exploit.In the initial phases of the implementation, the agent is expected to explore more in order to discover the environment better.However, after a sufficient exploration, the agent should start exploiting the available information to be able to focus on finding the best policy.We adopted OpenAI Gym [54] tool for building environment for this study.It is based on episodic RL, where experience of each agent is divided into episodes.In initial state of each episode, we randomly localize the UAV BS and the users in a grid, and learning proceeds until the environment reaches one of the stopping criteria (this will be detailed in the following paragraphs).The main goal here is to maximize the total reward per episode and to decrease the number of episodes for achieving desired performance.RL steps in each episode are given in Algorithm 1, where s t and s t+1 are the current and next states, respectively, and a t is the current state while R t+1 is the expected value of the reward function.
In this study, states refer to the position of the UAV in the grid.The agent in the developed Q-learning algorithm has seven action values for each state, which denote the agent action a t in UAV state of s t at time t.[22] policy to take random actions initially-which is referred to as exploring-and decays ǫ through iterationswhich is referred to as exploiting-for decreasing random actions.Given that the main goal of this study is to optimize energy consumption of the UAV along with maximizing user coverage, the reward function in the proposed method is inline with the objective function in (23), and depends on the energy consumption and coverage.
The components of the developed Q-learning algorithm for the problem of UAV BS positioning are detailed in the following paragraphs.

A. Environment
We create discrete environment with finite size (grid) representing the state of UAV in OpenAI-Gym.The size of the grids in the environment-of-interest in this study is (25,25,12), which is simulated with 10 meter-resolution in each axis.Therefore, the real environment size becomes (250, 250, 120) in meters.These certain dimensions of the environment are chosen by considering both the computational burden and the reality of the work; such that, the environment should be in a size of some realistic area (and the UAV BS should be able to have sufficient degree of freedom in movement) while not bringing much computational burden (the simulation time should be reasonable for us to make some tuning during the design of the algorithm).However, we intuitively confirm that the developed algorithm would work in any environment size, as the UA BS can only move slightly at one iteration thereby extending the size of the environment would not affect the performance of the algorithm other than prolonging the simulation time.

B. Agent
The UAV BS in the state s t corresponds to the agent in this study.It will take an action, a t , in state s t , and it receives an observation and reward from the environment.Accordingly, it updates Q-table in order to learn the dynamics of the environment, and adapt itself to the changes.It is quite convenient to choose the UAV BS as the agent in the developed Q-learning algorithm, as it is the only one taking different actions; e.g., moving in different directions.

C. Actions
We consider seven different actions that agents can take.Let C 3d = (x u , y u , z u ) be the current position of the UAV BSs and Ĉ3d be the position after an action taken, while r (in meters) denotes the step size in any direction.Then the list of the actions that the agent takes is as follows: : Hold

D. States
We denote state s as the position of the UAV in a 3D space.We divide 3D space into grids (i.e., we discretized the state space) for having finite set of state that can be used in Q-learning.This state selection is inline with the criterion given in [22], such that the state should be affected by the actions that the agent takes.As such, the actions of the agent is fundamentally altering the 3D position of the UAV BS, which changes the state of the agent, which is also defined to be the 3D position of the agent.

E. Reward
In order to avoid the limitations of the work (or respect the constraints, in other words), a penalty mechanism is developed, such that the agents obtains a reward of -1 when the UAV BS • goes beyond the dimensions of the environment, • flies on the NFZ, • does not respect any other constraint in (23).On the other hand, a reward function is designed for the cases where the UAV BS is not in one of the states listed above.Since the main goal is to optimize energy consumption along with maximizing number of user covered, the reward is defined inline with the optimization objective in (23), such that The selection of the reward function as in (24) (i.e., making it equal to h( C)) is a legitimate decision, because the the objective of the developed Q-learning algorithm is to maximize the reward, R, and the objective function in (23) is the maximization of h( C).Thus, making the reward equal to h( C) is completely inline with the model in (23).

F. Policy
We follow an ǫ-greedy policy [22] in order to explore the environment by taking random actions in earlier iterations (exploration phase).As the iterations proceed (e.g., the number of the iterations get larger), we turn the exploration phase into the exploitation phase by decreasing ǫ with a decay-rate of 0.01.This is done in order to allow the agent explore and acquire new experiences during the exploration phase, while in the exploitation phase it uses the obtained experience to converge to an optimal value.

G. Q-table Update
We update the Q-table according the action a t in the state s t using: where α is the learning-rate, and γ is the discount rate.Qtable update is crucial in storing the obtained experience as well as modifying it with the new data.

H. Initialization
In each episode, the UAV BS and the users are located randomly in the grid, so that the agent does not "memorize" (or it is called as overfitting in more technical terminology) a certain environment, instead produce a more generic model.

I. Episodes
The episode is considered as a snapshot of the environment in the problem formulation.The agent takes random actions in each episode and learn the environment using Q-table with (25) by evaluating the reward, R, through (24).When the the agent reaches stopping criteria, a new episode begins.

J. Stopping Criteria
If the predefined maximum number of iterations is reached or all the users are covered by the UAV BS, the current episode is terminated, and the algorithm goes into a new episode.The maximum number of iteration is set to 2000 in this work.

V. PERFORMANCE EVALUATION
In this section, we present the performance evaluation of the proposed methodology.After describing the simulation scenario, we introduce the benchmark method as well as the performance metrics, followed by presenting the obtained results and corresponding discussions.The parameters used in the simulation campaigns are given in Table I.

A. Simulation Scenario
We implement a simulation scenario in order to evaluate the proposed Q-learning algorithm.An urban area of 250 × 250 m 2 -which is discretized by means of square-shaped gridand n t = 100 total number of users (that are normally out-of-service from the terrestrial network) are considered.Consequently, the UAV can move in the discretized 3D space in terms of x, y, and z coordinates.Furthermore, due to the regulations, we impose a minimum and a maximum altitude of h min = 30 and h max = 120 meters, respectively, and a certain number of NFZs, corresponding to specific not allowed grids.As regards to the user mobility, we consider a random walk model, and the height from ground for all UEs is fixed to 1.5 meters.Furthermore, we assume the directivity angle θ = 60 • and the carrier frequency f c = 1 GHz.An initialization procedure is performed in order to explicate the initial conditions.In particular, the values for all the involved parameters are determined, considering arbitrary positions for the UAV BS and the users.An outage threshold is calculated considering a required minimum received power to establish and maintain with a certain QoS a connection between the UE and the UAV BS, called P rmin .For a given transmitted power for the UAV BS, P t -lower than the maximum allowed value, P tmax -, and the above-mentioned P rmin , the path loss experienced by the UE, L, can be expressed by the following relation, L = P t − P rmin .Considering the 2D position of the UAV BS (C 2d ), the QoS constraint can be expressed in terms of L lower than L max [26], where L max is the path loss experienced by edge users.The footprint of the UAV BS, on the other hand, can be considered as a circle centered in the 2D position of the UAV BS (C 2d ).

B. Benchmark and metrics
In this work, k-means algorithm is employed as a benchmark method, since it has been widely used in similar problems [23].k-means is an unsupervised clustering algorithm, where the data points are clustered according to certain features.In k-means, a centroid is assigned to each cluster and the objective is to place the centroids to the position which yields minimum cumulative distance to the data points.In particular, in the initialization of the algorithm, the centroids are placed randomly and the data points are assigned to each centroid to form a cluster, and the assignment is done in a way that each data point is assigned to the cluster that is closest to it in terms of Euclidian distance.Then, the centroids are moved towards to the center of their clusters, and this process iteratively continues until the convergence, where the centroids cannot be moved anymore.
Therefore, as this algorithm finds the position of the centroid, where the cumulative distance between the centroid and the data points is the minimum, it serves as a strong benchmark for this problem.In particular, if the UAV BS is considered as the centroid, while the ground users are the data points, the k-means algorithm positions the UAV BS at a point where it is closest to the ground users in terms of distance.Given that the distance is the primary parameter affecting the link quality between the transmitter and receiver, k-means algorithm becomes a appropriate benchmark.With this algorithm, we compute the centroid position related to the actual ground users' positions.Consequently, the centroid corresponds to the best 2D position for UAV, in terms of respective distances between UAV and ground users.
Two different phases, one for training and one for testing, respectively, are performed in order to demonstrate the efficiency of the proposed Q-learning algorithm in terms of coverage and energy consumption.Regarding the coverage, we count the number of ground users, which are normally out of service from the terrestrial network, connected to the UAV BS, and the more users covered means a better performance in terms of the coverage.In the energy consumption, we measure the total energy consumption, ET, of the UAV BS while it is providing service to the users, and the less energy consumption refers to a better performance in terms of the energy consumption as the flight time of the UAV BS is prolonged.
During the training phase of the developed Q-learning algorithm, the simulation is conducted through a certain number of episodes, in order to populate the related Q-table and consequently achieve the needed learning.A trade-off between coverage and energy consumption prioritization is considered both in training and testing phases.Specifically, five different experiments are performed with different values of the weights (e.g., w 1 and w 2 ) that are responsible of prioritizing the coverage or the energy consumption.

C. Simulation Results
Fig. 2 demonstrates the averaged and normalized results in terms of energy consumption (orange bars) and covered users (green bars) for different altitudes through k-means algorithm.It is worth noting that, since the k-means algorithm determines the 2D position of the UAV BS, the altitude of the UAV BS should also be determined.Although there are different methods in order to determine the altitude, such as a trigonometric approach is used in [23], those usually do not consider the altitude regulations for the UAVs.In this work, on the other hand, considering such regulations, we used different fixed levels for the altitude of the UAV BS.In particular, three different altitude levels are considered: i) minimum allowed altitude (h min =30), maximum allowed altitude (h max =120 m), and the middle point between the two h max + h min 2 =75 m.From (11) it is understood that the value in the second case, i.e. with maximum allowed height of 120 meters, can be assumed as the upper-bound in terms of coverage, since in this case the UAV is placed in the best 2D position with the maximum allowed height, that means the maximum achievable coverage area with respect to the size of the considered urban area, consequently obtaining the maximum number of covered users.A similar consideration can be done for this case (i.e., h max =120 m) in terms of energy consumption.Since all ground users are covered at the first iteration, one of the stopping criteria is readily matched, thereby no movement is performed by the UAV, resulting in energy consumption due to mobility equals to zero.For the two remaining cases, with altitudes of 30 and 75 meters, respectively, the results in terms of coverage can be considered as lower-bound and median values, since as previously stated, the coverage area, and subsequently, the number of covered user are highly dependent on the considered altitude of the UAV BS.Lastly, considering the energy consumption results, the UAV BS exploits the maximum number of allowed iterations attempting to match the stopping criteria for coverage, consequently resulting in the maximum value for energy consumption due to mobility.Fig. 3 presents the results in terms of achieved rewards, after an initial phase, the convergence of the Q-learning algorithm occurs, demonstrating the effectiveness of the learning algorithm.One of the important takeaway from the findings in Fig. 3 is that, regardless of the weighting approach (e.g., for different w 1 and w 2 values), the designed Q-learning algorithm converges to the final reward.This confirms the proper design of the algorithm, and is a clear sign that it can work in various scenarios.
Following the above assumptions, the efficiency of the proposed Q-learning algorithm is verified through the testing phase, with the UAV BS positioning optimization through kmeans as a benchmark.In this phase, the UAV BS is located in the above-mentioned simulated scenario and an arbitrary uniform distribution for the ground users is considered.The testing phase is conducted performing a certain number of runs, in order to average the results with regard to the specified parameters.Fig. 4 shows the single UAV BS position optimization for different set of weights and the results are the normalized version of the average values between 0 and 1.The goodness of considering a trade-off between coverage and energy consumption, achieved by means of the two different weights, is mostly visible in two of the five experiments.In particular, it can be seen that for the w 1 =0.2 and w 2 =0.8, the best average energy consumption is achieved, whereas for the weights w 1 =0.0 and w 2 =1.0, best overall coverage, and rewards are obtained.In other words, the proper performances of the above-mentioned tradeoff can be observed from the results in Fig. 4. Effectively, when energy consumption is not prioritized, the UAV BS finds the optimum position in fewer episodes but at the expense of a higher energy consumption, conversely in the remaining cases.Therefore, it is worth mentioning that the designed weighting mechanism works well, as the performance of the Q-learning algorithm is deeply affected by the numerical values of the wights.However, these results does not only affirm that the weighting mechanism works, but also gives superiority to the proposed approach, as it can converge to a solution according to the requirements of the network operators.

VI. CONCLUSION
In this paper, a smart UAV BS positioning mechanism was proposed by taking altitude regulations as well as NFZs into account along with some hard constraints, including maximum transmit power and directivity of the UAV BS antenna, to provide sustainable wireless coverage and services to the ground users under more realistic conditions.As such, first, two different optimization models were developed for the minimization of the energy consumption and maximization of the number of users covered.Then, these two distinctive models are combined with a weighting mechanism, and a multi-objective optimization problem formulation was developed.With the developed weighting mechanism, wireless networks operators become capable of positioning the UAV BSs according to their requirements by relatively ranking the energy consumption and the number of users covered.We proposed a Q-learning-based approach for UAVassisted communication systems, and the OpenAI Gym tool was used to build the RL environment.The objective is to find the optimal position of the UAV and minimize the energy consumption while maximizing the number of users covered.The results demonstrate that the proposed solution outperforms the baseline k-means method in terms of covered users, while achieving the desired minimization of the energy consumption.

Fig. 1 .
Fig.1.The considered scenario depicting a ground macro BS that provides a wide-range coverage and UAV BS that provides additional capacity to the cellular network.A no-fly zone (NFZ), over which the UAV BSs are prohibited to fly, is also illustrated.

Fig. 3 .
Fig. 3. Q-learning algorithm convergence in terms of rewards for different set of weights for 2000 episodes.

Fig. 4 .
Fig. 4. Monte Carlo test results.Single UAV position optimisation comparison for different set of weights.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 September 2021 doi:10.20944/preprints202109.0177.v1
The possible actions for each state s t are hold, move up, move down, move left, move right, move forward, and move backward.The agent follows ǫ-greedy