Deep Q-Learning-Based Transmission Power Control of a High Altitude Platform Station with Spectrum Sharing

A High Altitude Platform Station (HAPS) can facilitate high-speed data communication over wide areas using high-power line-of-sight communication; however, it can significantly interfere with existing systems. Given spectrum sharing with existing systems, the HAPS transmission power must be adjusted to satisfy the interference requirement for incumbent protection. However, excessive transmission power reduction can lead to severe degradation of the HAPS coverage. To solve this problem, we propose a multi-agent Deep Q-learning (DQL)-based transmission power control algorithm to minimize the outage probability of the HAPS downlink while satisfying the interference requirement of an interfered system. In addition, a double DQL (DDQL) is developed to prevent the potential risk of action-value overestimation from the DQL. With a proper state, reward, and training process, all agents cooperatively learn a power control policy for achieving a near-optimal solution. The proposed DQL power control algorithm performs equal or close to the optimal exhaustive search algorithm for varying positions of the interfered system. The proposed DQL and DDQL power control yields the same performance, which indicates that the actional value overestimation does not adversely affect the quality of the learned policy.


Introduction
A High Altitude Platform Station (HAPS) is a network node operating in the stratosphere at an altitude of approximately 20 km. The International Telecommunication Union (ITU) defines a HAPS in Article 1.66A as "A station on an object at an altitude of 20 to 50 km and a specified, nominal, fixed point relative to the Earth". Various studies have been performed on HAPS in recent years, and the commercial applications of HAPS have significantly increased [1]. In addition, the HAPS has potential as a significant component of wireless network architectures [2]. It is also an essential component of next-generation wireless networks, with considerable potential as a wireless access platform for future wireless communication systems [3][4][5].
Because the HAPS is located at high altitudes ranging from 20 to 50 km, the HAPSto-ground propagation generally experiences lower path loss and a higher line-of-sight probability than typical ground-to-ground propagation. Thus, the HAPS can provide a high data rate for wide coverage; however, it is likely to interfere with various other terrestrial services, e.g., fixed, mobile, and radiolocation. The World Radiocommunication Conference 2019 (WRC-19) adopted a HAPS as the IMT Base Station (HIBS) in the frequency bands below 2.7 GHz previously identified for IMT by Resolution 247 [6], which addresses the potential interference of HAPS with an existing service. In such a situation, Thus, the optimal exhaustive search method requires an impractically long computation time to solve the multicell power optimization problem. The proposed DQL algorithm performs comparably to an optimal exhaustive search with a feasible computation time. (4) Even for varying positions of the interfered system, the proposed DQL produces a proper power control policy, maintaining stable performance. (5) Comparing the proposed DQL algorithm with the DDQL algorithm shows no performance degradation due to overestimation in the proposed DQL. The remainder of this paper is organized as follows.
Section 2 presents the system model, including the system deployment model, HAPS model, interfered system model, and path loss model. In Section 3, the downlink SINR and I NR are calculated. In Section 4, a DQL-based HAPS power control algorithm is proposed. Section 5 presents the simulation results, and Section 6 concludes the paper.

System Deployment Model
HAPS communication networks are assumed to consist of a single HAPS, multiple ground user equipment (UE) devices (referred to as UEs hereinafter), and a ground interfered receiver. The HAPS, UE, and interfered receiver are distributed in the threedimensional Cartesian coordinate system, as shown in Figure 1. The coordinates of the HAPS antenna and the interfered receiver antenna are (0, 0, h H APS ) and (X, Y, h V ), respectively. The N UE UE devices with an antenna height of h UE are uniformly distributed within the circular HAPS area.
increases exponentially as the number of cells ( ) and power levels increase linearly Thus, the optimal exhaustive search method requires an impractically long computation time to solve the multicell power optimization problem. The proposed DQL algorithm performs comparably to an optimal exhaustive search with a feasible computation time (4) Even for varying positions of the interfered system, the proposed DQL produces a proper power control policy, maintaining stable performance. (5) Comparing the pro posed DQL algorithm with the DDQL algorithm shows no performance degradation due to overestimation in the proposed DQL. The remainder of this paper is organized as fol lows.
Section 2 presents the system model, including the system deployment model, HAPS model, interfered system model, and path loss model. In Section 3, the downlink SINR and are calculated. In Section 4, a DQL-based HAPS power control algorithm is pro posed. Section 5 presents the simulation results, and Section 6 concludes the paper.

System Deployment Model
HAPS communication networks are assumed to consist of a single HAPS, multiple ground user equipment (UE) devices (referred to as UEs hereinafter), and a ground inter fered receiver. The HAPS, UE, and interfered receiver are distributed in the three-dimen sional Cartesian coordinate system, as shown in Figure 1. The coordinates of the HAPS antenna and the interfered receiver antenna are (0, 0, ℎ ) and (X, Y, ℎ ), respectively The UE devices with an antenna height of ℎ are uniformly distributed within the circular HAPS area.

HAPS Model
We modeled the HAPS cell deployment and system parameters with reference to the working document for a HAPS coexistence study performed in preparation for WRC-23 [18]. As shown in Figure 2, a single HAPS serves multiple cells that consist of one 1st layer cell denoted as _1 and six 2nd layer cells denoted as _2 to _7. The six cells o the 2nd layer are arranged at intervals of 60° in the horizontal direction. Figure 3 presents a typical HAPS antenna design for seven-cell structures [4], where seven phased-array antennas conduct beamforming toward the ground to form seven cells, as shown in Figure  2. The 1st layer cell has an antenna tilt of 90°, i.e., perpendicular to the ground; the 2nd layer cell has an antenna tilt of 23°.

HAPS Model
We modeled the HAPS cell deployment and system parameters with reference to the working document for a HAPS coexistence study performed in preparation for WRC-23 [18]. As shown in Figure 2, a single HAPS serves multiple cells that consist of one 1st layer cell denoted as Cell_1 and six 2nd layer cells denoted as Cell_2 to Cell_7. The six cells of the 2nd layer are arranged at intervals of 60 • in the horizontal direction. Figure 3 presents a typical HAPS antenna design for seven-cell structures [4], where seven phased-array antennas conduct beamforming toward the ground to form seven cells, as shown in Figure 2. The 1st layer cell has an antenna tilt of 90 • , i.e., perpendicular to the ground; the 2nd layer cell has an antenna tilt of 23 • .
The antenna pattern of the HAPS was designed using the antenna gain formula presented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain is calculated as the sum of the gain of a single element and the beamforming gain of a multi-antenna array. The single element antenna gain is determined by the azimuth angle (φ) and the elevation angle (θ) between the transmitter and receiver and is calculated as follows: where G E,max represents the maximum antenna gain of a single element, A E,H (φ) represents the horizontal radiation pattern calculated using Equation (2), and A E,v (θ) represents the vertical radiation pattern calculated using Equation (3).
Here, φ 3dB represents the horizontal 3 dB beamwidth of a single element, and A m represents the front-to-back ratio.  The antenna pattern of the HAPS was designed using the antenna gain form sented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain lated as the sum of the gain of a single element and the beamforming gain of antenna array. The single element antenna gain is determined by the azimuth a and the elevation angle ( ) between the transmitter and receiver and is calculate  The antenna pattern of the HAPS was designed using the antenna gain formula sented in Recommendation ITU-R M.2101 [19]. The transmitting antenna gain is c lated as the sum of the gain of a single element and the beamforming gain of a m antenna array. The single element antenna gain is determined by the azimuth angl and the elevation angle ( ) between the transmitter and receiver and is calculated a Here, θ 3dB represents the vertical 3 dB bandwidth of a single element, and SLA v represents the front-to-back ratio.
The transmitting antenna gain of the HAPS is calculated using the antenna arrangement and spacing, as well as the target beamforming direction. The gain for beam i is calculated as follows: where N H and N V represent the number of antennas in the horizontal and vertical directions, respectively. v n,m is the superposition vector that overlaps the beams of the antenna elements, which is calculated using Equation (5), and w i,n,m is the weight that directs the antenna element in the beamforming direction, which is calculated using Equation (6).
Here, d H and d V represent the intervals between the horizontal and vertical antenna arrays, respectively, and λ represents the wavelength.
Here, φ i,escan and θ i,etilt represent the φ and θ of the main beam direction, respectively. The 1st layer cell of the HAPS uses a 2 × 2 antenna array, and the 2nd layer cell uses a 4 × 2 antenna array. Figure 4 shows the antenna pattern of the 1st layer cell, and Figure 5 shows the antenna pattern of the 2nd layer cell. ment and spacing, as well as the target beamforming direction. The gain for bea calculated as follows: where and represent the number of antennas in the horizontal and vertical tions, respectively. , is the superposition vector that overlaps the beams of t tenna elements, which is calculated using Equation (5), and , , is the weight t rects the antenna element in the beamforming direction, which is calculated using tion (6). = 1,2, … ; = 1,2, . . .
Here, and represent the intervals between the horizontal and vertical an arrays, respectively, and represents the wavelength.
Here, , and , represent the and of the main beam directi spectively.
The 1st layer cell of the HAPS uses a 2 × 2 antenna array, and the 2nd layer ce a 4 × 2 antenna array. Figure 4 shows the antenna pattern of the 1st layer cell, and 5 shows the antenna pattern of the 2nd layer cell.

Interfered System Model
Various interfered systems, e.g., fixed, mobile, and radiolocation services, can b sidered for the interference scenario involving a HAPS. We adopted a ground IMT station (BS) for the interfered system, referring to the potential interference scenar The antenna pattern of the interfered system was applied by referring to Recommend ITU-R F.1336 [20]. The receiving antenna gain is calculated as follows: where represents the maximum gain in the azimuth plane; ( ) represen relative reference antenna gain in the azimuth plane in the normalized direction of ( which is calculated using Equation (8); and ( ) represents the relative referen tenna gain in the elevation plane in the normalized direction of (0, ), which is calcu using Equation (9).
represents the horizontal gain compression ratio when the az angle is shifted from 0° to , which is calculated using Equation (10).

Interfered System Model
Various interfered systems, e.g., fixed, mobile, and radiolocation services, can be considered for the interference scenario involving a HAPS. We adopted a ground IMT base station (BS) for the interfered system, referring to the potential interference scenario [6]. The antenna pattern of the interfered system was applied by referring to Recommendation ITU-R F.1336 [20]. The receiving antenna gain is calculated as follows: where G 0 represents the maximum gain in the azimuth plane; G hr (x h ) represents the relative reference antenna gain in the azimuth plane in the normalized direction of (x h , 0), which is calculated using Equation (8); and G vr (x v ) represents the relative reference antenna gain in the elevation plane in the normalized direction of (0, x v ), which is calculated using Equation (9). R represents the horizontal gain compression ratio when the azimuth angle is shifted from 0 • to φ, which is calculated using Equation (10).
Here, x h and λ kh are given by Equations (11) and (12), respectively; φ 3 represents the 3 dB beamwidth in the azimuth plane; and k h is an azimuth pattern adjustment factor based on the leaked power. The relative minimum gain G 180 was calculated using Equation (13).
Returning to Equation (9), x v is given by Equation (14), and the 3-dB beamwidth in the elevation plane θ 3 is calculated using Equation (15), where G 0 represents the maximum gain in the azimuth plane. In addition, x k is calculated using Equation (16), where k v is an elevation pattern adjustment factor based on the leaked power. λ kv was calculated using Equation (17), and the attenuation inclination factor C was calculated using Equation (18). Figure 6 shows the antenna pattern of the interfered system calculated using Equation (7), which is the pattern for a typical terrestrial BS with a broad beamwidth in the azimuth plane but a narrow beamwidth in the elevation plane.
Returning to Equation (9), is given by Equation (14), and the 3-dB beamwi the elevation plane is calculated using Equation (15), where represents the mum gain in the azimuth plane. In addition, is calculated using Equation (16), is an elevation pattern adjustment factor based on the leaked power. kv was lated using Equation (17), and the attenuation inclination factor was calculated Equation (18). Figure 6 shows the antenna pattern of the interfered system calculat ing Equation (7), which is the pattern for a typical terrestrial BS with a broad beam in the azimuth plane but a narrow beamwidth in the elevation plane.

Path Loss Model
The path loss model of Recommendation ITU-R P.619 [21] was applied to the working document for the HAPS coexistence study performed in preparation for WRC-23 [22]. The total path loss that occurs when the HAPS signal reaches the UE and the IMT BS is expressed as follows: where FSL represents the free-space path loss calculated using Equation (20), which occurs in a straight path from a transmitting antenna to a receiving antenna in a vacuum state, and A xp is assumed to be 3 dB for depolarization attenuation. A g represents the attenuation loss due to atmospheric gases. A bs represents the resistive loss due to the spread of the antenna beam as the beam spreads attenuation. A g and A bs were calculated using the formulae in P.619.
Here, f represents the carrier frequency (in GHz), and d represents the distance (in km) between the transmitter and receiver.

Calculation of Downlink SINR and I NR 3.1. Calculation of Downlink SINR
The signal received by the UE from the HAPS transmission for the ith cell (Cell_i) is calculated as follows: where P Cell_i represents the HAPS transmission power for Cell_i, G Cell_i represents the transmitting antenna gain of Cell_i, G p represents the polarization gain, G r,UE represents the receiving antenna gain, and L ohm represents the ohmic loss. The UE receives signals from all N cell cells and considers the remaining signals (except for the strongest Cell j signal) as interference. Equation (22) is used to calculate the signal and interference, and the receiver noise is calculated using Equation (23).
Here, k and T represent the Boltzmann constant and noise temperature, respectively, and BW represents the channel bandwidth. N f represents the noise figure. Finally, the downlink SINR is calculated as follows:

Calculation of I NR
The interference power received by the interfered receiver from the HAPS transmitter servicing Cell i is calculated as follows: where G r,V represents the antenna gain of the interfered receiver. The aggregated interference power at the interfered receiver is calculated as follows: Finally, after converting the aggregated interference into I NR form in accordance with Equation (27) and comparing it with the protection criteria (I NR th ) of the interfered receiver, it is possible to check whether the interfered receiver is protected from the interference of the HAPS.

Problem Formulation
To satisfy the I NR th of the interfered system, the transmission power of the HAPS must be reduced. However, as the power of the HAPS is reduced, the η of the UE decreases, and the outage probability P out increases. Thus, the objective of this study was to find a HAPS transmission power set for each cell, i.e., P = {P Cell_i |i = 1, · · · , N cell }, that satisfies the I NR th of the interfered system while minimizing P out . The optimization problem of the HAPS transmission power can be formulated as follows: where N UE,o (P) represents the number of UEs that do not satisfy the minimum required SINR η o for a given HAPS transmission power set P.

Proposed Algorithm
To control the HAPS transmission power, it is necessary to independently determine the power level of each cell. Accordingly, the total number of HAPS transmission power sets increases exponentially to N N cell p as the number of selectable powers N p increases linearly. Although an exhaustive search algorithm can be used to find optimal solutions, this incurs excessive complexity and a long computation time. To solve this problem, we propose a DQL-based power optimization algorithm that can find a near-optimal P with low complexity. In the proposed DQL model, each agent functions as the power controller of a cell; accordingly, the number of agents is N cell .
The agent-the subject of learning-learns a deep neural network called Deep Q Network (DQN) and selects an action using this network. DQL is an improved Q-learning method. Q-learning is a method for selecting the best action in a specific state through the Q-table of a state-action pair. As the state-action space grows in Q-learning, creating a Q-table and finding the best policy become highly complex. In addition, the use of Q-learning is limited because learning in the Q-table format becomes more complex when multiple agents are used. In contrast, a DQL is a promising way to solve the curse of dimensionality by approximating a Q function using a deep neural network instead of a Q-table. The proposed algorithm uses a method in which each agent learns a policy based on its observation and action while treating all other agents as part of the environment to solve the multiple-agent problem.
The basic DQL parameters (state, action, and reward) are presented below. Each agent learns the policy independently using the training data at each timestep t. The state space of the m th agent comprises a set of (N cell − 1) interferences that the agent provides to UEs located at the centers of other cells and the agent's interference to the interfered receiver, which is expressed as Two power sets configure the action space of an agent: A 1 = {29, 31, 33, 35, 37} and A 2 = {26, 28, 30, 32, 34} (unit: dBm). The agent of Cell_1 in the 1st layer cell selects an action from A 1 , and the agents of the 2nd layer cell select an action from A 2 . All agent actions are initialized to the minimum power value to minimize the interference to the interfered receiver at the beginning of the learning process. The reward is calculated as follows. First, because the interfered receiver must be safe from HAPS interference, an agent receives a fixed r t of −100 (deficient value) for I NR > I NR th . In contrast, for I NR ≤ I NR th , an agent receives r t computed according to the lower 5% downlink SINR of each cell {η i |i = 1, 2, · · · , N cell } and the required SINR η o . The reward can be expressed as where Figure 7 shows the structure of the proposed DQL-based HAPS transmission power control algorithm. Each agent learns its DQN, and one DQN consists of the main network, target network, and replay memory. The main network estimates the Q-value Q(s, a; w) corresponding to the state-action pair through a deep neural network with a weight w. The main network consists of an input layer composed of seven neurons, a hidden layer consisting of 24 neurons, and an output layer consisting of five neurons. It is a fully connected network. w is updated every t in the direction that minimizes the loss function L(w) = E y j − Q(s, a; w) 2 . The target network calculates the target value where γ is the discount factor; s and a denotes the state and action, respectively, in the next step; andQ(s , a ; w − ) is the Q-value estimated through the target network with weight w − . The agent's transition tuple (s t , a t , r t , s t+1 ) is piled in the replay memory, from which a minibatch (size of 512 tuples) are randomly sampled at each step. The minibatch data are used to compute the target value y j . In a DQL, learning is stabilized, and the learning performance is improved through replay memory and a separate target network [23]. Algorithm 1 describes the proposed DQL-based HAPS transmission power control algorithm. For DQN training, was set as 100,000, and the minibatch size was set as 512. was set as 500, and was set as 10. The Adam optimizer was used to minimize ( ), and the learning rate and were 0.01 and 0.995, respectively. An -greedy policy was used to balance exploration and exploitation; was initially set as 1 and was reduced by Algorithm 1 describes the proposed DQL-based HAPS transmission power control algorithm. For DQN training, N was set as 100,000, and the minibatch size was set as 512. M was set as 500, and T was set as 10. The Adam optimizer was used to minimize L(w), and the learning rate and γ were 0.01 and 0.995, respectively. An ε-greedy policy was used to balance exploration and exploitation; ε was initially set as 1 and was reduced by 0.01 for every episode. Initialize the replay memory D to capacity N 2: Initialize the Q-function with random weights w 3: Initialize the targetQ-function with the same weights: w − = w 4: for episode = 1, M do 5: Initialize action a 0 = min a A 6: for timestep = 1, T do 7: if t = 1 8: Calculate s t via Equations (21) and (25)  9: end if 10: With probability, select a random action a t 11: Otherwise, select a t = argmax a Q(s t , a; w)

12:
Assign the selected power to the mth cell and compute I NR and η 13: Observe the reward r t and s t+1 14: Store the experience in (s t , a t , r t , s t+1 ) in D 15: Sample a random minibatch of experiences from D 16: Set y j = r j + γmax a Q s , a ; w −

17:
Perform optimization via L(w) and update w 18: Update the target networkQ with w − = w every 4 steps 19: end for 20: end for A DDQL is a reinforcement learning algorithm to improve performance degradation due to the overestimation of the DQL. Action-value can be overestimated by the maximization step in line 16 of Algorithm 1. Therefore, the DDQL calculates the target value as y j = r j + γQ s , argmax a Q(s , a ; w); w − to eliminate the maximization step. The DDQLbased HAPS power control algorithm proceeds the same way as Algorithm 1 except for calculating the target value.

Simulation Configuration
The simulation was conducted using MATLAB for three positions of the interfered receiver, and the learning order of the agent was randomly set for each t. Subsequently, the simulation proceeded according to Algorithm 1. When all M episodes were finished, the simulation ended, and the set P c composed of the power selected by each agent was calculated as the simulation result. Finally, the performance of the simulation was verified by comparing P c with the optimal power set P * obtained via an exhaustive search algorithm considering all N N cell p cases. The total elapsed time of the DQL and exhaustive search was about 7500 s and 21,000 s, respectively. The total elapsed time of the exhaustive search increased exponentially with the rise of N, but the DQL did not. Therefore, the computational efficiency of the DQL is more remarkable as the number of cells and power levels increase. In this simulation, performance comparison with the DDQL was additionally performed to check performance degradation due to overestimation of the DQL.
We applied the HAPS parameters and interfered system parameters, referring to the working document for the HAPS coexistence study performed in preparation for WRC-23 [18,24]. The simulation parameters of the two systems are presented in Tables 1 and 2, respectively.   Figure 8 shows the SINR maps obtained using P max = {37, 34, 34, 34, 34, 34, 34} and P min = {29, 26, 26, 26, 26, 26, 26} for all cells, that is, with no power control. We considered the three positions of the interfered receiver that do not satisfy the I NR th of −6 dB for the use of P max . In addition, the three locations were designed considering the representative interference power, which can accurately reflect the operating characteristics of the proposed power control algorithm. Interfered receiver 1 was located in the main beam direction for Cell_3 and received the highest interference from Cell_3. Therefore, the minimum power use of only Cell_3 satisfied an I NR th of −6 dB. Interfered receiver 2 was placed on the boundary between Cell_3 and Cell_4 and thus received equal (and the strongest) interference from these two cells. Interfered receiver 3 was located in the main beam direction for Cell_3, as the interfered receiver. However, the minimum power use of only Cell_3 could not satisfy the I NR th of −6 dB, and at least one other cell had to use less than the maximum power.   Figure 9 shows the SINR map based on the acquired using the proposed DQLbased power control algorithm for interfered receiver ①. Table 4 presents a performance comparison of the * values obtained via an exhaustive search and and a comparison of DQL and DDQL results. As shown, was equal to the optimal value * , providing the same and performance. Because the interfered receiver was located in the azimuth main beam direction of _3, the power of _3 significantly affected the interfered receiver. Even though all other cells used the maximum power, their interfer-  Table 3 presents the I NR and P out for P max and P min with varying interfered receiver locations. The results confirm that the P out and I NR had a tradeoff relationship. The same P out is shown regardless of the interference receiver position because of the absence of power control. Next, we compared the simulation results of the optimal exhaustive search and the proposed DQL-based power control algorithm for the three positions of the interfered receiver.  Figure 9 shows the SINR map based on the P c acquired using the proposed DQLbased power control algorithm for interfered receiver 1 . Table 4 presents a performance comparison of the P * values obtained via an exhaustive search and P c and a comparison of DQL and DDQL results. As shown, P c was equal to the optimal value P * , providing the same P out and I NR performance. Because the interfered receiver was located in the azimuth main beam direction of Cell_3, the power of Cell_3 significantly affected the interfered receiver. Even though all other cells used the maximum power, their interference was negligible. Therefore, all the cells except for Cell_3 used the maximum power for minimizing P out , as shown in Table 4.   Figure 9 shows the SINR map based on the acquired using the proposed DQLbased power control algorithm for interfered receiver ①. Table 4 presents a performance comparison of the * values obtained via an exhaustive search and and a comparison of DQL and DDQL results. As shown, was equal to the optimal value * , providing the same and performance. Because the interfered receiver was located in the azimuth main beam direction of _3, the power of _3 significantly affected the interfered receiver. Even though all other cells used the maximum power, their interference was negligible. Therefore, all the cells except for _3 used the maximum power for minimizing , as shown in Table 4. Figure 9. SINR map based on the obtained using the proposed DQL-based power control algorithm for interfered receiver ①.   Figure 10 presents the I NR and p out for each learning episode. As shown, the I NR and p out converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The I NR started at −11.01 dB, which was the value for the use of P min , as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, p out started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. and converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The started at −11.01 dB, which was the value for the use of , as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably.  We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance and converged to the optimal values of the exhaustive search algorithm as the number of learning episodes increased. The started at −11.01 dB, which was the value for the use of , as shown in Table 3, and converged to the optimal value of −6.93 dB. Similarly, started at 43.7% and converged to 0.6%. A large variance due to frequent exploration was observed at the beginning of the learning, but it gradually decreased and converged as the learning progressed. Figure 11 presents the cumulative and average rewards for each learning episode. As shown, the reward rapidly increased and then gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably.  We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance We compared the learning results of the DQL and DDQL. Even when the DDQL is used, the results are the same as in Table 4 and Figures 10 and 11, which shows that the overestimation of the DQL did not occur. As a result, it was confirmed that performance degradation due to overestimation did not happen, and sufficient learning is possible only with DQL. Figure 12 shows the SINR map based on P c acquired using the proposed DQL-based power control algorithm for interfered receiver 2 . Table 5 presents a performance comparison of the P * values obtained via an exhaustive search and P c and a comparison of the DQL and DDQL results. As shown, P c was equal to the optimal value P * , providing the same P out and I NR performance. The interfered receiver was located on the boundary between Cell_3 and Cell_4 and, thus, received equal (and the strongest) interference from these two cells. In addition, even though all the cells other than Cell_3 and Cell_4 used the maximum power, their interference was marginal. Therefore, in the optimal power control, Cell_3 and Cell_4 reduced the power required to satisfy the I NR th , whereas all the other cells used the maximum power for minimizing P out , as shown in Table 5.

Simulation Results for Interfered Receiver 2
providing the same and performance. The interfered receiver was located on the boundary between _3 and _4 and, thus, received equal (and the strongest) interference from these two cells. In addition, even though all the cells other than _3 and _4 used the maximum power, their interference was marginal. Therefore, in the optimal power control, _3 and _4 reduced the power required to satisfy the , whereas all the other cells used the maximum power for minimizing , as shown in Table 5.  As shown in Figure 13, the and converged to the optimal values of the exhaustive search algorithm. Similar to the case of receiver ①, as the learning progressed, the converged from −12.08 to −6.08 dB, and the converged from 43.7% to 0.2%. Figure 14 shows that the reward gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to quickly and stably learn the power control algorithm. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 5    As shown in Figure 13, the I NR and p out converged to the optimal values of the exhaustive search algorithm. Similar to the case of receiver 1 , as the learning progressed, the I NR converged from −12.08 to −6.08 dB, and the p out converged from 43.7% to 0.2%. Figure 14 shows that the reward gradually converged at approximately 300 episodes, indicating that the proposed DQL training process allowed the agent to quickly and stably learn the power control algorithm. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 5 and Figures 13 and 14, verifying that the desired learning is attainable with the DQL only.     Figure 15 shows the SINR map based on obtained using the proposed DQLbased power control algorithm for interfered receiver ③. The interfered receiver was located in the azimuth main lobe direction of _3. It was closer to the HAPS than the receiver considered in Section 5.2.1 and was more severely affected by _3; was not satisfied even for the minimum power of _3. Thus, the optimal power control adjusted the power of _2 and _4, which caused the second-most interference. Table 6 presents a comparison of the * values obtained using an exhaustive search and and a comparison of the DQL and DDQL results. Although the of was 0.6% higher than that of * , it corresponded to the third-smallest value among the 78,125 values generated by the exhaustive search algorithm. In summary, the proposed power control algorithm achieved outstanding performance close to the optimal value.  Figure 15 shows the SINR map based on P c obtained using the proposed DQL-based power control algorithm for interfered receiver 3 . The interfered receiver was located in the azimuth main lobe direction of Cell_3. It was closer to the HAPS than the receiver considered in Section 5.2.1 and was more severely affected by Cell_3; I NR th was not satisfied even for the minimum power of Cell_3. Thus, the optimal power control adjusted the power of Cell_2 and Cell_4, which caused the second-most interference. Table 6 presents a comparison of the P * values obtained using an exhaustive search and P c and a comparison of the DQL and DDQL results. Although the p out of P c was 0.6% higher than that of P * , it corresponded to the third-smallest value among the 78,125 values generated by the exhaustive search algorithm. In summary, the proposed power control algorithm achieved outstanding performance close to the optimal value.  As shown in Figure 16, the and converged to the optimal values of the exhaustive search algorithm, with slight gaps. Similar to the results presented in Section 5.2.1, as the learning progressed, the converged from −6.19 to −6.06 dB, and the converged from 43.7% to 5.7%. Figure 17 shows the cumulative and average rewards for each learning episode. The reward exhibited no noticeable improvement until approxi- Figure 15. SINR map based on P c obtained using the proposed DQL-based power control algorithm for interfered receiver 3 . Table 6. Performance comparison for interfered receiver 3 . P Cell 1 (dBm) P Cell 2 (dBm) P Cell 3 (dBm) P Cell 4 (dBm) P Cell 5 (dBm) P Cell 6 (dBm) P Cell  As shown in Figure 16, the I NR and p out converged to the optimal values of the exhaustive search algorithm, with slight gaps. Similar to the results presented in Section 5.2.1, as the learning progressed, the I NR converged from −6.19 to −6.06 dB, and the p out converged from 43.7% to 5.7%. Figure 17 shows the cumulative and average rewards for each learning episode. The reward exhibited no noticeable improvement until approximately 130 episodes, after which it rapidly increased and then gradually converged at approximately 350 episodes. This is because to satisfy the I NR th , more agents had to take action, and the actions had to be more diverse. Nonetheless, the proposed DQL training process allowed the agent to learn the power control algorithm quickly and stably. We compared the learning results of the DQL and DDQL. Even when the DDQL was used, the results were the same as in Table 6 and Figures 16 and 17, verifying that the desired learning is attainable with the DQL only.

Conclusions
This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search.
Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the nonstationarity problem of multiagent reinforcement learning, its solution would be worth studying.

Conclusions
This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search.
Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the nonstationarity problem of multiagent reinforcement learning, its solution would be worth studying.

Conclusions
This paper proposed a DQL-based transmission power control algorithm for multicell HAPS communication that involved spectrum sharing with existing services. The proposed algorithm aimed to find a solution to the power control optimization problem for minimizing the outage probability of the HAPS downlink under the interference constraint to protect existing systems. We compared the solution with the optimal solution acquired using the exhaustive search algorithm. The simulation results confirmed that the proposed algorithm was comparable to the optimal exhaustive search.
Future work will include various power levels and expanding to multiple-HAPS communication in spectrum sharing with multiple interference systems. Since the increase in the power level could reveal a value-based algorithm's limit, it is preferred to apply the policy-based algorithm. Given that multiple-HAPS communication could lead to the non-stationarity problem of multiagent reinforcement learning, its solution would be worth studying.