Deep Reinforcement Learning-Based Resource Management in Maritime Communication Systems

With the growing maritime economy, ensuring the quality of communication for maritime users has become imperative. The maritime communication system based on nearshore base stations enhances the communication rate of maritime users through dynamic resource allocation. A virtual queue-based deep reinforcement learning beam allocation scheme is proposed in this paper, aiming to maximize the communication rate. More particularly, to reduce the complexity of resource management, we employ a grid-based method to discretize the maritime environment. For the combinatorial optimization problem of grid and beam allocation under unknown channel state information, we model it as a sequential decision process of resource allocation. The nearshore base station is modeled as a learning agent, continuously interacting with the environment to optimize beam allocation schemes using deep reinforcement learning techniques. Furthermore, we guarantee that grids with poor channel state information can be serviced through the virtual queue method. Finally, the simulation results provided show that our proposed beam allocation scheme is beneficial in terms of increasing the communication rate.


Introduction
The sixth-generation (6G) wireless communication aims to expand network coverage and improve network performance [1,2].Maritime communication, as an important component of wireless communication, has received increasing attention with the growing maritime economy.
In general, maritime communication is composed of satellite communication and communication based on nearshore base stations (BSs).Satellite communication systems, such as the Global Maritime Distress and Safety System (GMDSS), Iridium system, and International Maritime Satellite System (Inmarsat) [3][4][5] can cover the large maritime environment, meeting the communication requirements of maritime users.However, the high cost and latency of satellite communication are the main challenges faced in maritime satellite communication [6].Communication systems based on nearshore BSs can be integrated with terrestrial communication systems, which effectively reduces cost and latency.Nevertheless, compared to terrestrial communication, maritime communication is subject to multiple factors.The refractive index fluctuation caused by the uneven atmospheric pressure and temperature, namely turbulence, reduces the performance of the communication system [7], and, due to the lack of scatterers in the vast maritime environment, the scattering of electromagnetic waves affects communication performance [8].Furthermore, maritime communication users exhibit locally dense, and overall sparse, distribution characteristics.These factors make it inappropriate to directly address the communication requirements of maritime users through traditional terrestrial communication.
Currently, there are some nearshore communication systems and networks designed for maritime communication.The Long Term Evolution (LTE)-Maritime project aims to meet the communication requirements of maritime users using ground infrastructure [9]; it can support a high data rate while providing coverage around 100 km from BSs.The mesh TRITON network, based on IEEE 802.16, focuses on dense wireless mesh networks in maritime nearshore areas for maritime users [10].The nearshore communication systems, such as the navigation telex (NAVTEX) system and the automatic identification system (AIS) provide services for information broadcasting, voice, and ship identification [11].However, the above communication systems and networks are merely direct applications of terrestrial communication systems in maritime environments.At present, there is still a lack of communication schemes based on the characteristics of the maritime environment and maritime users.To solve the above problems, beamforming technology can provide directional coverage, ensuring communication quality for users within the coverage area [12,13].
In maritime communication systems, beamforming technology can be used to solve the issue of communication distance and improve user communication quality [14,15]; it can further enhance communication system performance through its combined application with non-orthogonal multiple access (NOMA) technologies [16].However, further research is needed on how to manage beam resources, and when and where to make beam management decisions.On the one hand, maritime communication faces challenges such as long distances between BSs and users, significant transmission delays, and high interference [17].It is difficult to obtain the CSI in the maritime environment [18], and the traditional communication resource allocation schemes based on CSI are difficult to employ in the maritime environment [19].On the other hand, there is still a lack of relevant research on the distribution characteristics of communication users in maritime environments.So, it is necessary to design an efficient beam management scheme to cater to the characteristics of the maritime environment and users.
The essence of a beam management scheme is essentially a combinatorial optimization problem; its computational complexity increases continuously with the growth of the dimension of combinatorial optimization states.Traditional methods are not applicable when the state space is large.Reinforcement learning (RL) can solve this problem by structuring combinatorial optimization as a sequential decision process.It continuously interacts with the environment and updates iterations based on real-time data, optimizing the choices made.RL has been widely applied in the scenario of resource allocation in maritime environments [20,21].In a rapidly changing vehicular environment, Liang et al. [22] effectively improved the transmission rate in the end-to-end communication link.It has been shown that multi-agent RL can achieve significant performance gains by allocating appropriate resources in the face of uncertain environments [23].
Nevertheless, Shi et al. [24] also point out that, as the number of serviced users increases, which corresponds to the growth in the state space of RL, there is a higher requirement for online devices, which will reduce the efficiency of service devices.Meanwhile, due to the dynamic nature of the maritime environment, it is challenging to fully describe the state information of all communication users using a Q-network in RL.Hence, we consider using the method of deep reinforcement learning (DRL) to address the resource allocation problem.DRL as an effective method has achieved great performance in resource management [25,26].In [27], Hu et al. propose a DRL framework model to address the decision problem of dynamic resource allocation in satellite communication.In [28], Qian et al. effectively reduced the total energy consumption of the entire maritime Internet of Things network by utilizing relay devices for resource allocation through a DRL framework.All of these papers demonstrate the effectiveness of using a DRL method for resource allocation in communication networks.Furthermore, in order to further reduce the complexity of communication systems, an optimized algorithm using a grid-based method to tackle the resource allocation management problem is proposed in [29].Similarly, we can divide the sea area covered by the BS's transmitted signal into multiple grids.
In this paper, a virtual queue-based deep reinforcement learning (VQDRL) beam allocation scheme for maritime communication systems is proposed, wherein the CSI is unknown, aiming to maximize the communication rate of the maritime communication system.Due to the sparse features of maritime users, an optimized algorithm using a grid-based method is employed to tackle the resource allocation management problem.After modeling the maritime communication system and VQDRL framework, we employ neural networks to allocate beam resources.Furthermore, the neural network is trained based on the communication rate of maritime users to obtain the optimal beam allocation scheme.This paper has three main contributions which are as follows: 1.
Due to the complexity of the maritime environment, a grid-based method is adopted to discretize the coverage area, reducing the complexity of resource management.

2.
A VQDRL resources allocation scheme is employed in grids with unknown CSI.By continuously training the neural network to optimize its output, we obtain the most effective beam allocation scheme.

3.
A virtual queue method is employed in maritime communication, it can ensure that grids with poor channel states can be served.
The remainder of this paper is organized as follows.Section 2 introduces the system model.A VQDRL resources allocation scheme is proposed in Section 3, and the simulation results are given in Section 4. Finally, we conclude this paper in Section 5.

System Model
The maritime communication system model is shown in Figure 1.A uniform antenna array with a total of N t = N h t ×N v t antennas is deployed on the base station (BS).There are N h t antennas in the horizontal direction and N v t antennas in the vertical direction.Based on the characteristics of the antennas, a total of N beams are generated.After dividing the maritime environment using the grid-based method, we can get M grids.Now, N = {1, . . .n, . . .N} and M = {1, . . .m, . . .M} represent the set of beams and grids, respec- tively.Here, n and m denote the index of beams and grids.When we obtain the distance d m and horizontal angle θ m from grid m to the BS, we can describe the grid information by using the set s m = {d m , θ m }.And the received signal at the m-th grid with the n-th beam can be expressed as where P t represents the transmit beam power, G m denotes the antenna gain for grid m, α is the path loss coefficient, and h m and w n denote the channel from grid m to the BS and the unit-norm precoding vectors of beam n, respectively, which satisfy Let x m and x a denote the transmission signal from the BS to the grid m and a; moreover, the n 0 ∼ CN 0, 1 is the additive complex white Gaussian noise.

Problem Formulation
From Equation ( 1), when we transmit a signal from the BS to the grid m with beam n, we can get the received SINR of the users at time slot t using the following equation: The communication rate can be expressed as Consequently, the achievable communication sum-rate of the maritime communication system model is given by To maximize the accumulated communication throughput in a period set T = {1, 2, . . ., T}, the optimization problem can be expressed as where R th (t) denotes the minimum achievable communication rate for an m user in time slot t, and K m,n (t) represents whether the communication of beam n for grid m in the current time slot t was successful.Constraints (5a) ensure that the value of K m,n (t) can be either 0 or 1. Constraints (5b) and (5c) denote that a maximum of P beams can be used to serve P grids at any given time slot.Constraint (5d) means that when the current communication rate is greater than the minimum rate or equal to the minimum rate, the value of K m,n (t) is 1, otherwise, the value of K m,n (t) is 0.

Deep Reinforcement Learning Model
For the optimization equation described above, it is impossible to obtain real-time channel state information h m (t) to adjust beam allocation schemes.The beam allocation problem is constructed as an RL system model and the RL algorithm can optimize the action-choosing behavior through massive interactions between the agent and environment.However, when the dimensions of the state space and action space are too large, traditional tabular-based RL algorithms face issues such as being time-consuming.The DRL addresses the limitations of traditional RL algorithms by using deep neural networks to select actions; it updates the weight parameters by minimizing the loss function to optimize the actionchoosing behavior.The structure of the DRL model is shown in Figure 2.

State Definition
In our model, we can obtain the state information of the grid that needs communication, namely as s m = {d m , θ m }.The set of state information can be denoted as S = {s 1 , ...s m , ...s M }.Moreover, the BS can handle communication requirements from P grids at any given communication time slot t, and the system state s(t) ∈ R P×1 ⊆ S.

Action Definition
We define A(s(t)) as the set of available actions under state s(t).For any available action a * , we have a * ∈ A(s(t)).Additionally, we use a(t) to denote the beam chosen scheme at time slot t.

Reward Definition
We use reward r(t) to describe the degree of goodness or badness of taking an action a(t) in the current state s(t).In our model, we use r(t) to denote the condition of the communication rate.Typically, the reward value should be normalized in range r(t) ∈ [0, 1].

Action Selection
To avoid the local optimum of DRL, the resource allocation decision is made by adopting the ϵ-greedy strategy, where the action is randomly selected with a probability of ϵ, while the action with largest action-values is selected with a probability of 1 − ϵ.We use Q(s(t), a(t); θ) to denote the chosen action through the Q-network.Action selection with the ϵ-greedy strategy can be expressed as:

Replay Memory
To alleviate the problems of related data and non-stationary distributions in our system model, a replay memory technique, which randomly samples previous transitions, and, thereby, smooths the training distribution over many past behaviors, is adopted.The experience item is stored in the form of quad-tuples (s(t), a(t), r(t), s(t + 1)).

Loss Calculation
In order to improve the deep neural network performance, we use Q(θ − ) to denote the target network.During Q-network training, we update the weight parameters of the Q-network θ by minimizing the loss function, as given in Equation (3): where the Q target is the target value defined below: γ is the discount factor.

VQDRL Resources Allocation Scheme
In this section, a VQDRL resources allocation scheme is proposed to address the beam resource management problem in the maritime communication system.First, a virtual queue method is introduced to choose the grids of communication [30].Then, the BS allocates the beam to the selected communication grids by using the proposed VQDRL resources allocation scheme, which can maximize the communication rate.

Virtual Queue Method
In communication, we can use DRL to select beam and grid pairs to maximize the communication rate.However, each grid has a different CSI, and directly using DRL to allocate beams may result in some grids with poorer channel conditions being unable to communicate.To address this issue, the method of virtual queue is introduced.By recording communication requirements and queue lengths in different grids, this method can ensure that communication is implemented in grids with poor channel conditions.
When we transmit signals to maritime users with communication requirements, we also need to allocate beam resources as frequently as possible to ensure the communication quality of those users with higher requirements.Meanwhile, we aim to send signals fairly to each maritime area with communication requirements.By recording the requirements of different grids, the virtual queue method can more frequently select grids with higher communication requirements for communication; it also ensures fairness in the communication among grids.
Let r = [r 1 , . . ., r m , . . ., r M ] represents the vector of communication requirements of maritime users within each grid, and, for any grid m, we can calculate the proportion of a user's requirement to the total grid requirements as follows: Now, for each grid m, we create a virtual queue V m to denote the queue length at the beginning of the communication; it can be seen as the accumulated length of communication requirements up to the current communication time slot.The virtual queue length V m (t) is denoted according to the following dynamics: + (10) where x + ≜ max{x, 0}, d m (t − 1) denotes whether the communication rate in the current grid at time slot t − 1 meets the communication rate requirement; it can be denoted as , and η is a coefficient to avoid excessively queue length.At any given time slot t, the BS selects the grid to be served by beam resources and obtains the grid information s m using the following equation:

VQDRL Resources Allocation Scheme
After obtaining the communication grids for the current time slot using the virtual queue method, we aim to maximize the communication rate by applying the DRL algorithm for beam allocation in these grids.However, the obtained grid information cannot be directly used as inputs for the neural network.In this case, the VQDRL resources allocation scheme is proposed to effectively utilize the obtained grid information and employ it as input to the neural network to obtain beam allocation schemes that maximize the communication rate.
According to Equation ( 11), when we obtain the transmitted grid information s m in the current time slot t, the VQDRL algorithm uses s(t) = s m = {d m , θ m } as an input to the neural network to obtain the output, which is the index of the allocated beam.
We can get different Q(s(t), a * ; θ) values through the Q-network after inputting the state information s(t).In order to avoid the local optimum of VQDRL, we choose the action a(t) taken by the agent in the current state to allocate beam based on the probability of ϵ.When a randomly generated value is more than ϵ, we randomly select an action a, otherwise, we choose the action arg max a * Q(s(t), a * ; θ).
After selecting the action a(t) for the current time slot, when the BS sends beams to the grids, the maritime users will get their communication rates R m,n (t).The reward r(t) is used to describe the condition of the communication rate, and we can obtain the next state s(t + 1).It is obvious that when the constraint condition (5d) is satisfied under condition ∑ m,n R m,n (t)K m,n (t) = ∑ m,n R th (t)K m,n (t) = 0, the reward r(t) = 0. Otherwise, we define the function of reward r(t) as follows: To further train our neural network while avoiding the issue of local optimization, we often use the replay memory D to store training data.Let us use the buffer size to describe the maximum amount of data stored in D and store the data set (s(t), a(t), r(t), s(t + 1)) from the communication process into D. When the amount of data set stored is larger than the pre-defined batch size, we can randomly select a batch size of data from D to update the neural network.Otherwise, we continue the aforementioned communication process until the amount of data stored in D exceeds the batch size.
After selecting the data to be used for training, we update the weight parameters of the Q-network through the following process.We typically use the Bellman equation to update the Q-values like Equation ( 13) in RL: where β is the learning rate.Similar to RL, in DRL, based on the Bellman equation, we can obtain the weight update through gradient descent using the loss function.According to Equations ( 7) and ( 8), the gradient of loss function is calculated by calculating parameters as follows: In our model, we used the Adaptive Moment Estimation (Adam) optimizer to solve the gradient descent problem.It can adaptively adjust the learning rate to more effectively update the model's weights.As a hyperparameter, the learning rate has a significant impact on weight updates for different values.In this model, we used a learning rate of 10 −3 to train the model at the beginning of the training period and hyperparameter γ represents the discount factor.
After updating the weights of the Q-network, we determine whether it is time to update the weights of the target network for the current time slot.Typically, the target network parameter weights update every N t step.The parameter weights of the Q-network are assigned to the target network's parameters, completing the update of the neural network.

Simulation Result
In this section, we simulated a maritime environment and deployed an antenna array at the BS, constructing beam resources firstly, then we employed the VQDRL algorithm to obtain the beam allocation schemes and observed the communication rate within the beam coverage area through simulation results, further demonstrating the effectiveness of our algorithm.

Simulation Enviroment Configuration
We deployed Uniform Planar Array (UPA) antenna arrays at the BS, and considered taking the Kronecker product of the Discrete Fourier Transform (DFT) codebook in the horizontal direction and vertical direction by using the Kronecker-product method.The 3D Kronecker-product-based codebook can select the appropriate beam to enhance the channel gain for the grid in both the horizontal and vertical direction [31].It is generated as where m = 0, 1, . .., N v − 1, n = 0, 1, . .., N h − 1, N v , and N h are the number of codewords in the vertical direction and horizontal direction, respectively, ⊗ denotes the Kronecker product, and ξ is a parameter to adjust the proportion which is determined by the maximum downtilt.
According to Equations ( 3) and ( 12), we need to simulate the maritime communication channel model to obtain the communication rate and the reward value to train the neural network.The maritime communication channel is assumed to follow the Rician distribution, expressed as follows: where K is the rice factor, ĥm is the complex Gaussian random variables with zero mean and unit variance, which belong to a set of C 1×N t , and h m is the channel mean vector.When we consider the antenna array arranged in a UPA, the channel mean vector of the m-th grid vector is expressed by [32,33]: where λ is the wavelength, d is the inter-antenna spacing, and θ m is the horizontal angle of the BS to the m-th grid, while ϕ m is the vertical angle of the BS to the m-th grid, which can be calculated based on the distance from the grid to the BS under the condition of determining the height of the BS and the antenna, n h t = 0, 1, . . ., N h t , n v t = 0, 1, . . ., N v t .For the convenience of distinction, the above model parameters and channel transmission parameters are represented in Table 1.Communication rate of grid m with beam n Q(θ); Q(θ − ) Q-network and target network r m

Requirements of maritime users within grid m p m
Communication requirement of grid m V m ; K Virtual queue length of grid m and rice factor

Average Communication Rate Analysis
The effectiveness of the VQDRL algorithm under this model will be proven by analyzing the average communication rate with the entire grid and the average communication rate with different grids.All simulations were conducted on a desktop equipped with an Intel Core i7-10700 2.9 GHz CPU(Intel, Santa Clara, CA, USA), with each iteration of 1 × 10 5 time slots taking approximately 12 min in our model.
We divided the coverage area of the nearshore BS communication into grids of size 10 × 10.The parameters of each grid and model value can be represented as follows in Tables 2 and 3. To observe how the communication rate of the entire maritime environment changes with the communication time slots, we used the average communication rate to show the variation in Figure 3. Here, we employed random beam allocation and round robin beam allocation schemes as comparative simulation results.The scheme that randomly allocates a beam to the current communication grid is called the random beam allocation scheme.Meanwhile, the scheme that allocates a beam to the current grid in ascending order by beam index, and restarts the cycle when the maximum beam index is reached, is called the round robin beam allocation scheme.

7LPHVORW
$YHUDJHFRPPXQLFDWLRQUDWHESV+] 94'5/VFKHPH 5RXQGURELQVFKHPH 5DQGRPVFKHPH It is shown that, as the time slots increase, the difference in average communication rates between the three beam allocation schemes becomes larger.In particular, the round robin beam allocation scheme and random beam allocation scheme do not show a significant difference in the average communication rate for entire maritime grids.Furthermore, with the increase of time slots, the average communication rate for the entire maritime environment under the two schemes essentially remains consistent.Compared to the other two beam allocation schemes, the proposed VQDRL resources allocation scheme significantly improves the average communication rate for the entire maritime environment, and it gradually converges with the increase of the time slot.
To further analyze the average communication rate of the maritime environment, we sampled some grids to observe their average communication rate.In Figure 4, we take two grids, Grid 55 and Grid 95 , as our observation points to observe the average communication rate.The left subplot shows the variation of the average communication rate over time for Grid 55 , while the right subplot shows the variation of the average communication rate over time for Grid 95 .It is shown that, as the time slots increase, the average communication rate of the proposed VQDRL algorithm gradually increases until it stabilizes and converges.Meanwhile, the average communication rates of the two beam schemes in the comparative simulation gradually converge and eventually become consistent as the number of time slots increases.The performance of the random and round robin beam allocation schemes is worse than the performance of the proposed VQDRL algorithm.This trend aligns with the observations in Figure 3.

Awaiting Time per Transmission and Average Virtual Queue Length
In this subsection, we discuss the waiting time per transmission and the average virtual queue length of different grids.Let us set an upper bound on the length of the virtual queue, denoted as V max , for all grids.When the virtual queue length of any grid exceeds this value, we consider the grid to be in a communication waiting state.Here, we set V max = 1, and Figure 5 shows the average waiting time per transmission of different grids over 5 × 10 4 time slots.Meanwhile, we sampled some grids to obtain the average virtual queue length and the confidence interval for the average queue length in Figure 6.The average virtual queue length denotes the growth trend of virtual queues for different grids.It can be calculated by

*ULGLG $ZDLWLQJWLPHSHUWUDQVPLVVLRQ
. The average virtual queue lengths and their respective confidence intervals under multiple simulations for Grid 55 , Grid 53 , and Grid 95 are displayed in red, blue, and green, respectively.The purple line represents the upper bound of the virtual queue, and the grids whose virtual queue length exceeds the line at any time slot are considered to be in a waiting state until they are successfully communicated, reducing the virtual queue length to below that line.We observed that the average virtual queue length under these three different grids increases with the growth of time slots and eventually converges to around 0.6, which is often associated with the ability to communicate in each time slot.

Hyperparameters Analysis
In this subsection, we adjust buffer size, learning rate, and batch size to observe the variation of the average communication rate.
We adjust the buffer size in Figure 7.At the beginning of the simulation, the average communication rate with a buffer size of 256 is significantly higher than that of the other two batch sizes.As the number of time slots increases, the average communication rate of the buffer size 512 gradually increases and converges near the average communication rate of the buffer size 256.The average communication rate of the largest buffer size remained much lower than the aforementioned two buffer sizes throughout the simulation.This is because when the buffer size is too large, selecting data randomly from replay memory may include some outdated data, which affects the effectiveness of the simulation.In the simulation, we can adjust different buffer sizes to improve the average communication rate in the maritime environment.
As shown in Figure 8, when we set the learning rate to 1 × 10 −3 and 1 × 10 −5 , we can see that the convergence curves of the average communication rate are relatively less for both grids.However, when the learning rate is set to 1 × 10 −4 , the average communication rate shows the best performance.This is because a learning rate that is too large or too small can cause the beam selection to get stuck in local optima, resulting in poorer performance.We can set a moderate learning rate value to obtain the best performance.In Figure 9, we adjust the batch size.We can see that when the batch size is 64, the average communication rate performs better compared to the case with a batch size of 32 and 256, as the number of communications increased.This is because, when the batch size is 32, the training data is not sufficient to adequately represent the data features stored in the replay memory, leading to poor performance.And when the batch size is 256, a larger batch size will rely on training data that are too "old" and degrade the convergence performance.Furthermore, a large batch size will consume more time when training neural networks.Therefore, an appropriate batch size plays a crucial role in training neural networks.

Conclusions
In this paper, a VQDRL resources allocation scheme was proposed and investigated in the maritime environment, and we discussed the average communication rate performance.Firstly, a maritime communication model with grid-based partitioning was employed, and we utilized VQDQN for allocating beam resources.Secondly, the average communication rate of all grids and the average communication rates of different grids were simulated and the simulation results demonstrate the effectiveness of our beam allocation scheme.Additionally, we discussed the performance of average waiting per transmission, average virtual queue length, and confidence intervals of different grids.Finally, we adjusted hyperparameters to obtain a better performance on the average communication rate.The simulation results demonstrated that we can construct suitable beam allocation schemes to maximize the communication rate in the maritime environment using our algorithm.Furthermore, further work can be conducted by adopting the objective function of our proposed scheme to analyze various network performance metrics, including energy efficiency and spectrum utilization.Moreover, improving the generalization performance of our proposed scheme requires further investigation.

Figure 3 .
Figure 3.The average communication rate variation of different beam allocation schemes with entire grids.

Figure 4 .
Figure 4.The average communication rate of different beam allocation schemes with different grids.(a) Grid 55 .(b) Grid 95 .

Figure 5 .
Figure 5.The average waiting time slot of different grids.

Figure 6 .
Figure 6.The average virtual queue length over communication time slots.

Figure 7 .Figure 8 .
Figure 7.The average communication rate with different buffer sizes.

Figure 9 .
Figure 9.The average communication rate with different batch sizes.

Table 1 .
Summary of Key Notations.

Table 3 .
Simulation parameters and values.