Hyperparameter Optimization for the LSTM Method of AUV Model Identiﬁcation Based on Q-Learning

: An accurate mathematical model is a basis for controlling and estimating the state of an Autonomous underwater vehicle (AUV) system, so how to improve its accuracy is a fundamental problem in the ﬁeld of automatic control. However, AUV systems are complex, uncertain, and highly non-linear, and it is not easy to obtain through traditional modeling methods. We ﬁt an accurate dynamic AUV model in this study using the long short-term memory (LSTM) neural network approach. As hyper-parameter values have a signiﬁcant impact on LSTM performance, it is important to select the optimal combination of hyper-parameters. The present research uses the improved Q-learning reinforcement learning algorithm to achieve this aim by improving its recognition accuracy on the veriﬁcation dataset. To improve the efﬁciency of action exploration, we improve the Q-learning algorithm and choose the optimal initial state according to the Q table in each round of learning. It can effectively avoid the ineffective exploration of the reinforcement learning agent between the poor-performing hyperparameter combinations. Finally, the experiments based on simulated or actual trial data demonstrate that the proposed model identiﬁcation method can effectively predict kinematic motion data, and more importantly, the modiﬁed Q-Learning approach can optimize the network hyperparameters in the LSTM.


Introduction
The ocean occupies most of the earth's surface and is an important area for commercial activities, scientific research, and resource extraction, and its impact is critical to all aspects [1,2]. Over the past half-century, oceanographic research has shown that the oceans and seafloor are critical to understanding the planet. Exploring the marine environment provides valuable knowledge for many areas of science and engineering. AUVs have been widely used in marine engineering due to their unique advantages. Autonomous underwater vehicles (AUVs) equipped with various sensors have broad applications in scientific, military, and commercial missions, such as deep-sea exploration, cable/pipe tracking, feature tracking, and more [3].
AUVs are complex nonlinear coupled systems, and it is challenging to model them accurately [4,5]. Therefore, studying the non-linear mechanism of AUV systems and finding mathematical methods that can accurately express them have become the focus of AUV research, which has essential educational and economic value [6][7][8][9]. Neural network models have recently been widely used in AUV system model identification. In the identification networks, recursive structures are adapted to acquire dynamic information and improve the communication between neurons [10]. The experimental results show that the method can fit the AUV system. The weight values of the network were tuned using a hybrid algorithm different task requirements. The bow is generally equipped with an underwater acoustic communication machine; the navigation cabin includes attitude and heading reference system (AHRS), global positioning system (GPS), doppler velocity log (DVL), etc. The electronic energy cabin includes batteries, industrial computer systems, etc.; the propulsion system cabin mainly includes steering gear and thruster motors. The experiments carried out in this paper are based on the "Sailfish" 210 AUV platform, which has a maximum speed of 5 kn (2.5 m/s). The parameters are shown in Table 1. The general motion of the AUV was described with two coordinate systems, body-fixed reference frame (G − xyz) and earth-fixed reference frame (E − ξηζ) [28]. AUV translational and rotational motions are described in six degrees of freedom (6DOF) as follows: where η 1 and η 2 refer to the position and orientation of the AUV with respect to the earthfixed reference frame, υ denotes the translational and rotational speeds with respect to the body-fixed reference frame, τ 1 and τ 2 refer to the external forces and moments with respect to the body-fixed reference frame. A diagram of the AUV coordinate system is shown in Figure 1.
x G , y G , z G : the position of centre gravity of the AUV. I x , I y , I z : the moment of inertia of the AUV. u, v, w: velocities along the x-axis, y-axis, and z-axis of the AUV. p, q, r: roll angular velocity, pitch angular velocity and yaw angular velocity. u,v,ẇ,ṗ,q,ṙ: linear acceleration and angular acceleration. ∑ X, ∑ Y, ∑ Z, ∑ K, ∑ M, ∑ N: external force and moment.

The Dynamical Principles AUV Model Identification
The dynamic model mathematically describes the essential law of the interaction between the AUV and the environment, which can well reflect the state transition of the AUV under the action of force. This paper will realize the AUV model identification based on the dynamic model. The external force (moment) exerted on the AUV mainly includes gravity and buoyancy, hydrodynamic force, rudder force, thrust force, etc. [29]. The force analysis and model identification principle are detailed below.

The Static Force
The static force of the AUV is generated by the gravity P and buoyancy B. The center of buoyancy and center of gravity coordinates in the body-fixed frame are (0, 0, 0) and (0, 0, z e ). The component of the static force in the earth-fixed frame is (0, 0, P − B), which can be obtained by converting it to the motion coordinate system by Equation (7): where p is the underwater full displacement of AUV and h represents the depth of AUV.

Hydrodynamic Force
The hydrodynamic forces of AUV are usually divided into inertial hydrodynamic forces and viscous hydrodynamic forces, and the interaction between the two is ignored. In infinitely deep, wide and still water, the hydrodynamic forces of the AUV depend only on its motion and is a function of motion parameters u, v, w, p, q, r,u,v,ẇ,ṗ,q,ṙ. According to the idea of Taylor expansion, the hydrodynamic forces (X H , Y H , Z H ) and the moments (K H , M H , N H ) are expanded to obtain the expression of the hydrodynamic forces:

Thrust
The thrust generated by the propeller is calculated as follows: where r: the propeller rotational speed; D: the propeller diameter; t: thrust derating factor; ρ: the water density; and K T : dimensionless thrust coefficient. K T is a function related to the advance ratio J = u(1−w) nD , which can be approximated as: where k 0 , k 1 , k 2 are a constant coefficients. n = u(1−w) DJ can be obtained from the advance ratio formula, and the relevant variables are substituted into Equation (9), the functional relationship between thrust and speed can be obtained with the following equation: where L is the body length, a T = µk 2

Rudder Force
This paper discusses the underactuated underwater robot, whose rudder force comes from a pair of horizontal and vertical rudders mounted on its tail. When the rudder moves at speed V and angle of attack α, it will be subjected to two parts of force: lift force perpendicular to the direction of the water flow, and the resistance along the direction of the water flow, the calculation formula is as follows: where C L is the lift coefficient, C D is the drag coefficient, and A R is the cross-sectional area of the rudder.

The Principle of Identification
T is used to represent static force, thrust force, rudder force, and disturbance force. Then, using the superposition principle, we can obtain the AUV force expression: Bring Equation (13) into Equation (6), and simplify the equation of motion, we can obtain: where: It can be seen from the above analysis that the acceleration of the AUV is recorded as the combined action of the non-inertial hydrodynamic force and other forces except the hydrodynamic force, denoted asẊ = f (u, v, w, p, q, r, n, δ r , δ s ). Further, the acceleration results from the non-inertial hydrodynamic force F vis = f H (u, v, w, p, q, r), the thrust force F T = f T (u, n), and the rudder force Based on the above analysis of the AUV model, the AUV states are denoted as X t = [x, y, z, φ, θ, ψ, u, v, w, p, q, r], then the state changes can be expressed as ∆X t = f 1 (φ, θ, ψ, u, v, w, p, q, r, n, δ r , δ s ). The historical data of AUV implies the causal relationship of its dynamic model and has the characteristics of a hidden Markov model, which can be used to build an AUV data-driven model.

AUV Model Identification Method
AUV system identification is essentially a mathematical modeling method. Its primary purpose is to build its mathematical model, which can be used in many aspects, such as controller involvement, system prediction, and system simulation. We can see from the above that the AUV system is a complex and uncertain, highly nonlinear system, and it is not easy to obtain an accurate dynamic model. We aim at this problem by adopting a data-driven model identification method based on LSTM neural network. Further, the Q method is used to optimize its hyperparameters to improve the learning efficiency of the LSTM neural network.

Fundamentals of System Identification
Identification based on a neural network means that the neural network is directly used to learn the mapping relationship between input and output. The learning criterion minimizes the error between the network's output and the system's actual output. From the above, the goal of learning is to minimize the objective function of error [30], which is as follows: where, y n (t) is the output of the neural network at time t, and y(t) is the actual output of the system at time t. Neural networks can fit any function with arbitrary precision. In principle, the desired output will be obtained as long as there is enough training data and input.
Since the environment is full of time-series information, the information before and after them is related to a certain extent. For example, AUV's position and attitude data are all data sequences that change with time. Additionally, the LSTM [31,32] is good at processing this kind of data information. Its basic structure is shown in Figure 2. The network structure introduces a cell state which contains all the information at the last moment. When new information is encountered, a series of operations will be taken to choose between the old and new information. Coupled with the introduced "memory-forgetting" mechanism, the processing of long-term series data can be realized. The structure mainly includes input, forget, and output gates. The input are h t−1 and x t , the output is h t , and the cell states are c t−1 and c t .
The forget gate controls the time dependence and effects of previous inputs and determines which states are remembered or forgotten. The output of the forget gate is: The input gate is also called the selection memory stage. It is to decide the degree of consideration for the current moment. The calculations for each part are as follows: The output gate determines the final output information. The computational procedure is summarized as follows.
The above is the basic principle of the neural network LSTM, which uses the gated state to selectively memorize the input information to meet memory needs, while forgetting the long-term sequence information.

Identification Principle and Process
After the actual test of AUV, we can obtain the dataset required for model identification. After removing the invalid data, an input and output model is established according to the navigation control instruction information of the AUV and its posture information. According to Equation (14), we can obtain: where, x 10 = p, x 11 = q, x 12 = r F = F(φ, θ, ψ, u, v, w, p, q, r, n, δ r , δ s ) . The subscripts of the matrices E n and F n represent the row and column of the matrix, respectively. Next, the learned AUV dynamic model can be expressed as: where u t is comprised of the thruster command n t and rudder angle commands (δ r t , δ s t ).
The structure of the LSTM-based AUV model is shown in Figure 3. The input elements of the input layer is x inp t = [x t , u t ], and the output of neural network is x out t = [∆y t ]. The attitude and speed information of the next moment can be obtained by using the control instruction, attitude and speed information of 20 sets of time-series data. The learned model can be optimized according to the set loss function. More details of network architectures are described below.
To avoid inconsistencies due to different relative scale sizes of different features, we normalize the data as follows: where x, x ∈ R, x min = min(x), x max = max(x). Model evaluation criteria are mainly used to evaluate the accuracy of the recognition model. In this study, the mean square error between the output of the neural network and the actual output is used as the evaluation index, as follows: where d(n) is the output of the neural network, y(n) is the system's actual output, and N is the number of datasets calculated at one time.  . The multi-step error is adopted to test the effectiveness of the learned dynamic model, as shown in Equation (25).
where O 4−12 represents the desired input data (φ, θ, ψ, u, v, w, p, q, r) from the first four columns of the output data O obtained in the previous step. Additionally, the training dataset D H consists of input-output data pairs I : (O 4−12 , u ) → O : (∆y). To sum up, the learned multi-step model can update the AUV state (x, y, z, φ, θ, ψ, u, v, w, p, q, r) in a cyclic manner using only action instructions u .

MDP Modeling of LSTM Hyperparameter Optimization Problems
The performance of the LSTM algorithm is highly dependent on hyperparameters. Moreover, different tasks often require different hyperparameter configurations. To achieve high-precision identification of the AUV system, we adopt reinforcement learning to optimize the hyperparameter configuration of the LSTM network.
Reinforcement learning is one of the many categories of machine learning methods in which the best/suboptimal strategy is determined by interacting with dynamic environments [33]. A Markov Decision Process (MDP) includes five elements: M = S, A, R, T, γ : where: S: the set of possible states. A: the set of actions generated by the policy. R: reward model. T: dynamics model, the probability of reaching the next state with the current state and action. γ: discount factor (between 0 and 1). Figure 4 shows the basic structure of reinforcement learning. In this way, the agent learns how to map states to actions. At time t, the agent receives state s t and produces action a t , then transitions to the next state s t+1 and obtains reward r t . The process does not stop until the final condition is reached. To improve the optimization efficiency, the hyperparameter optimization of the LSTM neural network is regarded as a reinforcement learning problem with discrete state space and discrete action space. Then, the above elements are designed for this problem. Action space: The hyperparameters to be optimized and the candidate values of each hyperparameter are determined according to the LSTM network model structure. We concentrate on the number of neurons N n1 and N n2 in the hidden layers, batch size N b and time step N ts in this study, while other settings are determined empirically. Therefore, the hyperparameters to be optimized and their candidate values are shown in Table 2.  (5,10,15,20) State space: In this paper, the current hyperparameter configuration a of the network is taken as the state s at time t. Then, the state space S is the same as the action space A, that is, (N n1 , N n2 , N b , N ts ).
Reward: It can be seen from the previous analysis that the error of identification decreases with a decreasing output error. So we define the immediate reward as the root mean square error of test sets with Equation (26).
where N t is the number of datasets, d t , and y t present the output of network and the real output on the testing set, respectively.

Hyperparameter Optimization Method
The hyperparameter optimization problem of the LSTM neural network is defined as a reinforcement learning problem with discrete state space and action space. Reinforcement learning methods for value function or policy approximation through neural networks, such as DQN, are also suitable for solving reinforcement learning problems with discrete action spaces. However, this kind of algorithm needs much learning to ensure convergence, and it is not suitable for these methods when the computing resources are very limited. Thus, the Q-Learning algorithm is selected in this paper to solve the problem.
Q-Learning is a temporal difference algorithm designed to solve the reinforcement learning problem. The optimal action policy π * : s t → a t can be obtained by maximizing the action value Q function Q(s t , a t ), which reflects the long-term impact of an action. The Q function is updated according to Equation (27): where 0 < α < 1 is a learning rate, 0 < γ < 1 is the discount factor. The structure of the LSTM hyperparameter optimization method based on Q-Learning is shown in Figure 5. The hyperparameter optimization process is as follows: first, the Q-value table is initialized with zeros, and the initial hyperparameter configurations (that is, initial state s 0 ) are randomly selected in the action space. Then, the action a t is chosen according to the e-greedy selection rule. Next, we perform the new action a t and the system acquires a new state s t+1 and reward r t+1 . Finally, the Q-value  To improve the optimization efficiency of hyperparameters, we improve the above methods. In each round of learning, in addition to randomly selecting the initial state in the first round, the agent selects the current best hyperparameter configuration as the initial state to start optimization. It can effectively avoid the ineffective exploration of the reinforcement learning agent between the poor-performing hyperparameter combinations.

Identification Algorithm
After introducing the model identification method based on LSTM and the hyperparameter optimization method based on improved Q-learning, we can obtain the complete process and framework of the method in this paper. The algorithm flow of hyperparameter optimization for the LSTM method of AUV model identification based on Q-Learning is shown in Table 3. Table 3. AUV Model Identification Algorithm.

AUV Model Identification Algorithm
Given a training datasets, including input and output datasets do for episode in 1 to count Network initialization: Initialize neural network hyperparameters do for t in 1 to T Choose optimized hyperparameters Calculate network output Calculate the error between the true output and the output of the network Update neural network Update Q table end for end for Figure 6 shows a detailed system model identification block diagram. Offline models can be obtained by offline training through real historical data. During the actual sailing, the system will regularly check the accuracy of the learned dynamics model. If the error is greater than σ pre-determined empirically and does not meet the requirements, the model will be retrained based on new real-time data.

Results
In order to verify the effectiveness of the proposed method, numerous experiments have been performed. The experiments were divided into two parts based on simulation data (see Section 4.1) and real data (see Section 4.2). Simulations were run on the model described in Section 2. The real data were acquired by the Sailfish AUV.

Results on Simulation Data
First, the validity of AUV system identification based on the LSTM neural network is verified. A set of hyperparameters is randomly set for the LSTM neural network, and a system identification experiment is carried out for the above AUV simulation system. In total, 4990 input and output data pairs are used for model identification, of which 4000 sets of data are used as training sets to train the LSTM neural network. The other 990 sets of data are used as validation sets to test the recognition effect of the model. The number of training is set to 100.
The change curve of the loss function during training is shown in Figure 7. As shown in the figure, the loss curve converges during training, and the value of the loss function decreases with iterations. From the perspective of the convergence of the loss curve, the application of the LSTM neural network can effectively realize the identification of the AUV system. After 100 times of training, the validation dataset is brought into the network model to test. After calculation, the squared sum of the output error is 8.490711, and the mean and variance of the absolute value of the error are 0.1761263 and 0.0145471, respectively. We can see that the output of the LSTM network model can fit the actual output of the system after 100 times of training. However, the error between the network output and the real output is large. The fitting curve between the network output and the real output will be shown in the comparison results of different methods later.
The large output fitting error of the above LSTM neural network model is because the hyperparameter settings of the model are not suitable. The identification accuracy is greatly affected by the network hyperparameter settings. Inappropriate hyperparameter settings may even lead to non-convergence. In order to achieve high-precision system identification, a reinforcement learning algorithm is used to optimize the selection of the hyperparameters of the above neural network.
Data were divided into training and test sets during each test, yielding a training set of 4000 and a test set of 990. The proposed method was performed for 100 episodes. To demonstrate the optimization performance of the method, we compare the results of five different stages (proc1-0th episode, proc2-25th episode, proc3-50th episode, proc4-75th episode, and proc5-100th episode).
We can see from the above results that the five optimized LSTM neural network models can make the loss curve converge quickly, shown in Figure 8. It can be also noticed that the convergence speed increases as the episode increases, and this method has prominent optimization characteristics.
The statistical results of the output errors of the five groups of LSTM neural network models on the validation set are recorded in Table 4. In order to make the trend of MSE more apparent, it is enlarged 1000 times and displayed in Figure 9. As can be seen from the graph, the MAE, MSE and RMSE of the error of making predictions on the validation set decrease as the number of episodes increases. The above results also demonstrate the performance of the method from a statistical point of view.   In order to further illustrate the advantages of this method, we compared the effect of the method before and after optimization with the commonly used DR algorithm. We compare the prediction results on the validation set optimized after 0 episodes and 100 episodes with the results predicted by the DR algorithm. The fitting curves of the network output and the real output of the AUV's position, linear velocity, angle, and angular velocity are shown in Figures 10-13, respectively. It can be seen from the figure that the optimized LSTM model makes the output of the 12 variables of the identification model have better agreement with the actual system output. In order to more clearly reflect the effectiveness of the method proposed in this paper, the MAE, MSE, and RMSE indicators of the prediction error are shown in Table 5. Compared with the unoptimized approach and DR, our method shows superior performance, with less MAE (28.42%, 29.13%), MSE (38.66%, 38.96%), and RMSE (21.70%, 25.81%). It can be seen that the deviation between the output of our method and the actual system is smaller than that of the other two methods.

Results on Real Data
The AUV dataset is required for training and validating the LSTM neural network model identification method. Therefore, the experiments of data collection should be carried out first. The experiments were carried out on the Sailfish AUV, as shown in Figure 14.
The actual data includes the sensor's noise in the acquisition process, and it is more difficult to obtain an accurate model than the simulation. Therefore, the method performed 200 episodes. The experimental dataset is divided into training and testing sets, yielding a training set of 4000 and a test set of 990. From the loss curve of the LSTM neural network in the training process, this method has prominent learning characteristics, as shown in Figure 15.  The fitting curve between the network output and the real output is shown in Figures 16-19. The calculated MSE of the output error is 550.830181, and the MAE and RMSE values are 6.250734 and 23.469771, respectively. We can see that the LSTM neural network obtained by optimization can fit the system's actual output and achieve highprecision recognition of AUVs. In order to illustrate the superiority of the method, it is compared with the commonly used dead reckoning (DR) method. The results are shown in Table 6, the proposed method provided 64.90% higher MAE, 64.20% higher MSE, and 37.76% higher RMSE than the DR method.     Due to the small prediction bias for the AUV motion state, It turns out, yet again, that the proposed method has high predictive power. Therefore, the proposed identification method is of great significance to the actual navigation control of AUV.

Conclusions
Aiming at the identification problem of the AUV system, this paper adopts a neural network hyperparameter optimization method based on Q-Learning. This method has been experimentally verified, and the conclusions can be summarized as follows: 1. The LSTM framework has the characteristics of natural Markovization, which can model time series data with high precision. It is found that the historical data of AUV implies the causal relationship of its dynamic model and has the characteristics of a hidden Markov model. The experimental results also show that the adopted method can predict the AUV model well.
2. Optimally selecting hyperparameters can significantly improve the efficiency of LSTMs in specific tasks. It is concluded that the improved Q-learning method can make the LSTM neural network realize the high-precision identification of the AUV system.
3. The offline training in the system model identification framework can reduce online learning time and ensure the security of the initial online use. The online learning model can also ensure its validity.
4. The proposed method has high model identification accuracy and has certain application prospects. We can apply this method to fault diagnosis of AUV, design of the model-based controller, and other aspects.
However, the hyperparameter optimization method only considers recognition accuracy. Our method can potentially be improved in the convergence speed. Moreover, the performance of the LSTM and the improved Q method used in this paper still has certain limitations. We will improve the identification method and optimization method later.