Shear Wave Velocity Estimation Based on Deep-Q Network

: Geoacoustic inversion is important for seabed geotechnical applications. It can be formulated as a problem that seeks an optimal solution in a high-dimensional parameter space. The conventional inversion approach exploits optimization methods with a pre-deﬁned search strategy whose hyperparameters need to be ﬁne-tuned for a speciﬁc scenario. A framework based on the deep-Q network is proposed in this paper and the environment and agent conﬁgurations of the framework are specially deﬁned for geoacoustic inversion. Unlike a conventional optimization method with a pre-deﬁned search strategy, the proposed framework determines a ﬂexible strategy by trial and error. The proposed framework is evaluated by two case studies for estimating the shear wave velocity proﬁle. Its performance is compared with three global optimization methods commonly used in underwater geoacoustic inversion. The results demonstrate that the proposed framework performs the inversion more efﬁciently and accurately.


Introduction
Shear wave velocity estimation is an important geoacoustic inversion task for seabed geotechnical applications since shear wave velocity can provide a good indicator of sediment rigidity and characterization [1,2]. The seabed shear wave velocity profile can be estimated from the dispersion curve of the seismoacoustic interface waves, which is a convenient and low-cost approach compared to the direct approach (e.g., coring). Here, the interface waves refer to Scholte waves since in most underwater and seismic experiments sources are deployed in the water column and only Scholte waves can be generated [2].
There are two approaches for geoacoustic inversion [3]: the optimization-based approach and the machine learning (ML)-based approach. The optimization-based approach exploits the optimization method for determining a set of geoacoustic parameters that best fit the measured data. Based on the previous reviews [4,5], some optimization methods have been demonstrated to perform well for geoacoustic inversions, such as the genetic algorithm (GA) [6], differential evolution (DE) [7], and adaptive simplex simulated annealing (ASSA) [8]. On the other hand, with the development of ML, studies for geoacoustic inversion based on ML have appeared. Most of the studies are based on supervised learning, which aims to train a deep neural network for inversion based on a vast dataset [9][10][11][12]. This type of approach normally consists of the following steps: (1) creating a simulation dataset based on a physical forward model; (2) training a deep neural network based on the simulation dataset; (3) exploiting the trained neural network for the real-world inversion.
These two approaches can both provide acceptable performances for geoacoustic inversion. However, they also have some drawbacks. Since the prevalent optimization methods are not specifically designed for geoacoustic inversion, they may incur some limitations and difficulties to be applied in this specific field, such as difficulties in choosing the hyperparameters required by the algorithm, which may incur more time-costs. The ML-based approach introduces a drawback that the trained neural network cannot interact with the physical forward model. The procedure of creating the simulation dataset has to be repeated when the ocean environment changes significantly.
To avoid the drawbacks and keep the interactive ability of the optimization-based approach and the learnability of the ML-based approach, deep reinforcement learning (DRL) could become a potential option. Unlike supervised learning, DRL learns by trial and error, which iteratively updates the model by interacting with the environment to achieve good data fitting [13]. Its potential for geoacoustic inversion can be intuitively interpreted. The physical forward model role, e.g., the environment, can create a replica according to a set of geoacoustic parameters input by the DRL model. The DRL method can update the model by iteratively modifying the replica to obtain the best fit of the measured data.
DRL has been widely used as an intelligent controller for different purposes, including robotics [14], electronic sports [15], automatic controlling [16][17][18][19][20][21][22][23], etc. Specifically, Gu et al. demonstrated the effectiveness of DRL for controlling physical robots [14]. Joo et al. proposed a green signal time allocation system based on a deep Q-network (DQN) for reducing the standard deviation of each lane at an intersection [16]. Zhou et al. modeled penetration testing as a Markov decision process and exploited DQN for autonomous penetration testing [17]. Park et al. exploited a DRL-based DQN agent for a visual objecttracking task in a virtual environment. It has been demonstrated that the proposed agent outperforms some conventional methods of two public databases [18]. Gao et al. proposed a DRL-based method to solve the relay selection problem in the decode and forward relayaided free-space optical communication system [19]. Guan et al. designed a DRL-based spectrum allocation algorithm for the internet of vehicles discriminating services. It has been proven that the designed method allocates spectrum resources quickly and efficiently in a highly dynamic environment [20]. Zhao et al. utilized DQN for controlling the autonomous walking of an underground load-haul-dump machine and demonstrated the effectiveness of the proposed DQN-based method through experimental verification [21]. Qin et al. proposed a hierarchical DQN-based path-planning method for controlling the long-term data collection of unmanned aerial vehicles in dynamic scenarios [22]. Asaf et al. exploited DRL to set optimal contention windows under different network conditions for wireless LAN performance enhancement [23].
Even though it has been illustrated that the DRL outperforms to control the agent's behavior for performing well in a specific environment, the application of DRL still needs a specific configuration of the environment, action space of the agent, and reward. For instance, Wang et al. proposed a stochastic inversion of magnetotelluric data based on DRL [24], in which the environment state is defined as the layer information and the resistivity, and the agent space includes three linear operations (addition, subtraction, and keeping no variation). However, the problem of magnetotelluric inversion is quite different from the geoacoustic inversion. Moreover, the parameter space of the latter is a high-dimensional space, which means that the naive linear operations (e.g., addition or subtraction) are inefficient for the agent to explore in the space and determine the optimal solution. To the best of our knowledge, DRL has not been used for geoacoustic inversion. Therefore, our motivation is to investigate the potential of DRL and define a useful configuration of the environment and agent for the field.
In this paper, we propose a geoacoustic inversion framework based on a popular method of DRL and the DQN for estimating shear wave velocity from the dispersion data of interface waves. In the framework, a carefully designed configuration for the environment and agent is also proposed. A comprehensive performance analysis is presented to compare the proposed framework with three popular optimization methods (i.e., GA, DE, and ASSA) widely used for geoacoustic inversion.
The remainder of this paper is organized as follows. Section 2 states the considered problem. The theories of DRL and DQN are introduced in Section 3. Section 4 describes the proposed framework for geoacoustic inversion. A comprehensive performance analysis is presented in Section 5. Finally, the conclusions are given in Section 6.

Problem Formulation
The definition of the forward problem can be expressed as where F and d refer to the physical forward model and the observed data, respectively. m is a set of geoacoustic parameters standing for one ocean environment and seabed condition. The inversion problem aims at inferring the set of geoacoustic model parameters generating the observed data and can be expressed as: where F −1 refers to the inversion operation. The ocean environment and seabed can be parameterized as an N-layered structure with four geoacoustic parameters (layer thickness, density, compression wave velocity, and shear wave velocity): . A general workflow of inversion is illustrated in Figure 1. The terminologies are introduced as follows: • The environment consists of a physical forward model for calculating the replica, the observed data, and a misfit function for measuring the mismatch between the observed data and the replica. It receives a set of selected parameters from the agent and provides feedback to the agent. • An agent is an operator that samples from the parameter space following its search strategy and interacts with the environment. During each iteration of inversion, the agent will log the feedback from the environment, the instant best solution, and the related information. • A parameter space is a multi-dimensional space defined by the search bounds. The inversion is an iterative process. It starts with an initialization that defines a prior geoacoustic model and the original search bounds based on prior knowledge. During each iteration, the agent will sample from the parameter space. The environment will receive the selected parameters, correspondingly create a replica, and provide the misfit as feedback for the agent. The iteration stops once the termination criteria are met.
As shown in Figure 1, the existing optimization methods (GA, DE, and ASSA) role as the search strategy for controlling the agent to explore the parameter space and determine the best solution at the end. Specifically, GA [6] and DE [7] are two heuristic search algorithms inspired by the evolution of natural species. ASSA [8] is a hybrid optimization method that combines the downhill simplex and the simulated annealing methods. More details about GA, DE, and ASSA can be found in [6][7][8], respectively.

Theories of DRL and DQN
DRL is used to solve a type of task that controls an agent to iteratively interact with the environment, and maximize future rewards. This task can be formulated as a finite Markov decision process [25] and be achieved by the DQN algorithm [13]. DQN is derived from Q-learning and can learn an optimal strategy by estimating the Q-value, which expresses the quality of executing an action a given a certain environment state s. The Q-value can be iteratively updated by the following formula and converge to the optimum.
where Q(s, a) expresses the Q-value at the current environment state, α is the learning rate, r is the current reward, γ is the discount factor, and max a Q(s + 1, a ) represents the maximum Q-value in the next environment state s + 1.
The conventional Q-learning needs to create a Q-table for saving and updating the Q-value at each environment state, which can be intractable to build a table when the environment state number is huge. To mitigate this problem, the DQN algorithm utilizes a neural network instead of a Q-table for estimating the Q-value. The DQN algorithm introduces a replay memory D = {ex 1 , . . . , ex t } to save the agent's experience at different iterations, where ex t = [s t , a t , r t , s t+1 ] is the experience at t iteration. During the training stage, the DQN randomly selects a mini-batch from the replay memory for minimizing a loss function L: where the meanings of symbols are the same as in Equation (3). Given a specific configuration of the environment and agent, the training process of DQN is expressed in Algorithm 1 [13].
Algorithm 1 Training procedure of DQN 1: Initializing the parameters of DQN and the replay memory D. 2: for Epoch from 1 to M do 3: repeat 4: Collecting the initial environment state s 1 .

5:
With a preset probability selecting a random action a t otherwise selecting the a t = argmax a Q(s t , a).

6:
Executing action a t and receiving feedback from the environment. The feedback includes the reward r t and the new environment state s t+1 . 7: Saving the experience ex t = [s t , a t , r t , s t+1 ] in the replay memory D.

8:
Randomly sampling a mini-batch of experience [s i , a i , r i , s i+1 ] N mini i=1 from D where N mini is the size of the mini-batch. 9: 10: Minimizing the loss function 2 and updating the parameters of DQN. 11: until the termination criteria are met. 12: end for

Geoacoustic Inversion Framework Based on DQN
In this section, the DQN-based framework for geoacoustic inversion is presented from the DRL perspective (namely, the DQN framework).

Environment Configuration
As shown in Figure 1, the environment intakes the selected parameters and provides feedback for the agent to update the search results. During each iteration, the agent inputs k sets of selected parameters and the environment provides feedback to the agent.
The configuration of the environment for the DQN framework is listed as follows: • The physical forward model: a theoretical program for calculating the replica. • Observed data: the measured data or the data derived from the measured data. • Misfit function: the root mean squared error (RMSE) measures the difference between the observed data and replica. • Environment state: a special item for the DQN framework, which indicates the progress of the inversion. The environment state is formulated as: where i refers to the ith iteration, min(.), mean(.), and std(.) are operators for calculating minimum, mean, and standard deviation, respectively. E = [E 1 , . . . , E k ]/E norm refers to the normalized misfit values corresponding to k sets of parameters, where E norm is the minimum misfit value in the initialization stage and acts as the normalization factor. ∆ is an operator for calculating the difference from the last iteration, e.g., Termination criteria: whenever one of the conditions expressed below is satisfied, the iteration stops.
where i max is the maximum iterations, e threshold is a preset threshold of misfit, and convergence is a preset threshold for convergence. • Reward: a special item for the DQN framework, which is a signal for guiding the agent to learn a search strategy. For obtaining a fast and accurate search strategy, the reward rules are defined as: • Feedback: the feedback includes misfit values E i , the environment state S i and the reward.

Agent and Action Space
Action space consists of all the potential actions that may be selected by the agent during each iteration. For instance, the action space of GA consists of reproduction, crossover, and mutation [6]. The configuration of the agent for the DQN framework is listed as follows: • Agent state: during each iteration, the agent updates the agent state based on the feedback from the environment. The agent state is formulated as where B refers to the search bounds of the parameters, m is a set of parameters with the lowest misfit value among the k-selected sets. m mean and m std are mean and standard deviation values of the parameters, respectively, whose misfit values are the first 30% lowest values among the k-selected sets. • Action space: the agent has two actions for sampling from the parameter space. Each action consists of a sampling operation and an update rule.
-Action 0 samples with the uniform distribution from the search bounds B and iteratively searches the solution by compressing B.
More specifically, during each iteration, the agent conducts two steps: 1.
At beginning of the ith iteration, the search bounds are compressed as follows: 2.
After updating the search bounds, k sets of parameters are sampled with a uniform distribution from the updated search bounds B i . The selected parameters are fed into the environment and the corresponding feedback is received by the agent. The agent state is updated accordingly.
-Action 1 samples with the Gaussian distribution and iteratively searches the solution by updating m mean and m std of the selected k sets. More specifically, during each iteration, the agent conducts the following steps: The selected parameters are fed into the environment and the corresponding feedback is received by the agent. The agent state, except for the search bounds B, will be updated accordingly.

3.
In the received feedback, if abs( ) <= convergence , the expansion is activated for jumping out of the local minimum. The expansion is conducted as follows: Repeating steps 1 and 2.

DQN-Based Search Strategy
The search strategy acts as a guide that leads the agent to iteratively select an action from the action space according to the feedback from the environment. For instance, the search strategy of GA consists of a series of pre-defined rules that leads the agent of GA to find the optimal solution.
Unlike the existing optimization methods, which have a pre-defined search strategy, the DQN framework learns the strategy for controlling the agent defined in Section 4.2 by iteratively interacting with the environment defined in Section 4.1 to find a set of parameters corresponding to the lowest misfit value as quickly as possible.
The DQN-based search strategy is shown in Figure 2, where the architecture of the neural network is expressed in Table 1. The neural network has three dense layers followed by the ReLU activation function with the exception of dense layer 3. The input for dense layer 1 is the current environment state S i , a vector consisting of six components. The output by dense layer 3 is a two-component vector expressing the Q-values corresponding to the possibilities of executing actions, and the agent will execute the action with a larger Q-value.

Inversion Workflow
The inversion workflow based on the DQN framework is expressed in Algorithm 2: Algorithm 2 Inversion workflow based on the DQN framework. Obtaining the current environment state S i .

5:
Passing S i into the network of the DQN framework and collecting the suggested action. 6: Executing the suggested action and passing the k sets of parameters into the environment. 7: until the termination criteria shown in Equation (7) are satisfied. Output: Inversion result m end .
In the initialized environment state S 1 , E norm , E mean , and E std are the minimum, mean, and standard deviations of the misfit values (with a dimension that depends on the inverse problem) among the k selected sets. Specifically, E norm involves the normalization factor, and each item of the environment state is a percentage.

Implementation
Upon the environment and agent defined in Sections 4.1 and 4.2, the DQN framework is trained following the procedure in Section 3.
Hyperparameters of the implementation are shown in Table 2.

Numerical Experiments
In this paper, the DQN framework is applied to estimate the shear wave velocity based on the dispersion data of interface waves. The inversion performances of the proposed DQN framework and three alternative methods (GA, DE, and ASSA) are examined in two numerical experiments. Two geoacoustic models based on real scenarios in [2,26] are defined to increase the reality of the simulation. To increase the reliability of the evaluation, the inversion results discussed in this section are averaged over 100 independent inversions. A forward model DISPER80 [27] based on the Thomson-Haskell matrix method [28,29] is used for calculating the simulated dispersion curve based on the given geoacoustic model.

Experiment Setup
The numerical experiments were conducted on a server with Intel Core i7-9700K CPU @ 3.60 GHz, 8 cores, 64 G memory, and 1 T hard drive. Inversions of GA and DE were implemented based on a GitHub repository scikit-opt. The inversion of ASSA was implemented based on the algorithm proposed in [8]. The DQN framework was implemented based on PyTorch [30]. Preset parameters of the candidate methods are shown in Table 3, where item All refers to the parameters applicable for all the candidate methods. The metrics for evaluation are the misfit value, the running time per independent inversion and the relative error (namely, RE) formulated in Equation (11).
where m inversion is the estimated value of one geoacoustic parameter and m true is the corresponding ground truth.

Case 1
Case 1 defines a six-layer geoacoustic model referenced from the inversion results in the Grane field [26], as shown in Table 4. The phase velocity dispersion curves for the first five modes of the Scholte waves are calculated in the frequency range from 0 to 5 Hz as the ground truth. The density and compression wave velocities are considered as known since they are not sensitive to the dispersion property of the Scholte wave [1]. The original search bounds for Case 1 are shown in Table 5. The estimated dispersion curves with the ground truth are shown in Figure 3, where the black dots are the ground truth. The blue, green, yellow, and red lines express the estimated dispersion curves by the GA, ASSA, DE, and DQN framework, respectively.  The performance analysis for Case 1 is shown in Table 6, in which the bold fonts refer to the lowest relative error, misfit value, or running time over each row. The following phenomena are revealed based on Figures 3 and 4, and Table 6: • At the low frequency of the fundamental mode, the estimated errors by ASSA and DE are larger than that by the GA and DQN framework. • For all candidate methods, the relative errors of the shear wave velocity in all layers are lower than that of thickness. Both the shear wave velocity and layer thickness are sensitive to the dispersion properties of Scholte waves. However, in some cases, one can be more sensitive than the other. • The low-to-high ranking for misfit values is the DQN framework, ASSA, DE, and GA. It is the same as the low-to-high ranking for the relative error of each geoacoustic parameter with a few exceptions of h 1 , h 2 , h 4 , and Vs 2 . This illustrates that the misfit value has a partly positive correlation with the overall relative errors, which is significant since the ground truth of a field survey is mostly unknown and the misfit value is the only metric. • The DQN framework attains the lowest misfit and the shortest running time over others. Furthermore, the DQN framework has the lowest relative errors of all estimated geoacoustic parameters with a few exceptions of h 2 and Vs 2 .
Two geoacoustic parameters are picked for further statistical analysis since the DQN framework performs the best inversion on Vs 3 and does not perform the best on h 2 . The statistical analyses of geoacoustic parameters Vs 3 and h 2 are shown in Figure 5 and Figure 6, respectively. In the figures, the red dashed curve illustrates the distribution of the estimated parameter over 100 independent inversions. The gray block, the purple line, and the black line correspond to the histogram, averaged value over 100 independent inversions, and the ground truth, respectively. Intuitively, the closeness between the purple and black lines indicates the inversion performance (the closer, the better). In addition, the distribution of the estimated parameter corresponds to the uncertainty of the inversion results. A narrower distribution means a lower uncertainty. The DQN framework has the narrowest distribution, which leads to the lowest uncertainty of the inversion result compared to the other methods.

Case 2
Case 2 considers another geoacoustic model from the inversion results in the North Sea [2]. The geoacoustic model consists of a sediment layer with a linear velocity gradient over a continuous half-space (namely, LC introduced in [1]). The model parameters listed in Table 7 are used to generate phase velocity dispersion curves in the frequency range of 3 to 18 Hz. Five modes are selected as the ground truth. Vs t and Vs b refer to the shear wave velocities at the top and bottom of the sediment layer, respectively.  [10,100], and [100, 500], respectively.
The estimated dispersion curves and shear wave velocity profiles are shown in Figure 7 and Figure 8, respectively. Their legends are the same as in Figure 3 and Figure 4, respectively. The performance analysis for Case 2 is shown in Table 8, in which the bold fonts refer to the lowest relative error, misfit value, or running time over each row.
From Figures 7 and 8, and Table 8, the following points can be found: • The low-to-high ranking for the misfit values is the DQN framework, DE, ASSA, and GA. The ranking is consistent with the low-to-high ranking for the overall relative error. This trend is already found in Table 6. • The DQN framework attains the lowest misfit, the shortest running time, and the lowest overall relative error compared to other candidate methods.

Discussion
Since the ground truth information is not available in many real scenarios, Table 9 expresses the performance comparison based on the metrics of misfit and running times and concludes the analysis in Sections 5.2 and 5.3.  Table 9 demonstrates that the proposed framework performs a faster and lower misfit inversion (highlighted with bold fonts) compared to other methods. Furthermore, Figures 5 and 6 illustrate that the proposed framework provides the inversion results with relatively lower uncertainties than other methods.
As mentioned in Section 4.2, Action 0 explores the parameter space more roughly since it compresses the search bounds in a relatively fast way. On the other hand, Action 1 is more suitable for finely exploring one local area in the parameter space. To understand the learned search strategy of the DQN framework, Figure 9 exhibits how the actions are executed by the DQN framework in an independent inversion for Case 1, where the blue star, the orange dot, and the black curve refer to Action 0, Action 1, and the relative misfit as a function of iteration numbers, respectively. As shown in Figure 9, the agent executes Action 0 at the early stage of iterations and mainly executes Action 1 after that. This pattern can be interpreted as the DQN framework executing Action 0 at the early iterations to locate the rough area of the solution in the parameter space and executing a finer exploration (i.e., Action 1) after that to determine the final solution.
Note that we do not need to set a hard threshold for the agent to change the action from Action 0 to Action 1 because each independent inversion is initialized randomly. Furthermore, the relative misfit can exceed 100% at the first several iterations since the normalization factor E norm is defined by a random initialization.

Conclusions
In this paper, a DQN-based framework for geoacoustic inversion is proposed. The framework can be defined as an optimization-based inversion method with a learnable search strategy, which keeps both advantages of optimization-based and ML-based approaches. Its performance is assessed by two numerical cases for estimating the shear wave velocity profile based on Scholte wave dispersion curves and compared with that of three popular optimization methods. Compared to the fastest conventional method, the running time of the proposed framework can be further reduced by 37.7% (Case 1) and 68.9% (Case 2), respectively. Compared to the best conventional method, the misfit of the proposed framework can be further reduced by 51.6% (Case 1) and 9.4% (Case 2), respectively. The results demonstrate the potential of DRL for geoacoustic inversion and the superior performance of the proposed framework. More specifically, the proposed framework can provide a faster, lower misfit, lower relative error, and lower uncertainty inversion. Please note that the application scope of the proposed framework is the whole geoacoustic inversion field. The proposed framework can easily be applied to different inversion tasks by using an appropriate forward model.
The future research direction will focus on the representation of the misfit space (i.e., the environment state) and the design of new available actions in the action space for the agent to explore the parameter space.
Implementation code availability: The code will be made available for the peer reviewers and the public upon request after the manuscript is published.