Eliminate Time Dispersion of Seismic Waveﬁeld Simulation with Semi-Supervised Deep Learning

: Finite-difference methods are the most widely used methods for seismic waveﬁeld simulation. However, numerical dispersion is the main issue hindering accurate simulation. In the case where the ﬁnite-difference scheme is known, the time dispersion can be predicted mathematically and, thus, can be eliminated. However, when only pre-compiled software is available for waveﬁeld simulation, which is common in practical applications, the software-used algorithm becomes a black box (unknown). Therefore, it is challenging to obtain the mathematical expression of the time dispersion, resulting in difﬁculty in eliminating the time dispersion. To solve this problem, we propose to use deep learning methods to eliminate time dispersion. We design a semi-supervised framework based on convolutional and recurrent neural networks for eliminating time dispersion caused by seismic wave modeling. The framework of our proposed neural network includes two main modules: Inverse Model and Forward Model, both of which have learnable parameters. The Inverse Model is used for eliminating time dispersion while the Forward Model is used for regularizing the training. Particularly, this framework includes two steps: Firstly, using the compiled modeling software to generate two data sets with large and small time steps. Secondly, we train these two modules for transformation between large time-step data (with time dispersion) and small time-step data (without time dispersion) by labeled and unlabeled data sets. In this work, the labeled data set is a paired data set with large time-step data and their corresponding small time-step data; the unlabeled data set is the large time-step data that need time-dispersion elimination. We use the unlabeled data set to guide the network. In this learning framework, re-training is required whenever the modeling algorithms, time interval, or frequency band is changed. Hence, we propose a transfer learning training method to extend from the trained model to another model, which reduces the computational cost caused by re-training. This minor drawback is offset overwhelmingly by the modeling efﬁciency gain with large time steps in large-scale production. Tests on two models conﬁrm the effectiveness of the proposed method.


Introduction
Seismic wavefield simulation/modeling is essential for high-precision imaging of the Earth's subsurface [1]. The conventional seismic wavefield modeling is based on the numerical solution of the wave equation. The solution of the wave equation involves the calculation of the time and space derivatives with numerical methods. Researchers have developed wave-equation-based imaging methods such as reverse-time migration [2,3] and full-waveform inversion [4][5][6]. The finite-difference (FD) scheme is the most basic and often used method to solve the wave equation [7][8][9]. However, the FD approximations of time and space derivatives lead to numerical dispersion when the sample interval is large, even if it satisfies the stable condition [10,11]. To accurately simulate the wavefield, it is possible to minimize the dispersion errors in space and time.
Researchers have proposed methods to improve finite difference spatial accuracy, including methods to lengthen the spatial difference operator and methods to optimize the spatial FD coefficients [12][13][14]. However, using long difference operators to approximate spatial partial derivatives requires more computation, and due to the saturation effect [15], the improvement of the simulation accuracy becomes small as the operator length increases. The FD coefficients of the conventional FD methods are based on the Taylor series expansion, which can only ensure the accuracy of the simulation in the low wave-number part. The optimized FD methods approximate the dispersion relation to obtaining the FD coefficients using optimized methods, such as least squares (LS) minimization [16][17][18] and simulated annealing methods [12], which can significantly improve simulation accuracy in the medium-to-high wave-number part. The pseudo-spectral (PS) methods [19,20] are used to obtain spatial partial derivatives using forward and inverse Fourier transforms. The PS methods can theoretically be accurate up to the Nyquist wave number but require more computational time than FD methods.
The time derivatives in wave equations are commonly approximated using the secondorder FD format, but the low-order systems can result in significant time dispersion errors. Researchers have proposed the Lax-Wendroff methods [21,22] to improve time accuracy by replacing high-order time derivatives with spatial ones. To increase the stable timestep upper limit, optimal Lax-Wendroff methods were developed [13,23,24]. In addition, the new FD template method improves temporal accuracy by using time-space domain methods with new stencils. Liu and Sen (2013) [25] presented a centered-grid FD method with a rhombus-shaped stencil, which can achieve arbitrary even-order accuracy in the time and space domains but is computationally demanding. To balance accuracy and efficiency, Wang (2016) [18] proposed an FD method based on the combination of cross and rhombus stencils. Tang and Huang (2014) [26] developed the staggered-grid FD method of fourth-or sixth-order in time. Ren (2017) [27] presented the temporal high-order SFD schemes, and Ren (2022) [28] applied the staggered-grid FD method for source wavefield reconstruction.
Alternatively, the method of the post-propagation filter is an efficient approach to reducing time dispersion. Stork (2013) [29] firstly proposed the time dispersion correction filter, which demonstrated that time dispersion is independent of propagation path, medium, and spatial dispersion errors and successfully filtered away time dispersion errors from seismograms. Wang and Xu (2015) [30] introduced a time dispersion correction method based on the analytically anticipated time dispersion error, which improves the effectiveness of the time dispersion filter. The time dispersion filter applies the forward time-dispersion transform (FTDT) for predicting time dispersion error and the inverse time-dispersion transform (ITDT) for removal. However, the FTDT is an adjoint rather than an inverse of the ITDT, so the ITDT with their proposed filters cannot fully eliminate the time dispersion added by the FTDT. Li (2016) [31] suggested a time-varying filter to minimize time dispersion in shot recordings. Koene (2018) [32] derived the precise inverse of ITDT, which is employed as a new FTDT, where the ITDI can recover (with proper band restrictions) the input signal, and the ITDT can successfully minimize temporal dispersion of the modeling wavefields with FD, PS, and spectral element methods, enhancing temporal accuracy more effectively.
These methods are used to propagate seismic data through synthetic models, and the forward modeling methods are capable of determining the dispersion relationship quantitatively. However, because highly integrated software is typically used in practical production, software algorithms are a black box to users (i.e., it is impossible to obtain the specific forward modeling method from the compiled software) which makes it impossible to quantify the dispersion relationship of the forward modeling results, and therefore difficult to accurately eliminate the time dispersion of the modeling results. Recently, machine learning and deep learning algorithms have been utilized with some success in seismic impedance inversion [33,34], seismic modeling [35], and seismic data interpretation [36]. In this paper, we propose to use machine learning methods to overcome this problem.
The framework of our proposed neural network includes two main modules: Inverse Model and Forward Model and these two main modules for transformation between large time-step data (with time dispersion) and small time-steps data (without time dispersion). We propose a semi-supervised machine learning strategy to eliminate time dispersion. Specifically, we use deep learning to learn the mapping patterns in temporally dispersed samples. The Inverse Model is used for eliminating time dispersion while the Forward Model is used for regularizing the training. Although the deep-learning network is a "blackbox", which is difficult to be quantitatively interpreted, in this study, we can generate any amount of data that the network needs to train the network. Therefore, the network is able to perform as designed. Thus, the proposed deep-learning method is considered feasible.
The rest of this paper is organized as follows. Section 2 shows the details of our framework, which is based on semi-supervised learning to train our proposed model and then shows a new training method by transfer learning based on the pre-trained model. Section 3 provides and analyzes the experimental findings of our proposed network on the Marmousi and SEAM models. Section 4 presents discussion. Section 5 summarizes the entire paper.

Theory
In this work, we use a modeling program with the pseudo-spectral (PS) method as the black box, i.e., the compiled software, to generate the data set with large and small time steps. Since the PS method is used, the simulated seismic data are free of spatial dispersion but have temporal dispersion. A few training data are first generated using the modeling program with large and small time steps, which mimic the time-dispersed data and the time-dispersion-free data, respectively. The data with small time steps are used as the label of the data with large time steps.
In this paper, we propose a semi-supervised framework for eliminating time dispersion based on Convolutional Neural Networks (CNNs) and Gate Recurrent Units (GRUs). We can utilize semi-supervised networks to train not only labeled data but also unlabeled data. The unlabeled data act as constraints to better train the generative network and generate more accurate outcomes.
CNNs are used to extract feature information, where dilation convolution is used to expand the perceptual field and where the data are inputted point by point. In this work, the size of the data is large, and multiple dilation convolutions with different dilation rates are superimposed; thus, the different perceptual fields can lead to multi-scale contextual information, which also helps to reduce computational effort. Recurrent Neural Networks (RNNs) are used to extract temporal information since time dispersion data are a type of temporal data. An RNN network typically has four inputs and one output. The inputs include three gate functions-Input Gate, Forget Gate, and Output Gate-that can be used to train the network's weights using the context-specific knowledge of the data to learn the mapping relationship between the inputs and the outputs. The GRU network is an adaptation of the RNN network that, to simplify computation, merges the typical Input Gate and Forget Gate into a single Update Gate and blends cell states with hidden states.

Network Structures
We choose two main modules for transformation between large time-step data and small time-step data: Inverse Model and Forward Model, both of which have learnable parameters; the Inverse Model is used for eliminating time dispersion, and the Forward Model is used for regularizing the training. The workflow we proposed is shown in Figure 1, which includes labeled and unlabeled data sets. The labeled data set is a paired data set with large time-step data and their corresponding small time-step data, and the unlabeled data set is large time-step data that contain time dispersion that needs to be eliminated. Here, x is the labeled single trace of time-dispersed data, its corresponding target time-dispersion-free data is labeled as y, and x_u is the unlabeled single trace of time-dispersed data. x is mapped to y pred by the Inverse Model, and then y pred with y calculates property loss 1; unlabeled data (in the dashed black box) x_u is mapped to y_u pred by the Inverse Model, then y_u pred is mapped to x_u pred by the Forward Model, and x_u pred with x_u calculate the unlabeled loss. y is mapped to x pred by the Forward Model, and then x pred with x calculate property loss 2. Specifically, property loss 1 adds property loss 2 as property loss; both property loss and unlabeled loss are used for updating the weights of the Inverse Model and Forward Model.
The network takes input data (both labeled and unlabeled large time-step data) and feeds them to the proposed Inverse Model, which transformations the data from the source domain with time dispersion to the target domain without time dispersion, and the loss function between the label of the input and the output of the Inverse Model is minimized. Next, the generated without-time-dispersion data of the unlabeled data set is then fed into the proposed Forward Model, which converts it back to time-dispersed data, and the loss function between the original input and the output of the Forward Model is minimized. Furthermore, the network takes the other input data (labeled small time-step data) and feeds them to the proposed Forward Model, which transformations the data from the source domain without time dispersion to the target domain with time dispersion, and the loss function between the label of the input and the output of the Forward Model is minimized.
The architecture of the Inverse Model and Forward Model are similar to that used by Motaz Alfarraj [37], where the difference is the adjustment of local parameters and the removal of the upsampling submodule. Figure 2 shows the Inverse Model in the proposed workflow. The Inverse Model consists of three main submodules. These submodules are denoted as Sequence Modeling, Local Pattern Analysis, and Regression, and each have a different role in the overall model. Figure 3 shows the Forward Model architecture, which is similar to that of the Inverse Model in the proposed workflow. We noted that time-dispersion-free data need more complex mapping than time-dispersed data, so the differences are 2 times the GRU output data size and two fully-connected-layer modules in Forward Model. The fully-connected-layer modules are mainly used for compensation of resolution mismatch between input data and output data.  Sequence Modeling submodule comprises multiple Gated Recurrent Units (GRUs) [38]. The input traces of GRU are modeled as sequential data trace by trace, and temporal features are computed based on temporal variations of these data, so GRU can continually update the memory of the past and future moments and adjust its weights in the training process. Consider that more hidden layer neurons will have significantly better effects on capturing temporal features; therefore, one GRU module is set in the Sequence Modeling submodule, where we set the input data size to the number of hidden layer neurons. Then, one fully connected layer is used as a transition layer, which compensates for the dimensional mismatch between the outputs and the inputs.
Local Pattern Analysis submodule comprises three parallel 1D CNN modules, each parallel CNN module having the same convolutional kernel size and different dilation coefficients. Three outputs of these parallel CNN modules are stitched together into one output, which is then fed into another 1D CNN module with three layers and decreasing convolutional kernels. Dilation is defined as the intervals between convolution core points in the convolution layers [39]. The superposition of multiple dilated convolutions with different dilation rates expands the receptive field, and the different receptive fields lead to multi-scale temporal feature information. There is also a group normalization layer and an activation function layer between two adjacent CNN layers. The CNNs are used mainly to capture high-frequency trends in the data. The output of deep GRUs is regarded mainly as low-frequency trends in the data. To capture the frequency content of the full frequency band, the two outputs of the CNNs and GRUs are combined and fed into Regression.
The final submodule is Regression, which is used for mapping the extracted features from the other two submodules to the target data. Regression comprises a 1-layer GRU followed by a linear layer. In addition, the input layer of GRU and output layer in this submodule all use the transpose function. Hence, the output size is set to the same as the input size, i.e., one trace in and also one trace out.
In the end, we propose a transfer learning training method to extend from the trained model to another model, which helps the network adapt to new seismic data better. We consider that GRU is designed to capture the sequential relationship of input data when extracting features from the training set, and GRU continually adjusts the weights of the Local Pattern Analysis submodule. Thus, for the new seismic data, we fix the initialization of the Local Pattern Analysis submodule by using the trained model weights file and retraining the parts with the GRU. This training method can use a smaller number of training data rather than starting from scratch.

Loss Function
In this work, the input size and output size are both set to N × 1. We denote the labeled single trace data of time-dispersed data as x, the unlabeled data as x_u, which is also the data for time-dispersion elimination. The corresponding time-dispersion-free data of x is denoted as y. The Inverse Model and Forward Model are denoted as f wI (·) with weights W I and f wF (·) with weights W F , respectively.
As shown in Figure 1, property loss 1 adds property loss 2 as property loss, which is calculated between predicted and target on traces and maps temporally dispersed data to time-dispersion-free data. Unlabeled loss is calculated between predicted and input unlabeled data. This kind of loss is adopted to constrain the prediction consistency on the dispersive time-series traces. In this part, we have modified the mean absolute error (L 1 ) by calculating property loss and unlabeled loss separately to better fit our proposed workflow training loss function. Formally, the loss function takes the following form: The parameters of both the Inverse Model and the Forward Model are adjusted by combining both losses as in Equation (3). In this work, we choose α = 0.2 and β = 1, which are proposed by Motaz Alfarraj [37].

Training Procedure
We generate the data set using a compiled wavefield modeling program, which uses the pseudo-spectral (PS) method, for simulating wavefields with large and small time steps. Firstly, the proposed strategy is tested on a partial modified Marmousi model. The corresponding velocity model is shown in Figure 4. The source wavelet is a Ricker wavelet with a dominant frequency of 15 Hz. The spatial location of the source is unequally spaced. A split-spread geometry is applied. The receiver array has a receiver spacing of 20 m. The minimum and maximum offsets are 0 m and 14,750 m, respectively. We choose the time intervals of 0.2 ms and 1.2 ms, assuming that the solution obtained with the 0.2 ms interval is time-dispersion-free data. We use approximately 90% of the shot recording data for the training set and the remainder for the test data set. The validation data are randomly picked from the test data set. Furthermore, the proposed strategy is tested on the SEAM model. The corresponding velocity model is shown in Figure 10. In the SEAM model test, we choose the time intervals of 0.4 ms and 2 ms, assuming that the solution obtained on the 0.4 ms intervals is time-dispersion-free data. Then, we apply transfer learning with the trained network from the Marmousi model and try the SEAM model with the training samples consisting of 40% of shot recording data for the training set. The training procedure for the proposed workflow is shown in Algorithm 1. The initial learning rate is 1 × 10 −4 , x ∈ X represents the labeled time-dispersed data, x_u ∈ X_U represents the unlabeled time-dispersed data, and y ∈ Y represents time-dispersion-free data.  X → Y pred , Y → X pred 6: Calculate the Property Loss in Equation (1)  7: Randomly sample the unlabeled data x_u 8: X_U → Y_U pred , Y_U pred → X_U pred 9: Calculate the Unlabeled Loss in Equation (2)  10: Calculate the Loss in Equation (3) using property loss, unlabeled loss, α and β 11: end for 12: Update W I and W F in order to minimize Loss 13: end for

Results
Our main goal is to eliminate time dispersion, and we apply our method to two data sets with Pytorch on a Nvidia RTX3090 GPU. Images below are mainly the results predicted on the test data sets, which are shown for evaluating the performance of the proposed workflow quantitatively.

The Data Test with the Marmousi Model
In this part, the data set is generated by using the modeling program with the velocity model of a partial modified Marmousi (Figure 4).
To create the data set, we choose the time intervals of 0.2 ms and 1.2 ms, assuming that the solution obtained with the 0.2 ms time intervals is approximated as time-dispersion-free data. The data set includes 32 shots, each of which contains 550 receivers with a receiver spacing of 20 m. Each trace has 2001 time sampling points by re-sampling at the time interval of 3.6 ms. We resample the data because the frequency of exploration seismic data including this example is usually less than 100 Hz. In addition, the resampling into a coarse interval can also save the computational and memory cost. In the training process, we use 30 shots, which are fed into the network sequentially trace by trace, and 2 shots for testing. Considering a sufficient amount of data and a large number of traces per shot, we set the batch size to 50. The Inverse Model and Forward Model are together trained for 500 epochs. Next, we follow the training procedure. Figure 5 shows that the loss function curve leveled off. As shown in Figure 1 the workflow we proposed, the Inverse Model is trained to eliminate time dispersion of large time-step (i.e., 1.2 ms) unlabeled data set, and the Forward Model is used for regularizing the training by semi-supervised learning mapping small time-step data (i.e., 0.2 ms) to large time-step data (i.e., 1.2 ms). Figures 6 and 7 show the results at the time from 3.81 s to 4.23 s of Trace No. 500 by the Inverse Model and the Forward Model, respectively. Results for the shot profiles are shown in Figure 8, and the enlarged views of the red box areas in the results are shown in Figure 9. It is evident that the dispersion artifacts in large time-step data, as seen in Figure 9a, are suppressed by the proposed Inverse Model as seen in Figure 9b, and the results are similar to the corresponding small time-step data in Figure 9d. Figure 9e shows the output of Forward Model, which has visible dispersion artifacts, are inputted by the small time-steps data as seen in Figure 9d, and the results in Figure 9e are similar to large time-steps data in Figure 9a. Figure      We choose five different metrics to measure the performance of the proposed method: mean absolute error (L 1 Loss), Pearson's correlation coefficient (PCC), coefficient of determination (r 2 ), peak signal-to-noise ratio (PSNR/dB), and structural similarity index (SSIM) between target and the prediction. PCC is a metric that measures the linear relationship between the estimated and target traces, and r 2 is a goodness-of-fit metric that takes the mean squared error between the two traces; both metrics are often used to assess the overall fit of two traces. PSNR is used to compare the quality of output image and the original image, and SSIM measures the similarity between the two images. The quantitative outcomes are computed using the training and testing traces. In this, we have modified the L 1 Loss (L 1 ) to better fit our proposed workflow. Specifically, the L 1 Loss is the sum of the squares of all corresponding location differences: where N = 2001 is the number of one trace data points, L 1 represents the original L 1 Loss, and the modified L 1 Loss is set as L. The result is shown in Table 1. The results in Table 1 demonstrate that the suggested workflow performs almost as well on validation data as it does on training data, demonstrating its generalizability beyond training data.

The Data Test with the SEAM Model
In this part, the data set is generated by using the compiled program with the velocity model of SEAM (Figure 10). To create the data set, we choose the time step of 0.4 ms and 2 ms, assuming that the solution obtained on the 0.4 ms time step is approximated as time-dispersion-free data. The data set contains 12 shots, each of which contains a receiver spacing of 20 m between the receiver array, and each shot contains 325 traces, each trace with 2001 time sampling points by re-sampling to the time interval of 4 ms. In this training, we choose 5 shots for training and 7 shots for testing. Considering the small size of the SEAM data set, we adjust the batch size to 5, and the Inverse Model and Forward Model are together trained for 250 epochs. Next, we follow the training procedure. Figure 11 shows that the loss function curve leveled off. Here, we adopt a new training approach-transfer learning-to evaluate the performance of the method. Considering that the Marmousi model has sufficient training data and performs well on the test data sets, we apply transfer learning with the trained Marmousi model, and we fix the initialization of the Local pattern analysis submodule by Marmousi model weights file, and then continue training the SEAM data set. The remainder of the steps are the same as the previous training process.
As shown in Figure 1 the workflow we proposed, the Inverse Model is trained to eliminate time dispersion of large time-step (i.e., 2 ms) unlabeled data set, and the Forward Model is used for regularizing the training by semi-supervised learning mapping small time-step data (i.e., 0.4 ms) to large time-step data (i.e., 2 ms). Figures 12 and 13 show the result at the time from 2.6 s to 3.8 s of Trace No. 50 by the Inverse Model and the Forward Model, respectively. Results for the shot profiles are shown in Figure 14, and the enlarged views of the results are shown in Figure 15. It is evident that the dispersion artifacts in large time-step data, as seen in Figure 15a, are mitigated by the proposed Inverse Model as seen in Figure 15b, and the results are similar to small time-step data shown in Figure 15d. Figure 15e shows the output using Forward Model, which has visible dispersion artifacts, are inputted by the small time-step data as seen in Figure 15d, and the results in Figure 15e are similar to large time-step data in Figure 13a. Figure 13c shows the absolute difference between Figure 13b and d while Figure 13f displays the difference between Figure 13a,e.    Similar to the Marmousi model, we choose five different metrics to measure the performance of the proposed methods: Pearson's correlation coefficient (PCC), coefficient of determination (r 2 ), modified L1 Loss (L), peak signal-to-noise ratio (PSNR/dB), and structural similarity index (SSIM) between target and the mean prediction. The results are shown in Table 2.

Discussion
In this work, the given network is designed to map the data simulated by a compiled program using two designated time intervals. Hence, re-training is required whenever the modeling algorithm, time interval, or frequency band is changed. This is because, as in all deep learning methods, training a neural network on one data set and then applying it to another data set requires that both data sets have the same distribution; otherwise, the test data will be incorrect due to overfitting. Considering the computational effort associated with re-training, on the new seismic model, we propose transfer learning to overcome this problem. When we test the proposed strategy on the SEAM model using the transfer learning training approach, we regard Marmousi as a pre-training seismic model. However, the Marmousi model should have enough training data and perform well on the test data sets, otherwise, the SEAM model data set may not work well with small amounts of data.
Additionally, as a comparison, we built a standard GRU network to eliminate time dispersion. This GRU network is from Pytorch machine learning library, and the size of input and output are the same as our proposed network. The data from the SEAM model are used for test. The results are shown in Figures 16-18. As can be seen, the GRU deep-learning model can also eliminate the time dispersion but less effective than the proposed semi-supervised deep-learning model. This experiment demonstrates that many deep-learning models can achieve time-dispersion elimination but the performance may be different.

Conclusions
Seismic wavefield simulation is a key component of seismic data imaging and inversion. Finite-difference (FD) methods are the most widely used technique, but it suffers from time and spatial dispersion. In the case where only a pre-compiled software is available, the FD scheme/algorithm is unknown, and it becomes a black box to user. Therefore, it is hard to predict the time dispersion mathematically. Herein, we propose using deep learning methods to overcome this problem.
In this study, we use CNNs and GRUs with semi-supervised learning to eliminate time dispersion of seismic records. We estimate the proposed Inverse Model and Forward Model by training the network with a few shots generated by a compiled modeling program. The training data sets include labeled and unlabeled data sets. The labeled data set is a paired data set with large time-step data (with time dispersion) and their corresponding small time-steps data (without time dispersion). The unlabeled data set is the large time-step data that need time-dispersion elimination. We then apply this network to the unlabeled data sets. In addition, re-training is required whenever the modeling algorithm, time interval, or frequency band is changed, so we propose a transfer learning training method to extend from the trained model to another model, which reduces the computational load caused by re-training. Numerical tests on different models indicate that the proposed method is capable of sufficiently eliminating the time dispersion. The networks and examples presented in the paper can be accessed via https://github.com/wavetomo/Timedispersion (accessed on 10 October 2020).