Parallel Dislocation Model Implementation for Earthquake Source Parameter Estimation on Multi-Threaded GPU

: Graphics processing units (GPUs) have been in the spotlight in various ﬁelds because they can process a massive amount of computation at a relatively low price. This research proposes a performance acceleration framework applied to Monte Carlo method-based earthquake source parameter estimation using multi-threaded compute uniﬁed device architecture (CUDA) GPU. The Monte Carlo method takes an exhaustive computational burden because iterative nonlinear optimization is performed more than 1000 times. To alleviate this problem, we parallelize the rectangular dislocation model, i.e., the Okada model, since the model consists of independent point-wise computations and takes up most of the time in the nonlinear optimization. Adjusting the degree of common subexpression elimination, thread block size, and constant caching, we obtained the best CUDA optimization conﬁguration that achieves 134.94 × , 14.00 × , and 2.99 × speedups over sequential CPU, 16-threads CPU, and baseline CUDA GPU implementation from the 1000 × 1000 mesh size, respectively. Then, we evaluated the performance and correctness of four different line search algorithms for the limited memory Broyden–Fletcher–Goldfarb–Shanno with boundaries (L-BFGS-B) optimization in the real earthquake dataset. The results demonstrated Armijo line search to be the most efﬁcient one among the algorithms. The visualization results with the best-ﬁt parameters ﬁnally derived by the proposed framework conﬁrm that our framework also approximates the earthquake source parameters with an excellent agreement with the geodetic data, i.e., at most 0.5 cm root-mean-square-error (RMSE) of residual displacement.


Introduction
The Monte Carlo method has widely been used for geophysical source parameter estimation in remote sensing [1][2][3]. The method repeats nonlinear optimization from multiple random starting points to derive the best-fit source parameters that minimize an objective function indicating the misfit between geodetic measurement and the dislocation model. Therefore, it enables avoiding local minima and evaluating the uncertainties of parameters [4].
Interferometric synthetic radar (InSAR) has been one of the most common methods for surface displacement measurement. It compares the phase offset of two or more complex-valued SAR images obtained from different times and locations [5]. InSAR can derive precise geophysical information with centimetric or millimetric accuracy. For this reason, InSAR data have been widely employed in remote sensing, including earthquakes, volcanic activities, landslides [6], thaw-derived slope failure [7], or glacial ice movement [8] investigations.
The dislocation model mathematically calculates the surface deformation of earthquakes or volcanic activities. The rectangular dislocation model, also known as the Okada model, is a popular model for earthquake source modeling, which assumes finite rectangular source and isotropic half-space [9,10]. Other models include the prolates spheroid model [11] and the spherical source model [12] for magma source modeling.
Most nonlinear optimization algorithms iteratively search local minima using gradient or Hessian approximation of the objective function. However, since the dislocation model computes three-dimensional (3D) surface displacement of points in a two-dimensional (2D) mesh, computation of the objective function leads to many floating-point operations proportional to the mesh size. Therefore, gradient or Hessian approximation of the objective function becomes too expensive. Furthermore, the nonlinear optimization is iterated more than 1000 times for the Monte Carlo method, and thus the Monte Carlo method takes an extreme computational burden.
In the recent decade, graphics processing units (GPUs) have been gaining attention as a powerful solution to overcome the performance limitations of conventional multi-core CPUs, especially for applications demanding a massive amount of computations. Furthermore, the release of NVIDIA's computed unified device architecture (CUDA) [13], a new general-purpose parallel programming interface, also facilitated the use of GPU platforms for general-purpose computing applications. Since the CUDA architecture can reorganize the computation around data when the data can be processed independently, we can execute the independent computation in parallel on a vast number of threads to improve the overall performance significantly. Therefore, the Okada model, the dislocation model discussed in this paper, fits well for the GPU implementation because the 3D displacement computation of each point needs a massive number of floating-point operations and has no dependencies between these operations. This paper proposes a performance acceleration framework based on CUDA GPU for the earthquake source estimation with the Okada dislocation model. For this purpose, (1) we analyze the effects of various combinations from options of CUDA optimization techniques and, based on the analysis, perform the CUDA kernel optimization for the parallel implementation of the Okada model; (2) we investigate the performance and correctness of line search algorithms used for the nonlinear optimization to derive the earthquake source parameters in an efficient manner; (3) finally, we verify that the earthquake source parameters derived by our proposed approach also fit the geodetic data correctly by visualizing the residual data between the geodetic data and our modeled data.
Considering the target application characteristics, three techniques, common subexpression elimination, thread block size adjustment, and constant caching, are employed for the CUDA kernel optimization. Then, the combinations generated from the options of these optimization techniques are evaluated in terms of efficiency and occupancy. As a result, the configuration with the shortest average computation time is selected as our best CUDA optimization configuration. As for the comparison of line search algorithms, we employ the root-mean-square-error (RMSE), computation time, 95% confidence interval, and parameter distribution on the real earthquake dataset as the evaluation metrics. Finally, the source parameters with the best consistency between geodetic and modeled displacement are determined and validated.
The paper is organized as follows; Section 2 introduces the related work and background of this study. Then, we define the problem addressed and describe our proposed approach in Section 3. Next, Section 4 presents the experimental results and discussions of our approach. Finally, we conclude the paper in Section 5.

Earthquake Source Parameter Estimation
In remote sensing, the Monte Carlo method has widely been used for earthquake or volcanic source parameter estimation. To reduce the computation time, many researchers used techniques that randomly sample only a portion of the entire geodetic measurement, such as quadtree sampling [14], or defined bound constraints based on previous seismic studies conducted in the same study area. De Novellis et al. [15] estimated the source pa-rameters and its uncertainties for the 2015 Wolf volcano using the Okada and Yang models and the Levenberg-Marquardt method. Studies by Funning et al. [16] and Qu et al. [17] used the downhill simplex optimization and the Okada model to estimate the source parameters of the 2003 Bam earthquake and the 2009 Yao'an earthquake, respectively. Dicelis et al. [18] estimated the source parameters of the 2008 Quetame earthquake using the nonlinear interior-point optimization and the Okada model.
Recently, studies estimating the source parameters using a statistical approach or machine learning have also been proposed. Bagnardi and Hooper [19] and Dutta et al. [20] used the Bayesian approach based on the Markov Chain Monte Carlo. Šílený [21] and Picozzi et al. [22] applied a genetic algorithm and artificial neural network for source parameter estimation. Lee and Kim [23] proposed a parameter search space reduction method based on the principal component analysis to accelerate nonlinear optimization.

Nonlinear Optimization
In general, nonlinear optimization algorithms are classified into trust-region methods or line search methods. Both of them approximate an objective function as a quadratic model, but the methods proceed in different ways. Trust-region methods define a region around the current step that the quadratic model is adequate and find the approximate minimizer in the region. However, line search methods first find the descent direction and focus on searching appropriate step length [24]. Since most source parameters have their own upper and lower bound constraints, we should employ an optimization algorithm handling constrained problems.
The Levenberg-Marquardt algorithm is a popular trust-region method [25]. The algorithm can be thought of as a combination of the steepest descent and the Gauss-Newton method. Using the damping parameter µ, it operates like the steepest descent when the current point is far from the correct solution. On the other hand, when the current point is close to the correct solution, it behaves like the Gauss-Newton method [24]. We can generally use the algorithm to solve unconstrained nonlinear optimization problems, but the algorithm for the constrained problem was also developed in [26]. Other trust-region methods include trust-region reflective [27] and Dogleg [28] methods.
In line search methods, we typically use the steepest descent, Newton's method, or the quasi-Newton methods to derive the descent direction. In the k-th step, let f k and d k be the objective function and the descent direction, respectively. In the steepest descent, the computation of the descent direction is very simple, calculated as d k = −∇ f k . However, it slowly converges to the local solution. Newton's method finds the descent direction using the inverse of Hessian matrix as d k = −∇ 2 f −1 k ∇ f k . It converges to the local solution faster than the steepest descent, but computing the inverse of ∇ 2 f k is too expensive. As an alternative to Newton's method, the quasi-Newton method approximates the inverse of the Hessian matrix using only the gradient of the objective function. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is a representative quasi-Newton method, but the BFGS method also suffers from high memory usage. To resolve this problem, the limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method [29] was developed. The L-BFGS-B method, which is a variant of L-BFGS, was also proposed to handle simple bound constraints [30].
In this paper, we used the L-BFGS-B algorithm to handle bound constraints with a small memory footprint. Another advantage of using L-BFGS-B is that the user can adjust the computational cost and memory requirement by tuning the parameter m, which denotes the number of stored previous points and gradients.
In line search methods, computing the step length α is a key factor for the accuracy and performance of the algorithm. For the k-th iteration, let x k , d k , and α k be the current point, descent direction, and step length, respectively. The ideal step length would be the global minimizer of the function φ(·) defined by Equation (1). However, since it is too expensive to derive the global minimizer, there is a trade-off between the quality of a step length and the computation time [24].
Typical line search algorithms perform a sequential search until predefined conditions of φ(·) are satisfied. A popular condition is the sufficient decrease condition, described in Equation (2) for some constant 0 < c 1 < 1.
This condition stipulates that α k should give a sufficient decrease in the objective function. Since the sufficient decrease condition is not enough to rule out unacceptably short step lengths, we normally can use the curvature condition, defined by Equation (3) for 0 < c 1 < c 2 < 1. Both the sufficient decrease condition and curvature condition are collectively known as the Wolfe conditions or the weak-Wolfe conditions [31]. In addition, we can modify the curvature condition to find a broad neighborhood of a local minimizer or stationary point of φ(·), as described in Equation (4). The conditions satisfying both Equations (2) and (4) are called the strong-Wolfe conditions.
There are various algorithms for line search. Armijo line search (Armijo) [32], bisection method for weak-Wolfe conditions (BWW), and Móre-Thuente line search (MT) [33] find a step length that satisfies the sufficient decrease, weak-Wolfe, and strong-Wolfe conditions, respectively. All three methods use iterative steps to find the appropriate step length. BWW and MT need to recalculate both the objective function f and its gradient g for every iteration. However, Armijo requires only a new calculation of the objective function f .

GPU and CUDA
Unlike CPUs optimized for sequential performance by using sophisticated control logic and large cache memories, GPUs consist of a massive number of threads with small cache memories to achieve high throughput [34]. CUDA is a general-purpose parallel computing platform and programming model for the NVIDIA GPUs. CUDA is designed to develop scalable parallel applications while maintaining a low learning curve for developers [35]. The architecture also provides flexibility in the assignment of local resources to enhance performance, but this flexibility induces developers to perform hand-crafted optimization [36]. Figure 1 shows the general compilation and execution flow of CUDA C/C++ program [34]. As illustrated in Figure 1a, a single source program is first divided into the host, i.e., CPU, and device, i.e., GPU, code. The host code is standard C/C++ code, and the device code is written in data-parallel functions called kernels, specified by CUDA keywords. A general C/C++ compiler compiles the host code, and the device code is first converted into parallel thread execution (PTX) intermediate code that is the CUDA's instruction set architecture and then compiled to binary [37]. Figure 1b shows the execution and memory transfer of the CUDA program at runtime. The program starts with the host, and when the host launches the kernel, it is executed by thousands of threads on a device.
Thread scheduling is essential for enhancing parallel performance. CUDA runtime system organizes threads in a two-level hierarchy. Threads are grouped into thread blocks, and thread blocks form a grid. When a kernel function is launched, each thread block is assigned to a streaming multiprocessor (SM) during its execution, and each block is divided into warps of 32 threads. The warp is a thread scheduling unit in SMs and executes in single instruction, multiple data (SIMD) fashion. SMs perform zero-overhead thread scheduling that can interleave warps or select ready warps with no additional cost or time. It can also hide the latency of global memory access or long-latency instructions [38].  Memory bandwidth optimization is also significant to boost the execution efficiency of CUDA kernels because the data to be processed by GPU should be transferred from the host memory to the device's global memory, as shown in Figure 1b. Since the global memory is implemented with dynamic random access memory (DRAM), applications can saturate the memory bandwidth easily. CUDA provides several programmable on-chip memories to reduce the demand for the global memory bandwidth of the applications. Table 1 summarizes the types of CUDA memory. Shared memory is on-chip memory with faster access speed than global and local memory, and variables in shared memory space are shared among threads in the same thread block. Therefore, we can eliminate the global memory access bottleneck by using shared memory as a software-managed cache. For read-only data referenced by many threads simultaneously, constant memory can help reduce the memory access latency via automatic caching. GPUs are being actively used in remote sensing. Representatively, several efforts for the incorporation of GPU to hyperspectral image processing have been directed. Since the computational cost of hyperspectral image processing is expensive due to the high dimensionality of the image, GPUs can be a powerful solution for the processing, such as unmixing or dimensionality reduction [39,40]. In seismic studies, there was an implementation of double-difference seismic tomography with GPU to improve performance [41]. Venetis et al. [42] showed that GPU could significantly reduce the execution time of the grid search algorithm used in earthquake source parameter estimation [42].

Our Proposed Approach
This paper aims to accelerate earthquake source parameter estimation based on the Okada dislocation model by improving the most time-consuming parts in two folds, as highlighted in Figure 2.
First, we optimized the parallel implementation of the Okada model to enhance the model performance. We chose the CUDA optimization techniques regarding the computational characteristics of the Okada model and set 32 combinations from different options of the techniques. Each configuration was evaluated in terms of efficiency and occupancy to derive the effects of the optimization techniques on the computation time, and we selected a configuration with the shortest average computation time as the best CUDA optimization configuration. Second, in the earthquake source parameter estimation, we assessed the performance and correctness of line search algorithms, Armijo, MT, and BWW, for the L-BFGS-B optimization. We employed the misfit, the computation time, and the distribution of estimated source parameters for the performance assessment. In addition, the parallel implementation of the L-BFGS-B (cuLBFGSB) [43] was also evaluated. Finally, we derived the best-fit source parameters and visualized the modeled result to verify our approach.

Dislocation Model and Dataset
For the forward modeling of the earthquake, we chose the Okada model that assumes a finite rectangular source and isotropic homogeneous half-space. Table 2 describes the physical source parameters that define a rectangular source, and Figure 3 shows the fault geometry of the Okada model, respectively. We assumed a Poisson's ratio ν of 0.   Since the source parameter estimation procedure needs a 3D surface displacement dataset, InSAR techniques for retrieving 3D displacement have been introduced [44]. Many studies used the differential InSAR technique that combines line-of-sight (LOS) SAR images from ascending and descending paths [15][16][17]. However, since this method suffers from low accuracy of north component displacement, stacking InSAR and multiple aperture interferometry (MAI) method [45] was proposed to solve this problem.
In this paper, we use the 2017 Pohang earthquake dataset as the three-dimensional geodetic displacement d for line search algorithm comparison. For accurate measurement, the stacking InSAR and MAI method [45] was used. Four SAR images were acquired from CSK and ALOS2. The data size, the pixel size, and the total observation area are 628 × 518, 30 m 2 , and 18.84 × 15.54 km, respectively. In the rest of this paper, we use G, x, and d to denote the dislocation model, the input source parameters, and the measured surface displacement, respectively.

GPU Kernel Optimization
Since the Okada model has no dependence between point calculations, it is straightforward to implement the model in a CUDA kernel function. We first implemented a sequential C++ Okada model based on the MATLAB open-source [46] and then wrote the CUDA kernel function computing the Okada model. Figure 4 shows the difference between CPU and CUDA implementation. While CPU code sequentially calculates the displacement of the 2D mesh using a nested loop, in parallel implementation, a host code launches the kernel with a specific keyword "__global__", and thousands of threads simultaneously execute the kernel function.
In the parallel implementation of the Okada model, each thread performs a massive number of arithmetic operations, but there are no shared variables in thread block scope and no thread synchronizations. Therefore, maximizing instruction efficiency can be the prime optimization goal of the Okada model. However, since efficient instructions generally need additional resources, there is a trade-off between the instruction efficiency and the number of active warps that determine thread-level parallelism. We set our optimization goal to minimize the computation time by striking a balance between instruction efficiency and thread-level parallelism. For this purpose, we used two performance metrics: efficiency [47] for instruction efficiency and occupancy [48] for thread-level parallelism. We use the NVIDIA Nsight compute profiling tool to calculate the performance metrics.

CPU Implementation
CUDA Implementation  Efficiency indicates the instruction efficiency of the configuration. It is calculated as the reciprocal of the total number of PTX instructions of the kernel [47], described as Equation (5). Thus, high efficiency means that the kernel can perform the same task with fewer instructions. Efficiency = 1 The number of instructions (5) Occupancy represents the warp scheduling performance, calculated as Equation (6). Maximizing the number of active warps in SM facilitates hiding the latency caused by the global memory access, branch divergence, or instruction stall. Since a GPU has limited resources such as shared memory, registers, or thread blocks per SMs, inefficient resource usage may decrease the occupancy.

Occupancy =
Active warps per SM Maximum warps per SM (6) This paper attempt to obtain the best CUDA optimization configuration by employing the following three techniques.
Thread block size.
There are other CUDA optimization techniques, such as tilling or loop unrolling. However, we did not employ the tiling technique using shared memory because all operations are performed independently in each thread, and there are no variables referenced at the thread block scope. Furthermore, since our implementation has no inner loop, we do not use the loop unrolling technique.
CSE removes repeated calculations such as arithmetic or load operation by replacing repeated calculations with the precomputed value from the register, as shown in Figure 5 [38]. Since the register reference cost is less than the arithmetic operation cost, CSE can increase the instruction efficiency. However, CSE may degrade warp scheduling performance since CSE tends to use additional registers. We set four options according to the degree of CSE: low (L), medium-low (ML), medium-high (MH), and high (H). As the degree of CSE increases, more repeated expressions are removed by using more registers. Figure 6 shows a part of the source code of the L and H options. In the source code, the L option compute repeated expressions with fewer variable usage. However, the H option precomputes the repeated expressions just once, but it needs 15 more registers than the L option. Table 3 shows the total number of register variables substituting repeated expressions for each degree of CSE.
All possible combinations from the options of the three optimization techniques are 4 × 2 × 4 = 32. We conducted experiments to evaluate the performance of our optimization approach. First, we compared 32 combinations using efficiency and occupancy to estimate the effects of options on the instruction efficiency and thread-level parallelism. Then, based on the computation time, we evaluated the significance of the instruction efficiency and thread-level parallelism on the parallel implementation of the Okada model and select the fastest one as the best CUDA optimization configuration.

Line Search Algorithm for the L-BFGS-B
After selecting the best CUDA optimization configuration described in Section 3.2, we integrate our CUDA implementation of the Okada model into the Monte Carlo method estimating the earthquake source parameters with the L-BFGS-B optimization. Then, we analyze the computational cost of subroutines in the L-BFGS-B to find the most timeconsuming part and alleviate the bottleneck.
The L-BFGS-B proceeds in the sequence of generalized Cauchy point computation (algorithm CP), subspace minimization, line search, objective function and gradient up-date, and update of a new limited-memory Hessian approximation, as described in Algorithm 1 [30]. We define the objective function f as the misfit of geodetic measurement d and model displacement G(x) and use the central finite difference equation for gradient approximation, as formulated by Equations (7) and (8) Update point: Compute objective function f k+1 = 1 2 G(x k+1 − d) 2 2 and gradient g k+1 9: Update limited memory BFGS matrix H k+1 10: As summarized in Table 4, we analyzed the computational cost of L-BFGS-B subroutines [30,50], where m is the number of BFGS corrections stored, n is the number of variables to be optimized, t is the number of free variables, n int is the number of segment explorations of the algorithm CP, liter is the number of iterations of the line search algorithm, w is the mesh width of the Okada model, and h is the mesh height of the model, respectively. We optimized nine parameters, all input parameters of the Okada model except Open, in this paper and usually used small values of m (e.g., 3 ≤ m ≤ 20) [50]. However, the mesh size of the Okada model varies from hundreds to thousands of squares depending on the scale of geodetic measurement, and thus the subroutine that includes the most dislocation model calculations takes the highest computational burden. Therefore, the line search algorithm can be a bottleneck of the L-BFGS-B because the Okada model is repeatedly calculated. Table 4. Computational cost of the L-BFGS-B subroutines.

Subroutine Computational Cost
Algorithm CP (2m + 2)n + O(m 2 ) · n int Subspace minimization (direct primal method) To alleviate this bottleneck, we compared four different candidates of the line search algorithm: Armijo, MT, BWW, and cuLBFGSB. We performed the five thousand iterations of the Monte Carlo method for each candidate in the real earthquake data mentioned in Section 3.1 and evaluated the performance and correctness of the results. As for performance metrics, we used the average computation time of Monte Carlo iterations, RMSE, and parameter distribution for estimation speed, misfit, and uncertainties of parameters, respectively. Armijo, MT, and BWW were implemented with CPU-based L-BFGS-B, and cuLBFGSB was fully implemented with CUDA GPU and used simplified algorithm CP and line search algorithm since they have strong sequential dependence [43]. We used the open-source of Fei et al. [43] for MT and cuLBFGSB and implemented Armijo and BWW based on Burke [51]'s research [51].

Experimental Result and Discussion
This section investigates the performance effects of different combinations of the CUDA optimization techniques on the parallel implementation of the Okada model. Then, we explore the runtime behavior of the line search algorithms for the L-BFGS-B to estimate the best-fit source parameters. Finally, the correctness of our source parameter estimation result is verified by the RMSE and the residual displacement. Table 5 shows the specification of the experiment environment. We used the NVIDIA GeForce RTX 2080 SUPER GPU with 48 SMs, 64 CUDA cores per SM, and 8GB GDDR6 memory.

Kernel Optimization and Evaluation
For the first-order analysis of the optimization techniques, we assessed efficiency and occupancy of 32 different combinations presented in Section 3.2. For this purpose, we wrote CUDA kernel functions that calculate the objective function 1 2 G(x) − d 2 2 for the 500 × 500 mesh size. The results acquired from this experiment are shown in Figure 7. In Figure 7, both metrics were scaled with a minimum value of 0 and a maximum value of 1. Figure 7a shows that CSE is the most dominant factor for efficiency. From the experimental results, we observe that efficiency tends to be proportional to the degree of CSE. We also notice that the thread block size and constant caching options have little effect on efficiency since the thread block size and constant caching affect only computing resource allocation and utilization of the GPU. By contrast, a higher degree of CSE decreases occupancy in most cases, as depicted in Figure 7b. In the "without CC" option, there is little difference in occupancy between the L and ML options and the MH and H options. This is because the difference in the number of precomputed subexpressions between the L and ML options and the MH and H options is relatively small, as shown in Table 3. However, in the "with CC" option, the H option shows higher occupancy than the MH option.
To analyze this cause, we compared the reduced number of reduced register usage by constant caching for each CSE option. Constant caching decreases register usage per thread in all CSE options. For example, in the L, ML, and H options, the number of register usage per thread reduced by constant caching becomes 12, 8, and 6, respectively. However, in the MH option, only two registers per thread are saved by constant caching, and thus constant caching has only a trivial effect on occupancy when the MH option is used. Since the maximum number of resident threads per SM is limited, the number of thread blocks in an SM decreases when a thread block size increases. For this reason, the larger the thread block size, the greater the loss of active warps per reduced thread block by resource limitation in SM. Therefore, we can conclude that a small thread block size leads to high occupancy and better thread-level parallelism.
Then, we compared the computation time of each configuration to select the best CUDA optimization configuration with the shortest average computation time. To this end, we measured the computation time of 1000 test runs for the 500 × 500 mesh size. As shown in Figure 8, the most remarkable enhancement appears when the degree of CSE changes from the ML to the MH, and the H option is the best configuration of CSE regardless of the other options of the two optimization techniques. As for the thread block size option, the computation time increases as the thread block size is enlarged. Furthermore, thus, the thread block size of 64 shows the best performance, up to a 5.3% reduction in the computation time than other thread block size options. Constant caching also slightly reduces the computation time by up to 3.7%. Finally, we selected [CSE, constant caching, thread block size]=[H, "with CC", 64] as the best CUDA optimization configuration for our application, called CUDA_optimized. We also set a baseline configuration for the performance comparison, called CUDA_baseline, with the configuration [L, "without CC", 64]. The CUDA_optimized shows a 2.38 ms computation time, which is 2.93 times faster than the CUDA_baseline. This result indicates that the instruction efficiency has more effect on computation time than thread-level parallelism because a configuration with higher efficiency shows shorter computation time regardless of occupancy.
After the CUDA optimization, we also compared the computation time of our CUDA_optimized implementation with sequential CPU, multi-threaded CPU using OpenMP, and the CUDA_baseline for various mesh sizes ranging from 100 × 100 to 1000 × 1000. We generated 2D meshes by calculating grid coordinates from the reference point and the pixel size mentioned in Section 3.1. Figure 9 shows that the CUDA_optimized has the shortest computation time, followed by the CUDA_baseline, 16/8/4/2 threads CPU, and sequential CPU implementations. We also observe that utilizing more threads improves the performance in multi-threaded CPU implementations. However, the performance gap between 8 and 16 threads is relatively small due to the limitation of CPU cores. Furthermore, in our experiments, using 32 and 64 threads showed almost the same performance as the 16 threads implementation. Table 6 summarizes the speedup ratios achieved by the CUDA_optimized over CPUbased and the CUDA_baseline implementations. According to Table 6, the CUDA_optimized greatly improves the computation time of target objective functions compared with other implementations. For instance, it achieves up to 134.94×, 14.01×, and 2.99× speedups over sequential CPU, 16 threads CPU, and the CUDA_baseline for the 1000 × 1000 mesh size. Significantly, the performance gap between the CUDA_baseline and the CUDA_optimized is noteworthy. From the result, we can state that it is crucial to use the application-specific, hand-crafted optimization of CUDA kernel code to improve the performance, not merely write parallel code. We also compared the results with previous work by Venetis et al. [42]. Although they used a different source parameter estimation algorithm from our study, they also used the Okada model and evaluated the GPU acceleration effect with the OpenMP implementation. The GPU implementation by Venetis et al. [42] achieves about 32× and 145× speedups over OpenMP-8 threads and sequential CPU implementations, respectively. To summarize the comparison result of performance evaluation, speedups over sequential CPU are similar. However, the speedup achieved over OpenMP-8 threads in [42] is about twice compared with our result in Table 6. We consider that this is because our study used the Intel ® Core TM i9-10900K @ 3.70GHz processor with 10 CPU cores and 20 threads that is roughly two times faster than the platform used in [42], i.e., Intel ® Core TM i7-3770K @ 3.50GHz with 4 CPU cores and 8 threads. On the other hand, for the mesh sizes from 100 × 100 to 400 × 400, the speedup ratio increases considerably, but in the mesh sizes from 500 × 500 to 1000 × 1000, the speedup ratio no longer increases noticeably. As shown in Figure 10, we can also observe that the Giga floating-point operations per second (GFLOPS) of the CUDA_optimized code for different mesh sizes changes in agreement with the results in Table 6. Therefore, it is considered that the GFLOPS of the target application represents the throughput achievable on the GPU, which limits the degree of performance improvement. In summary, the CSE technique that enhances the instruction efficiency is the most effective optimization technique for parallel implementation of the Okada model. Our CUDA_optimized implementation achieves more than an order of magnitude speedup compared to traditional multi-threaded CPU-based implementations. Moreover, our optimization approach shows significant performance improvement, almost three times more speedup than before CUDA optimization.

Line Search Algorithm Comparison and Correctness Verification
We applied our parallel implementation to the earthquake source parameter estimation procedure using the L-BFGS-B. In doing so, we evaluated four different solutions introduced in Section 3.3, which are Armijo, MT, BWW, and cuLBFGSB, to derive the best-fit source parameters. Five thousand iterations of the Monte Carlo method were performed for each candidate. We defined the lower and upper bound of source parameters based on the study of Lee [52], as summarized in Table 7. Starting points of the Monte Carlo iterations were randomly sampled from a uniform distribution in the boundaries, and termination conditions of the L-BFGS-B were set to maxIter = 1000, f tol = 10 −3 , m = 8, and gtol = 10 −3 .  We used the RMSE, computation time, and 95% confidence interval obtained from the Monte Carlo iterations as the indicators to evaluate and compare the performance and correctness of candidates. Table 8 lists the statistics of the RMSE and the computation time measured for the candidate implementations. As shown in Table 8, Armijo and cuLBFGSB outperform MT and BWW in the mean and the standard deviation values of RMSE. Although Armijo, MT, and cuLBFGSB show a similar mean computation time, Armijo has the lowest mean and standard deviation values of the computation time. In addition, the summary of the best-fit parameters minimizing the RMSE and their 95% confidence interval is given in Table 9. The best-fit parameters derived from candidates are almost identical, but Armijo has a much smaller confidence interval than the other candidates, with a 42∼84% interval size reduction. This result remarks that the best-fit parameters determined by Armijo are the most reasonable.  To estimate the uncertainties of parameters in detail, we illustrated the histogram of each parameter in Figure 11. All parameters from Armijo feature a unimodal histogram with a small standard deviation. However, most source parameters estimated from MT show multiple peaks, and the Dip derived from BWW peaks around both the lower and upper bound. Furthermore, cuLBFGSB shows that most results of the Length and the Width converge on the lower bound, and the Slip does not converge on the unique solution either with a high standard deviation. It seems that the highly biased results of cuLBFGSB were caused by its simplified line search algorithm. This result confirms again that Armijo has the best agreement with the earthquake source parameter estimation. Therefore, as the final step, we verified the correctness of our source parameter estimation framework with the best-fit parameters derived by Armijo. Figure 12 compares the geodetic displacement and the modeled displacement acquired from the final source parameters. Each row of the figure represents the surface displacement for the E-W direction, the N-S direction, and the depth direction. The third column also shows the residual displacement between the geodetic and the modeled displacements. The residual displacement demonstrates that our proposed framework approximates the geodetic displacement quite precisely, where the RMSE for each direction is 0.49, 0.31, and 0.33 cm for the E-W, the N-S, and the depth direction, respectively.

Conclusions
This paper presented a performance acceleration framework for earthquake source parameter estimation based on the Okada dislocation model using CUDA GPU. To this end, we first carefully set the optimization technique candidates considering the characteristics of the target application and analyzed the effects of various combinations from the suggested optimization options in terms of efficiency and occupancy. Then, based on the analysis, we performed the CUDA kernel optimization of the Okada model and evaluated its performance improvement compared to multi-threaded CPU-based and baseline GPU implementations. We also explored the performance and correctness of different line search algorithms for the L-BFGS-B optimization, which is one of the most time-consuming parts of our target problem. Finally, using the assessment result, we selected the Armijo line search as the most efficient one.
The performance evaluation results for CUDA kernel optimization showed that the CUDA_optimized implementation achieved up to 2.99× and 14.00× speedups over the CUDA_baseline and 16 threads CPU implementations, respectively. We also observed that Armijo shows a 42∼84% reduction in confidence interval size over other candidates, and all parameters derived from Armijo feature a unimodal histogram with a small standard deviation. The best-fit parameters from Armijo had the best consistency between geodetic and modeled displacement, at most 0.50cm RMSE for each direction. Consequently, the results demonstrated that our proposed approach successfully accelerates earthquake source parameter estimation procedure with correctness verification of the best-fit parameters.  Data Availability Statement: The surface deformation maps were provided by Jung and Baek. They have generated the maps from the ALOS PALSAR-2 interferometric data that were provided through the JAXA's ALOS-2 research program (RA4, PI No. 1412).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

InSAR
Interferometric synthetic aperture radar 3D Three-dimensional 2D Two-dimensional GPU Graphics processing unit CUDA Compute unified device architecture