A Novel Method of Seismic Signal Detection Using Waveform Features

: The detection of seismic signals is vital in seismic data processing and analysis. Many algorithms have been proposed to resolve this issue, such as the ratio of short-term and long-term power averages (STA/LTA), F detector, Generalize F, and etc. However, the detection performance will be affected by the noise signals severely. In this paper, we propose a novel seismic signal detection method based on the historical waveform features to improve the seismic signals detection performance and reduce the affection from the noise signals. We use the historical events location information in a speciﬁc area and waveform features information to build the joint probability model. For the new signal from this area, we can determine whether it is the seismic signal according to the value of the joint probability. The waveform features used to construct the model include the average spectral energy on a speciﬁc frequency band, the energy of the component obtained by decomposing the signal through empirical mode decomposition (EMD), and the peak and the ratio of STA/LTA trace. We use the Gaussian process (GP) to build each feature model and ﬁnally get a multi-features joint probability model. The historical events location information is used as the kernel of the GP, and the historical waveform features are used to train the hyperparameters of GP. The beamforming data of the seismic array KSRS of International Monitoring System are used to train and test the model. The testing results show the effectiveness of the proposed method.


Introduction
Seismic arrays have been widely used in the area of seismic monitoring and nuclear test monitoring. Seismic arrays are composed of multiple sensors located in a certain range, and the signal-to-noise ratio of far-field signals can be effectively improved by means of delay and sum beamforming. The detection capability of seismic signals can be improved [1,2]. Compared with three-component stations, seismic arrays can provide more accurate estimates of signal azimuth and slowness through techniques such as FK analysis [3,4]. Scholars have proposed various data processing and analysis methods for seismic arrays, which are powerful tools to improve detection capabilities and reduce manual workload. Impulsive noise signals usually come from the instrument, human-made, and nature. Noise signals are often mixed in seismic signals, which causes false detection of events and inaccurate parameter calculations. How to effectively detect seismic signals and accurately identify seismic phases from noise signals is one of the challenges we have to face. on original signals. By analyzing the signal, we choose some helpful features in the seismic signal detection, such as the peak and ratio of the STA/LTA trace, the average value of energy in a specific frequency band, and the energy value after empirical mode decomposition (EMD), which are independent of each other and introduced in detail later. Thirdly, the single-feature model is constructed by location and waveform information. Each feature obeys the Gaussian distribution generated from GP. Lastly, we calculate the posterior probability of the joint features to predict the possibility that new signal is a seismic signal. During the process, the optimal hyperparameters of GP are obtained by maximizing the marginal likelihood function L(θ) [32].
In summary, the innovations and contributions of this work have several points: • Focusing on a specific region, waveform features that are suitable for seismic signal detection are chosen and extracted, such as the envelop features of STA/LTA trace and frequency-domain energy features of the signal.

•
The Gaussian process with event location information in a specific region as a kernel function is used to build the signal features model. The seismic detection is achieved using joint features probability model.
In the following sections, first of all, a brief introduction of GP is introduced. Next, waveform features used in this paper are described in detail, and a joint multi-features probability model is constructed. Then the implementation of seismic signal detection method proposed in this paper is elaborated, including data preparation, feature extraction, model training, and etc. Finally, we analyze and discuss the results.

Gaussian Process
In general, the Gaussian process (GP) regression is a powerful tool, which generates distribution over functions applied to non-parametric regression [36]. In Gaussian process regression, we assume the output y of a function f at input x can be written as In Equation (1), the observation signal y is composed of the signal term f (x) and the noise term , with ∼(0, σ 2 n ) being independent and identically distributed. The distribution of function f (x) comes from a Gaussian process, which is defined by a mean and a covariance function.
where the mean function m(x) reflects the expected function value at input x. The covariance function k(x, x ) shows the correlation between input x and test point x . The function k is commonly called the kernel of the Gaussian process [37]. Given an observation data set D = (X, y) of n elements, where (X, y) is a pair of input-output data, the output f * is predicted by the conditional distribution p( f |D) at the test input x * . According to the definition, the joint multivariate distribution of previous observations y and function values f * can be expressed as y f * ∼ N K(X, X) + σ 2 n I K(X, X * ) where K(X, X), K(X, X * ), K(X * , X) and K(X * , X * ) are n × n covariance matrices. These are similarly evaluated from n training points and n test points. When there is only one test point, we can calculate the covariance vector between a test point and n training points in same way, indicated as K(X, X), K(X, x * ), K(x * , X) and K(x * , x * ). We can derive the conditional distribution function p( f |D), which is the crucial predictive equation for Gaussian process regression, where the mean function isf * = K(X * , X)[K(X, X) + σ 2 n I] −1 y and the covariance function is For the single test point, we can rewrite the mean function in the following way, where X is the observed data set, x i is the element in X, w i is the weight computed by the Equation (10). So we can make a new prediction of new point x * by weighted sum. The output y * is associated with the data set X. The kernel shows the similarity between the prediction point x * and input x i of the observation set X.
The covariance function is also called the kernel. The choice of an appropriate kernel is based on assumptions such as smoothness and likely patterns to be expected in the data [36]. The marten kernel is chosen for the calculation of the Gaussian process, which is usually used in geography.
in the Equation (11), p is an integer, and p = v + 1 2 . With the p = 1, the above formula can be reduced to where r is the distance between the two points on the surface of the earth and related to the geographic location of longitude and latitude. This paper uses the Geographical spherical distance to calculate the earth's surface distance. The correlation of two points depends on the distance between x and x . Therefore, the similarity between the near points is higher than the far points.

Optimizing Hyper-Parameters
The hyperparameters in the Equation (11) are the length-scale l, the signal variance σ 2 f , and the noise variance σ 2 n , which are unknown and need to be inferred from the observations. Obtaining the hyper-parameters is maximizing the marginal (log) likelihood through a optimization tool of gradient ascent. Using the training set D = {(X i , y i )|i = 1, 2, ..., n} and the hyper-parameters vector θ (θ = (l, σ 2 f , σ 2 n )), the log marginal likelihood is with K y = K(X, X) + σ 2 n I. The equation is composed of three terms, denoting a fit penalizing measure. The term − 1 2 y T K −1 y y shows degree of data-fit. The term − 1 2 log|K y | is related to the kernel and observation inputs, displaying the complexity penalty. The term − n 2 log2π is a normalization constant. The partial derivative of θ plays a vital role in seeking the optional value of gradient ascent, which is obtained by the Equation (14),

Waveform Features
Based on the statistics and observations about a series of seismic events, we analyze waveform features in the time domain and energy features in the frequency domain. We can identify some useful features to distinguish seismic signals and noise. These features include the amplitude α and the ratio ρ of the STA/LTA trace of the seismic waveform, signal energy mean γ in the same length time window near the arrival of the signal and the power λ of the intrinsic mode function 1 (IMF1) after the waveform decomposed by EMD.
In Figure 1, we can see the seismic signal amplitude is higher than noise. Statistically, the mean value of the Gaussian distribution of seismic signals is higher than the mean of noise signals, as shown in Table 1. So the probability of noise is usually smaller compared to the seismic signal obtained by the predictive distribution from the Gaussian process. Additionally, as shown in Figure 1, it can be seen that the noise has many visible fluctuations. The ratio value is the largest peak divided by the average of other spikes. The ratio of the signal is higher than the noise. In Figure 2, we can notice the difference between the spectrogram of the seismic signal and noise. The waveform energy average of the seismic signal is higher than the noise for 1-2 Hz, so the band energy average can be utilized as another feature to distinguish signals and noise. In addition, EMD can decompose the non-stationary non-linear signal into a series of various frequency sub-components, namely the IMF, reflecting many inherent information of the signal. In Figure 3, we can note that the signal is decomposed into a group of IMF components, arranged in descending order of frequency. After statistics, it was found that the main component containing most seismic signal energy was the IMF1 part. Then, we used the energy of IMF1 to distinguish seismic signals and noise.

Establish a Signal Model
After extracting the characteristics of historic seismic events, the next question is how to use these features to build a model. For each signal, the corresponding feature elements are available from the time domain and the frequency domain. Using the Gaussian process, we can obtain every feature distribution. We assume that all waveform features are independent. Then the joint probability model is as follows, In the Equation (16), p(s i |E) is the occurring probability value of a seismic signal in the specific area. We use p(•|E) to denote the probability value for every feature. We adopt log(•) to indicate logarithmic normalization of data.
In the following, take the amplitude feature as an example to introduce the modeling process, and similarly for other features. For a specific phase of a particular array, by observing a large number of historical event waveforms, the observation can be expressed as the sum of an unknown function about the event origin f (k) j (e) and a noise term in Equation (17). Through the GP, the model of the amplitude characteristics is established as Equation (18), According to the Equation (18), the occurrence probability of the amplitude about a new observation signal s i can be predicted.
in the Equation (19),f * is mean and Σ f * is covariance of the test point computed separately by Equations (7) and (8).

Data Preparation
We choose a specific area of Japan and adjacent seas where earthquakes occur more frequently and can provide a large number of historical events for study, guaranteeing the data used for the research are sufficient. Then, three years (from May 2011 to May 2014) of data from the KSRS station about 266 events are selected as the data set, and it is randomly divided into a training set and a test set. Furthermore, the same amount of impulsive noise data are chosen as part of the test set.
In this section, we select a specific region for studying how to detect the teleseismic P phase accurately. As depicted in Figure 4, the location and size of the region is longitude (29 • N∼44 • N) and latitude (131 • E∼146 • E). The reason for picking a specific region is the covariance function in the Gaussian process is related to the correlation of the signals, and the mean function is the linear combination of kernel functions about input points. If the selected region is too wide, the correlation of the signals will be weak. In supervised learning, the concept of similarity between data points is crucial. This is an underlying assumption that points close to the input x are likely to have similar target values, so training points close to the test points can provide information for prediction, see Equations (4)- (14).
To further explain the relationship between distance and correlation, we select three locations among the corner area of (33 • N∼38 • N) and (135 • E∼140 • E), which are (136 • E, 34 • N), (135.4 • E, 37 • N), and (139.6 • E, 36 • N). As shown in Figure 5, except for the blue line, the other three lines indicate three locations where the earthquake occur possibly. The vertical axis represents the locations of historical events. The horizontal axis shows the location correlation between the new predictive signal and observed seismic events. The larger the corresponding value on the vertical axis, the greater the possibility of a new seismic signal occur in this area. The blue line represents a location where an event has never occurred at (137 • E, 35 • N). As can be seen from the image, the covariance of the blue line is zero in all locations; that is, the occurrence of a new seismic signal in the point cannot be predicted.

Extracting Features
There are two main types of features, one is from the energy characteristics of seismic signals, and the other comes from the STA/LTA waveform features. We use different methods to extract various features. Firstly, we find features from the STA/LTA trace. In Figure 1, the STA/LTA amplitude of the signal is higher than the noise signal. The maximum amplitude of STA/LTA can be obtained. Then, the ratio between seismic signals and noise is also discriminate. Since the STA/LTA detector is sensitive to background noise, the spikes in the STA/LTA of noise change more drastically and richly, compared to seismic waves. The ratio of ρ is defined by   . It can be seen from the graph that covariance is related to whether an earthquake has occurred and its location.
where α is the max amplitude, n is the number of spikes other than the maximum α, β is the average value of other peaks, and N is the quantity of STA/LTA short time windows. The spectrum energy characteristics obtained by Fast Fourier transform shows the property in the frequency domain. In Figure 2, the seismic energy mean is higher than noise in 1-2 Hz, which can distinguish seismic signals from noise. In time-frequency analysis, the length of the processing signal is identical. Due to the range of energy values being large, we need to perform log normalization for data.
EMD is satisfactory for the study of non-linear and non-stationary signals, and it is based on the characteristics of the signal itself. The decomposed signal components are arranged from high to low frequency, determined by the principle of decomposition. Figure 3 is the EMD picture. Each element is arranged in descending order of frequency. Statistically speaking, most of the signal energy is spread in the first few components, whether it is seismic signal or noise. For the seismic signal, the most significant portion of the signal is located in IMF1, but the largest part of the noise is found in IMF2 or IMF3.
The energy formula for calculating the IMF component is, Furthermore, we can obtain the energy proportion of different IMF components, Most of the useful information about the original signal is concentrated in IMF1-IMF5. Therefore, the total energy of the signal is approximate to the former five components sum, using this to estimate the energy proportion of every IMF element, as shown in Equation (23). The statistical results of the above features are shown in Figures 6-9. In [38], researchers proposed a non-parametric test based on the Kolmogorov-Smirnov (KS) test, referred to as the KS Predictive Accuracy (KSPA) test, which can evaluate whether there exists a significant difference between two forecast models. In this paper, we use the common KS test to test whether the data set of features conforms to the theoretical distribution. We perform statistics on the data and observe the distribution characteristics of the data through the preliminary judgment of the histogram, and obtain the mean and variance of the data, and then use the KS test to determine whether each feature data set meets the obtained normal distribution. At this time, when the obtained p-value is higher than 0.05, it is considered to conform to the normal distribution. The test results for each feature are shown in Table 2. We can see that the features in this article follow the Gaussian distribution. In this paper, we use GP to construct the distribution of each feature. After processing of each feature, we can create a joint probability model containing multiple characteristics to distinguish seismic signals from noise.

Training
In order to get the optimal hyperparameter θ = (σ 2 f , σ 2 n , l), we find the maximum value of the marginal likelihood function by the gradient ascent using training data set D = (X, y). The gradient ascent is similar to the gradient descent excepting the subtraction replaced by the addition during the gradient update process. There are various methods of gradient descent, such as random gradient descent, mini-batch gradient descent, momentum method gradient descent, and so on. We use the Adam gradient descent method in the momentum method. The advantage of this method is that the learning rate is automatically adjusted during the training process, and parameter updating is relatively stable. It is easy to implement and computationally efficient. Moreover, the hyperparameters gained in this way are usually reliable.
The training process is shown in Figure 10. In the beginning, the hyperparameters are randomly initialized. Each time the gradient rises, the partial derivative matrix needs to be calculated according to Equation (15), and the three hyperparameters are simultaneously updated according to Equation (14). In the training process, a threshold is necessary to terminate the procedure. When the loss function is less than the set threshold, the marginal likelihood value is considered to be the maximum, and the three hyperparameters are output while the training is terminated. Otherwise, the training is continued until the marginal likelihood value is maximized.When the kernel hyperparameters σ 2 f , σ 2 n , l are optimized by training, the posterior distribution at the test point x * is The matrix X of shape n × n describes the location inputs corresponding to history events E. Then, we calculated the posterior mean µ y and covariance Σ y . By the posterior distribution, we can get the probability of the test point x * .

Modeling
As shown in Table 3, we obtain the hyperparameters and the marginal likelihood value of each feature, using gradient ascent. During the training process, we can obtain appropriate hyperparameters with a strategy that assigns the initial hyperparameters more than once. Among the hyperparameters, l represents the length scale. When l is small, the swing of distribution will increase, and the performance will be reduced. Therefore, it is necessary to control the corresponding termination threshold of the training process. The magnitude of the marginal likelihood is related to the number of training sets and the size of the feature scale. Generally, the amount of training data is more substantial, while the log-marginal likelihood is higher. When the amount of data increases, the complexity of the model increments. However, the complexity of the model decreases as the length scale increases. The training results from Table 3 also illustrate the above analysis.
From the parameters in Table 3, the mean and covariance of each feature distribution can be achieved, as shown in Table 4. Table 1 lists the statistical distributions of several features. Every row shows the statistical feature distribution of the signal training set, signal test set, and noise test set. We can see from Table 1 that it is consistent between the distribution of seismic train set and seismic test set, while there are vast differences in the distribution of signals and noise. Figures 6-9 shows two curves, one is the statistical distribution, and the other denotes Gaussian distribution. The shape of the statistical distribution depends only on the number and size of the data training set. However, the Gaussian process uses the prior information of the location and waveform features to construct the distribution. The seismic signals for testing are closer to events in the training set, and the feature distributions of test signals are consistent with the training distributions, so the probability of being judged as a real seismic signal is high.

Results
In the test process, it is crucial to pay attention to select the threshold value. The probability of the noise misjudged as a seismic signal is 5%, known as the false alarm rate. Usually, the threshold of STA/LTA can be obtained directly according to the false alarm rate. Differently, the threshold of the posterior distribution consists of two parts, one is the probability threshold from the fake alarm rate, as marked by the red dot in Figure 6, and the other is the mean of the amplitude feature Gaussian distribution. Due to the symmetry of Gaussian distribution, the threshold acquired by only the false alarm rate will exclude some signals higher than the average value but lower than the probability threshold, as located to the right of the black point in Figure 6. Therefore, the advantage of the double threshold is that the new signal is judged as noise only when it is below both the probability threshold and the mean threshold. Otherwise, it is considered as an earthquake signal, and similarly for other features threshold choices. For the whole multi-features joint probability model, the threshold also consists of two parts. The mean threshold is still the mean of the amplitude distribution.
After choosing the suitable threshold, the test results of different methods are obtained, shown in Table 5. From top to bottom is the test result of STA/LTA, STA/LTA with changing characteristic function (C_STA/LTA), single waveform feature (amplitude α, ratio ρ, average energy value of specific frequency band log(γ), IMF1 energy value log(λ)) and multiple features. Different characteristic functions lead to various detection in the STA/LTA method. In this paper, the absolute value function as the most commonly used characteristic function is chosen, which is CF(i) = |Y(i)|. Moreover, we select another one defined as CF(i) = y(i) 2 + K(Y(i) − Y(i − 1)) 2 , where K is a weighting factor that is related to sample rate and station noise characteristics. The test results demonstrate selecting the appropriate feature function will improve the detection effect relatively. In this article, seismic events have specific occurring location and waveform information. For amplitude feature α, we can get the distribution from the Gaussian process using historical seismic signal messages. Then we obtain the mean and variance of distribution about other features in the same ways. In other words, we obtained a posterior distribution of new signals based on prior information, which is shown in Figure 6. Furthermore, using the distribution can predict the probability of a new signal occurring. The detection rate of most single features does not have too much improvement compared to STA/LTA, which can be enhanced by combining multiple features.
We combine multiple features to form a seismic signal detector, as shown in Equation (16). The false alarm rate of 5% provides the probability threshold, and the mean of amplitude distribution obtained by GP is the other part of the threshold. In Table 5, we show the test results of various methods. We can see the STA/LTA detector generates correctly 123 signal detections and 190 noise detections, while the algorithm in this paper successfully achieves 136 signal detection and 190 noise detections. Comparing to the traditional STA/LTA and C_STA/LTA, the detection rate is increased by 3.7% and 2.3% respectively. In this paper, the results of the two methods only provide 0 or 1, the square errors between the predicted value and the true value are also 0 or 1. Then we cannot get the error statistical distribution and its cumulative distribution function (cdf) for the waveform features and STA/LTA. Therefore, the KSPA cannot be used to evaluate whether there is a significant difference between the two methods. We use the recall and precision indicators to further evaluate the performance of algorithms, as shown in Table 6. From the table, the recall and precision of our algorithm are 0.925 and 0.931, respectively. These indicators exceed the other two algorithms, and especially the recall is improved by 8.9% and 5.4% compared to the other two methods respectively. The F score of our algorithm is also the biggest one, which means the performance of our detector is better. In summary, we conclude that the multi-features joint probability model method can improve the detection performance compared to the STA/LTA. To further prove the effectiveness of this method, we selected one-year raw data of 2019 and used our method to detect these seismic signals. The process includes reading seismic array data, then performing beamforming processing, and applying our detection method to beam channel. After manual identification and confirmation, there are a total of 265 seismic signals associated with the seismic events that occurred in the specific area during this period. Our method proposed in this article detects 253 signals, and the signal detection rate is 95.5%, while STA/LTA detects 243 signals, and the detection rate is 91.7%, as shown in Table 7. We also found that the seismic signals missed by STA/LTA have lower SNR. Therefore, the application proved the effectiveness of our methods.

Discussion and Conclusions
Compared with the STA/LTA, the method proposed in this paper uses the historical seismic events information in a specific area to achieve the seismic signal detection. This method has a higher detection rate and lower false alarm rate, especially for the weak seismic signals. It can advance detection performance without performing special processing such as denoising. Meantime, compared with other template matching methods, this method proposes to use event location information and the waveform feature information as prior. It uses the joint features from STA/LTA trace and time-frequency analysis of waveform instead of the raw waveform data directly, so it has better generalization performance. However, on the other hand, the location of the events in the selected region is crucial to the kernel of GP, so the new seismic signal should be happened in the same area with the historical events, ensuring the distance of the events is not too far. If the new signals come from an event in this area, the detection performance will be greatly improved.
This paper collects earthquake events in the area around Japan within three years as the primary research object. The location information and waveform features of the seismic events are used as prior information. The Gaussian Process is applied to obtain posterior distributions for testing new signals. The test results show that the proposed method is excellent for detecting and distinguish seismic signals and noise. In this article, with the same 5% false alarm rate, the signal detection rate of traditional STA/LTA is 83.6%. However, the waveform feature model composed of the posterior distribution of each feature obtained through the Gaussian process can improve the correct detection rate to 92.5%. This method can detect some weak seismic signals, thereby improving the detection rate and reducing the false detection rate.
The significant contribution of this paper is to use historical seismic events information to detect new seismic signals. By collecting seismic events in geographical regions where earthquakes frequently occurred as a prior can identify whether the new signal is a seismic signal or not. This algorithm can be used not only to detect weak seismic signals but also to detect aftershock event signals.
Author Contributions: J.L. made main contributions to this research, containing providing the ideas of modeling and collecting data. M.H. was responsible for simulation. G.C. and W.W. provided suggestions on the feasibility and innovation of the solution. X.W. and J.W. provided supervision and support for the project. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Application of data mining technology in analysis and processing of seismic waveform data.