An Earthquake Forecast Model Based on Multi-Station PCA Algorithm

: With the continuous development of human society, earthquakes are becoming more and more dangerous to the production and life of human society. Researchers continue to try to predict earthquakes, but the results are still not signiﬁcant. With the development of data science, sensing and communication technologies, there are increasing efforts to use machine learning methods to predict earthquakes. Our work raises a method that applies big data analysis and machine learning algorithms to earthquakes prediction. All data are accumulated by the Acoustic and Electromagnetic Testing All in One System (AETA). We propose the multi-station Principal Component Analysis (PCA) algorithm and extract features based on this method. At last, we propose a weekly-scale earthquake prediction model, which has a 60% accuracy using LightGBM (LGB).


Introduction
An earthquake is a kind of natural disaster that is represented by the activation of faults as a consequence of the regional stress field resulting from plate movement. When the fault zones between plates slide, huge energy is released and then causes an earthquake. China is located at the junction of the Indian Ocean plate, the Pacific Plate, and the Eurasian plate. The three plates create a seismic belt named the Pacific Rim seismic belt, which is an area with lots of earthquakes. For these reasons, many earthquakes in China such as the Xingtai earthquake in 1968, the Tonghai earthquake in 1970, the Tangshan earthquake in 1976, and the Wenchuan earthquake in 2008 caused huge losses, while the earthquake has become an important disaster to humans' lives and property.
This situation raises the demand for an earthquake prediction method to guide people to avoid the earthquake disaster, and the efforts have never stopped. Earthquake prediction research could be divided into the short-term prediction of a few days, the medium-term prediction of under one year, and the long-term prediction of years. The short-term prediction is most difficult and valued for avoiding disaster.
To help explore this problem, the Peking University Key Laboratory of Integrated Microsystems (IMS) presents a multi-component earthquake prediction system named AETA [1], which is a set of monitoring systems that integrates precursor signal collection, data transmission, storage and processing, analysis, and real-time display on a website [2], with a series of advantages contains high sensitivity, low cost, real-time, and large area monitoring. We set hundreds of monitoring stations in China and accumulated 50 TB data with 20 GB increased each day, and these data have shown excellent results in earthquake analysis. Electromagnetic and acoustic signals are both under the monitoring of the AETA. In this work, the PCA method is used to extract features from origin data, and a machine learning method LightGBM (LGB) is used to predict the earthquake.

Related Work
Current earthquake prediction methods contain statistics methods and precursory anomalies analysis methods. The statistics methods pay attention to the historical earthquakes and the structure of potential sources, and they use statistical and geological theories to analyze the relationship between the historical earthquakes and the potential earthquakes in the future.
These works [3][4][5] use Gutenberg-Richter law to analyze the law of earthquakes in a large area and over a long term. The method uses a statistical law that the magnitude M and the number N of earthquakes whose magnitude is upon M in a specific area and time, log 10 N = a − bM, and a, b are constants. This method could only analyze earthquake risk in the long term, so it can not guide humans to avoid disaster.
The precursory anomalies analysis method is based on a law that there will be some anomalies behind an earthquake. Cicerone surveyed the published scientific literature to identify each kind of earthquake precursor and to explore the proposed physical models to explain each earthquake precursor [6]. Rikitake analyzes probabilities for an anomalous signal of various geophysical elements to be related to a forthcoming earthquake, which are estimated based on the existing data of precursors [7], finding that precursors can be detected further from the epicenter as the mainshock magnitude becomes larger [8]. Yun-Tai finds that there seems to be a close relationship between these gravity variations and the occurrences of earthquakes [9]. Brodsky finds that observable precursors may exist before large earthquakes from the 1 April 2014 Chile event [10]. These works showed that there are usually some precursor signals before a burst earthquake, such as land deformation revealed by geodetic survey and tide-gauge observation, ground tilt observed by a watertube tiltmeter, anomalous seismic activity, and geomagnetic field change, and they proved that this method could show a risk of some earthquakes in an acceptable time range, instead of a long-term analysis according to the Gutenberg-Richter law or other statistical laws. However, earthquake precursory anomalies are complex and uncertain, which makes it difficult to analyze them. As a common method, PCA shows high performance in feature extraction, and we use it to extract our data. LightGBM is a kind of powerful model in machine learning based on the decision tree, which is stable and quick with high accuracy and prediction ability [11].
The PCA algorithm is more widely used in the geomagnetic research field. Han Peng applied Principal Component Analysis to study the daily variation of geomagnetism after harmonic approximation at three stations and found that the proportion of the second principal component fluctuated several times in one month before the earthquake (the 2000 Izu Islands earthquake). Li Jiankai and Tang Ji applied Principal Component Analysis to the horizontal magnetic field component data of electromagnetic observation at Jinggu, Lijiang, and Datong, and they found that the proportion of the second principal component, which could be related to the local underground electrical structure change and electromagnetic disturbance, increased significantly about two weeks before the earthquake (Jinggu earthquake in 2015) [12]. Guo has performed statistical studies on the electromagnetic data observed at the AETA station using a modified PCA method. They concluded that 80% of AETA stations have a significant relationship between electromagnetic anomalies and local earthquakes [13].
Zou and Guo used the sliding Principal Component Analysis (PCA) method to study the ionospheric electron concentration with daily periodic variation and achieved some reflective effect on the 2011 Honshu M9.0 earthquake in Japan [14]. Lv used this method to analyze the electromagnetic signals of some AETA stations and found clear anomalies before and after the Jiuzhaigou earthquake in 2017 [15]. In our work, we use the PCA method to extract the main components of the low electromagnetic feature.

AETA System Structure
Previous research works have found a variety of precursor anomalies, and we focus on the electromagnetic radiation anomaly in the earthquake preparation period and occurrence period, and we call it earthquake electromagnetism. Yao's work [16] shows that different electromagnetic radiation anomalies include changes of the geoelectric field, electromagnetic field, electromagnetic disturbance signal, and the charge concentration in the space ionosphere often appears before an earthquake. These electromagnetic signals may come from the piezomagnetic effect, piezoelectric effect, induced magnetic effect, dynamic electromagnetic effect, thermal magnetic effect, and other subsurface processes. Jian's work [17] and Feng's work [18] observed electromagnetic anomalies in the ultra-low frequency band before the Wenchuan earthquake, Minxian Zhangxian earthquake, and Izu Islands earthquake; the seismic electromagnetic anomalies are mainly concentrated in the ultra-low frequency band below 30 Hz.
Although the ultra-low frequency electromagnetic anomalies have shown significant relation with earthquakes, there is still no work to monitor it in the long term and over a large area. However, earthquake precursor anomalies are generally long-term processes, so there is a large space in such a research area. To explore the possibility of ultra-low frequency electromagnetic anomalies in the long term, we designed the AETA system and set about four hundred monitor stations around Sichuan and Yunan, and the area is about 800,000 km 2 . Figure 1 shows the whole system of AETA. One station of AETA consists of a probe and station devices, which are responsible for data processing tasks and data uploading tasks. The probe collects acoustic and magnetic signals and sends them to the station for the next step. The station will do some processing work that contains down-sampling and wave filtering; then, it sends them to the cloud server. Everyone could gain data through the webpage and API. (To access the public data, just view the website aeta.cn or https://dtvs.aeta.net/stations, or send an email to yongshanshan@pku.edu.cn. The last access of the database is 6 December 2021.) Figure 2 shows the distribution of AETA stations. The stations follow the distribution of fault zones in this area to catch anomalies as much as possible.
The electromagnetic disturbance probe has a frequency response range of 0.1 Hz∼10 kHz with an intensity range of 0.1 nT∼1000 nT. Its sensitivity is 20 mV/nT in the range of 0.1 Hz∼10 kHz, the data resolution is 18 bit, and the voltage resolution is 19.073 µV. The high-frequency sampling rate of the probe is 30 kHz.
The energy of magnetic abnormal concentrates in ultra-low frequency [19][20][21]. According to this, we use a Butterworth filter to down-sample the raw data of 30 kHz to reduce the cost of data transmission and storage. The signal energy is concentrated below 250 Hz. According to the Nyquist sampling theorem [22], a sample rate f s could perfectly reconstruct a signal whose band limit is f s /2, and the raw data will be down-sampled to 500 Hz, and this process saves information in the raw data under 250 Hz. The difference between the full frequency signal and the low-frequency signal is not obvious, and the signal energy is concentrated below 250 Hz. The seismic electromagnetic phenomenon is related to the rock fracture caused by the accumulation of underground stress and the electromagnetic radiation generated during the development of microcracks. This is more related to the ultra-low frequency signal below 30 Hz, and this frequency is less interfered with by other electromagnetic sources, so this frequency range is more valuable.   Figure 3 shows the data preprocessing work flow of AETA. In the cloud server, the raw data will be reduced to one value per minute according to Equation (1). The constants in Equation (1) are results of the device metadata of the Analog-to-Digital Converter (ADC). As Figure 4 showed, each point represents a minute, and the raw time-series data whose frequency is 500 Hz is reduced to a series data whose frequency is 1/60 Hz. These data are used to be the input of machine learning, and we call it the minute data.

Data Preprocessing
The sliding window uses a constant length window to slide on the input series with a constant step to cut a series of sub-series and deal with the sub-series. We use the sliding window method to process the minute data; the window length is 60, and the step is also 60, so we average per hour to generate one number. In this way, one day will contain 24 values, and they will use as a vector of length 24 according to the chronological order, and the data become a series of vectors, in which each vector represents one day. The sliding window method is used to organize these vectors. Since the sun's activity influences the local geomagnetic, a period window of 27 is used to reduce this influence, and the final input is a series of matrices A 24×27 , in which 27 means the sun's activity period and 24 means hours in one day.

Introduction
The data are complex and full of noise and redundant information, which is the reason why a feature dimension reduction process is important. In our work, the effective information that is related to an earthquake is not obvious, making the noise reduction process more important. Dimension reduction could reduce noise in data. On the other hand, the complexity of common algorithms increases as KlogK, K 2 , K 3 , or a K when the dimension K increases. Therefore, reducing the dimension of the input data can greatly reduce the calculation cost.
PCA is an algorithm commonly used in dimension reduction and feature extraction. It could reduce the duplicate data, and it shows the main components of original input data.

PCA Algorithm of Single Station
Based on the assumption that local underground variation may cause some detailed changes of the local geomagnetic field, the analysis of the local AETA geomagnetic signal may capture earthquakes.
In the PCA algorithm, a co-variance matrix Σ of matrix A 24×27 . As Equation (2), for a multidimensional random variable X that is length n, the co-variance matrix is a symmetric matrix of n × n, and the elements on the diagonal are the variances of the random variables in each dimension. For the A 24×27 , its co-variance matrix is Σ 27×27 . Element Σ ij in matrix Σ is as Equation (3), so matrix Σ is as Equation (4).
The procedure of the PCA algorithm is as follows: 1.
Choose K eigenvectors with the largest eigenvalues as row vectors and use the corresponding eigenvectors as row vectors of new matrix P. The K should be chosen to make the cumulative contribution eigenvector rate a of eigenvalue a more than 85%. The contribution is calculated as shown in Equation (5); 4.
Take the eigenvector matrix P to reconstruct the original data matrix A 24×27 ; first, calculate the projection values of A 24×27 on these eigenvectors, and then inverse transform them to the original coordinate system to obtain the reconstruction matrix A 24×27 ; 5.
Use the last line of the original data matrix A 24×27 to be vector L, which is related to the last day in 27 days and its length is 24, and use the last line of the reconstructed data matrix A 24×27 to be vector L . Take the difference of the two vectors. Take the difference between the original data matrix A 24×27 and the related element of the last row of the reconstructed data matrix A 24×27 (the last day of the selected 27 days). Then, set all elements of this line to zero except the one with the largest absolute value. This will generate a vector named "feature vector" of length 24 as the eigenvector of the matrix A 24×27 , and it is the output vector of this algorithm. In addition to this, the largest K eigenvalues are also output.
This algorithm is suitable for our data, which have large information redundancy. The data matrix of one sample is a 24 × 27 matrix, the 24 is related to 24 h per day, and the 27 is related to the 27 days of one sample according to the sun's activity period. A PCA algorithm is used to extract the eigenvectors, which make up more than 85% of the energy and reconstruct them. The reconstructed matrix contains the main signals in the data matrix. After subtracting them from the original matrix, some subtle changes of the periodic signals will be captured.
Based on the matrix A 24×27 of each time step, we can get several features related to the last day of the 27 days of its time step. Based on the matrix of each A 24×27 , we can get several features related to the last day of the 27 days. The features contain a vector generated in Step 5, which is called "feature vector". The number of values is decided by K, and here are three: feature1, the first principal component; feature2, the second principal component; and feature3, the third principal component. These features will be used in the prediction model to represent each station itself.
Another feature is the proportion of the related eigenvalues of each principal component in the sum of all eigenvalues. We find that before an earthquake, the proportion of the first principal component often decreased and the proportion of the second and third principal components increased, so it can become a feature [23][24][25]. The first principal component may be related to the background signals caused by the local environment and geological structure, so it is a general data form. On another side, the second and third principal components may be related to pre-anomalies of an earthquake, which may be caused by seismic activity.

PCA Algorithm of Multiple Stations
Earthquake precursors are area processes, so the anomalies of one station before the earthquake may be related to its neighbor, and the data of a single station cannot catch this type of potential relevance between different stations. In this work, we use multi-station PCA to resolve these problems. Earthquakes are caused by the movement of the earth's plates, and our work is based on a hypothesis that the rupture of the fault zone is regional, so the earthquake precursor may be captured and recorded by multiple stations at the same time.
We hypothesize that an area with low earthquake probability should have mild underground activities and the data of stations in this area should be similar. An earthquake is a regional event, and the data of stations in this area should be more different.
A PCA method such as that in Section 4.2 is as follows. Firstly, use a sliding window of length 27 to slide on the time series data processed in Section 3.2; for station A, it is Row A 1×27 ; for station B, it is Row B 1×27 , and so on. Stack the vectors to organize the time-series data to be a matrix: for example, two 1 × 27 vectors will become a 2 × 27 vector. In this step, matrix M n×27 is output, and n is the number of used stations.
Secondly, use method in Section 4.2 to calculate the co-variance matrix of M n×m and to get the eigenvalue λ 1 , λ 2 , . . . , λ n (from large to small) and their related eigenvector ξ 1 , ξ 2 , . . . , ξ n . In this work, m is 27 days. The eigenvalues λ 1 , λ 2 , . . . , λ n (from large to small) are called the first principal component, the second principal component, and so on.
We believe that the first principal component represents the similarity between stations, and the second principal component represents the difference. Based on this hypothesis, if an abnormal rise of the proportion of the second principal component appears, it indicates that the data of the two stations have unusual differences, which may be related to earthquakes.
A measure named E s is used to measure the impact of earthquakes on a certain region, which is widely used in the research of Hayakawa and other seismologists [26]. Equation (6) shows the calculation method of this measure, in which M is the magnitude of the earthquake and R is the distance from earthquake to the station, E s is the influence of that one earthquake, and E s is the whole influence in one day. The previous methods such as that of Li Jiankai [12] set the stations' number used in one matrix as 3.
The two-stations method is used to extract the potential relationship between multiple stations, which is different from the single station method and expands the number of available points. The first principal component and the second principal component are features of two stations, and they are located at the center of the two stations. Figure 6 is an example of the first principal component feature. This area of this example is 98 • E to 107 • E, 22 • N to 34 • N, the time is from 12 September 2020 to 9 October 2020, and we use the last day, 9 October 2020, to represent the time. Each virtual station in this image is the geometric center of a couple of two stations, and its color is related to the first principal component. The geometric center of all points that are above 0.6 will be regarded as the potential earthquake center, and this is our future method to locate the epicenter, while it is not used in this work.
In summary, the two-stations method works on two stations; we use the sliding method, whose window is 27 days, to cut data, stack two vectors to compute the covariance matrix, and use a PCA method to extract the features. Its output is two eigenvalues, named the first principal component and the second principal component, and their eigenvectors. The eigenvalues will be used as features in our model. The corresponding position of features is the center of two stations, which is called the virtual station. For an area that contains K stations, the number of virtual stations is K(K − 1).

Figure 6.
Multi-PCA map. The red star is the earthquake, and the black star is the predicted epicenter. Green points are virtual stations, which are located at the center of two stations.

Model Structure
This work uses LightGBM (LGB) to predict the existence of earthquakes.
LGB is a kind of implementation of Gradient Boosting Decision Tree (GBDT) [11,27,28], which is a semi-automatic machine learning model with good performance.
LGB is often used to be a final prediction layer that summarizes features and generates a prediction. First, it generated multiple classifiers, and then, it combines them. Each sub-classifier has multiple layers, each layer has multiple nodes, which represents a possible property, and the whole classifier looks like a tree. Figure 7 shows this structure. This model selects split features by Gini coefficients at each node, trains a series of base classifiers based on the idea of gradient boosting, and considers all base classifiers together when making predictions.
The model focuses on the earthquake existence of an area, its input is a series of features extracted from the virtual stations in this area. A typical area usually contains about 20∼40 real stations, so it contains 400∼1600 virtual stations. We define a threshold V and all the virtual stations' first principal components, which are under the threshold, and its corresponding second principal components will be used to calculate the input features. The granularity of output of the single-station PCA method and two-stations PCA method is individual days. For one week, we have 7 days, in which each day has its single-station PCA method and two-stations PCA method outputs. The output of the two-stations PCA method requires the following process, and the abnormal value is the first principal components value, which is under the threshold V: Concatenate all the values and the single-station method output to be a vector whose length is 14 + 14 + 7 + 21 = 56.
An area usually has ∼20-40 stations, and each sample which is a vector with a length of 56 is related to 7 days data. In addition to the input vector with a length of 56, the model needs its corresponding label for training. The label of each 7-day window is marked by the earthquake in its area. For a 7-day window, we choose the biggest earthquake magnitude in the next 7 days after the last day of data as the label, and the relationship of label and magnitude is as follows:
Ms5.0-Ms5.5 is marked as 5. LGB is a kind of implementation of Gradient Boosting Decision Tree that is widely used in various machine learning tasks.

Experiment
In this work, the used area is a rectangle in 98 • E∼106.2 • E and 28.1 • N∼34 • N. The time range of the training data set is from 1 July 2019 to 1 March 2021, and the time range of the testing data set is from 1 March 2021 to 31 December 2021. The distribution of data is shown in Table 1. The corresponding earthquakes of the testing data set are listed in Table 2, while their number is 63. The corresponding stations are listed in Table 3. The sliding window time step of the training data is 1 day, and the sliding window time step of the testing data is 4 days. If magnitude Ms3.5 is regarded as the threshold of positive and negative, the number of positive samples in the train set is 393, the number of negative samples in the train set is 216, the number of positive samples in the test set is 37, and the number of negative samples in the test set is 32. The distribution of train set and test set is different especially in earthquakes whose magnitudes are beyond 4.5, such as the Ms6.4 earthquake on 21 May 2021 and its aftershocks, which are shown in Table 2. The difference is a big challenge for the generalization performance of the model and shows the complexity of earthquake prediction.
It is easy to find that the data are unbalanced from Table 1, so we expand the unique labels 3, 4, and 5 in the train data by copying them into the train data several times. Label 3 is copied one time, label 4 is copied three times, and label 5 is copied five times.
The LightGBM algorithm is a machine learning algorithm with lots of super parameters that need to be set up manually, and the best values according to the experiment need to be chosen as well. In this work, the super parameters that we changed include the learning rate (Lr), Num-leaves, and max-depth. Lr is a super parameter that controls the speed of updating internal parameters of the model, Num-leaves and max depth control the complexity of the model. Generally, a bigger Lr means faster and rough updating, and a bigger Num-leaves and bigger max-depth mean a more complex model. We use the grid search method, which generates all possible super parameter combinations and tests them one by one. Lr is chosen from [0.0001, 0.001, 0.01, 0.1, 0.5], Num-leaves is chosen from [3,5,8,12,16,20], and max-depth is chosen from [3,5,8,12,16,20,25].  In the experiment, we use a random train set to run a control group experiment. To ensure the authenticity of the data, the random data share the same source with the true train set, which is not shuffled, and the relationship between data and label is shuffled. This random train set is trained in the same environment of true train set and super parameters and will be evaluated in the same test set. The final super parameters that we used are as follows: Lr is 10 −5 , max-depth is 12, and num-leaves is 20.

Results Analysis
We use heat maps to visualize the data, as Figure 8 shows. Each number in the heat map means a class in which the true magnitude is the x-axis and the predicted magnitude is the y-axis. For example, in the position (0, 1) true set results, the number is 7, which means that in the 69 samples of the test set, there are seven events in which the events have 0 labels (meaning there is no earthquake) and the model gives 1 label (meaning it believes that an earthquake under Ms3.5 should happen). On the diagonal, the predicted magnitude and the true magnitude are equal, which means that it is a full right prediction, and the further the prediction is from the diagonal, the greater the deviation between the predicted and true value. In the random results map, 11 samples are on the diagonal, while 16 samples are on the diagonal in the true map. The random set model gives 19 negative samples' positive predictions whose magnitudes are around Ms3.5 and six negative samples' negative predictions, while the true set models give 12 negative samples' positive predictions and 13 negative samples' negative predictions. In addition, the random model has no ability to give positive predictions, because the relationship is broken by shuffle, and it can only give label 0 and label 1, which are more numerous. However, the model trained on true data (called true model) can give some meaningful results.
Accurately distinguishing the magnitude of two nearby labels is hard, because the train samples are not enough and unbalanced, the earthquake is complex, and the theory is not developed. To give a more intuitive prediction of whether there is an earthquake or not, we choose a threshold to determine the output of the model. In Section 5.2, we use Ms3.5 to be the threshold of positive and negative. Table 4 shows the precision, recall, F1-score, and accuracy of the positive-negative binary results of the two models. Precision means the ratio of right positive samples in all the predicted positive samples, recall means the ratio of the right positive samples in all the true positive samples, and the f1-score is their geometric mean. Accuracy means the ratio of all right predictions. There is no big gap between the two models' accuracy, but the F1-score of the random model is 0 because it can only give a negative prediction, as Figure 8 shows. However, the true model could predict some true earthquakes even though there is a gap between their magnitude, which means that the model truly learns something from the data.  The true model has 198 sub-trees, the random model has 30,000 sub-trees, and each tree is more complex than the true model. We think that it is because the data and label in the random train set are shuffled, and the model over fits on it, making a model with poor generalization ability.

Further Work
A better model of earthquake prediction should be built from the following aspects:
Using more effective features besides PCA features because it can only describe a part of the earthquake precursors; 3.
Using a complex model, a better model may extract more of the relationship between features and events.

Conclusions
In our work, we raise an earthquake prediction model based on data accumulated by the AETA system to achieve certain results. We prove that there is a correlation between the AETA data and earthquakes around the AETA stations.
There are still some improvements that could be applied. As a result of the uneven distribution of the AETA stations, the epicenter location still has a bias problem, which then leads to a low-accuracy prediction. On the other hand, the feature that we extracted from the AETA data is insufficient, and we believe that the next work should focus on other features and combine these features to build a better model.