A New Multi-Scale Sliding Window LSTM Framework (MSSW-LSTM): A Case Study for GNSS Time-Series Prediction

: GNSS time-series prediction plays an important role in the monitoring of crustal plate movement, and dam or bridge deformation, and the maintenance of global or regional coordinate frames. Deep learning is a state-of-the-art approach for extracting high-level abstract features from big data without any prior knowledge. Moreover, long short-term memory (LSTM) networks are a form of recurrent neural networks that have signiﬁcant potential for processing time series. In this study, a novel prediction framework was proposed by combining a multi-scale sliding window (MSSW) with LSTM. Speciﬁcally, MSSW was applied for data preprocessing to effectively extract the feature relationship at different scales and simultaneously mine the deep characteristics of the dataset. Then, multiple LSTM neural networks were used to predict and obtain the ﬁnal result by weighting. To verify the performance of MSSW-LSTM, 1000 daily solutions of the XJSS station in the Up component were selected for prediction experiments. Compared with the traditional LSTM method, our results of three groups of controlled experiments showed that the RMSE value was reduced by 2.1%, 23.7%, and 20.1%, and MAE was decreased by 1.6%, 21.1%, and 22.2%, respectively. Our results showed that the MSSW-LSTM algorithm can achieve higher prediction accuracy and smaller error, and can be applied to GNSS time-series prediction.


Introduction
The long-term accumulated Global Navigational Satellite System (GNSS) coordinate time series provides valuable data for geodesy and geodynamic research [1][2][3]. These data not only reflect the long-term trend of change, but also represent nonlinear changes caused by geophysical effects. GNSS coordinate time series play an important role in the monitoring of crustal plate movements [4,5], dam or bridge deformation monitoring [6][7][8][9][10], and the maintenance of global or regional coordinate frames [11,12]. The coordinates of the successive time point can be predicted by analyzing the GNSS coordinate time series, thus providing an important basis for judging the motion trend. Therefore, the prediction of GNSS coordinate time series is a highly valuable work.
It is well known that the GNSS coordinate time series reflect both the deterministic law of motion and uncertain information, which may be caused by imperfect processing models, geophysical effects, and other factors that are difficult to model [13]. Two kinds of time-series analysis methods exist: physical modeling and numerical modeling. In the traditional physical and numerical modeling method, models of coordinate time series are constructed according to geophysics theory, the linear term, the periodic term, and gap information [14,15]. Usually, in these traditional modeling methods, the feature information and modeling parameters must be established artificially. The exclusion of elements will lead to systematic deviation and limitations in the results. used in this area, and has achieved good results, because it can take into account information at different scales. The multiscale sliding window is a feature extraction method for image processing in the field of computer vision [35,36] that is able to consider the feature information at different scales. In this study, we applied the idea of the multiscale sliding window to one-dimensional time-series data. Furthermore, we applied the algorithm that was originally conceived for application to two-dimensional data, to one-dimensional data, thus providing a new idea for the use of LSTM.
In this study, we proposed a multiscale sliding window LSTM (MSSW-LSTM) approach for GNSS time-series prediction. The new method uses several different sliding windows for data preprocessing that can capture data information at different scales. Then, the preprocessed outputs are used as inputs into the corresponding LSTM, and each LSTM can be adjusted according to the data. The structure of this article is as follows: Section 2 details the methodology for the MSSW-LSTM. Then, the data and processing strategy are introduced in Section 3. Section 4 analyses the experimental results, and a discussion and conclusions are given in Section 5.

LSTM
The traditional neural network model does not encompass the processing information of the previous time span, but only concerns information of the current time. In contrast, the RNN has a memory function, which provides information of the current moment to the subsequent moment. However, the long-term dependence of the RNN leads to gradient explosion. By comparison, LSTM can avoid the problem of gradient disappearance by optimizing memory cells, via the introduction of the concept of gates.
As shown in Figure 1a, a typical LSTM cell has three gates, i.e., input gate, forget gate, and output gate. The cell state and output hidden state are also cores of the LSTM cell. The single-layer and multi-layer LSTM models are shown separately in Figures 1b and Figure 2.
The definition of the forget gate can be written as: where σ is the logistic sigmoid function, fh W , fx W are the weight matrix for transformation of information from cell to gate vectors,  The input gate can be shown as: The definition of the cell state update can be written as:   The definition of the forget gate can be written as: where σ is the logistic sigmoid function, W f h , W f x are the weight matrix for transformation of information from cell to gate vectors, h t−1 is the input of the previous time, x t is the input of the current time, b f is the offset value of the forget gate, and f t is the forget gate of the moment t. The forget gate combines the input h t−1 of the previous time with the input x t of the current time to selectively forget the content. The input gate can be shown as: where σ and tanh are activation functions, W ih , W ix , W ch , W cx are weight matrixes, h t−1 is the input of the previous time, x t is the input of the current time, b i and b c are offset values of the input gate, and i t and C t are the input gates of the t moment. The input gate combines the input h t−1 of the previous time with the input x t of the current time to selectively remember the content.
The definition of the cell state update can be written as: where f t is the forget gate, C t−1 represents the information of the previous moment on the main line, and i t is the input gate. C t denotes information that should be memorized at time t, and C t indicates the cell state of the main line. The main line cells selectively remember and forget the current input information. Finally, the output gate can be obtained by: where σ and tanh are activation functions, W oh and W ox are weight matrixes, h t−1 indicates the input of the previous time, x t is the input of the current time, b o denotes offset values of the input gate, O t represents the output gate, C t is the cell state of the main line, and h t denotes the output of the t moment.

Multi-Scale Sliding Window LSTM
The sliding window, usually when dealing with two-dimensional images, is widely used in computer vision processing, such as in the fields of object detection and semantic segmentation. In this study, the concept of the sliding window was applied to data preprocessing. because GNSS coordinate time series are one dimensional, the sliding window was reduced to one dimension to construct the data sets. Traditional data preprocessing uses a single-scale sliding window to establish the initial data, as shown in Figure 3, among which the length_x and length_y are unique. The current LSTM research on time series uses a single-scale sliding window, or other transformations of the data. However, the information captured by a single scale at each time has a fixed scale, and this method of constructing a dataset is not perfect. The construction of the dataset may determine the accuracy of the model training. In this study, we proposed the method of a multiscale sliding window to input different scale information into the corresponding network, form a unified dimension, and integrate the existing research into a unified processing framework. of constructing a dataset is not perfect. The construction of the dataset may d accuracy of the model training. In this study, we proposed the method of sliding window to input different scale information into the corresponding ne a unified dimension, and integrate the existing research into a unified proce work. form dimension, such as seconds, minutes, hours, days, weeks, months, or yea struction the of multiscale sliding window is undertaken as follows: Assume that the length of the front portion in the ith sliding window is and the length of the back portion is _ length yi . At each time, one unit is m quentially construct the data, and the following conditions ar The data formats are as follows: The GNSS coordinate time series are obtained and arranged in a unified dimension according to the time sequence: where m is the length of X. The interval of the GNSS time series should a adopt uniform dimension, such as seconds, minutes, hours, days, weeks, months, or years. The construction the of multiscale sliding window is undertaken as follows: Assume that the length of the front portion in the ith sliding window is length_xi, and the length of the back portion is length_yi. At each time, one unit is moved to sequentially construct the data, and the following conditions are required v ≤ m − (length_xi + length_yi) + 1. The data formats are as follows: In the multiscale mode, k ≥ i ≥ 2, where k represents a total of k scales. length_x1, length_x2, . . . , length_xk are not equal because it would be meaningless to construct duplicate data sets. However, length_y1, length_y2, and length_yk are equal, which is convenient for the final result of the weighting calculation.
The constructed data set is shown in Equation (9) and Figure 4: Remote Sens. 2021, 13, 3328 6 of 15 Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 17   Figure 4 is a schematic diagram of K sliding windows of different scales. It can be seen that the sizes of the red sliding windows are different at different scales, and the sizes of the blue sliding windows are the same.
Thus, an MSSW-LSTM algorithm for GNSS time-series prediction was proposed. The overall processing flow of MSSW-LSTM is shown in Figure 5. First, the GNSS station coordinate time series is obtained, and different datasets are constructed using the multiscale window. Following the construction of the datasets, the corresponding LSTM subnetworks are established for each data set according to the actual situation of the data set.  Thus, an MSSW-LSTM algorithm for GNSS time-series prediction was proposed. The overall processing flow of MSSW-LSTM is shown in Figure 5. First, the GNSS station coordinate time series is obtained, and different datasets are constructed using the multiscale window. Following the construction of the datasets, the corresponding LSTM subnetworks are established for each data set according to the actual situation of the data set.
Each LSTM sub network has its own weight matrix after training, adjustment, and optimization. The trained parameters are saved, and the model of each subnetwork is used for prediction. The prediction results of each subnetwork (1) 1 r , subnetwork (2) 2 r ,..., sub network (k) k r are then obtained.

Load model prediction (k)
The Final Result is Obtained by Weighting the Prediction Outputs The final prediction value R is the weighted value of each subnetwork prediction result, and the calculation formula is shown in Equation (10):  (11). In general, if there is little difference between the subnetworks, the weight value of each subnetwork should be the same, as shown in Equation (12).  Each LSTM sub network has its own weight matrix after training, adjustment, and optimization. The trained parameters are saved, and the model of each subnetwork is used for prediction. The prediction results of each subnetwork (1) r 1 , subnetwork (2) r 2 , . . . , sub network (k) r k are then obtained.
The final prediction value R is the weighted value of each subnetwork prediction result, and the calculation formula is shown in Equation (10): where w 1 , w 2 , . . . , w k−1 and w k are the weights of the prediction results from each subnetwork. The sum of all weight values should be 1, as shown in Equation (11). In general, if there is little difference between the subnetworks, the weight value of each subnetwork should be the same, as shown in Equation (12).
It should be noted that the MSSW-LSTM method has significant flexibility. For example, LSTM networks may be the same or different, and may consistent of a single layer or multiple layers. This flexibility is beneficial for researchers, who are able to select the most appropriate network model according to their own dataset characteristics and utilize the advantages of the network model.

Evaluation Criteria
To quantitatively evaluate the prediction accuracy of our proposed model, some indexes are used to calculate the difference between the real value and the predicted value. Here, the root mean square error (RMSE) and the mean absolute error (MAE) are used to evaluate the prediction accuracy [37], and the corresponding formulas are shown as below.
where N is the number of datasets, and y i are true values and y i are predicted values.

MSSW-LSTM Process Strategy
The data processing strategy proposed in this paper is shown in Figure 6a, and the main steps of the MSSW-LSTM algorithm are described in Table 1.

Evaluation Criteria
To quantitatively evaluate the prediction accuracy of our proposed model, some dexes are used to calculate the difference between the real value and the predicted val Here, the root mean square error (RMSE) and the mean absolute error (MAE) are used evaluate the prediction accuracy [37], and the corresponding formulas are shown as belo

MSSW-LSTM Process Strategy
The data processing strategy proposed in this paper is shown in Figure 6a, and t main steps of the MSSW-LSTM algorithm are described in Table 1.   Figure 6b shows the specific method of using the MSSW-LSTM algorithm to process the XJSS station. Three sliding windows with different scales were used to preprocess the GNSS time series. Accordingly, three sub-sequence sets were obtained, and three different LSTM networks were then established. For specific processing, please refer to Section 3.2.

MSSW-LSTM Processing for the XJSS Station
To more accurately verify the effectiveness of the MSSW-LSTM algorithm, we directly selected a real dataset, rather than simulated one, with a long time span. Through screening, a daily coordinate time series of the XJSS station in the Up component, representing a total of 1000 epochs with high data integrity, was finally selected as the experimental data. The data collection period was from 20,110,412 to 20,140,206, and data were obtained from the China Earthquake Networks Center. The overall data processing flow is shown in Figure 6b.
Then, we preprocessed the data and constructed a multiscale sliding window to form a new sub-sequence. Here, we first provide two definitions. The fixed sliding window length is the length of the training data entered at each time, and the predicted length denotes the data label, which represented the true value. In total, three sliding windows were constructed, as follows: 1.
The fixed sliding window length was 10, and the predicted length was 1; 2.
The fixed sliding window length was 15, and the predicted length was 1; 3.
The fixed sliding window length was 20, and the predicted length was 1; As shown below, the first sub-sequence had a fixed sliding window of 10 and a data label length of 1, resulting in the construction of 990 available datapoints: The second sub-sequence had a fixed sliding window of 15 and a data label length of 1, resulting in the construction of 985 available datapoints, namely: Similarly, the third sub-sequence had a fixed sliding window of 20 and a data label length of 1, resulting in the construction of 980 available datapoints, namely:  During training, data normalization cannot be ignored. To ensure the stability of the data, they were preprocessed and normalized, and the attributes were scaled to between 0 and 1.
Following preparation of the datasets, they were divided into a training set and a validation set. Usually, the training set comprised 70% of the data and the verification set the remaining 30%. In our case, there were 990, 985, and 980 sub-sequence datasets in the first, second, and third groups, respectively. To ensure that prediction results of the three networks can be used when the final verification set is weighted, the number of verification sets should be consistent; thus, the final number of verification sets was 294 (980 × 0.3 = 294). The specific number of datapoints is shown in Table 2. Following the compilation of the dataset, the LSTM networks were constructed. In this experiment, a total of three sub-sequences were constructed, so three LSTM networks were required to be established correspondingly. The construction of the network should set reasonable parameters according to the actual situation. Through preliminary experiments, we found that the single-layer LSTM network was sufficient to train and simulate data. Therefore, to save operating costs and calculation space, a smaller model should be used in practice. The parameters of the three LSTM networks constructed in this study are shown in Table 3. Table 3. Hyperparameter set. The numbers of hidden cells in the three subnetworks were 10, 15, and 20, respectively. It should be noted that it was coincidental that the number of hidden cells was consistent Remote Sens. 2021, 13, 3328 11 of 15 with the size of the training window. The learning rate and epoch of the three networks were the same, i.e., 0.01 and 5000, respectively, and the Adam Optimizer was chosen as the stochastic optimization algorithm.

Experiment Results of Three Networks
In this study, data were collected from the XJSS station. The specific data preparation and distribution are shown in Table 2. The main hyperparameters of the three networks are shown in Table 3.
The training and prediction results of these three neural networks with different settings using different scale windows are shown in Figures 7-9, respectively.
Epochs 5000 5000 5000 The numbers of hidden cells in the three subnetworks were 10, 15, and 20, respectively. It should be noted that it was coincidental that the number of hidden cells was consistent with the size of the training window. The learning rate and epoch of the three networks were the same, i.e., 0.01 and 5000, respectively, and the Adam Optimizer was chosen as the stochastic optimization algorithm.

Experiment Results of Three Networks
In this study, data were collected from the XJSS station. The specific data preparation and distribution are shown in Table 2. The main hyperparameters of the three networks are shown in Table 3.
The training and prediction results of these three neural networks with different settings using different scale windows are shown in Figures 7-9, respectively. In the first experiment, a sliding window of 10-1 and 10 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 7a. We can observe that after 5000 training runs, the loss curve dropped to around 0.006. The training results and prediction results are shown in Figure 7b. The blue curve represents the original value, comprising 990 groups of data in total, the red curve denotes the In the first experiment, a sliding window of 10-1 and 10 hidden cells of the single-layer LSTM were used as the network framework. The loss curve is shown in Figure 7a. We can observe that after 5000 training runs, the loss curve dropped to around 0.006. The training results and prediction results are shown in Figure 7b. The blue curve represents the original value, comprising 990 groups of data in total, the red curve denotes the training neural network, containing 696 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction.
In the second experiment, a sliding window of 15-1 and 15 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 8a. We can observe that after 5000 training runs, the loss curve dropped to around 0.002. The training results and prediction results are shown in Figure 8b. The blue curve represents the original value, comprising a total of 985 groups of data, the red curve denotes the training neural network, containing 691 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction.
In the third experiment, a sliding window of 20-1 and 20 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 9a. After 5000 training runs, the loss curve dropped to below 0.001. The training results and prediction results are shown in Figure 9b. The blue curve represents the original value, comprising a total of 980 groups of data, the red curve denotes the training neural network, containing 686 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction. training neural network, containing 696 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction. In the second experiment, a sliding window of 15-1 and 15 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 8a. We can observe that after 5000 training runs, the loss curve dropped to around 0.002. The training results and prediction results are shown in Figure 8b. The blue curve represents the original value, comprising a total of 985 groups of data, the red curve denotes the training neural network, containing 691 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction. In the third experiment, a sliding window of 20-1 and 20 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 9a. After 5000 training runs, the loss curve dropped to below 0.001. The training results and prediction results are shown in Figure 9b. The blue curve represents the original value, comprising a total of 980 groups of data, the red curve denotes the training neural network containing 686 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction.  In the second experiment, a sliding window of 15-1 and 15 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 8a. We can observe that after 5000 training runs, the loss curve dropped to around 0.002. The training results and prediction results are shown in Figure 8b. The blue curve represents the original value, comprising a total of 985 groups of data, the red curve denotes the training neural network, containing 691 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction. In the third experiment, a sliding window of 20-1 and 20 hidden cells of the singlelayer LSTM were used as the network framework. The loss curve is shown in Figure 9a. After 5000 training runs, the loss curve dropped to below 0.001. The training results and prediction results are shown in Figure 9b. The blue curve represents the original value, comprising a total of 980 groups of data, the red curve denotes the training neural network containing 686 groups of data, and the magenta curve shows the remaining 294 groups of data for prediction. During the training of the neural network, the lower the loss value, the better. Although a smaller loss value indicates a stronger fitting performance of the neural network to the training data, it also results in less generalization ability. In practical application, the hyperparameters can be adjusted according to the data and the neural network structure; for example, a loss value of around 0.005 can maintain a good fit and generalization ability. In addition, the use of 5000 training runs in this experiment was also based on empirical values obtained after multiple training runs.

Experiment Summary
By weighted summation of the three groups of prediction results, 294 corresponding MSSW-LSTM prediction results were obtained, as shown in the Figure 10 below. Figure 10 shows the comparison between the forecast results of MSSW-LSTM (blue) and the actual time series (red). It can be seen that MSSW-LSTM forecast results are consistent with the true values.
By weighted summation of the three groups of predictio MSSW-LSTM prediction results were obtained, as shown in  Figure 10 shows the comparison between the forecast re and the actual time series (red). It can be seen that MSSW consistent with the true values.
To quantitatively evaluate the prediction accuracy of o (RMSE and MAE) were used to calculate the difference bet values. In the three LSTM experiments, the network mod nearly 700 datapoints, and the remaining 294 datapoints w results are shown in Table 4, among which the RMSE of LSTM are 3.2292, 4.1424, and 3.9810, respectively, and the MAE of 2.4252, 3.0239, and 3.0679.  To quantitatively evaluate the prediction accuracy of our proposed model, indexes (RMSE and MAE) were used to calculate the difference between the real and predicted values. In the three LSTM experiments, the network model was obtained by training nearly 700 datapoints, and the remaining 294 datapoints were predicted. The statistical results are shown in Table 4, among which the RMSE of LSTM(1), LSTM(2), and LSTM(3) are 3.2292, 4.1424, and 3.9810, respectively, and the MAE of these three experiments are 2.4252, 3.0239, and 3.0679. We can observe that, although the network model may become more complex, the RMSE and MAE may not necessarily decrease. For example, although LSTM(2) is more complex than LSTM(1), its RMSE value is greater. It is obvious that both RMSE and MAE of MSSW-LSTM reach the best values. Specifically, RMSE is reduced by 2.1%, 23.7%, and 20.1% and MAE is decreased by 1.6%, 21.1%, and 22.2%, respectively.
In theory, through a single network, it is difficult to achieve optimal results even after multiple training runs. However, the MSSW-LSTM algorithm can combine the advantages of multiple networks and the data characteristics at different scales. The principle of this advantage is that the framework adopts the idea of measurement adjustment.

Discussion and Conclusions
In this study, a new forecasting framework, named MSSW-LSTM, comprising a multiscale sliding window (MSSW) and LSTM, was proposed for predicting GNSS time series. In the data preprocessing stage, the multiscale sliding window is used to form different training subsets, which can effectively extract the feature relationship under different scales, and facilitates mining the deep features of the data. The LSTM network can then effectively avoid the problem of gradient disappearance in the process of parameter solving. The MSSW-LSTM can use multiple LSTM networks to make simultaneous predictions, and obtains final results by weighting.
To verify the effectiveness of the MSSW-LSTM algorithm, 1000 daily solutions of the XJSS station in the Up component were selected for prediction experiments. The results of three groups of controlled experiments showed that the RMSE was reduced by 2.1%, 23.7%, and 20.1%, and MAE was decreased by 1.6%, 21.1%, and 22.2%, respectively. The experimental results showed that the proposed framework has a higher prediction accuracy and a smaller error.
It should be noted that the MSSW-LSTM method has significant flexibility. Researchers can easily construct appropriate subspace subsets formed by multiscale windows according to different data characteristics. In addition, LSTM networks may be the same or different, and may comprise single layer or multiple layers. This feature is beneficial to researchers for the selection of the most appropriate network model according to their own dataset characteristics, and for use of the advantages of the network model. MSSW-LSTM is a general framework for prediction that can be extended to other fields, such as traffic flow prediction, weather forecasting, and air quality forecasting.