1. Introduction
Mining useful information from time-series data is an active research field that has attracted increasing attention in the last decades due to the rich application scenarios of TSEP, including mine hazard detection [
1], nuclear power plants abnormal event detection [
2], medical diagnosis [
3], river level [
4], temperature [
5], stock price [
6], and so on. Using the prediction of time series events, it is possible to predict the development trend of time series in the future and, in practical applications, it can help people to analyze and make decisions. Traditional time series forecasting methods use historical data to predict future values, which are not suitable for solving the problem of event forecasting in time series data. Because the events in the time series usually span a certain time range and show a certain trend or regularity through a continuous value, this is essentially different from a single value. In contrast to numerical forecasting in time series data, event forecasting in time series data has not been deeply studied. Due to the high dimensionality and noise of time series data, how to mine useful information from sequence data is still a big challenge.
Traditional TSEP methods use historical data to predict future time series values, and then classify the predicted subsequence to obtain event predictions. Such a framework mainly involves four tasks [
7]: preprocessing, model derivation, time-series prediction, and event detection. The main methods of time-series prediction are traditional forecasting models such as ARIMA, SARIMA and GARCH. The main methods of event detection include decision trees (DT), artificial neural networks (ANN), and support vector machines (SVM). However, since time series events usually span a certain time range, these traditional forecasting models predict a single value and then classify it, are susceptible to small disturbances in the training data, and have a high degree of instability [
8]. This type of method ignores the characteristics of each subsequence itself, and its predictive ability is limited.
In recent years, the new event prediction method represented by the graph model Time2Graph [
9] has received widespread attention. Based on Time2Graph, the EvoNet model [
10] has more successfully captured the time-varying relationship between time series. The specific steps are: representing and identifying representative patterns of subsequences, capturing the transitional relationship between the patterns, using pattern features as the input of the classifier and the classification result as the prediction of future events, so as to predict the event. The key to this prediction method is how to extract the characteristics of time subsequences. The above two methods extract the characteristics by establishing state diagrams. In addition, the time series can also be interpreted as a new state sequence and model their sequential dependencies. The literature provides a new method, X-HMM (XGBoost-HMM) [
11], which takes the hidden state as the input of the classifier. Compared with the traditional framework, this kind of method omits the step of time series forecasting, thus avoiding the deviation caused by time series forecasting, but this method classifies each subsequence set separately, ignoring the remote dependencies between the various subsequence sets.
Inspired by the sequence labeling method, the use of depth models and probabilistic graph models for sequence labeling can well capture the dependencies between subsequence sets [
12] and transfer the idea of sequence labeling to event prediction, which can greatly improve forecasting accuracy. As a classic sequence data processing model, LSTM has a solid theoretical foundation [
13]. Using the LSTM model for event labeling prediction, and then using conditional random field (CRF) to smooth the prediction results, is expected to further improve the accuracy of event prediction to help solve various application problems in this field.
There are shape similarities and change similarities between different subsequences, so the method of pattern recognition can be used to extract representative patterns in the sequence and then express the subsequences as a combination of patterns. Studies have shown [
10] that the clustering method (cluster) is superior to other pattern recognition methods such as SAX-VSM [
14] and Fast shapelets [
15] when extracting sequence patterns for relationship modeling. The cluster method is easy to realize and has other advantages, such as high classification accuracy [
16], so this paper mainly uses the cluster method for pattern recognition to obtain the transitional relationship between the patterns. However, after expressing the subsequence as a combination of patterns, problems such as excessive data dimensions and redundant information will occur. The XGBoost algorithm [
17] has the advantages of parallelization, high scalability, and fast speed, so this article chooses this algorithm for feature selection and obtains the soft classification probability value of the subsequence sets to assist LSTM-CRF in event prediction. Thus, a novel prediction model based on Cluster-XGBoost-LSTM-CRF is constructed. Compared with previous models, this model can capture the dependency between subsequence sets well and take the interaction of different events into account in the model.
In order to verify the effectiveness of CX-LC, this article conducted experiments on five real data sets. The prediction results show that CX-LC is superior to several other models in event prediction. The main contributions of this research are:
This paper transformed the problem of event prediction into a problem of sequence labeling and captures the dependency between subsequence sets;
The CX-LC model developed in this paper can well extract the features in the original data set and smoothly optimize the prediction results;
This paper conducted experiments on five data sets to prove that the CX-LC model has more accurate predictions.
The main content of this paper is structured as follows:
Section 2 introduces the relevant definitions of time series event prediction, as well as the model derivation and theoretical research of CX-LC;
Section 3 presents the experimental part of this paper;
Section 4 presents the conclusion and prospect of this paper.
3. Experiment
This article applied the CX-LC model to the prediction of time-series events and aimed to explore and answer the following two questions:
Q1: Compared with other advanced classification event prediction methods, how does CX-LC perform on the same prediction task?
Q2: Do the pattern recognition, feature selection, and smoothing refinement parts of the CX-LC framework really improve the prediction results?
3.1. Introduction to Data Sets and Prediction Tasks
This article used five real-world data sets for experimental exploration. All five data sets are public data from Kaggle:
- 1.
DJIA 30 Stock Time Series;
- 2.
Web Traffic Time Series Forecasting;
- 3.
Air Quality Data in India (2015–2020);
- 4.
Daily Climate time series data.
Except for the WebTraffic data set, the rest are small sample data.
Table 1 shows the relevant information of these data sets.
DJIA30: This data set is a stock time-series. It contains the stock price data set of 30 DJIA companies in the past 13 years (518 weeks in total). It contains five trading days each week, and each trading day records four observations: opening price, trading volume, highest price, and lowest price reached on the day. If the slope of one of the four observations of a stock in a week is greater than 1, it is determined that the price of the week fluctuates abnormally. The prediction task is to predict whether there will be abnormal price fluctuations in the next week, based on the observed values in the past year (50 weeks).
WebTraffic(Web): This data set is a time-series of network traffic. It contains the number of views of 50,000 Wikipedia articles in the past 2 years (26 months in total), 30 days each month, and one observation per day is recorded. When the slope of the observed value curve of an article in a month is greater than 1.0, it is determined that the reading volume of the article has increased rapidly this month. The prediction task is to predict whether there will be a rapid increase in reading volume in the next month, based on the observations in the past year (12 months).
PM2.5: This data set is a time-series of air quality. It contains the PM2.5 detection concentrations of 38 air detection stations in the past 3 years (130 weeks in total), 7 days a week, and one observation value is recorded every day. If the observed mean value of a testing station in a week is too large (two variances exceeding the overall mean value), it is determined that there is an abnormality in the PM2.5 concentration in that week. The prediction task is to predict whether there will be an abnormal change in concentration in the next week, based on the observed values in the past five months (20 weeks).
CO: This data set is a time-series of air quality. It contains the CO detection concentrations of 24 Italian cities in the past 4 months (2664 h in total), 24 h a day, and an observation value is recorded every hour. If the difference between the maximum value and the minimum value observed in a city in a day is too large, it is determined that there is an abnormality in the CO concentration of that city on that day. The prediction task is to predict whether there will be an abnormal change in concentration in the next day, based on the observed values in the past three weeks (20 days).
Temperature(Temp): This data set is a temperature time-series. It contains the temperature of 100 representative cities in the world in the past 4 years (222 weeks in total), 7 days a week, and one observation value is recorded every day. If the difference between the maximum value and the minimum value observed in a city in a week is too large, it is determined that the temperature of the city on that day is abnormal. The prediction task is to predict whether there will be an abnormal temperature change in the next week based on the observed value in the past year (50 weeks).
3.2. Baseline Method
This article compares the proposed CX-LC model with eight other models:
Classification event prediction method. For the prediction of time series events, most of the current research is to take each subsequence set
as an independent individual, extract the information in the subsequence set
as the input of the classifier, and use the event prediction result
as the output of the classifier. This article uses the most advanced sets of frameworks as the baseline model. X-HMM [
11] is a sequential model, XGBoost can capture the relationship between different observation features, and then HMM can find the most likely hidden state sequence of a given observation sequence, so as to make event prediction. Time2Graph [
9] adopts shapelet to extract states; it aggregates the graphs at different times as a static graph and conduct DeepWalk to learn graph’s representations, which then serve as features for event predictions. Evolutionary State Graph Network models (EvoNet) [
10] both the node-level (state-to-state) and graph-level (segment-to-segment) propagation, and captures the node-graph (state-to-segment) interactions over time, and then uses it as a function of event prediction. Since the original EvoNet model combines and scrambles the subsequences of different samples before training the classifier, in order to highlight the comparison effect, this article added the
model (
), which trains different samples separately. Call the original model
(
). These four methods are all classification algorithms that do not consider the time factor between the subsequence sets
.
Labeling event prediction method. The CX-LC model proposed in this paper first uses the cluster algorithm for pattern recognition, uses XGBoost for feature selection, then uses the LSTM algorithm for event prediction, and finally uses the CRF model to improve the continuity of the prediction results. In order to explore whether the first, second, and last steps have an effect on improving the prediction accuracy, this article used the XGBoost-LSTM-CRF (X-LC) model with the pattern recognition step removed, the cluster-LSTM-CRF (C-LC) model without the step of feature selection, and the cluster-XGBoost-LSTM (CX-L) model without smoothing are tested on five data sets. As we expected, the complete CX-LC model has better forecast results.
3.3. Implementation Details
This article conducted experiments on five data sets, applied the time series missing value filling algorithm based on matrix decomposition to interpolate [
30]. When dividing the data set, the data set was divided into a training set and a test set at a ratio of 8:2, the training set was used for training the model, and the test set was used for model evaluation. When the model was optimized, each model went through 100 iterations. When selecting the activation function of the neural network, the selection was made by traversal, using the tanh activation function in the fully linked layer of LSTM. Similarly, other hyperparameters were determined by empirical judgment and traversal debugging.
3.4. Performance Comparison and Discussion
3.4.1. Performance Comparison
In this section, the performance of CX-LC and other models were compared. Because the ratio of positive and negative samples was not balanced, Recallratio, Precisionratio, and F1-score were used as evaluation indicators at the same time. All reports were the average results of five repeated experiments and were run in the same environment.
Table 2 shows the prediction results.
By comparing the results of labeling and classification prediction methods, Q1 can be answered. The results in
Table 2 show that, in addition to the Webtraffic and Djia30 data sets, the labeling prediction results of the other three data sets are basically better than the classification prediction results, converting the event prediction problem into a sequence labeling problem can greatly improve the prediction effect. There is still valuable information transfer between the subsequence sets, and the labeling prediction method predicts the events of the subsequence sets in chronological order; the prediction at each moment is related to the prediction at the previous moment. The algorithmic process of event prediction on the subsequence set, in turn, can well capture the important correlations in the time propagation process.
Comparing the classification and prediction methods, EvoNet and Time2Graph has a better classification effect than the sequential model X-HMM. Time2Graph models the aggregated static graph, and the EvoNet model performs node-graph interaction during the time graph propagation, making it more suitable for the time modeling of an evolutionary state graph. This indicates that, in future research, we can try to build a graph model to capture the characteristic information of the subsequence, and then carry out label prediction.
Vertical comparison of classification results shows that the Precisionratio of each model is higher than the Recallratio, except for the Djia30 data set, which indicates that the model has a low misjudgment rate for positive events, but it can not comprehensively detect positive events.
By comparing the results of the four labeling prediction methods, Q2 can be answered. The results in
Table 2 show that, on all data sets, the prediction performance of the CX-LC model is better than the other three models.
Comparing CX-LC and X-LC, the prediction performance of CX-LC is much higher than that of the X-LC model, which means that, when making labeling predictions, the prediction performance of the model is easily affected by input features. The original data of as input ignore the influence of each subsequence itself on the event, and the mode is the most direct response to the occurrence of the event. Expressing the subsequence as a weighted combination of each mode can effectively extract the important information in the subsequence.
Comparing CX-LC and C-LC, it can be seen that the prediction performance of CX-LC is higher than that of the C-LC model, and feature extraction does improve the prediction accuracy, avoiding the influence of redundant information on the prediction model.
Comparing CX-LC and CX-L, it can be seen that the prediction performance of CX-LC is slightly higher than that of the CX-L model, indicating that considering the transition relationship between events can effectively correct the prediction results of LSTM.
In summary, CX-LC has almost the best performance on different data sets, which also makes CX-LC universally applicable.
3.4.2. Discussion
It is not difficult to find that CX-LC is sometimes better than EvoNet, sometimes worse, but sometimes very similar. In order to further discuss the predictive performance of the two models and analyze their classification ability for different samples, this article discussed the misclassification samples of the two models using PM2.5 data and Djia30 data.
For the prediction task of PM2.5, among the negative samples predicted by CX-LC as positive samples, the top 20 samples with the highest frequency of occurrence are shown in
Figure 6. According to this figure, each historical data point used as a prediction sample contains 140 days (20 weeks) of PM2.5 data, with significant fluctuations in the overall sample. Regardless of which stage of the data, there is a high weekly mean. When CX-LC predicts such historical data, it is easy to predict negative samples as positive samples. According to the definition in
Section 3.1, positive samples mean a high level of PM2.5. Under the long-term high PM2.5 indicators, relevant departments may control the air, so as to quickly reduce the PM2.5 value in a short time. In this case, the model CX-LC often does not have good strain energy, and how to better predict this sudden change is the focus of future research. Similarly, for the PM2.5 dataset, among the positive samples predicted by CX-LC as negative samples, the top 20 samples with the highest frequency of occurrence are shown in
Figure 7. Observing these 20 samples, it can be found that the 140 day long historical data only have significant fluctuations in the middle stage (such as the 19th, 26th, 57th, 101st week, etc.), or is generally relatively flat, with periodic fluctuations and no obvious trend (such as the 171st, 180th, 182nd, 632nd week, etc.). CX-LC has a poor prediction performance for samples with stable fluctuations and low mean values. Compared to predicting negative cases as positive cases, CX-LC is less likely to predict positive cases as negative cases; because the increase in PM2.5 value often takes a certain amount of time, it is rare for the air to suddenly deteriorate, such as an increase in vehicles traveling during holidays and an increase in vehicle exhaust emissions.
As shown in
Table 2, as a classification model, EvoNet also has a good classification performance on the PM2.5 dataset. To compare with CX-LC,
Figure 8 and
Figure 9 show EvoNet’s two types of classification errors.
Observing
Figure 6 and
Figure 8, it can be seen that both types of models have more difficulty predicting sudden decreases in PM2.5 values. However, compared to EvoNet, CX-LC has a better predictive performance for samples with a slow upward trend (such as weeks 17, 84, 96, 127, 325, etc.). Observing
Figure 7 and
Figure 9, it can be seen that both types of models have certain shortcomings in predicting sudden increases in PM2.5 values. However, compared to EvoNet, CX-LC has a better predictive performance for samples with smoother fluctuations (such as week 59, week 69, week 336, week 526, etc.). Because CX-LC considers the more remote dependencies of data, when predicting each sample, it not only considers the sample itself, but also considers all previous data. For samples with no obvious or slow trends, CX-LC can make correct predictions based on earlier history.
For small-scale datasets, CX-LC can often achieve better prediction results, but for large-scale datasets, EvoNet has a better prediction performance. To analyze CX-LC’s shortcomings, this article uses the djia30 dataset to discuss sample features. As shown in
Table 2, when predicting the djia30 dataset, EvoNet has a higher Precisionratio than CX-LC. Based on this result, the classification performance of the two models for positive cases is discussed. As shown in
Figure 10, there are 40 typical positive samples in the djia30 dataset that were misclassified by CX-LC but correctly classified by EvoNet. These samples were misclassified as negative by CX-LC and correctly classified as positive by EvoNet. As shown in the figure, the weekly historical data of this type of sample have significant fluctuations. Due to the length of the historical data being 50, each sample contains a sufficiently rich weekly slope, while the EvoNet model focuses on extracting feature transformations within the sample and training the classifier based on an unordered subsequence set. Therefore, when the sample size is large and the prediction history length is long, it can achieve better results than CX-LC.
4. Conclusions and Future Work
Based on the idea of sequence labeling, this paper proposed a novel time-series event prediction framework, namely cluster-XGBoost-LSTM-CRF. This model captured the pattern information of subsequences through pattern recognition, and the state of the pattern is the key to determining the occurrence of an event. The labeling prediction of LSTM can capture the interaction between the set of subsequences and the long-term dependencies between events. In order to verify the effectiveness of CX-LC, we conducted extensive experiments on five real-world data sets. Experimental results show that our model is better than other popular benchmarking methods. Based on this, the time-series event prediction problem was transformed into a labeling problem, which can complete the prediction task well. In the five data sets, the CX-LC model has a good prediction effect on the change trend of temperature and air quality, indicating that the model is more suitable for data sets with less human intervention and is relatively stable, so it can be used to provide a relevant basis for atmospheric environmental impact assessment, planning, management, and decision-making. However, for more complex and larger data sets, the prediction effect of such models is poor. How to capture more complex time series patterns and how to solve the problem of uneven samples still needs to be studied.
The CX-LC model is only a preliminary exploration of applying the idea of sequence labeling to the problem of event prediction. The model still has shortcomings, and there are directions that can continue to be explored:
- 1.
When performing feature extraction of subsequence sets, we can consider the pattern changes between subsequences, construct the pattern evolution diagram, and enrich the feature information;
- 2.
The LSTM model is a relatively simple and conventional sequence labeling algorithm; only the sequence structure is considered in the calculation process, if the pattern is successfully constructed for the evolution diagram, we can try a more comprehensive network structure, such as GNN, GCN, etc.;
- 3.
In the event prediction, the periodicity and seasonality of the original data were not considered, although such characteristics can be captured when fitting the CX-LC model. However, if these effects are directly reflected in the structure of the model, it may be able to improve the forecasting progress of the model;
- 4.
The results of 4.4 show that the performance of CX-LC on small-sample data sets is better than other models, but the prediction effect on large data sets still needs to be improved.
There is still a lot of work to be carried out on how to improve the framework based on sequence annotation to be more suitable for event prediction. In the future, we will also look for better event prediction methods for different types of time data sets.