Next Article in Journal
Trajectory and Global Attractors for the Kelvin–Voigt Model Taking into Account Memory along Fluid Trajectories
Next Article in Special Issue
Leveraging Chain-of-Thought to Enhance Stance Detection with Prompt-Tuning
Previous Article in Journal
Exploiting Cross-Scale Attention Transformer and Progressive Edge Refinement for Retinal Vessel Segmentation
Previous Article in Special Issue
OL-JCMSR: A Joint Coding Monitoring Strategy Recommendation Model Based on Operation Log
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SVSeq2Seq: An Efficient Computational Method for State Vectors in Sequence-to-Sequence Architecture Forecasting

1
School of Information Science and Engineering, Shandong University, Qingdao 266237, China
2
State Key Laboratory of Microbial Technology, Shandong University, Qingdao 266237, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(2), 265; https://doi.org/10.3390/math12020265
Submission received: 9 December 2023 / Revised: 5 January 2024 / Accepted: 10 January 2024 / Published: 13 January 2024

Abstract

:
This study proposes an efficient method for computing State Vectors in Sequence-to-Sequence (SVSeq2Seq) architecture to improve the performance of sequence data forecasting, which associates each element with other elements instead of relying only on nearby elements. First, the dependency between two elements is adaptively captured by calculating the relative importance between hidden layers. Second, tensor train decomposition is used to address the issue of dimensionality catastrophe. Third, we further select seven instantiated baseline models for data prediction and compare them with our proposed model on six real-world datasets. The results show that the Mean Square Error (MSE) and Mean Absolute Error (MAE) of our SVSeq2Seq model exhibit significant advantages over the other seven baseline models in predicting the three datasets, i.e., weather, electricity, and PEMS, with MSE/MAE values as low as 0.259/0.260, 0.186/0.285 and 0.113/0.222, respectively. Furthermore, the ablation study demonstrates that the SVSeq2Seq model possesses distinct advantages in sequential forecasting tasks. It is observed that replacing SVSeq2Seq with LPRcode and NMTcode resulted in an increase under an MSE of 18.05 and 10.11 times, and an increase under an MAE of 16.54 and 9.8 times, respectively. In comparative experiments with support vector machines (SVM) and random forest (RF), the performance of the SVSeq2Seq model is improved by 56.88 times in the weather dataset and 73.78 times in the electricity dataset under the MSE metric, respectively. The above experimental results demonstrate both the exceptional rationality and versatility of the SVSeq2Seq model for data forecasting.

1. Introduction

Sequential data forecasting is a fundamental inquiry in science: to what extent can we anticipate the future from historical data? Forecasting future data states from past data states is an intriguing and arduous task in numerous fields, such as energy and smart grid management, sensor network monitoring, and disease propagation analysis [1].
At present, several forecasting techniques have been developed for sequential data forecasting, such as Recurrent Neural Network (RNN)-based methods [2,3], Transformer-based methods [4,5], Linear-based methods [6,7], and TCN-based methods [8,9], which have achieved immense success in extensive fields. Transformer-based methods such as Informer [10], Autoformer [4], and Crossformer [5] have great advantages in managing remote and complex dependencies. Nevertheless, they often require substantial computational resources and memory, which increases the risk of overfitting large Transformer models with limited data. Linear-based methods such as DLinear [6] and TiDE [7] exhibit high computational efficiency and strong interpretability; however, the dependency on linear transformation makes them more suitable for dealing with simple sequence tasks, but the feature extraction ability is weak. TCN-based methods such as SCINet [8] and TimesNet [9] have obvious advantages in terms of computational efficiency and number of parameters, but they tend to prioritize the capture of local features and may fall short in identifying long-term dependencies.
Notably, in comparison to the above methods, RNN-based methods such as Long Short-Term Memory (LSTM) [2] and Gated Recurrent Unit (GRU) [3] inherently model sequential data with remarkable parameter efficiency and exhibit superior capabilities in handling extended sequences. RNN-based methods have attracted increasing attention in the field of sequential data forecasting due to their superior capabilities in handling extended sequences. For example, in 2014, Cho et al. [3] proposed the RNN encoder–decoder model using the hidden states as state vectors, which utilized one RNN to encode a sequence of symbols as a fixed-length vector representation, and another RNN to decode the representation as another sequence of symbols. In 2022, Li et al. [11] introduced the Multi-Dimensional Spatial-Temporal Recurrent Neural Network (MST-RNN), an approach that leveraged both the temporal duration and semantic tag dimensions of Points of Interest (POIs) in each layer of the neural network framework. In 2023, Jadhav et al. [12] introduced two methods for analyzing dynamic relationships within real-world sequential data: the Internal Time-Varying sequence model (ITV model) and the External Time-Varying sequence model (ETV model). Their models were distinguished by an automated basis expansion module that dynamically adapted internal or external parameters with each time step, thereby minimizing computational complexity. In addition, RNN-based methods often use a Sequence-to-Sequence structure to handle dialogue, translation, and prediction tasks [2]. In 2014, Google [13] proposed the Sequence-to-Sequence structure as a common structure for RNN-based methods, which can be simply understood as consisting of three parts: the encoder, the decoder, and the state vector connecting them. The encoder comprehended the input sequence to produce the state vector, and the decoder utilized the state vector as an input to eventually generate an output that satisfied the given task requirements [2,3].
As one of the important components of RNN based on Sequence-to-Sequence structure, improving the computation method of the state vector is crucial to enhance the performance of complex forecasting, as it is the only channel connecting the encoder output to the decoder input and thus has a significant impact on the final output result [14]. At present, many elaborate designs of state vectors for RNN-based improvement models have been proposed to solve the ubiquitous applications of serial forecasting. For example, Serban et al. [15] and Sordoni et al. [16] both used a context RNN for efficient sequence data prediction. Notably, these methods still struggle to capture global patterns and exacerbate the negative impact of noisy data on long-term dependencies, resulting in inadequate long-term forecasting. Furthermore, to improve the shortcomings of the above models, Serban et al. proposed the Latent Variable Hierarchical Recurrent Encoder–Decoder (VHRED) model [17], which incorporated a latent variable into the decoder and was trained by maximizing the log-likelihood. In addition, Weston et al. [18] introduced a class of memory networks that combined successful learning strategies with a memory component. Further, Fernando et al. [19] proposed a tree memory network to jointly model long-term relationships, where the method employed the output of the input module as a state vector. The above methods lack personalized state vectors for the different hidden layers of the decoder; consequently, they often strive to achieve superior performance in complex prediction tasks. To further improve the performance of data forecasting, Sennrich et al. [14] calculated the state vector by computing weight vectors and summing the hidden states of the encoder. Although this method considered all the input vectors, it did not consider the positional information of the vectors. In 2023, Cao et al. [20] introduced an end-to-end encoder–decoder structure, which introduced the meta-path augmented residual information matrix to preserve the structure evolution mechanism and semantics in HINs, and used it as input to the encoder to obtain a low-dimensional embedding representation of the nodes.
Consequently, improving the state vector calculation method will enable RNN-based methods to achieve commendable predictive performance in sequence forecasting. Google [13] first introduced the Transformer, an approach that relied solely on the attention mechanism, garnering significant achievements in areas such as natural language processing [21] and computer vision [22]. Currently, many elaborate Transformer variants have been proposed, such as Autoformer [4], non-stationary Transformers [23], and the PatchTST model [24], to address ubiquitous serial forecasting applications. These attention mechanism-based methods fully leverage the positional information of all vectors. Attention mechanism has not only shown good performance in sequence data prediction but has also been used to automatically synthesize high-quality images from textual descriptions. For example, Chopra et al. [25] proposed the AttnGAN method, which used an attentional model to generate sub-regions of an image based on the description. Inspired by the immense success of the attention mechanism in extensive fields, the attention mechanism used to compute the state vector in the Sequence-to-Sequence structure will effectively improve the prediction performance.
In this paper, we propose an innovative and efficient method for computing State Vectors in Sequence-to-Sequence (SVSeq2Seq) architecture. In this strategy, the attention mechanism is meticulously designed as the computational method of state vectors for Sequence-to-Sequence structure, which has enhanced capabilities in sequence modeling. Consequently, the proposed SVSeq2Seq effectively exploits the relationship between the hidden layers of the decoder and those of the encoder, thereby demonstrating superior capabilities in predicting sequential data. SVSeq2Seq comprises state vector weighting and tensor train networks. Our contribution can be summarized as follows:
(i) We use the relative significance between hidden layers to dynamically capture the interdependent aspects among elements. We then compute weight vectors for the encoder’s hidden layers based on these dependencies. We then provide adaptive state vector weights based on the computed weight vectors.
(ii) We apply tensor train decomposition to fit the expansion tensor, which successfully counteracts the problem of high dimensionality.
(iii) We confirmed the outstanding performance of SVSeq2Seq on six authentic datasets, showing significant advantages over the other seven baseline models in predicting the three datasets. We also performed ablation experiments on three of these datasets, confirming the rationality and generalizability of SVSeq2Seq. Finally, we compare the prediction effect before and after deleting all state vector components in the experimental process, which proves the importance of SVSeq2Seq in the Sequence-to-Sequence structure.

2. Model Descriptions and Preliminaries

In this study, we introduce an efficient computational method for state vectors in Sequence-to-Sequence structure forecasting, i.e., SVSeq2Seq, which incorporates state vector weights and tensor train networks. As shown in Figure 1, SVSeq2Seq records the hidden layers of the encoder (red line) and computes their correlation with the hidden layers of each decoder (green line), thus providing personalized state vectors for the hidden layers of each decoder.

2.1. Encoder–Decoder Architecture of the RNN

First, the fundamental framework of the encoder–decoder architecture of the RNN is outlined, and based on this, an efficient strategy for state vector computation is developed.
In the encoder–decoder architecture, the encoder processes the input vector x = x 1 , , x T and transforms it into a state vector c . The most common approach is to use an RNN, as shown in Equations (1) and (2):
h t = f ( x t , h t 1 )
and
c = q { h 1 , , h T } ,
where h t R n is a hidden state at time t , and c is a vector generated from the sequence of the hidden states. The function f can be a non-linear equation, with examples being naïve RNN, LSTM, or GRU, and q ( { h 1 , , h T } ) = h T .
The decoder derives the output y t utilizing the state vector c . In the context of the translation tasks, it is common for the decoder to establish a probability for the translation y , which is achieved by decomposing the joint probability into ordered conditionals, as shown in Equation (3):
p ( y ) = t = 1 T p ( y t y 1 , , y t 1 , c ) ,
where y = y 1 , , y T y . Using an RNN, each conditional probability is formulated, as shown in Equation (4):
p ( y t { y 1 , , y t 1 } , c ) = g ( y t 1 , h t , c ) ,
where g is a non-linear function that outputs the probability of y while h is the hidden state of the RNN.
Within the Sequence-to-Sequence framework, the decoder has the capability to generate multiple y t by interpreting the state vector c .
As described above, the encoder produces the state vector c , which the decoder then uses to output the desired result y . Given the significant influence of the state vector c on the output within the encoder–decoder framework, we propose an efficient approach to compute the state vector in the Sequence-to-Sequence structure.

2.2. State Vector Computation

In our efficient state vector computation approach, we leverage the correlation between the decoder’s hidden layer h and the encoder’s hidden layer h to compute the weight a of distinct h . This method allows us to generate unique state vectors c corresponding to different h . Figure 2 illustrates the entire process of computing the state vector c .
The state vector c depends on a sequence of { h 1 , h 1 , , h T } onto which an encoder maps the input x 1 , , x T . Each h i contains information about the whole input with a strong focus on the parts surrounding the i-th input. The state vector c is computed by a weighted sum of { h 1 , h 1 , , h T } , as shown in Equation (5):
c i = j = 1 T α i j h j .
We assume that the decoder input is x 1 , x 2 , , x t . Since x 1 and h 1 are known, we can calculate y 1 . This allows us to set x 2 = y 1 , indicating that the output from one time step becomes the input for the next. One might wonder about the origin of x 1 . Typically, we use the special character “<BOS>” (beginning of sentence) to denote the beginning of an input x 1 . The terminating condition reflects this approach. Specifically, the forecasting process terminates when the output is a special character “<EOS>” (end of sentence).
Let us use h 1 as a case study to illustrate the computation of the weight α 1 j . To start off, we establish two weight matrices, denoted as W q and W k , as shown in Equation (6):
q 1 = w 1 q · h 1 e x t e n d   k 1 = w 1 k · h 1 e x t e n d   h 1 e x t e n d = { h 1 , h 1 , , h T }
Let us express this in vector form in Equation (7):
Q = W q · H i e x t e n d K = W k · H i e x t e n d   H i e x t e n d = { h i , h 1 , , h T }
We use Q and K to compute the correlation between h i and the elements within the set { h 1 , , h T } in H i e x t e n d , denoted as a i j and computed as Equation (8):
a i j = ( q i ) τ k j
Let us express this in vector form in Equation (9):
A = K T · Q
The softmax operation is applied to the A matrix, ensuring that its elements fall within the range [0, 1], and the result is denoted as A in Equation (10):
A = s o f t m a x ( A )
The state vector c is computed by substituting the elements α i j of the matrix A into Equation (5).
To illustrate the computation process of the state vectors c , Algorithm 1 is provided, which shows the pseudocode in detail.
Algorithm 1: State vectors c
Input: The input vector x 1 , , x T of Encoder and H i e x t e n d
Output: State vectors c
1 for x i in x 1 , , x T do
2  Calculate q i and k j , Equation (7);
3  Calculate a i j , Equation (8);
4  Express a i j in vector form, Equation (9);
5  Make sure the elements in A within [0, 1], Equation (10);
6  Take the elements α i j of A into Equation (5);
7 ends
8 Get c i = j = 1 T α i j h j

2.3. Tensor Train Networks

As the number of hidden layers in the encoder and decoder increases, the size of H e x t e n d also expands rapidly. This expansion complicates the computation of Equations (7) and (9), and consequently requires a larger size for training. To overcome this difficulty, we use tensor networks to approximate H e x t e n d . Such networks encode a structural decomposition of tensors into low-dimensional components and have been shown to provide the most general approximation to smooth tensors (Figure 3) [26].
A tensor train model decomposes a P -dimensional tensor H e x t e n d into a network of sparsely connected low-dimensional tensors { A p R r p 1 × n p × r p } such that Equation (11)
H i 1 i P = α 1 α P 1 A α 0 i 1 α 1 1 A α 1 i 2 α 2 2 A α P 1 i P α P P
In Equation (11), α 0 = α P = 1 . When r 0 = r P = 1 , the r p is called the rank of the tensor train. With tensor trains, we can reduce the number of parameters from ( q + k ) T + 1 to ( q + k ) R 2 ( T + 1 ) , with R = m a x p r p as the upper bound of the tensor train rank. Thus, a major advantage of tensor trains is that they do not suffer from the curse of dimensionality, which is in sharp contrast to many classical tensor decomposition models, such as the Tucker decomposition.

3. Experiment Results

3.1. Experimental Setup

To verify the effectiveness of our state vector computation method, we constructed a Sequence-to-Sequence structure using an LSTM. For all experiments, we used an initial sequence of length t 0 as input and varied the prediction time horizon T. We trained all models using stochastic gradient descent, with a regression loss function L ( y , y ^ ) = t = 1 T | | y ^ t y t | | 2 2 applied to sequences of length T. In this case, y t = x t + 1 represents the true values, while y ^ t represents the predicted values from our model.
Two conventional machine learning-based methods, such as Support Vector Machines (SVM) and Random Forest (RF), were selected as the baselines to compare the performance with our proposed model. Further, we also selected seven popular prediction models as benchmarks from three categories, including Transformer-based methods such as Informer [10], FEDformer [27], and Crossformer [5]; Linear-based methods such as DLinear [6] and TiDE [7]; TCN-based methods such as SCINet [8] and TimesNet [9].

3.2. Dataset Preparation

Six real-world datasets including Electricity Transformer Temperature (ETT), weather, electricity, traffic used by Autoformer [4], solar-energy datasets proposed in LSTNet [28] and Performance Measurement System (PEMS) evaluated in SCINet [8] were selected in our experiments.

3.2.1. ETT Dataset

ETT dataset encompasses temperature recordings of electrical transformers over a period and typically includes corresponding information on electricity usage. Spanning data from several months to several years, the ETT dataset is recorded at regular intervals, such as every 15 min, every half hour, or hourly. The applicability of the ETT dataset is extensive, especially in areas where accurate short-term or long-term predictions of electrical demand are needed. This includes analyzing and forecasting supply and demand dynamics in the electricity market, predicting residential and industrial electricity consumption patterns in energy usage, and conducting short-term and long-term load forecasting for power systems.

3.2.2. Weather Dataset

The weather dataset comprehensively records meteorological indicators at 10 min intervals throughout the year 2020, encompassing 21 variables including air temperature and humidity. The dataset’s scope of application is exceptionally broad, ranging from analyzing and forecasting climate change trends to tracking environmental shifts, assessing pollution and ecosystem health, and utilizing historical and real-time data for predicting both short-term and long-term weather conditions.

3.2.3. Electricity Dataset

The electricity dataset encompasses hourly electricity consumption data for 321 customers spanning from the year 2012 to 2014. The utility of this data is remarkably broad, encompassing applications such as analyzing energy consumption patterns to drive efficiency improvements, forecasting short-term or long-term loads in power systems, predicting electricity prices in the energy market based on supply and demand conditions, and monitoring anomalous behaviors within the electrical grid.

3.2.4. Traffic Dataset

The traffic dataset is a compilation of data collected by the California Department of Transportation, characterizing road occupancy rates measured by various sensors located on highways in the San Francisco Bay Area. This dataset is applicable to a wide range of uses, such as forecasting traffic flow at specific times and locations, identifying bottleneck segments and periods of congestion, and supporting infrastructure planning and development through the analysis of traffic patterns.

3.2.5. Solar-Energy Dataset

Solar-Energy dataset chronicles the solar energy production in the year 2006, with samples collected every 10 min from 137 photovoltaic power stations in Alabama. This dataset has a similarly extensive range of applications, such as forecasting the energy output of solar power stations or photovoltaic systems, investigating the relationship between solar radiation and climate change, and assessing the impact of different geographical locations and seasons on the performance of photovoltaic systems.

3.2.6. PEMS Dataset

PEMS collects and stores real-time traffic data from across the entire highway system in California, including metrics such as traffic volumes, vehicle speeds, and lane occupancy rates. This information is primarily obtained through inductive loop detectors and other sensors deployed on the highways. The system aggregates data both temporally (e.g., every 5 min) and spatially (e.g., per detection point or region), thereby supporting traffic management, planning, and research initiatives. Spanning a wide network of California’s highways, the dataset provides several years of historical data, proving invaluable for transportation research, urban planning, traffic engineering, and environmental impact assessments.

3.3. MAE and MSE Calculation

For the purposes of this experiment within the context of sequence prediction tasks, we have selected two prevalent and extensively utilized evaluation metrics, i.e., MSE and MAE. The MSE represents the average of the squares of the differences between the predicted and actual values. The computational method for MSE is delineated in Equation (12):
M S E = 1 m i = 1 m ( y t e s t ( i ) y ^ t e s t ( i ) ) 2
where m denotes the number of samples, y t e s t ( i ) represents the actual values, and y ^ t e s t ( i ) signifies the predicted values by the model. The squared term in MSE accentuates the impact of larger errors, rendering it particularly sensitive to outliers. This attribute can be advantageous in certain scenarios, such as when substantial prediction errors entail more severe consequences than minor ones. As a differentiable function, MSE possesses favorable mathematical properties during optimization processes, such as gradient descent, which are crucial for model training.
The MAE is the average of the absolute differences between the predicted values and the actual values. The MAE directly quantifies the average deviation between the predicted values and the actual values and is straightforward to interpret and comprehend. Due to the absence of a squared term, MAE is less sensitive to outliers compared to the MSE, offering a more robust error estimation in the presence of anomalous data points. The method for calculating MAE is presented in Equation (13):
M A E = 1 m i = 1 m | y t e s t ( i ) y ^ t e s t ( i ) |

3.4. Comparison with Machine Learning-Based Methods

Traditional machine learning methods still hold advantages in sequence data forecasting. Therefore, we conduct extensive experiments to evaluate the forecasting performance of our proposed SVSeq2Seq model together with two conventional machine learning-based baselines, such as Support Vector Machines and Random Forest. As shown in Table 1 and Figure 4, our proposed SVSeq2Seq shows excellent prediction performance on the three datasets, both MSE and MAE, compared with the SVM and RF. In particular, the performance of the SVSeq2Seq model in the weather dataset was improved by 56.88 times in the weather dataset and 73.78 times in the electricity dataset at the MSE level, respectively.

3.5. Comparison with Several Up-to-Date Methods

Further, we also compared our proposed model with several up-to-date methods, including Informer, FEDformer, Crossformer, DLinear, TiDE, SCINet, and TimesNet. The prediction results are shown in Table 2 and Table 3, with the best results in bold. Lower MSE and MAE values indicate more accurate predictions. Our model outperforms other state-of-the-art baseline models on three out of the six datasets while achieving comparable results on the remaining three datasets. These results demonstrate the excellent performance of our model in sequence prediction scenarios.
As shown in Table 2, when evaluated using MSE as a metric, our proposed SVSeq2Seq model outperforms all other baseline models on the weather, electricity, and PEMS datasets. On the weather dataset, our proposed model performs comparable to TimesNet but outperforms Informer by up to 59.15%. On the electricity dataset, the SVSeq2Seq model we proposed performs up to 40.19% better than Informer. Similarly, the SVSeq2Seq model delivers up to 65.34% improvement over TiDE on the PEMS dataset.
When evaluated with MAE as the metric, the performance of the SVSeq2Seq model is essentially the same as when evaluated with MSE (Table 3). SVSeq2Seq shows an improvement of up to 52.55% over Informer in the weather dataset, up to 28.21% over Informer in the electricity dataset, and up to 47.01% over TiDE in the PEMS dataset.
Figure 5 illustrates the comparison of forecasting results between the SVSeq2Seq model and the other seven models on the ETT, weather, electricity, traffic, solar energy, and PEMS datasets. The results in Figure 5 show that the MSE and MAE of the SVSeq2Seq model exhibit significant advantages over the other seven baseline models in predicting the three datasets, i.e., weather, electricity, and PEMS. The MSE values are as low as 0.259, 0.186, and 0.113 in the weather, electricity, and PEMS datasets, respectively. Correspondingly, the MAE values were as low as 0.260, 0.285, and 0.222, respectively. In addition, the MAE and MSE values of the SVSeq2Seq model also perform well in the three other datasets, such as ETT, traffic, and solar energy. Therefore, our proposed SVSeq2Seq model displays exceptional forecasting ability in the six datasets.

3.6. Ablation Experiment

To verify the rationality and generality of our proposed state vector computation method, we performed detailed ablation experiments, including the replacement of different encoders, decoders, and state vector computation methods. LPRcode is a state vector computation method used in the reported study [3], where the output of the encoder is used as a direct input for the decoder. On the other hand, the reported NMTcode computes the state vector by summing the hidden states of the encoder [14]. The experimental results are shown in Table 4 and Table 5.
The universality of the SVSeq2Seq model was confirmed in ablation experiments using MSE as the evaluation metric. To achieve this goal, LSTM and RNN were employed as encoders and decoders in sequential order. As shown in Table 4, the MSE of the ETT, weather, and electricity datasets exhibited minimal variations of 0.305, 0.37, and 0.304, respectively, demonstrating the robust universality of the SVSeq2Seq model. To provide additional evidence of the reliability of the SVSeq2Seq model, we also conducted a series of experiments using the LPRcode and NMTcode models, which resulted in a significant decrease in forecasting accuracy. Specifically, when LSTM was used as both the encoder and decoder on the ETT dataset, the MSE increased by 18.05 and 15.69 times, respectively. Similarly, when RNN was employed as both the encoder and decoder, the MSE increased by 10.32 times and 10.11 times, respectively. The significant increase in MSE after replacing SVSeq2Seq highlights the soundness of the proposed SVSeq2Seq model in the Sequence-to-Sequence framework.
The results of the ablation experiments were similar when using MAE as the evaluation metric and MSE as the benchmark. The chosen evaluation metrics did not appear to have a significant impact on the results. The LSTM and RNN were implemented sequentially as the encoder and decoder. Minor fluctuations in the MAE were observed on the ETT, weather, and electricity datasets, with values of 0.409, 0.535, and 0.319, correspondingly (Table 5). This confirms the superior adaptability of the SVSeq2Seq model. After replacing SVSeq2Seq with LPRcode and NMTcode, and using LSTM as both the encoder and decoder in the ETT dataset, MAE increased by 16.54 and 16.48 times, respectively. Similarly, when RNN was used as the encoder and decoder, the MAE increased by 9.8 folds and 9.5 times, respectively. These results affirm the effectiveness of the proposed SVSeq2Seq model.
Figure 6 shows the comparison of forecasting results for the ablation experiments in three real-world datasets, including ETT, weather, and electricity. As shown in Figure 6, regardless of whether LSTM or RNN is used as the encoder and decoder for the SVSeq2Seq model, the forecasting results are significantly superior to those obtained using LPRcode and NMTcode models. The above results indicate that the proposed SVSeq2Seq model has outstanding universality and rationality compared to the LPRcode and NMTcode models.
Furthermore, to validate the efficacy of the state vector, we eliminated all components of the state vector, such as SVSeq2Seq, LPRcode, and NMTcode. Compared with the proposed model with SVSeq2Seq as a state vector (Table 4 and Table 5), the MSE and MAE values of the model without state vector components increased substantially in both the LSTM and RNN models (Table 6). These results demonstrated the significant contribution of the state vector, such as SVSeq2Seq, LPRcode, and NMTcode, within the Sequence-to-Sequence architecture.

4. Discussion

This study proposes an efficient method for computing state vectors in Sequence-to-Sequence architecture that can serve as a foundation for further research and development in sequence data forecasting. Even though we can mitigate the issue of gradient explosion and enhance the predictive performance of SVSeq2Seq by optimizing the computation of the state vector c and employing tensor train, we still encountered numerous limitations and challenges during the early stages of modeling and development of SVSeq2Seq. Initially, computational and storage resources posed constraints on the model construction, hence we proposed the utilization of tensor train to alleviate this issue. Secondly, in terms of dataset selection, a significant amount of time was devoted to addressing the quality, quantity, and representativeness of the data. Lastly, the complexity of the model, hyperparameter tuning, interpretability, and the generalization ability of the SVSeq2Seq also presented considerable challenges throughout its development. After overcoming the above challenges, the SVSeq2Seq model has excellent performance in sequence data forecasting on multiple datasets.
Analyzing the above experimental results reveals that the SVSeq2Seq significantly outperforms traditional machine learning methods (e.g., SVM and RF) in the prediction of sequential data across ETT, weather, and electricity datasets. There are primarily two possible reasons for the suboptimal performance of machine learning algorithms in sequence prediction tasks. First, conventional SVMs and RFs do not inherently capture this sequential information as they assume the input features to be independent and identically distributed. This implies that without proper feature engineering to incorporate temporal information (such as using sliding windows, time lags, etc.), these models may fail to yield accurate predictions. Second, many sequence datasets exhibit non-linearities and non-stationarities. While RFs can capture a degree of non-linearity, they may not be sufficiently flexible for sequences with complex temporal dynamics. SVMs can address non-linearity issues by employing non-linear kernel functions, yet their predictive capacity might be compromised if the sequence data demonstrate significant non-stationarity.
In addition, our proposed SVSeq2Seq model still shows satisfactory performance compared with several up-to-date methods such as Informer, DLinear, and TimesNet in six real-world datasets. This observed significance can be attributed to the advantage of the tensor train network. A significant benefit of using a tensor train is that it is not affected by dimension enlargement. So, in practice, we can use a tensor train to approximate f . In SVSeq2Seq, we use the tensor train to reduce the parameter dimension from ( q + k ) T + 1 to ( q + k ) R 2 ( T + 1 ) . This effectively reduces the amount of computation that increases sharply as H e x t e n d increases. Let f H μ k be a Sobolev function defined on I = I 1 × I 2 × I d , where each I i is a set of vectors as given in Equation (14).
H μ k = f L μ ( I ) : i k D ( i ) f 2 < +
where D ( i ) f is i-th weak derivative of f and μ 0 . Any Sobolev function can be decomposed by Equation (15):
f ( ) = i = 0 λ ( i ) γ ( ; i ) ϕ ( i ; )
where { λ } is the eigenvalue and γ ( ) and ϕ ( ) are the eigenfunctions. Therefore, we will denote f by Equation (16):
f ( x ) = α 0 , , α d = 1 A 1 ( α 0 , x 1 , α 1 ) & A d ( α d 1 , x d , α d )
where { A d ( α d 1 , s d , α d ) = λ d 1 ( α d 1 ) ϕ ( α d 1 ; s d ) } is the basis function on each input dimension. Then, f ( x ) is truncated into a low-dimensional subspace ( r < ) in Equation (17) as follows:
f x = α 0 , , α d r A 1 ( α 0 , x 1 , α 1 ) A d ( α d 1 , x d , α d )

5. Conclusions

In this paper, we propose a novel and efficient state vector calculation method for Sequence-to-Sequence architecture forecasting, in which the proposed SVSeq2Seq model uses the correlation between the hidden layers of the decoder and the encoder to compute the weight vector of the hidden layer of the encoder. It is worth noting that as the number of hidden layers in the encoder increases, the computational complexity of this method grows exponentially. In addition, we introduce the use of tensor train decomposition to approximate the extended tensor, which effectively mitigates the curse of the dimensionality problem. The experimental results demonstrate that the proposed SVSeq2Seq outperforms the reported baseline models in most scenarios. When using the MSE as an evaluation metric, the SVSeq2Seq model achieved the best results with scores of 0.259, 0.186, and 0.113 on the weather, electricity, and PEMS datasets, respectively. With the MAE as the evaluation metric, SVSeq2Seq obtained optimal results with scores of 0.260, 0.285, and 0.222 on the datasets, respectively, demonstrating the effectiveness of SVSeq2Seq in sequence prediction tasks. In the ablation experiment, regardless of whether MSE or MAE was used as the evaluation criterion, the SVSeq2Seq group achieved the best performance on the ETT, weather, and electricity datasets, indicating that our model possesses excellent generalizability. The experimental results demonstrate that SVSeq2Seq exhibits superior performance in sequence data prediction tasks, particularly in predicting long-term nonlinear sequence data. In the future, we plan to further enhance the flow of data within the encoder and decoder, and optimize the computation of memory information weights within the RNN to improve the sequence prediction capabilities of SVSeq2Seq. Regarding machine learning approaches, we will explore the use of ensemble learning techniques in conjunction with hyperparameter tuning to enhance the performance of machine learning methods in sequence prediction tasks. Additionally, we aim to further investigate the application of the state vector in time series analysis tasks using Transformer-based methods.

Author Contributions

Conceptualization, G.S.; Methodology, G.S. and W.W.; Validation, G.S. and X.Q.; Formal analysis, G.S.; Writing—original draft, G.S.; Writing—review and editing, X.Q. and Q.Z.; Supervision, Y.L.; Funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the present article.

Acknowledgments

The authors would like to extend their gratitude and acknowledgments to all the participants for their time spent on this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Barbosa, A.; Bittencourt, I.; Siqueira, S.W.; Dermeval, D.; Cruz, N.J.T. A context-independent ontological linked data alignment approach to instance matching. Int. J. Semant. Web. Inf. 2022, 18, 1–29. [Google Scholar] [CrossRef]
  2. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 2010, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  3. Cho, K.; Merrienboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; Volume 1406, p. 1078. [Google Scholar] [CrossRef]
  4. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the 35th Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 22419–22430. [Google Scholar] [CrossRef]
  5. Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar] [CrossRef]
  6. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
  7. Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term forecasting with TiDE: Time-series dense encoder. Trans. Mach. Learn. Res. 2023, 2304, 08424. [Google Scholar] [CrossRef]
  8. Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. In Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 5816–5828. [Google Scholar] [CrossRef]
  9. Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022; Volume 2210, p. 02186. [Google Scholar] [CrossRef]
  10. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Held Virtually, 2–9 February 2021. [Google Scholar] [CrossRef]
  11. Li, C.; Li, D.; Zhang, Z.; Chu, D. MST-RNN: A multi-dimension spatiotemporal recurrent neural networks for recommending the next point of interest. Mathematics 2022, 10, 1838. [Google Scholar] [CrossRef]
  12. Sneha, J.; Zhao, J.; Fan, Y.; Li, J.; Lin, H.; Yan, C.; Chen, M. Time-varying sequence model. Mathematics 2023, 11, 336. [Google Scholar] [CrossRef]
  13. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NeurIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar] [CrossRef]
  14. Rico, S.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar] [CrossRef]
  15. Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using generative hierarchical neural network models. AAAI Conf. Artif. Intell. 2016, 1410, 3916. [Google Scholar] [CrossRef]
  16. Sordoni, A.; Bengio, Y.; Vahabi, H.; Lioma, C.; Simonsen, J.G.; Nie, J. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 553–562. [Google Scholar] [CrossRef]
  17. Serban, I.V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A.; Bengio, Y. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
  18. Jason, W.; Chopra, S.; Bordes, A. Memory networks. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7380–7382. [Google Scholar] [CrossRef]
  19. Fernando, T.; Denman, S.; McFadyen, A.; Sridharan, S.; Fookes, C. Tree memory networks for modelling long-term temporal dependencies. Neurocomputing 2018, 304, 64–81. [Google Scholar] [CrossRef]
  20. Cao, J.; Li, J.; Jiang, J. Link prediction for temporal heterogeneous networks based on the information lifecycle. Mathematics 2023, 11, 3541. [Google Scholar] [CrossRef]
  21. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar] [CrossRef]
  22. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  23. Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-stationary transformers: Rethinking the stationarity in time series forecasting. arXiv 2022, arXiv:2205.14415. [Google Scholar] [CrossRef]
  24. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar] [CrossRef]
  25. Singh, S.K.; Chopra, M.; Sharma, A.; Gill, S.S. A comparative study of generative adversarial networks for text-to-image synthesis. Int. J. Softw. Sci. Comp. 2022, 14, 1–12. [Google Scholar] [CrossRef]
  26. Román, O. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Ann. Phys. 2014, 349, 117–158. [Google Scholar] [CrossRef]
  27. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. Int. Conf. Mach. Learn. 2022, 2201, 12740. [Google Scholar] [CrossRef]
  28. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [Google Scholar] [CrossRef]
Figure 1. The operational status of SVseq2seq within the Sequence-to-Sequence structure.
Figure 1. The operational status of SVseq2seq within the Sequence-to-Sequence structure.
Mathematics 12 00265 g001
Figure 2. The computation procedure of the state vector c .
Figure 2. The computation procedure of the state vector c .
Mathematics 12 00265 g002
Figure 3. Schematic diagram of the working of tensor train networks.
Figure 3. Schematic diagram of the working of tensor train networks.
Mathematics 12 00265 g003
Figure 4. Comparison of the forecasting results of the SVSeq2Seq model with SVM and RF at the MSE (A) and MAE (B) level on three real-world datasets.
Figure 4. Comparison of the forecasting results of the SVSeq2Seq model with SVM and RF at the MSE (A) and MAE (B) level on three real-world datasets.
Mathematics 12 00265 g004
Figure 5. Comparison of the forecasting results of the SVSeq2Seq model with other models on six real-world datasets. (A) ETT dataset; (B) weather dataset; (C) electricity dataset; (D) traffic dataset; (E) solar energy dataset; (F) PEMS dataset. The black and red colors of numbers indicated the MSE and MAE values, respectively.
Figure 5. Comparison of the forecasting results of the SVSeq2Seq model with other models on six real-world datasets. (A) ETT dataset; (B) weather dataset; (C) electricity dataset; (D) traffic dataset; (E) solar energy dataset; (F) PEMS dataset. The black and red colors of numbers indicated the MSE and MAE values, respectively.
Mathematics 12 00265 g005
Figure 6. Comparison of forecasting results for ablation experiments in three real-world datasets. (A) ETT dataset; (B) weather dataset; (C) electricity dataset. The black and red colors of numbers indicated the MAE and MSE values, respectively.
Figure 6. Comparison of forecasting results for ablation experiments in three real-world datasets. (A) ETT dataset; (B) weather dataset; (C) electricity dataset. The black and red colors of numbers indicated the MAE and MSE values, respectively.
Mathematics 12 00265 g006
Table 1. The MSE and MAE values of the three models’ forecasting results.
Table 1. The MSE and MAE values of the three models’ forecasting results.
ModelsMSE 1MAE 2
SVSeq2SeqSVMRFSVSeq2SeqSVMRF
ETT0.52812.97114.5620.52413.51812.057
Weather0.25914.73211.6040.26012.35111.943
Electricity0.18611.26413.7240.28510.90512.718
1 Bold indicated the MSE optimal value; 2 Bold indicated the MAE optimal value.
Table 2. The MSE values of the eight models’ forecasting results.
Table 2. The MSE values of the eight models’ forecasting results.
ModelsSVSeq2SeqTimesNetSCINetTiDEDLinearCrossformerFEDformerInformer
ETT0.5280.4140.9540.6110.5590.9420.4374.431
Weather0.2590.2590.2920.2710.2650.2590.3090.634
Electricity0.1860.1920.2680.2510.2120.2440.2140.311
Traffic0.7370.6200.8040.7600.6250.5500.6100.764
Solar-Energy0.2800.3010.2820.3470.3300.6410.2910.235
PEMS0.1130.1470.1140.3260.2780.1690.2130.171
Bold indicated the MSE optimal value in different datasets.
Table 3. The MAE values of the eight models’ forecasting results.
Table 3. The MAE values of the eight models’ forecasting results.
ModelsSVSeq2SeqTimesNetSCINetTiDEDLinearCrossformerFEDformerInformer
ETT0.5240.4270.7230.5500.5150.6840.4491.729
Weather0.2600.2870.3630.3200.3170.3150.3600.548
Electricity0.2850.2950.3650.3340.3000.3340.3270.397
Traffic0.5020.3360.5090.4730.3830.3040.3760.416
Solar-Energy0.3010.3190.3750.4170.4010.6390.3810.280
PEMS0.2220.2480.2240.4190.3750.2810.3270.274
Bold indicated the MAE optimal value in different datasets.
Table 4. The MSE of the models’ forecasting results in ablation experiments.
Table 4. The MSE of the models’ forecasting results in ablation experiments.
ModelsLSTM +
SVSeq2Seq
RNN +
SVSeq2Seq
LSTM +
LPRcode
LSTM +
NMTcode
RNN +
LPRcode
RNN +
NMTcode
ETT0.5280.8339.5338.2868.6008.425
Weather0.2590.6298.4768.3328.4599.762
Electricity0.1860.5809.0727.6498.7148.619
Bold indicated the MSE optimal value in different datasets.
Table 5. The MAE of the models’ forecasting results in ablation experiments.
Table 5. The MAE of the models’ forecasting results in ablation experiments.
ModelsLSTM +
SVSeq2Seq
RNN +
SVSeq2Seq
LSTM +
LPRcode
LSTM +
NMTcode
RNN +
LPRcode
RNN +
NMTcode
ETT0.5240.9338.8258.6379.1468.902
Weather0.2600.7959.6017.2129.7028.316
Electricity0.2850.6049.4798.9669.7569.637
Bold indicated the MAE optimal value in different datasets.
Table 6. The MSE and MAE of the models’ forecasting results.
Table 6. The MSE and MAE of the models’ forecasting results.
ModelsMSE 1MAE 2
LSTMRNNLSTMRNN
ETT9.60310.4929.54810.734
Weather8.9729.7769.6709.809
Electricity9.74710.9469.24910.385
1 Bold indicated the MSE optimal value; 2 Bold indicated the MAE optimal value.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, G.; Qi, X.; Zhao, Q.; Wang, W.; Li, Y. SVSeq2Seq: An Efficient Computational Method for State Vectors in Sequence-to-Sequence Architecture Forecasting. Mathematics 2024, 12, 265. https://doi.org/10.3390/math12020265

AMA Style

Sun G, Qi X, Zhao Q, Wang W, Li Y. SVSeq2Seq: An Efficient Computational Method for State Vectors in Sequence-to-Sequence Architecture Forecasting. Mathematics. 2024; 12(2):265. https://doi.org/10.3390/math12020265

Chicago/Turabian Style

Sun, Guoqiang, Xiaoyan Qi, Qiang Zhao, Wei Wang, and Yujun Li. 2024. "SVSeq2Seq: An Efficient Computational Method for State Vectors in Sequence-to-Sequence Architecture Forecasting" Mathematics 12, no. 2: 265. https://doi.org/10.3390/math12020265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop