Deep Learning for Time Series Forecasting: Advances and Open Problems

: A time series is a sequence of time-ordered data, and it is generally used to describe how a phenomenon evolves over time. Time series forecasting, estimating future values of time series, allows the implementation of decision-making strategies. Deep learning, the currently leading ﬁeld of machine learning, applied to time series forecasting can cope with complex and high-dimensional time series that cannot be usually handled by other machine learning techniques. The aim of the work is to provide a review of state-of-the-art deep learning architectures for time series forecasting, underline recent advances and open problems, and also pay attention to benchmark data sets. Moreover, the work presents a clear distinction between deep learning architectures that are suitable for short-term and long-term forecasting. With respect to existing literature, the major advantage of the work consists in describing the most recent architectures for time series forecasting, such as Graph Neural Networks, Deep Gaussian Processes, Generative Adversarial Networks


Introduction
A time series is a sequence of data enumerated in time order.Time series are used to study how certain measures, e.g., air pollution data [1], ozone concentration [2], plant growth [3], sunspots [4], Dow Jones and Nasdaq indices [5], and electricity consumption [6], evolve over time.Time series forecasting estimates values that a time series takes in the future, allowing the implementation of decision-making strategies, e.g., abandonment of fossil fuels to reduce the surface temperature of the Earth.Specifically, time series forecasting is very relevant for the energy domain (e.g., electricity load demand [7,8], solar and wind power estimation [9,10]), meteorology (e.g., prediction of wind speed [11], temperature [12,13], humidity [12], precipitation [13,14]), air pollution monitoring (e.g., prediction of PM 2.5 , PM 10 , NO 2 , O 3 , SO 2 , and CO 2 concentrations [12,15,16]), the finance domain (e.g., stock market index and shares prediction [17,18], the stock price [19,20], exchange rate [21,22]), health (e.g., prediction of infective diseases diffusion [23], diabetes mellitus [24], blood glucose concentration [25], and cancer growth [26]), traffic (e.g., traffic speed and flow prediction [27][28][29][30]), and industrial production (e.g., petroleum production [31], remaining life prediction [23,32,33], industrial processes [34], fuel cells durability [35], engine faults [36]).Deep learning algorithms are currently the leading methods in machine learning due to their successful application to many computer science domains (e.g., computer vision, natural language processing, speech recognition).In recent years, there has been a growth of interest in the application of deep learning to time series forecasting, due to deep learning's capability to grasp the nonlinear relations among input data (i.e., time series samples).To the best of our knowledge, there are several reviews on deep learning for time series forecasting (e.g., [37][38][39][40][41][42][43]), but they have the following major shortcomings: they lack a description of Transformer and its variants; they do not provide a clear distinction between models for short-term and long-term forecasting, there are no sections on short-term and long-term forecasting benchmarks; they do not cover the most recent deep learning strategies for short-term forecasting (e.g., Graph Neural Networks, Deep Gaussian Processes, Generative Adversarial Networks, and Diffusion Models).The aim of this work is to provide a comprehensive survey of recent deep learning approaches for time series forecasting, underlining the advances and the open problems for each reviewed deep learning architecture.Specifically, the survey focuses on works about machine learning applied to time series forecasting that are not older than 2016, for the sake of length.The paper is organised in the following sections: in Section 2, the foundations of deterministic time series are introduced; Section 3 describes deep learning architectures for short-term forecasting, i.e., Convolutional Neural Networks, Temporal Convolutional Networks, Recurrent Neural Networks (RNNs), Graph Neural Networks, Deep Gaussian Processes, Generative Adversarial Networks, and Diffusion Models; Section 4 discusses long-term forecasting architectures, i.e., the Transformer architecture and its time series-based variants; Section 5 reviews other heterogeneous architectures for both short-term and long-term forecasting; benchmarking for short-term and long-term time series forecasting is presented in Section 6; in Section 7, some conclusions are drawn and future possible developments are discussed; finally, in the appendix are reported the main mathematical notation used in the work and a description of the main diffusion model foundations.

Deterministic Time Series
A time series is called a univariate time series if all its samples are scalar; otherwise, if all samples are vectors, it is called a multivariate time series.A time series is defined as stationary when the dynamical process that generated it does not change over time, otherwise, it is non-stationary.A deterministic stationary time series x t , with t = 1, . . ., L, can be described by an autoregressive model as follows: x t+p = f (x t−1 , . . ., x t−q ) + t+p ∀p ∈ [0, P] where f (•) and q are called skeleton and model order of time series, i.e., how many past samples are required to model the time series adequately, respectively, and t+p represents an indeterminable part originated either from unmodeled dynamics of the process or from real noise.P defines the prediction length, i.e., how many future samples should be predicted.Figure 1 gives a graphical representation of an autoregressive model.

. . . X
An example of an autoregressive model for forecasting based on deterministic stationary time series.In the figure, the model order is q = 2 and the prediction length is P = 0 (i.e., it is a one-step ahead forecasting problem).
If P = 0, the autoregressive model defines the so-called one-step ahead forecasting, otherwise, a prediction length P > 0 specifies a multi-step ahead forecasting problem.Moreover, multi-step ahead forecasting can be further divided into short-term and long-term forecasting.In the literature, the typical threshold value of prediction length P between short-term and long-term forecasting ranges between 2 and 48 [44].Finally, for the sake of clarity, in this work, one-step ahead forecasting is included in short-term forecasting.

Deep Learning Models for Short-Term Forecasting
In short-term forecasting, the skeleton of time series can be approximated by the following deep learning models, which are described below.The section is organised as follows.Firstly, Convolutional Neural Networks (Section 3.1), and Temporal Convolutional Networks (Section 3.1.2)are introduced.Furthermore, Recurrent Neural Networks (Section 3.2) are described, including Elman RNNs (Section 3.2.1),Echo State Networks (Section 3.2.3),Long Short-Term Memory (Section 3.2.4),and Gated Recurrent Units (Section 3.2.5).Successively, hybrids and variants (Section 3.3) of the aforementioned architectures are briefly reviewed.Moreover, Graph Neural Networks (Section 3.4), Deep Gaussian Processes (Section 3.5), and generative models (Section 3.6), i.e., Generative Adversarial Networks (Section 3.6.1)and Diffusion Models (Section 3.6.3),are discussed.Finally, for each reviewed model, drawbacks and limitations are discussed in proper sections.

Convolutional Neural Networks
Convolutional Neural Networks (CNNs) [45], as shown in Figure 2, have a deep architecture generally formed by convolution, pooling, and fully connected layers.CNNs have three main peculiarities: local connectivity, shared weights and translation equivariance.Local connectivity resides in each CNN neuron being connected only to its exclusive input region, the so-called receptive field.Besides, the neurons of a given layer share the same weight matrix.Translation equivariance is the ability of CNNs to detect certain patterns, regardless of the position they have in the input image.1D convolution (see Figure 3) of an input sequence X = [x 1 , . . ., x L ] with a given kernel w with size q is defined as: It is worthwhile to remark that in the autoregressive approach, the kernel size q corresponds to the model order, generally fixed using model selection techniques (e.g., cross-validation) [46].Moreover, CNN can stack different convolutional layers in order to transform the input data (i.e., the past time series values) into a more suitable higher-level representation for the forecasting task.CNN time series forecasting applications are described in Table 1.

Shortcomings of Convolutional Neural Networks
The major drawback of CNNs for time series forecasting is the use of Euclidean kernels [51], since only a continuous and commonly short time series subsequence is considered at a time by an Euclidean kernel.However, in time series forecasting tasks it may be useful to extract representative features by analysing non-contiguous and farther time series samples in the past.This limitation is overcome by Temporal Convolutional Networks (see Section 3.1.2),with the use of causal and dilated convolutions, and by Graph Neural Networks (see Section 3.4), with a properly designed adjacency matrix.

Temporal Convolutional Networks
Temporal Convolutional Networks (TCNs) were proposed for action segmentation and detection by Lea et al. [52].In a nutshell, a TCN is composed of cascaded 1D convolutional layers that allow mapping arbitrarily long inputs onto output sequences of the same length.In contrast to traditional CNNs, TCNs perform causal and dilated convolutions.Unlike 1D convolution (see Equation ( 2)), in causal 1D convolution (see Figure 4) the output element at time t is yielded by the convolution between the kernel and past input elements only, namely [x t−1 , x t−2 , . . ., x t−q ], where q is the kernel size that corresponds to the model order in an autoregressive approach (see Section 3.1).In detail, causal 1D convolution is defined as follows: A dilated 1D convolution (see Figure 5) differs from a 1D convolution due to the introduction of a dilation factor d. The dilation, applied to convolution, is equivalent to considering a fixed step between two adjacent kernel elements.In particular, dilated causal 1D convolution can be defined as: When d = 1, a dilated 1D convolution is reduced to a 1D convolution.Dilated convolutions allow the networks to have a large receptive field, i.e., to capture information into the far past, by a logarithmic growth of the number g of convolutional layers, as stated in: where b is the logarithmic base of the dilation factor d i for the i-th convolutional layer (namely, d i = b i ).TCN time series forecasting applications are reported in Table 2.

Recurrent Neural Networks
Recurrent Neural Networks (RNNs) [45] aim to explore the relations between the current time series samples and the past ones.An RNN processes one time series sample at a time to approximate the skeleton f (•) and determine the model order q implicitly.While in Equation (1) the model order q is assumed to be constant, RNNs exhibit a dynamic temporal behaviour, as they do not require a fixed model order q and they auto-determine how far back to look in the past.An RNN is composed of recurrent layers, which process one input sample at a time.The earliest RNN applications for time series forecasting [56,57] are replaced by more specific and sophisticated recurrent architectures, that is, Elman Recurrent Neural Networks [58], Echo State Networks [59], Long Short-Term Memory Networks [60], and Gated Recurrent Units [61].

Elman Recurrent Neural Networks
Elman [58], Williams-Zipser [62], and Jordan [63] RNNs are the earliest Recurrent Neural Networks properly designed to handle temporal patterns in time series.In particular, Elman RNN (ERNN) uses a recurrent layer, which yields an output h t that depends on the current sample x t and the previous output h t−1 by a function g(•) and a generic set of time-shared parameters ω, as described: where h t−1 is produced by the same recurrent layer, i.e.,: and so on.The basic recurrent layer, often called a vanilla cell, works like a fully connected layer with a fixed number of units, jointly applied to the current input x t and the last output h t−1 : In this case, the set of parameters of a recurrent layer is ω = {V, W, b}, where V is the input-recurrent weight matrix, W is the recurrent-recurrent weight matrix, and b is the bias vector.In Equation (8), g(•) is a nonlinear activation function, usually a hyperbolic tangent.ERNN time series forecasting applications are summarised in Table 3.

Shortcomings of Recurrent Neural Networks
Recurrent neural networks based on the vanilla cell suffer from unstable training, which prevents the network from grasping long-term dependencies.Recurrent networks, like most neural networks, are trained by gradient descent [67], and backpropagation [67] (Backpropagation is denoted Backpropagation Through Time (BPTT), when applied to recurrent neural networks) is used to compute the gradient of the loss function w.r.t. the network's weights.When back-propagation is applied to deep networks, the problems of vanishing or exploding gradients [45] may arise.As the error gradient is back-propagated, some of its components might either get very small, giving a negligible contribution to the corresponding weight update, or very large, leading, in this way, to unstable training.Over the years, several approaches have been proposed to cope with unstable gradients.
Among the most successful approaches are reservoir computing methods, e.g., Echo State Networks [59] (see Section 3.2.3),and gated cells, e.g., Long Short-Term Memory (LSTM) cells [60] and Gated Recurrent Units (GRU) [61].A gated cell controls how much input information flows through the layer and prevents derivatives from vanishing or getting large.

Echo State Networks
Echo State Networks (ESNs) were suggested by H. Jaeger [59] in 2001 as a variant of ERNNs.ESNs are really effective in dealing with chaotic multivariate time series [68].In addition, these networks mitigate the unstable gradient problem and are more computationally efficient due to the use of fixed, random weight matrices for the recurrent layers.Based on the vanilla cell of ERNNs (see Equation ( 8)), ESNs avoid backpropagation on the recurrent layer by setting V and W as fixed (i.e., non-trainable) random matrices.Furthermore, a given sparsity level is considered in matrix W.Although random matrices are an advantage of ENSs to mitigate the unstable gradient problem, they, at the same time, represent a major ESNs shortcoming since they make particularly difficult the application of common interpretability approaches, e.g., gradient-based approaches [69,70].ESN time series forecasting applications are described in Table 4.

Long Short-Term Memory
Long Short-Term Memory (LSTM) Networks [60] were originally proposed for natural language modelling.The LSTM cell (see Figure 6) uses three gating mechanisms: input, forget and output gates.Firstly, the input of LSTM layers, which is composed of the current input x t and the output h t−1 from the last time step are is combined into the current input vector i t as follows: where γ(•) can be any sigmoidal function (e.g., logistic or hyperbolic tangent ones) and {W i , U i , b i } is a set of parameters.Furthermore, the three gates are computed as: where σ(•) is the logistic function, g t , f t , o t are the input, forget, and output gates, respectively, and Furthermore, the LSTM's inner state c t is updated by a linear combination of i t and the previous inner state c t−1 : where is the element-wise product.Finally, the output h t of a LSTM layer is defined as:  LSTM time series forecasting applications are described in Table 5.
Table 5. LSTM applications on time series forecasting.

Gated Recurrent Units
RNNs based on Gated Recurrent Units (GRUs) [61] were introduced for Statistical Machine Translation.A GRU layer, as shown in Figure 7, uses two gating mechanisms: an update and a reset gate.Both the reset and the update gate depend on x t and h t−1 .Analogously to LSTM, the reset gate r t and the update gate u t are computed as: where σ(•) is the logistic function and the rest of the parameters have the same meaning as the LSTM (see Section 3.2.4).Furthermore, an intermediate output h t is given by: where {W, V, b} is an additional set of parameters and is the element-wise product.Finally, the output h t of the GRU layer is given by the sum of h t and h t−1 , weighted element-wise by the update gate: where e is a column vector whose elements are all equal to 1. GRU time series forecasting applications are described in Table 6.

Update Gate x t h t-1
Reset Gate Architecture of a GRU cell.The column vector e is composed of elements that are all equal to 1. Table 6.Applications on time series forecasting using GRU-based Recurrent Neural Networks.

Shortcomings of LSTMs and GRUs
It has to be remarked that, even if training is stable, it can be hard for recurrent networks to learn dependencies between distant sequence samples.For instance, a recurrent network that generates an output sequence starting from an input sequence is shown in Figure 8.  Supposing that the output element at position t = 60 has a strong dependency on the input at position t = 20, information about the input sample x 20 is useful to predict the output sample y 60 .The output sample y t is predicted starting from h t , a lossy summary of the past inputs yielded by the recurrent layer; hence, the only way for the output layer to know about x 20 is through h 60 .The recurrent layer first captures information about x 20 through h 20 , which has to pass by many steps and then aggregate information about many other input elements, before achieving h 60 .There is no guarantee that after so many recurrent steps, adequate information about x 20 is preserved into h 60 .In fact, h 60 may just contain information about the most recent samples and have no information about x 20 at all.The short-term memory of recurrent networks is one of their major drawbacks and one of the main reasons why attention mechanisms and Transformers were originally introduced in deep learning (see Section 4.1).

Hybrids and Variants of Deep Neural Networks
In recent years, specific deep neural networks have been proposed as hybrids or variants of the aforementioned architectures.Hybrid models combine multiple statistical or machine learning methods to improve the robustness and accuracy of forecasting.Towards the same goal are the works that propose variants of deep neural architectures properly adapted for time series forecasting tasks.Hybrids and variants of deep neural networks share the same limitations as the models they are based on.The most successful approaches are summarised in Table 7.

Graph Neural Networks
A recent promising research direction in multivariate time series forecasting is the application of Graph Neural Networks (GNNs) [124,125].The original domain of GNNs is the handling of spatial dependencies among entities in a graph-modelled problem.In detail, GNNs can be used to generate representations of such entities that depend on the structure of the graph and on any additional information.A graph G is formally defined as a tuple G = [V, E ], where V denotes the set of nodes and E is the set of edges, the connections between the nodes of the graph, represented, in this case, with an adjacency matrix.The definition of this matrix is based on a metric function that can be either a priori fixed or learned during the training process [125].The basic idea of a GNN can be summarised by the use of three main operators: aggregate, combine, and readout.The k-th GNN layer performs the aggregate and combines operators.The former consists of agglomerating, for each graph node v ∈ V, information from its neighbourhood N(v) as defined by the adjacency matrix: where h k−1 u is the feature vector of the u-th neighbouring node of v, yielded by the previous GNN layer k − 1, and h k N(v) is the aggregated information.The latter combines the aggregated information h k N(v) with the feature vector h k−1 v of the current node v: When k = 1 the aggregate operator is not defined, whereas the combine operator reduces to: where x v represents the input feature vector associated to the v-th node.Furthermore, the readout operator is applied on the output of the last GNN layer K in order to obtain a final vector h G representing the whole graph G = [V, E ]: In the case of multivariate time series forecasting, GNNs have been successfully applied as feature selection mechanisms.It is important to remark that GNNs could also be applied to spatio-temporal time series forecasting which is not the object of the survey.GNN time series forecasting applications are described in Table 8.

Deep Gaussian Processes
Let D = ( x 1 , x 2 , . . ., x n ) be a data set and Y = ( y 1 , y 2 , . . ., y n ) the target output, a Gaussian Process [131] is a Bayesian model composed of a collection of random variables, any finite number of which have a joint Gaussian distribution, and it is fully defined by a mean function m( x i ) and covariance function k( x i , x j ), which is usually a Mercer kernel [131,132].The analytical solution of a Gaussian Process entails computing the inverse of the covariance matrix of observations, which has a computational complexity of O(n 3 ).A common approach for coping with this computational drawback is the use of Sparse Gaussian Process [133].This method consists of considering a reduced set of m (m n) training samples, the so-called inducing variables, reducing, in this way, the complexity to O(nm 2 ).A sequence of Gaussian Processes defines a Bayesian model called Deep Gaussian Process (DGP) [134].As shown in Figure 9, in DGPs the output of the single Gaussian Process located at the previous layer is fed as an input to the Gaussian Process located at the next layer.Unlike the rest of the deep learning techniques, Deep Gaussian Processes can estimate not only the value of future time series samples but also provide a confidence interval of the predictive time series value representing, in this way, the uncertainty of the model.DGP time series forecasting applications are described in detail in Table 9.
Table 9. DGP applications on time series forecasting.

Generative Models
Among the latest trends in deep learning research, there are the so-called generative models, specifically Generative Adversarial Networks (GANs) and Diffusion Models (DMs).Both families are popular for their groundbreaking capabilities in generating synthetic images.These successes encouraged researchers to apply GANs and DMs to sequential data as well, including time series.As generative models, both methodologies have been used for time series generation tasks.However, it can adapt them for other time series-related tasks as well, specifically time series forecasting.Sections 3.6.1 and 3.6.2provide an overview of GANs and their usage for time series forecasting, respectively; Sections 3.6.3and 3.6.4discuss diffusion models and their applications to time series forecasting, respectively.

Generative Adversarial Networks
A GAN [140] is composed of two separate artificial neural networks: a generator network G that generates synthetic data, with the same distribution of the input ones, and a discriminator network D that classifies input data as either real or synthetic.G and D are trained with an adversarial training approach.G takes random noise as input and it has to transform the noise into a synthetic data sample following the same distribution of the real data.D receives both real and generated samples and it estimates the probability that any given data sample comes from the real data rather than from the generator.The two networks are trained jointly with a minimax two-player game [140], i.e., the discriminator is trained to maximise the correct classification ratio for both real and generated samples.Whereas, the generator has the goal to trick the discriminator into misclassifying generated samples by minimising the correct classification ratio.This training procedure is expressed by the objective function: where, x is a real data point, sampled from the real data distribution p x ( x); z is a noise vector, sampled from a distribution p z ( z), a priori fixed; D( x) is the probability distribution estimated by the discriminator; G( z) is the sample produced by the generator, starting from the noise z.GANs can be implemented by any neural architecture for the generator and the discriminator.For instance, G and D can be implemented by MLPs [67], as originally proposed in [140], CNNs (see Section 3.1), with some architectural constraints to stabilise the training procedure [141], or LSTM (see Section 3.2.4)networks [142].

Generative Adversarial Networks in Time Series Forecasting
In literature, there are two main approaches for using GANs in time series forecasting: as data augmentation or as an end-to-end forecasting model.In the former case, GANs generate synthetic time series in a given domain (e.g., financial or health-related time series) in order to augment the original small data set.The augmented data set, with both real and generated time series, is then used to train a traditional forecasting model, e.g., a model based on an LSTM network.In the latter case, the forecasting model is a GAN itself, and it generates future samples starting from previous ones and, eventually, other exogenous inputs (In time series forecasting, a variable is said to be exogenous if it is not a time series sample, but it can still affect the time series samples.For instance, temperature may be an exogenous variable in rainfall time series forecasting) [143].For example, in [144] a GAN is firstly used to augment a building energy consumption data set and then, an ensemble of five traditional predictive models is trained on such augmented data set.In particular, to augment the data set, Conditional GANs (CGANs) [145] and Information Maximising GANs (InfoGANs) [146] are compared with each other.Similarly, in [147] COVID-19 epidemic data is generated by a custom GAN with an LSTM generator and an MLP discriminator.Furthermore, a Transformer is used to make a forecasting model trained on the GAN-augmented data set.Moreover, some GAN-based models have been specifically developed for time series generation, e.g., QuantGAN [148], for financial time series with long-term time dependencies, SynSigGAN [149], for continuous biomedical signals, Recurrent Conditional GANs (RCGANs) [150], for medical data, TimeGAN [151], a framework for domain-agnostic time series generation, Conditional Sig-Wasserstein GAN (Sig-WCGAN) [152], and TTS-GAN [153], entirely based on Transformers.Some of the aforementioned GANs, e.g., RCGAN, TimeGAN, Sig-WCGAN, are conditional GANs [145], i.e, time series are not generated from only random noise but also conditioned on the real time series and/or related information, e.g., exogenous inputs, for improving generated time series quality.The use of conditional GANs is popular for end-to-end forecasting, where the generated time series window, typically in the short-term future, is often conditioned on previous samples and on other exogenous inputs (see [154][155][156][157][158]).Table 10 collects some works on GAN applications for time series forecasting.
Table 10.GAN applications on time series forecasting.

Diffusion Models
A new family of generative architectures are the diffusion models.They have been showing cutting-edge performance in generating samples that reflect the observed data in different domains, e.g., image generation, video generation, and text synthesis.Three key formulations are used to develop diffusion-based approaches for short-term time series applications: denoising diffusion probabilistic models (DDPMs) [173,174], scorebased generative models (SGMs) [175], and stochastic differential equations (SDEs) [176].Diffusion models aim to approximate a generative process G that generates new samples drawn from an underlying distribution q(x), given some observed data x from the same distribution.To approximate G, in the forward process, a progressive injection of Gaussian noises on the observed data is performed by the majority of diffusion models.Furthermore, a reverse process is applied, by a learnable transition kernel, to reconstruct the original data.Most diffusion models assume that, after a finite number of noise injection steps, q(x) distribution of the observed data will become a Gaussian distribution.Therefore, the goal of diffusion models is to find the probabilistic process P that approximates q(x) distribution of original data from the Gaussian distribution.In this manner, any sample of Gaussian distribution can be transformed by P into a new sample of q(x) distribution of observed data x.The principle of diffusion models is to progressively perturb the observed data x with a random noise by a forward diffusion process F before recovering the original data using a backward reverse diffusion process B. A deep neural network is used in the B process to approximate the amount of noise that must be removed in the denoising steps to recover the original data.For the sake of readability, the theoretical foundations of diffusion models and their main architectures are omitted in the section and moved in Appendix B, whereas the diffusion models for short-term time series forecasting are described in the following subsection.

Diffusion Models in Short-Term Time Series Forecasting
In recent years, several diffusion-based approaches for time series forecasting have been proposed.They are based on the three predominant methods of diffusion model described in Appendix B. The first prominent diffusion model architecture for time series forecasting is TimeGrad [177], which is a DDPM variant.The forward process of TimeGrad injects noises into data at each predictive sample, and then denoises gradually through the backward process conditioned on previous time series samples.For encoding the previous time series samples, TimeGrad uses an RNN architecture, e.g., LSTM (see Section 3.2.4) or GRU (see Section 3.2.5),The objective function of TimeGrad is represented by a negative log-likelihood, denoted as: where [t 0 , T] is the prediction length.The Equation ( 24) can be reformulated considering the lower bound: The parameters θ are estimated during the training, minimising the negative log-likelihood objective function with a stochastic sampling.Furthermore, future time series samples are generated with a step-by-step procedure.The observation for the next samples at time T + 1 is predicted in a similar way as DDPM (see Appendix B).Similarly, the ScoreGrad model [178], based on the same target distribution of TimeGrad, defines a continuous diffusion process using SDEs (see Appendix B).ScoreGrad consists of two modules: the former is a feature extraction module (e.g., an RNN) almost identical to TimeGrad, or an attention-based network, e.g., Transformer (see Section 4.1), for computing the hidden state h t of previous time samples, the latter is a conditional SDE-based score-matching module.The objective function of ScoreGrad is computed as follows: with L t (θ) being: It is worthwhile to remark that, in Equation ( 27), the general formulation of SDEs has been used for the sake of simplicity.Recently, time series research has paid attention to avoiding model overfitting phenomena in the forecasting of short time series.D 3 VAE [179], tries to address the problem of short time series, applying a coupled diffusion process for time series data augmentation, and then it performs a bidirectional autoencoder (BVAE), as a score model.Moreover, D 3 VAE takes into account the decoupling of latent variables by reducing total correlation to improve prediction interpretability and stability.Furthermore, the objective function of D 3 VAE includes the mean square error (MSE), which highlights the requirement of supervision, among the forecast and current samples in the prediction window.Unlike TimeGrad, D 3 VAE injects noises separately into the previous samples (context) x 0 1:t 0 −1 and the prediction window x 0 t 0 :T by the coupled diffusion process: where indicates the standard Gaussian noise.Short time series forecasting benefits the simultaneous improvement of the context and prediction window provided by the diffusion process.The B process is made up of two steps.The former forecasts x k t 0 :T with a BVAE model, considering the context x k 1:t 0 −1 .The latter denoises the output xk t 0 :T of BVAE with a denoising score matching module, as follows: where E( xk t 0 :T ; e) is the energy function.The objective function of D 3 VAE is composed of four losses, that can be written as follows: where λ 1 , λ 2 , λ 3 are the regularisation parameters of divergence between target distribution and distribution of prediction window, denoising score matching objective, and total correlation among latent variables, respectively.Diffusion models for time series forecasting are summarised in Table 11.

Deep Learning Models for Long-Term Forecasting
In long-term forecasting, the skeleton of a time series can be approximated by using the Transfomer architecture.Firstly, the original Transformer architecture (Section 4.1) is described and attention mechanisms are presented (Sections 4.1.1 and 4.1.2).Furthermore, the main limitations of Transformers are discussed (Section 4.1.3)and variants of Transformer, properly designed to cope with long-term time series forecasting tasks, e.g., Informer, Autoformer, FEDFormer and Crossformer, are presented (Section 4.1.4).

Transformers
The Transformer [181] is a deep-learning architecture borrowed from Natural Language Processing.It can be described as: "a model architecture eschewing recurrence and relying entirely on attention mechanisms to draw global dependencies between input and output" [181].The Transformer architecture was proposed to overcome the main drawbacks of recurrent models (see Sections 3.2.2 and 3.2.6)with sequence modelling tasks: 1.
The output state h t of a recurrent layer at time t depends on the state h t−1 , produced at the previous time step.This inherent sequential nature prohibits the intra-sequence parallelism of recurrent networks.

2.
Recurrent networks cannot generally learn relationships between sequences of distant samples, since information must first pass through all data samples in between (see Figure 8).
The standard Transformer follows a general encoder-decoder architecture for sequenceto-sequence transduction, as shown in Figure 10.
In time series forecasting, the Transformer's input is a time-ordered sequence of past samples X = [ x 1 , x 2 , . . ., x L ] where L is the sequence length and x i ∈ R d is the i-th sample of a d-dimensional multivariate time series.Due to the use of attention mechanisms, Transformers make no assumption on any intrinsic temporal or spatial ordering of input elements, namely inputs are seen as a set of samples rather than ordered sequences of samples.If there is a relevant input ordering for the modelling task, e.g., time series forecasting, the ordering should be encoded in the input embedding.In Transformers, this is commonly achieved by summing a positional embedding E pos to the main sample embedding F(X ) [181]: where the matrix (Differently from what appears in some machine learning papers, the more precise tensor product notation is used in the whole work for representing matrices) F(X ) ∈ R L ⊗ R D represents a projection of the input sequence in a higher D-dimensional space (D > d).In time series forecasting, a 1D convolutional layer is commonly used with D learned kernels, as described in Section 3.1, in order to extract a D-dimensional representation for each sample in X [181][182][183][184]. E pos can either be a learned embedding or a fixed embedding.A naive solution, yet effective, consists of using a sinusoidal position encoding [181].However, in time series forecasting, other positional embeddings can be used as well, e.g., temporal-based embeddings [182][183][184].The encoder and the decoder can have two different separated embeddings, or they can share the same embedding if input and output samples belong to the same set.In time series forecasting, the encoder input is the complete sequence of past samples X , while the decoder input is commonly composed of the most recent part of X (e.g., the second half of X , i.e., [ x L/2 , x L/2+1 , . . ., x L ]) and a zero-vector whose length is equal to prediction length P, see Equation (1).The encoder and decoder are composed of N e and N d stacked layers, respectively (see Figure 10).The output of a layer is the input for the next layer.Each encoder layer has two sublayers: a self-attention layer, that relates each input sample with the rest of the samples, and a shallow feed-forward dense layer, shared along the sequence axis, that works as a nonlinear projection layer.To foster gradient propagation and training, each sublayer's input is added to its own output with a residual connection [185], and layer normalization [186] is used to normalise the samples of the resulting sequence into a normal distribution with a learned mean and standard deviation.Each decoder layer follows the same overall structure of a generic encoder layer, but it has one additional sublayer.The first sublayer implements a particular kind of self-attention mechanism, the so-called causal (or masked) self-attention [181].It works similarly to the encoder layer's self-attention, as each input sample is related to the others in the decoder's input sequence, but it also uses masking to prevent future samples from being considered when the processing of the current sample occurs.Furthermore, the output of the causal self-attention sublayer is related to the encoder's hidden representation (that is, the output of the final encoder layer) by a cross-attention layer.As in encoder layers, a Multi-Layer Perceptron [67] with one hidden layer is used for projecting nonlinearly the output of cross-attention.Moreover, each sublayer is wrapped by a residual connection followed by layer normalization.Finally, the output of the last decoder layer is fed to a final prediction layer.

N d
Decoder Layers . Transformer architecture.On the left side, the encoder processes an input sequence, producing a hidden representation.On the right side, the decoder uses the encoder's output to generate the output sequence.The decoder works in an autoregressive way, consuming past generated samples as additional inputs to generate the next output sample.

Attention Mechanisms
The most important computational blocks of a Transformer are attention mechanisms, that allow the model to focus its attention on specific parts of the input, depending on the information being processed.Among various definitions of attention, Transformers adopt the so-called scaled dot-product attention, which is very similar to multiplicative attention [187].Attention mechanisms operate on the following elements: a set of queries Q ∈ R M ⊗ R D k that represents the information being processed by the model, and sets of , where D k and D v denote the dimension of space where queries, keys and values are projected.Moreover, N denotes the cardinality of both keys and values, while M is the cardinality of the input queries.The output Y for all queries is computed as follows: The attention output Y ∈ R M ⊗ R D v is a matrix whose i-th row contains the output vector for the i-th query.Note that the softmax in Equation ( 33) is applied row-wise to its input matrix.Where do these queries, keys and values come from?First of all, keys and values are often the same vectors, i.e., a value vector coincides with its key.Furthermore, as described in Section 4.1, the Transformer performs attention in two ways, self-attention and cross-attention.
In self-attention, queries and values are the same vectors; in cross-attention queries come from the previous decoder sublayer, while key and value vectors are given by the encoder's hidden representation.

Multi-Head Attention
Multi-Head Attention (MHA) is a more advanced version of the aforementioned scaled dot-product attention.As discussed in [181], the scaled dot-product attention permits a network to attend over a sequence.However, often there are multiple different aspects a sequence element wants to attend to, and a single weighted average is not an adequate option for it.This motivates the extension of the scaled dot-product attention to MHA, which allows the model to jointly attend to information from diverse representation subspaces, as shown in Figure 11.In MHA, keys, values and queries are linearly projected H separate times, by three learned projection matrices, onto spaces of dimensions D k , D v and D k respectively.Furthermore, a scaled dot-product attention is applied to each of these projections and the results are concatenated together and re-projected onto the previous layer space.Each projection-attention pair defines a so-called attention head h i .For the sake of simplicity, keys, values and queries are assumed to have the same dimension D. Each attention head h i has three learned matrices: , used to project keys, values and queries, respectively.Each attention head applies a scaled dot-product attention (see Equation (33)) to the projected keys, values and queries (see Section 4.1.1): Finally, the attention output Y is given by: where the outputs h i from all attention heads are concatenated into a single R M ⊗ R HD v matrix and then re-projected linearly to the original D-dimensional space via an additional projection matrix

Shortcomings of Transformers
There are three main shortcomings of Transformers.Firstly, Transformers are locallyagnostic, that is, the scaled dot-product of the attention mechanism (see Equation (33)) is insensitive to the local context, which can make the model prone to anomalies in time series forecasting [188].Furthermore, Transformers suffer of memory bottleneck, i.e., Transformers' space complexity is O(L 2 ) with sequence length L [188].Similarly, Transformers also have the same time complexity, limiting their application to the long-term forecasting.These shortcomings are faced by some of the Transformer variants described in the following section.

Transformer Variants for Time Series Forecasting
In recent years, many variants of the naive Transformer [181], specific for time series forecasting, have been proposed.Key innovations that these variants suggest concern the embedding layer, attention mechanisms and even the encoder-decoder structure.Most of the literature focused on the design of alternative attention mechanisms that are more suitable for time series forecasting tasks.One of the first such works is the LogTrans [188], which handles two limitations of the traditional Transformer: locally-agnostic and memory bottleneck (see Section 4.1.3).The former limitation is tackled using causal convolutions (see Section 3.1.2) to generate keys and queries in the self-attention module.For the latter, a log-sparse mask is considered in order to reduce the computational complexity (see Section 4.1.3) of multi-head attention.Inspired by the idea of LogTrans, another variant, the Informer [182], defines a new sparse measure to characterise a subset of the most informative queries before applying attention.In addition, this strategy also allows for the reduction of the computational complexity of attention mechanisms.Unlike LogTrans and Informer, the Autoformer [183] replaces the standard scaled dot-product attention with an autocorrelation mechanism.Additionally, a decomposition module is employed to break down the time series into trend and seasonal components, assuming implicitly that they exist and are additive.The autocorrelation mechanism measures the time-delay similarity between input signals and aggregates the top-k similar sub-series to produce the output.The FEDformer [184], based on the work of Linformer [189], applies attention to a low-rank approximation of the input based on the Restricted Isometry Property (RIP) matrix theory [190].First, it represents the input signal into a frequency domain (either Fourier or Wavelet).Furthermore, it achieves a linear complexity by applying simplified attention mechanisms on a randomly selected subset of frequencies with a fixed size m.Recently, research efforts have moved from attention mechanisms to input representation, specifically concerning how to relate the dimensions of a multivariate time series and how to project the input sequence into a latent representation.The patchTST [191] assumes channel independence, i.e., independence among the dimension of the input multivariate time series, processing each dimension as a univariate time series.PatchTST segments each input sequence into shorter, local sub-sequences that are fed as input samples to a naive Transformer encoder [181].All time series dimensions are implicitly related via the sharing of the encoder weights.A similar consideration is adopted by the Crossformer [192], which segments each dimension of the input time series into non-overlapping shorter sub-sequences.Unlike patchTST, however, the Crossformer explicitly defines the relations among all dimensions using a Two-Stage Attention (TSA) mechanism.Furthermore, Crossformer follows a Hierarchical Encoder-Decoder architecture, in which multiple layers of TSA are used to capture relations at multiple time scales.Another relevant work is the Pyraformer [193], which proposes a Pyramidal Attention Module (PAM) to capture long-term dependencies while achieving a complexity that is linear in the sequence length.Essentially, PAM consists of applying the classic scaled dot-product attention in a sparse fashion according to a pyramidal graph, built using a cascade of strided convolutions, that defines a multi-scale representation of the input sequence.According to PAM, each node of the graph is a query and it can attend only those nodes (keys) that are its direct neighbours in the graph.In this way, Pyraformer is able to capture both short-term and long-term dependencies while still achieving a linear complexity.Similarly to Pyraformer, Scaleformer [194] addresses the importance of multi-scale dependencies in time series forecasting.The approach is orthogonal to many time series Transformers and, as such, it has been empirically evaluated with some of the aforementioned models like the Autoformer [183] and the FEDformer [184].
Given an input past sequence and the corresponding target sequence, the main idea is to apply one of the above-mentioned Transformer models, multiple times at multiple time scales.At a given scale s i , the input to the encoder is the original look-back window but downsampled by a factor s i using average pooling; the input to the decoder is given by the model forecast at the previous scale s i−1 , but upsampled by a fixed factor s through linear interpolation.To mitigate error propagation and distribution shifts that are due to repeated upsampling operations, the encoder and decoder's inputs are first normalised using Cross-Scale Normalization.Finally, a loss function, based on an adaptive loss, is applied at each time scale between the model forecast and the target sequence, which is also downsampled by a factor s i via average pooling.The Triformer [195] proposes a particular architecture that integrates attention mechanisms and recurrent units to ensure high efficiency and accuracy.The former is achieved by a patch attention mechanism with linear complexity; the latter is obtained by using variable-specific parameters.The patch attention mechanism splits the input sequence in P patches of length S and assigns a learnable pseudo-timestamp T p to each patch.When patch attention is applied, each pseudo-timestamp T p is considered as a query Q for its patch only.Moreover, variable-specific parameters are introduced by factorising the projection matrices (i.e, W K and W V ) into three matrices: left variable-agnostic matrix L ∈ R D ⊗ R a , middle variable-specific matrix B ∈ R a ⊗ R a and right variable-agnostic matrix R ∈ R a ⊗ R D , where a D. Finally, to cope with the limited temporal receptive field that is due to the patch mechanism, recurrent units are used to aggregate and control the information for all pseudo-timestamps of each layer before the final prediction.All above-mentioned variants of Transformer share the over-stationarization problem that consists in the inability to generate distinguishable attention scores when trained on stationarized series [196].The Non-stationary Transformer [196] proposes a generic framework to overcome the problem of over-stationarization.This framework is composed of two interdependent modules: Series Stationarization and De-stationary Attention.The former attenuates the non-stationarity of the time series considered, using two sequential operations: Normalization module, which computes the mean and the variance for each input time series in order to transform it into a stationary time series, and a De-normalization module, which transforms the model outputs back into a non-stationary time series, using the mean and variance computed in the previous module.The latter is a novel attention mechanism, which can approximate the attention scores that are obtained without stationarization and discover the particular temporal dependencies from original non-stationary data.Transformer variants for time series forecasting are described in detail in Table 12.Further details on each Transformer variant, can be found in the original paper that presents the architecture.

Other Relevant Deep Learning Models
This section is reserved for those works that propose interesting architectures for short-term and long-term time series forecasting which do not fit the previously defined categories.Even though these models might share the same building blocks of well-known architectures (e.g., CNN, TCN, RNN, Transformer), due to their peculiarity and heterogeneity it has been decided to collect them under a proper section.In [197] a Continuous Recurrent Unit (CRU) based on stochastic differential equations (SDEs) and Kalman filters, that can handle irregularly sampled time series, is proposed.The FiLM (Frequency-improved Legendre Memory) model [198] generates a representation of past time series samples by Legendre projections.It uses a noise reduction module based on Fourier analysis to preserve only relevant information from previous time samples, reducing the effect that noisy signals can have on time series forecasting.In [199], it shows how a time series forecasting model fully based on MLP can compare competitively with state-of-the-art Transformer models (e.g., PatchTST [191]), reducing, in this way, the forecasting computational cost.In detail, it proposes an adaption (TSMixer) of MLP-Mixer, originally proposed for the vision domain, for time series forecasting.A convolution-based architecture, MICN (Multi-scale Isometric Convolution Network) [200], can discover local patterns and global correlations in time series by using a multi-branch structure and relying on downsampled 1D convolutions to extract local features and isometric convolutions, a particular case of causal convolution (see Section 3.1.2),to discover global correlations.Table 14 summarises the aforementioned models.

Benchmarks for Time Series Forecasting
Recently, a group of time series have emerged as benchmarks for assessing the performance of machine learning models in time series forecasting tasks.This section describes the most relevant benchmarks for both short and long-term forecasting.

Benchmarks for Short-Term Forecasting
Among several different data sets used for short-term forecasting, the most popular ones are described in Table 15.It is worth quoting the M4 data set [44], proposed in 2020 for the homonymous M4 competition as a common benchmark for evaluating the performance of short-term forecasting models.The M4 data set contains 100.000 time series subdivided according to their data frequency into six groups: M4-Yearly, M4-Quarterly, M4-Monthly, M4-Weekly, M4-Daily and M4-Hourly.Furthermore, time series are also categorised into six domains, namely, Demographic, Finance, Industry, Macro, Micro and Other.Some insights on how time series are distributed into these categories are given in Figure 12.
Table 15.Short-term forecasting data sets.The column Dim refers to the dimensionality of time series.

Benchmarks for Long-Term Forecasting
Nowadays, a group of specific data sets has become the de-facto benchmark [183] to assess long-term forecasting accuracy of all Transformer variants presented in Section 4.1.4.In detail, this benchmark is composed of nine multivariate time series data sets concerning the following domains: electricity, transportation, weather, exchange rate and illness (see Table 16).Time resolution can vary from 10 min up to 7 days.

Conclusions
The paper has reviewed deep learning architectures for time series forecasting, underlining their advances.Nevertheless, four major problems remain open.The first one resides in the inability of most deep learning methods, with the exception of Deep Gaussian Processes, to estimate a confidence interval for the time series prediction.In principle, all deep learning architectures quoted in the survey can be properly modified using Bayesian training strategies [206] in order to provide the uncertainty of the model prediction, but, to the best of our knowledge, it has not been performed yet.The second problem resides in the development of more and more complex deep learning architectures.This makes them subject to overfitting, a problem that can hardly be faced by deep learning architectures.Therefore, the development of deep learning architectures for time series forecasting that are robust w.r.t.overfitting is becoming more and more relevant.The third problem consists in the need for adequately long time series.In particular, some deep learning architectures, e.g., Transformers, require the estimation of millions of parameters, implying, in this way, the necessity of adequately long time series for estimating them.The problem seems to be partially addressed by data augmentation but the proposed solutions are not fully adequate, yet.Finally, the last problem emerges in most of the reviewed deep learning models.They assume the dynamical stationarity of time series, implying that the dynamic system generating time series is stationary over time.When the aforementioned assumption is violated, a concept drift phenomenon [207] in time series is observed, consequently leading to a dramatic decrease in the prediction accuracy of deep learning models for time series forecasting.
where K is the finite number of noise level of forward process, a k ∈ (0, 1) for k = 1, 2, . . ., K are hyperparameters indicating the variance of the noise level at each step, N (•, •) is the Gaussian distribution, and I is the identity matrix.The Gaussian transition kernel q(•|•) has a fundamental property that allows obtaining x k directly from original sample x 0 :

Figure 2 .
Figure 2.An example of Convolutional Neural Network applied to time series forecasting.The red, the blue and the green boxes represent CNN's convolutional layers.

Figure 4 .Figure 5 .
Figure 4. Causal 1D convolution with a kernel of size q = 3. Zero padding elements are added at the beginning of the input sequence in order to have an output sequence with the same length as the input sequence.

Figure 6 .
Figure 6.Long-Short Term Memory cell architecture.

Figure 8 .
Figure 8.The red path in the recurrent model denotes the flow that information about an input sample (x 20 ) must follow before reaching an output layer (y 60 ) that might require it.

Figure 9 .
Figure 9. Deep Gaussian Processes.In the figure X , Y represent the data set and the desired output, respectively, f (i) is the latent function to be estimated and m i , k i represent the mean and covariance function of i-th layer.

Figure 11 .
Multi-Head Attention with H heads.Comparison of a single scaled dot-product attention (a) and multi-head attention with H attention heads (b).

Figure 12 .
Composition of the M4 data set.(a,b) show, respectively, the distribution of M4-Monthly and M4-Yearly time series into the six M4 categories.(c) shows the number of time series in each of the M4 data sets.

Table 1 .
Example of 1D convolution using a kernel with size k = 3.The scalar product is denoted by •.The yellow boxes denote the learned kernel.Recent applications on time series forecasting using Convolutional Neural Networks.

Table 2 .
Time series forecasting applications using Temporal Convolutional Networks.

Table 3 .
Elman RNN applications for time series forecasting.

Table 4 .
ESN applications on time series forecasting.

Table 7 .
[105]cations in time series forecasting using variants and hybrids of common deep neural networks.The symbol '+' denotes a combination of multiple models or methodologies.GARCH and ANN acronym stand for Generalized AutoRegressive Conditional Heteroskedasticity[105]and Artificial Neural Networks, respectively.

Table 8 .
GNN applications on time series forecasting.

Table 11 .
Recent diffusion models for time series forecasting.

Table 12 .
Recent variants of Transformer architecture for time series forecasting.

Table 14 .
Recent applications on time series forecasting using other deep learning architectures.

Table 16 .
Long-term forecasting benchmark data sets.The data set size (•, •, •) refers to the cardinality of training, validation and test set, respectively.The columns dim, pred len and time res refer to the dimensionality of time series, the number of predicted future samples and the time resolution, respectively.