Entanglement-Structured LSTM Boosts Chaotic Time Series Forecasting

Traditional machine-learning methods are inefficient in capturing chaos in nonlinear dynamical systems, especially when the time difference Δt between consecutive steps is so large that the extracted time series looks apparently random. Here, we introduce a new long-short-term-memory (LSTM)-based recurrent architecture by tensorizing the cell-state-to-state propagation therein, maintaining the long-term memory feature of LSTM, while simultaneously enhancing the learning of short-term nonlinear complexity. We stress that the global minima of training can be most efficiently reached by our tensor structure where all nonlinear terms, up to some polynomial order, are treated explicitly and weighted equally. The efficiency and generality of our architecture are systematically investigated and tested through theoretical analysis and experimental examinations. In our design, we have explicitly used two different many-body entanglement structures—matrix product states (MPS) and the multiscale entanglement renormalization ansatz (MERA)—as physics-inspired tensor decomposition techniques, from which we find that MERA generally performs better than MPS, hence conjecturing that the learnability of chaos is determined not only by the number of free parameters but also the tensor complexity—recognized as how entanglement entropy scales with varying matricization of the tensor.


Introduction
Time series forecasting [1], despite its undoubtedly tremendous potential in both theoretical issues (e.g., mechanical analysis, ergodicity) and real-world applications [2] (e.g., traffic, weather, and clinical records analysis), has long been known as an intricate field. From classical work on statistics such as auto-regressive moving average (ARMA) families [3] and basic hidden Markov models (HMM) [4,5] to contemporary machinelearning (ML) methods [6][7][8][9] such as gradient boosted trees (GBT) and neural networks (NN), the essential complexity in time series has been more and more frequently recognized. In particular, forecasting models have extended their applicable range from linear, Markovian cases to nonlinear, non-Markovian, and even more general situations [10]. Among all known methods, recurrent NN architectures [11], including plain recurrent neural networks (RNN) [12] and long short-term memory (LSTM) [13], are the most capable of capturing this complexity, as they admit the fundamental recurrent behavior of time series data. LSTM has proved useful in speech recognition and video analysis tasks [14] in which maintaining long-term memory is essential to the complexity. In relation to this objective, novel architectures such as higher-order RNN/LSTM (HO-RNN/LSTM) [15] have been introduced to capture long-term non-Markovianity explicitly, further improving performance and leading to more accurate theoretical analysis.
Still, another domain of complexity-chaos-has been far less understood [16,17]. Even though enormous theory/data-driven studies on forecasting chaotic time series by means of recurrent NN have been conducted [18][19][20][21][22][23][24][25], there is still a lack of consensus on which features play the most important roles in the forecasting methods. The notorious indication of chaos, |δx t | ≈ e λt |δx 0 | (1) (where λ denotes the spectrum of Lyapunov exponents), suggests that the difficulty of forecasting chaotic time series is two-fold: first, any small error will propagate exponentially, and thus multi-step-ahead predictions will be exponentially worse than one-step-ahead ones; second, and more subtly, when the actual time difference ∆t between consecutive steps increases, the minimum redundancy of model capacity needed for smoothly descending to the global minima (or sufficiently good local ones) during NN training also increases exponentially. Most studies only address the first difficulty by improving the prediction accuracy achievable at the global minima. Yet the latter is in fact more crucial, especially when ∆t is so large that the time series looks apparently random and a trivial local minimum would most likely be reached instead. Recently, tensorization has been introduced in recurrent NN architectures [26,27]. A tensorized version of HO-RNN/LSTM, namely, HOT-RNN/LSTM [28], has claimed an advantage in learning long-term nonlinearity in Lorenz systems of small ∆t. On the one hand, we believe that the global minima of chaos (where the dominance of linear dependence is absent) can be most efficiently reached through tensorization approaches, where all nonlinear terms, up to some polynomial order, are treated explicitly and weighted equally. On the other hand, for simple chaotic dynamical systems, nonlinear complexity is only encoded in the short term, not the long term, which HO/HOT models will not be efficient in capturing when ∆t is large. Hence, a new tensorization-based recurrent NN architecture is desired so as to foster our understanding of chaos in time series and to meet practical needs, e.g., the modeling of laminar flame fronts and chemical oscillations [29][30][31][32].
In this paper, we introduce a new LSTM-based architecture by tensorizing the cellstate-to-state propagation therein, retaining the long-term memory features of LSTM while simultaneously enhancing the learning of short-term nonlinear complexity. Compared with traditional LSTM architectures, including stacked LSTM [33] and other aforementioned statistics/ML-based forecasting methods, our model is shown to be a general and outperforming approach for capturing chaos in almost every typical chaotic continuoustime dynamical system and discrete-time map with controlled comparable NN training conditions, justified by both our theoretical analysis and experimental results. Our model is also tested on real-world time series datasets, where the improvements range up to 6.3%.
During the tensorization, we have explicitly embedded many-body quantum state structures-a way of reducing the exponentially large degree of freedom of a tensor (i.e., tensor decomposition)-popularly studied in condensed matter physics, which is not unseen in NN design [34]. A many-body entangled state living in a tensor-product Hilbert space is hardly separable. The same inseparability behavior also appears in nonlinear multivariate functions when crossing terms between different variables become too complex. This similarity motivated us to adopt a special measure of tensor complexity, namely, the entanglement entropy (EE) [35], which is commonly used in quantum physics and quantum information [34]. For one-dimensional many-body states, two thoroughly studied, popular but different structures exist-multiscale entanglement renormalization ansatz (MERA) [36] and matrix product states (MPS) [37], of which the EE scales with the subsystem size or not at all, respectively [35]. For most pertinent studies, MPS has been proven to be efficient enough to be applicable to a variety of tasks [38][39][40][41]. However, our experiments show that, regarding our entanglement-structured design of the new tensorized LSTM architecture, LSTM-MERA performs even better than LSTM-MPS in general without increasing the number of parameters. This finding leads to another interesting result. We conjecture that not only should tensorization be introduced, but the tensor's EE has to scale with the system size as well; hence, MERA is more efficient than MPS in learning chaos.

Formalism of LSTM Architecture
The formalism starts from an operator-theoretical perspective by defining two general types of real operators, W and σ, through which most NN architectures can be represented. W : X → G is simply a linear operator, but σ : G → G is a nonlinear operator that σ(G) = (σ • G) ∈ G given G ∈ G where • stands for the entry-wise operator product. All double-struck symbols (X, G, · · · ) used in this context are general real vector spaces considered to be of covariant type, as W can be interpreted as a linear-map-induced 2contravariant bilinear form. Next, a state propagation function (i.e,. a gate) g(x, y, · · · ; W ) = σ(W (x ⊕ y ⊕ · · · )) is introduced, where x ⊕ y ⊕ · · · stands for the tensor direct sum of real vectors x, y, · · · . Following the formalism, an LSTM architecture can be expressed as follows: where the four gates controlled by W i , W m , W f , and W o are the input, memory, forget, and output gates. The state s t and the cell state c t are h-dimensional covectors, whereas the input x t is a d-dimensional covector ( Figure 1) (Figure 2), linear, and a tanh activation function. d is the input dimension of x t , and h is the hidden dimension of state s t and cell state c t . An h-dimensional vector tanh c t is first expanded into a P × L-dimensional matrix where L and P are dubbed the physical length and physical degrees of freedom (DOF), respectively. Then, the matrix is tensorized into an L-rank tensor of dimension P L and passed forward. The effectiveness of this architecture is investigated in Section 4.3.

Tensorized State Propagation
Our tensorized LSTM architecture (Figure 1) is exactly based on Equation (2), from which the only change is: g(T (σ(c t )); W T ) is coined a tensorized state propagation function, for which W T : T → G acts on as a covariant tensor Each W l in Equation (4) maps from σ(c t ) to a new covector q t,l ∈ Q. Here, Q is named a local q-space, as, by way of analogy, considered as encoding the local degree of freedom in quantum mechanics. Q can be extended to the complex number field if necessary. Mathematically, Equation (4) offers the possibility of directly constructing orthogonal polynomials up to order L from σ(c t ) to build up nonlinear complexity. In fact, when L goes to infinity, T = lim

L→∞
(1 ⊕ Q) ⊗L = 1 ⊕ Q ⊕ Q ⊗ Q ⊕ · · · becomes a tensor algebra (up to a multiplicative coefficient), and T (σ(c t )) admits any nonlinear smooth function of σ(c t ). A full tensor can be represented by introducing multiple auxiliary and learnable tensors (e.g., disentanglers and isometries, as used in MERA, and inverse isometries, as used in MPS) of different virtual dimensions {D I , D II , · · · } labeled by different levels, rendered in different colors. The first-level virtual dimension is D I ≡ P, the physical DOF by definition. Other virtual dimensions {D II , · · · } are free hyperparameters to be chosen, the larger of which should better represent the full tensor. The numbers of applicable levels in (a,b) are always constant (one and two, respectively), yet the number of applicable levels in c is log 2 L, depending on the physical length L.
Following Equation (4), T (tanh c t ) is simply the tensor product of all column vectors on the right hand side of Equation (5). From the realization of Equation (3), W T ∈ M(h, P L ) [ Figure 2a], however, a problem of exponential explosion (a.k.a. the "curse of dimensionality") arises. Treating W T maximally by training all hP L learnable parameters is very computationally expensive, especially as L cannot be small because it governs the nonlinear complexity. To overcome this "curse of dimensionality", tensor decomposition techniques have to be exploited [39] for the purpose of finding a much smaller subset T ⊂ M(h, P L ) to which all possible W T belong, without sacrificing any expressive power.

Many-Body Entanglement Structures
Below, we introduce the two many-body quantum state structures (MPS and MERA) as efficient low-order tensor decomposition techniques for representing W T .

MPS
As one of the most commonly used tensor decomposition techniques, MPS is also widely known as tensor-train decomposition [42] and takes the following form [Figure 2b] in our model, where w † 1 , w † 2 , · · · , w † L are learnable 3-tensors (the symbol † denoting that they are inverse isometries [35]). D II is an artificial dimension (the same for all α). w 0 is no more than a linear transformation that collects the boundary terms and maintains symmetry. The above notations are used for consistency with quantum theory [35] and the following MERA representation.

MERA
The best way to explain MERA is using graphical tools, e.g., tensor networks [35]. MERA differs from MPS in its hierarchical tree structure: within each level {I, II, · · · }, the structure contains a layer of 4-tensor disentanglers of dimensions {D 4 I , D 4 II , · · · }, and then a layer of 3-tensor isometries of dimensions {D 2 I × D II , D 2 II × D III , · · · }, of which details can be found in [36]. MERA is similar to the Tucker decomposition [43] but fundamentally different because of the existence of disentanglers, which smear the inhomogeneity of different tensor entries [36]. Figure 2c shows the reorganized version of MERA used in our model, where the storage of independent tensors is maximally compressed before they are multiplied with each other by tensor products, which allows more GPU acceleration during NN training.

Scaling Behavior of EE
Now we take advantage of an important measure of tensor complexity: the entanglement entropy (EE). Given an arbitrary tensor W µ 1 ···µ L of dimension P L and a cut l so that 1 ≤ l L, the EE is defined in terms of the α-Rényi entropy [35], assuming α ≥ 1. The Shannon entropy is recovered under α → 1. σ i (W(l)) in Equation (6) is the i-th singular value of matrix W(l) = W (µ 1 ×···×µ l ),(µ l+1 ×···×µ L ) , matricized from W µ 1 ···µ L . How S(l) scales with l determines how much redundancy exists in W µ 1 ···µ L , which in turn reveals how efficient at most a tensor decomposition technique can be. For one-dimensional gapped low-energy quantum states (ground states), their EE is saturated even as l increases, i.e., S α (l) = Θ(1). Thus, their low-entanglement characteristics can be efficiently represented via MPS, of which the EE does not scale with l either and is bounded by S α (l) ≤ S 1 (l) ≤ 2 log D II [35]. By contrast, a non-trivial scaling behavior S α (l) = Θ(log l) corresponds to gapless low-energy states and can only be efficiently rep- [36]. The bounds of both MPS and MERA have also been proven to be tight [35,36].
The different EE scaling behaviors of MERA and MPS have hence provided an apparent geometric advantage of MERA, i.e., its quasi-two-dimensional structure [Figure 2c], enlarging which will increase not only the width but also the depth of NN as the number of applicable levels scales logarithmically with L, offering even more power for model generalization on the already-inherited LSTM architecture [11]. Such an advantage is further confirmed by Equation (9) and then in Section 4.1, in which tensorized LSTMs with the two different representations LSTM-MPS and LSTM-MERA are tested.

Expressive Power
First, we prove the following theorem that links the variations of c t and s t : Theorem 1. Given an LSTM architecture [Equation (2)], to which the input is a chaotic dynamical system x t , characterized by a matrix λ of which the spectrum is the Lyapunov exponent(s), so that any variation δx t propagates exponentially [Equation (1)], then, up to the first order (i.e., δx t−1 ), where C ∝ 1/ W 2 ∞ and · p=∞ is the operator norm.
which yields Equation (7), where w.l.o.g. all linear maps are assumed to be around the same magnitude ∼ W ∞ . Note that |c t−1 | is also bounded because |g(1, x t−1 , s t−1 ; W f )| ≤ 1, which means c t is stationary. Equation (7) suggests that the state propagation from c t to s t carries the chaotic behavior. In fact, to preserve the long-term memorization in LSTM, c t has to depend on c t−1 with a linear behavior and thus cannot carry chaos itself. This is further experimentally verified in Section 4.3. Nevertheless, note that Equation (1) is a necessary condition of chaos, not a sufficient condition. Now, we look into the expressive power of our introduced tensorized state propagation function, W T T (σ(c t )). One of the advantages of tensorizing the state propagation function in the form of Equation (4) is the well-behaved polynomial space constructed by the tensor product, by virtue of which the approximation of W T T to any (k-Sobolev) function f can always be bounded, as proven by the following theorem: is the Sobolev norm and C is a finite constant.
Proof. The Hölder-continuous spectral convergence theorem [44] is spanned by polynomials of a degree of at most N. Next, note that in the realization of T (σ(c t )) each W l is independent [Equation (4)], and thus P N = span(T (σ(c t ))) is possible, where N is determined by L, P, and h. When P − 1 ≥ h, the maximum polynomial order is guaranteed L; when P − 1 < h, dim{Q} < dim{G}, and hence T (σ(c t )) can only fully cover a polynomial order of up to L(P − 1)/h . Finally, Equation (8) is proven based on the fact that W T T maximally admits P N f as long as hP L ≥ ∑ N i=0 h i = (h N+1 − 1)/(h − 1), the latter of which is the size of the maximum orthogonal polynomial basis admitted by P N .
Equation (8) can be used to estimate how L scales with the chaos the dynamical system possesses. In particular, Equation (7) suggests that ∂ (1) f ∼ e λ∆t , where ∆t is the actual time difference between consecutive steps. Therefore, to persevere the error bound [Equation (8)] one at least expects that L −1 ∼ e −λ∆t , i.e., L has to increase exponentially with respect to λ∆t. To achieve this, tensorization is undoubtedly the most efficient approach, especially when ∆t is large.

Worst-Case Bound by EE
The above analysis offers an intuitive bound on the expressive power of W T T . Unfortunately, Equation (8) is valid only when all hP L degrees of freedom of W T are independent. A low-order tensor decomposition may therefore impact the expressive power.
Below, we compare the two different entanglement structures, MPS and MERA, which have major differences in their EE scaling behaviors. We proceed via the following theorem, relating the tensor approximation error to entanglement scaling: Theorem 3. Given a tensor [W T ] µ 1 ···µ L and its tensor decomposition W T , the worst-case p-norm (p ≥ 1) approximation error is bounded from below by where S α≡p (W(l)) is the α-Rényi entropy [Equation (6)].
Proof. Equation (9) is easily proven by noting the Minkowski inequality The worst-case bound [Equation (9)] is optimized whenever S p (W T (l)) scales the same way as S p (W T (l)) does. Assuming S p (W T (l)) = C + C log l, then an MPS-type W T cannot efficiently approximate W T unless D II increases with log l too, from which the total number of free parameters ∼ PLD 2 II [ Figure 2b], however, becomes unbounded. By contrast, a MERA-type W T matches the scaling, by which the total number of free parameters ∼ (D 4 + D 3 )L (where D ≡ D II , · · · = exp C ) is efficient enough for any worst-case l.
It is unknown how quantitatively the failure to approximate W T may impact the expressive power given in Equation (8). The disappearance of the worst-case bound in Equation (9) is a necessary condition for Equation (8) to be valid.

Results
We investigated the accuracy of LSTM-MERA and its generalization ability on different chaotic time series datasets by evaluating the root mean squared error (RMSE) of its onestep-ahead predictions against target values. The benchmark for comparison was chosen to be a vanilla LSTM, of which the hidden dimension h was arbitrarily chosen in advance. LSTM-MERA (and other architectures if present) was built upon the benchmark.
Each time series dataset for training/testing consisted of a set of N X time series, {X i |i = 1, 2, · · · , N X }. Each time series X i = {x i t |t ∈ T i } is of fixed length |T i | = input steps + 1 so that all but the last step of X i were inputs, whereas the last step was the one-step-ahead target to be predicted. The dataset {X i } was divided into two subsets-one for testing and one for training, which was further randomly split into a plain training set and a validation set by 80% : 20%. Complete details are given in Appendix C.
All models were trained using Mathematica 12.0 on its NN infrastructure, Apache MXNet, using an ADAM optimizer with β 1 = 0.9, β 2 = 0.999, and = 10 −5 . The learning rate = 10 −2 and batch size = 64 were chosen a priori. The NN parameters producing the lowest validation loss during the entire training process were accepted.

Comparison of LSTM-Based Architectures
When evaluating the advantages of LSTM-MERA, a controlled comparison is essential to confirm that the architecture of LSTM-MERA is inherently better than that of other architectures, not just because the increase of the number of free and learnable parameters (even though more parameters do not necessarily mean more learning power). Here, we studied different architectures (Figure 3) that were all built upon the LSTM benchmark and shared nearly the same number of parameters (param. #). A "wider" LSTM was simply built by increasing h. A "deeper" LSTM was built by stacking two LSTM units as one unit. In particular, LSTM-MPS and LSTM-MERA were built and compared. In general, non-tensorized LSTM models performed worse than tensorized LSTM models. After the number of free parameters increased from 332 (benchmark) to 668 ± 28, both the "wider" and "deeper" LSTMs showed signs of overfitting. The "deeper" LSTM yielded lower RMSE than the "wider" LSTM, confirming the common sense that a deep NN is more suitable for generalization than a wide NN [45]. Both LSTM-MPS and LSTM-MERA yielded better RMSE and showed no sign of overfitting. However, LSTM-MERA was more powerful, showing an improvement of ∼25% over LSTM-MPS in RMSE [ Figure 3a].

Logistic Map
Figure 3b describes a specific forecasting task on the simplest one-dimensional discretetime map-the logistic map: predicting the target given only a three-step-behind input. Different LSTM models yielded very different results when learning this complex task. After the number of free parameters increased from 35 (benchmark) to 1142 ± 89, all LSTM models yielded lower RMSE than the benchmark. Only LSTM-MERA was able to reach a much lower RMSE (presumably a global minimum) with a remarkable improvement of ∼94% over LSTM-MPS [ Figure 3b]. We infer that the local minima reached by the other LSTM models might correspond to the infinite numbers of unstable quasi-periodic cycles in the chaotic phases. In fact, as shown in Figure 3b, Prediction 3, the benchmark fit the target better than LSTM-MERA for this specific example of a quasi-period-2 cycle. However, LSTM-MERA learned the full chaotic behavior and thus performed much better on general examples.
The learning process for the logistic map task was indeed very random, and different realizations yielded very different results. In many realizations, non-tensorized LSTM models did not even learn any patterns at all. By contrast, tensorized LSTM models were more stable in learning.

Comparison with Statistical/ML Models
We compared LSTM-MERA with more general models, including traditional statisical and ML models, including RNN-based architectures (Figure 4). Specifically, we looked into HOT-RNN/LSTM, which is also claimed to be able to learn chaotic dynamics (e.g., the Lorenz system) through tensorization [28]. Furthermore, for each model we fed its one-step-ahead predictions back so as to make predictions for the second step, and kept feeding back and so on. In theory, the prediction error at the t-th step should increase exponentially with t for chaotic dynamics [Equation (1)].

Gauss Iterated Map
We tested the one-step-ahead learning task on the Gauss "cubed" map on plain HO-RNN/LSTM [15] and its tensorized version HOT-RNN/LSTM [28]. The explicit "history" length was chosen to be equal to our physical Length L. The tensor-train ranks were all chosen to be equal to D II , as when we built the MPS structure in LSTM-MPS. Figure 4 shows that neither HO-RNN nor HO-LSTM performed better than the benchmark, suggesting that introducing explicit non-Markovian dependence (Appendix B.1) is not helpful for capturing chaotic dynamics where the existing nonlinear complexity is never long-term. HOT-LSTM was better than the benchmark because of its MPS structure, suggesting that tensorization, on the other hand, is indeed helpful for forecasting chaos. LSTM-MERA was still the best, with an improvement of ∼88% over the benchmark. Inter-estingly, the benchmark itself as a vanilla LSTM was already much better than plain RNN architectures (HO-/HOT-RNN).
The learning task was next tested on fully connected deep NN architectures of depth ≤ 8 (equal to the input steps). At each depth three units were connected in series: a linear layer, a scaled exponential linear unit, and a dropout layer. Hyperparameters were determined by means of an optimal search. The best model having the lowest validation loss consisted of 17, 950 free parameters. The task was also tested on GBT of maximum depth = 8, as well as on the ARMA family (ARMA, ARIMA, FARIMA, and SARIMA), among which the best statistical model selected by Kalman filtering was ARMA (3,4).
With enough parameters, the deep NN became the second best ( Figure 4). All RMSE values increased when making longer-step-ahead predictions, and for the four-step-ahead task the deep NN and LSTM-MERA were the only models that did not overfit and still performed better than the statistical model, ARMA, which made no learning progress but only trivial predictions.

Comparison with LSTM-MERA Alternatives
Here we tested the ability of LSTM-MERA in the learning of short-term nonlinear complexity by changing its NN topology (Figure 5). We expected to see that, to achieve the best performance, our tensorization (dashed rectangle in Figure 1) should indeed act on the state propagation path c t → s t , not on s t−1 → s t or c t−1 → c t .

Thomas' Cyclically Symmetric System
We investigated different LSTM-MERA alternatives on Thomas' cyclically symmetric system (Figure 5) in order to see if the short-term complexity could still be efficiently learned. The embedded layers, in addition to being located at Site A (the proper NN topology of LSTM-MERA), were also located alternatively at Site B, C or D for comparison. The benchmark was a vanilla LSTM with no embedded layers.
As expected, the lowest RMSE was produced by the proper LSTM-MERA but not its alternatives ( Figure 5). The improvement of the proper LSTM-MERA over the benchmark was ∼60%. Interestingly, two alternatives (Site B, Site C) performed barely better than the benchmark even with more free learnable parameters. In fact, in the case in which the state propagation path c t−1 → c t is tensorized (Site B), the long-term gradient propagation along cell states is interfered and the performance of LSTM is deterred; when the path s t−1 → s t is tensorized (Site C), the improvement is the same as just on a plain RNN and is thus also limited. Hence, proper LSTM-MERA NN topology is critical for improving the performance of learning short-term complexity.

Generalization and Parameter Dependence of LSTM-MERA
The inherent advantage of LSTM-MERA and its ability to learn chaos have been shown. Hereafter investigated are its parameter dependence, as well as its generalization ability ( Figure 6). Each following model (benchmark versus LSTM-MERA) was sufficiently trained through the same number of epochs so that it could reach the lowest stable RMSE. In-between check points were chosen during training, in which models were tested a posteriori on the test data to confirm that an RMSE minimum had eventually been reached.

Rössler System
In theory, a chaotic time series of larger ∆t should be harder to learn [Equation (1)]. This is confirmed in Figure 6a, in which a larger ∆t corresponds to a larger RMSE for both models. The greatest improvement of LSTM-MERA over the benchmark was ∼76%, observed at ∆t = 5. The improvement was less when ∆t increased, possibly because the time series became too random to preserve any feasible pattern even for LSTM-MERA. The improvement was also less when ∆t was small, as the time series was smooth enough and the first-order (linear) time-dependence predominated, which a vanilla LSTM could also learn.

Hénon Map
In view of the fact that the time-dependence is second-order [ Figure 6b], there was no explicit and exact dependence between the input and target in the time series dataset. Different input steps were chosen for comparison. When input steps = 1, there was no sufficient information to be learned other than a linear dependence between the input and target, and thus both the benchmark and LSTM-MERA performed the same [ Figure 6b]. When input steps > 1, however, the time-dependence could be learned implicitly and "bidirectionally" given enough history in length. LSTM-MERA constantly exhibited an average improvement of 45.3%, the fluctuation of which was mostly due to the learning instability not of LSTM-MERA but of the benchmark.

Duffing Oscillator System
Based on Figure 6d, it was clearly observed that larger L yielded better RMSE values. The improvement related to L was significant. This result is not unexpected, since L determines the depth of the MERA structure, with a larger depth corresponding to better generalization ability.

Chirikov Standard Map
As Figure 6d shows, by choosing different P, the greatest improvement of LSTM-MERA over the benchmark was ∼56%, observed at P = 8. In general, there was no strong dependence on P.

Real-World Data: Weather Forecasting
The advantage of LSTM-MERA was also tested on real-world weather forecasting tasks [ Figure 6e,f]. Unlike for the synthetic time series, here we removed the first-layer translational symmetry [Equation (A1)] previously imposed on LSTM-MERA so that presumed non-stationarity in real-world time series could be better addressed. To perform practical multi-step forecasting, we kept the one-step-ahead prediction architecture of LSTM, yet regrouped the original time series by choosing different prediction window lengths (Appendix C.3).
The improvement of LSTM-MERA over the benchmark was less significant. The average improvement was ∼3.0%, whereas the greatest improvement was ∼6.3%, considering that the prediction window length was small and reflecting that LSTM-MERA is better at capturing short-term nonlinear complexity rather than long-term non-Markovianity. Note that, in the second dataset [ Figure 6f], we deliberately used a very small number (=128) of training data to test the overfitting resistibility of the models. Interestingly, LSTM-MERA did not generally perform worse than vanilla LSTM even with more parameters, probably due to the deep architecture of LSTM-MERA.

Discussion and Conclusions
The limitations of our model mostly come from the fact that it is only better than traditional LSTM at capturing short-term nonlinearity but not long-term non-Markovianity, and thus its improvement on long-term tasks such as sequence prediction would be limited. That being said, the advantages of tensorizing state propagation in LSTM are evident, including: (1) Tensorization is the most suitable method for the forecasting of nonlinear chaos since nonlinear terms are treated explicitly and weighted equally by polynomials.
(2) Theoretical analysis is conductible since an orthogonal polynomial basis on k-Sobolev space is always available. (3) Tensor decomposition techniques (in particular, from quantum physics) are applicable, which in turn can identify chaos from a different perspective, i.e., tensor complexity (tensor ranks, entropies, etc.). Our tensorized LSTM model not only offers a general and efficient approach for capturing chaos-as demonstrated by both theoretical analysis and experimental results, showing great potential in unraveling real-world time series-but also brings out a fundamental question of how tensor complexity is related to the learnability of chaos. Our conjecture that a tensor complexity of S α (l) = Θ(log l) in terms of α-Rényi entropy [Equation (6)] generally performs better than S α (l) = Θ(1) in chaotic time series forecasting will be further investigated and formalized in the near future. MERA on synthetic time series, we have added such a partial translational symmetry constraint on and only on Level I for the purpose of controlling the number of free learnable parameters in our model. a time series. All time series (from both the training array and the testing array) made up the entire time series dataset and served for training and testing, respectively. The initial conditions for training and testing were made different on purpose in order to test the generalization ability of the models, yet they were chosen to belong to the same chaotic regime so that the generality of their subsequent chaotic dynamics was always guaranteed by ergodicity. We investigated four different dynamical systems: Logistic map: x n+1 = rx n (1 − x n ); Gauss iterated map: x n+1 = exp −αx 2 n + β; Hénon map: x n+1 = 1 − ax 2 n + bx n−1 ; Chirikov standard map: p n+1 = (p n + K sin θ n ) mod 2π, θ n+1 = (θ n + p n+1 ) mod 2π.
Details of the above systems are listed in Table A1.  Each time series dataset for continuous-time dynamical systems was constructed differently than in Appendix C.1: only one array was produced by discretizing the dynamical system by ∆t given the initial conditions; then the array was standardized; a time window still moved from the beginning to the end and extracted a sub-array of length (input steps + 1) at each step; each extracted sub-array was a time series. All time series made up the entire time series dataset, which was then randomly divided into two subsets, one for testing and one for training. Four different dynamics were investigated: Thomas' cyclically symmetric system: dx dt = sin y − bx, dy dt = sin z − by, dz dt = sin x − bz; Rössler system: dx dt = −y − z, dy dt = x + ay, dz dt = b + z(x − c); Duffing oscillator system: Details of the above systems are listed in Table A2.   The data were retrieved using Mathematica's WeatherData function, https://reference. wolfram.com/language/note/WeatherDataSourceInformation.html ( Figure A1) (accessed on 1 September 2021), and detailed information about the data has been provided in Table A3. Missing data points in the raw time series were reconstructed via linear interpolation. The raw time series was then regrouped by choosing different prediction window lengths, for example, a prediction window length = 4 means that every four consecutive steps in the time series are regrouped together as a one-step four-dimensional vector. Then, the dataset was constructed from the regrouped time series the same way as in Appendix C.2 using a moving window on it after standardization.