Robust Multi-Dimensional Time Series Forecasting

Large-scale and high-dimensional time series data are widely generated in modern applications such as intelligent transportation and environmental monitoring. However, such data contains much noise, outliers, and missing values due to interference during measurement or transmission. Directly forecasting such types of data (i.e., anomalous data) can be extremely challenging. The traditional method to deal with anomalies is to cut out the time series with anomalous value entries or replace the data. Both methods may lose important knowledge from the original data. In this paper, we propose a multidimensional time series forecasting framework that can better handle anomalous values: the robust temporal nonnegative matrix factorization forecasting model (RTNMFFM) for multi-dimensional time series. RTNMFFM integrates the autoregressive regularizer into nonnegative matrix factorization (NMF) with the application of the L2,1 norm in NMF. This approach improves robustness and alleviates overfitting compared to standard methods. In addition, to improve the accuracy of model forecasts on severely missing data, we propose a periodic smoothing penalty that keeps the sparse time slices as close as possible to the time slice with high confidence. Finally, we train the model using the alternating gradient descent algorithm. Numerous experiments demonstrate that RTNMFFM provides better robustness and better prediction accuracy.


Introduction
With the advancement of Internet of Things (IoT) technology and the reduced cost of sensor deployment, numerous IoT applications are producing massive amounts of time series data, such as intelligent building energy monitoring [1] and real-time rail monitoring applications [2].Users can get information about the monitored object or region in real time according to multiple sensors, enhancing the effectiveness and security of pertinent decisions.Large size and multidimensionality are two important characteristics of the time series data used in these applications.For example, a smart building might include hundreds of sensors that track energy consumption [1].Typically, such time series are sampled at high frequencies over lengthy periods of time, which can be leveraged to forecast energy consumption patterns, detect anomalies, and optimize building management decisions.Another feature is that most of these data have non-negativity constraints.For example, the speed of vehicle movement and customer electricity consumption are usually stored in a nonnegative matrix.In addition to IoT data, multidimensional time series data are similarly generated in some fields, such as e-commerce [3], web traffic [4], and the biomedical field [5].
The development of modern sensor technology and large-scale data storage technology has brought some new challenges to the task of multidimensional time series forecasting.First, while the data may be generated by different sensors or objects, it may be impacted by common trends [6].Traditional statistical time series forecasting models like ARIMA [7] are hard to build correlations across dimensions.Second, sensor data are often plagued by data loss problems, power exhaustion of sensor nodes, monitoring objects temporarily disappearing, and network congestion are all important causes of data loss.In the actual world, we typically observe two types of missing data patterns: pointwise missing, where data is lost randomly, and continuous missing, in which data is lost continuously for a period of time.Further, noise and outliers are prevalent in the collection and transmission of data due to factors such as electromagnetic interference or various external, immeasurable disturbances [8].Severe outliers can change the distribution of the original data, so preprocessing for outliers and missing values is required before performing the prediction task.Yet, the accuracy of the preprocessed data directly affects the performance of the prediction model, and the high-quality interpolation algorithm incurs additional computational costs.Figure 1 illustrates two data missing patterns and outliers, where missing entries are marked with hollow dots.Outlier entries are marked with yellow dots.
To address these challenges, we focus on several recent approaches based on matrix factorization (MF) to solve these problems in multidimensional time series data forecasting.Matrix factorization can find a lower-dimensional representation of the original matrix that captures the latent features or patterns in the data.To enable matrix factorization models to be applied in time series forecasting applications [9], Yu et al. [3] provided temporal regularization matrix factorization (TRMF) that accounts for positive and negative correlations by placing a new AR regularizer on the temporal factor matrix.Ahn et al. [10] used a Gaussian kernel as a weighting function to model temporal factors with temporal correlation.Overall, these matrix factorization methods can automatically solve the missing value problem, and they have demonstrated exemplary performance in handling multidimensional time series tasks.However, the above model fails to consider that noise and missing data in forecasting tasks are also key factors affecting the performance of the forecasting task.
This paper proposes a robust temporal nonnegative matrix factorization forecasting model (RTNMFFM) for multidimensional time series based on the L 2,1 norm [11].It is a robust version of Temporal Nonnegative Matrix Factorization (TNMF) and focuses on multidimensional time series data forecasting.The overall contribution of our model is as follows:

•
We propose RTNMFFM, a multi-dimensional time series matrix factorization method that can efficiently handle noise and missing values.RTNMFFM utilizes the L 2,1 norm as the loss function for non-negative matrix factorization to boost the model's capacity to handle anomalous data and integrates an autoregressive (AR) regularizer [3] to capture the temporal correlation of the factor matrix.The model can automatically estimate missing values and make predictions.

•
We propose a period smoothing penalty method using high-confidence time slices to improve the stability of predictions when data are severely missing.

•
We propose an alternating optimization strategy applicable to RTNMFFM, using the Adam optimizer [12] to accelerate the training process.Experiments have shown that RTNMFFM provides state-of-the-art errors on noisy and missing data.
The rest of the paper is organized as follows: in Section 2, we briefly review the related work on multidimensional time series data forecasting.Section 3 describes our proposed RTNMFFM model in detail.Section 4 provides results from experiments based on several real-world datasets.This is followed by the conclusions and future work in Section 5.

Related Work
In this section, we will first introduce the time series forecasting model that deals with noise and missing data, and then we will concentrate on the time series forecasting model based on matrix factorization.

Time Series Forecasting Model That Can Handle Anomalous Data
To deal with noisy time series data, Singh et al. [13] discussed injecting Gaussian noise into time series data, applied Fourier analysis to filter the noise, and proposed a pattern modeling and recognition system (PMRS) to forecast noisy data.Laurinec et al. [14] proposed a density-based unsupervised ensemble learning method to improve forecasting accuracy for extremely fluctuating and noisy time series.Liu et al. [15] proposed an integrated three-phase model called adaptive noise reduction stacked autoencoder (ANR-SAE-VALSTM).The model uses ANR to filter out the noise and Long Short-Term Memory (LSTM) neural networks to forecast the data.Rasul et al. [16] proposed TimeGrad, a multidimensional probabilistic forecasting method using a diffusion probability model.The disadvantage is the high cost of model training.
For handling incomplete data, inputting data before forecasting is one class of methods.Sridevi et al. [17] proposed to use of ARLSimper, an autoregressive-based estimator of missing values, to repair the missing values and then forecast future data.In this way, the imputation algorithm's effectiveness and performance indirectly influence the forecasting algorithm's accuracy.On the other hand, some studies build forecasting models directly from missing data.Che et al. [18] proposed a decay mechanism integrated with a Gated Recursive Unit deep learning model (GRU-D).Bokde et al. [19] proposed a preprocessing algorithm.The method simultaneously forecasts and backcasts missing values for imputation by improving the Pattern Sequence Forecasting (PSF) algorithm.
However, as mentioned earlier, these methods, in which time series data with anomalous data are first preprocessed to remove missing values and outliers from the data and then predicted using certain time series prediction models, are prone to accumulating errors in the interpolation algorithm.

Forecasting Models Based on Matrix Factorization
Matrix factorization is a low-rank factorization model.Because of its ability to find latent factors in data, it is frequently used in clustering [20] and recommendation systems [21].MF models such as non-negative matrix factorization (NMF) [22] and more general forms of tensor factorization (CP, Tucker) have been widely used for complex time-stamped events [9].On time series data, matrix factorization is widely used for dimensionality reduction [23] and data imputation [24].
The default MF can only capture global low-rank features.For time series data, we want the decomposed matrix to maintain the temporal correlation of the original data.Early graph Laplacian regularization fails to establish negative correlations between time points [25].To create a model of non-negative matrix factorization having temporal smoothness, Chen et al. [26].construct the difference terms using Toeplitz matrices.Rao et al. [27] use a graph-based approach to introduce Laplacian regularization to deal with temporal correlation.The above regularization method maintains the temporal pattern, improves temporal smoothness, and shows better performance on interpolation tasks.However, these models cannot perform forecast missions.Yu et al. [3] developed a new regularization framework, and the proposed temporal regularized MF model (TRMF) can be applied to multidimensional time series data with missing values.TRMF can handle missing values while achieving forecasts through an autoregressive regularizer.Takeuchi et al. [28] obtained better forecasting performance by modeling spatiotemporal tensor data by introducing a spatial autoregressive regularizer.Sen et al. [29] proposed a global model combining MF and temporal convolutional network regularization.It can perform better with unnormalized data.Chen et al. [30] proposed a fully Bayesian matrix factorization (BTMF) framework to establish temporal correlation through vector autoregression (VAR) [31] models.Yet, BTMF's use of the Gibbs sampling algorithm to increase the model's robustness has a very high time cost, so it cannot be suitable for large datasets.Yang et al. [32] proposed to use the LSTM temporal regularizer matrix factorization model, but also Group Laplacian (GL) is used as a spatial regularizer to take advantage of the spatial correlation between sensors.More detailed comparisons between TRMF and other competitors can be found in Section 4.1.2.
However, the above framework mostly considers how to establish temporal correlation, ignoring the effects of noise, outliers, and missing values.Too many parameters in the model can also cause overfitting.To the best of our knowledge, the L 2,1 -norm has better robustness properties [33,34], and non-negative matrix factorization can reduce overfitting.

Problem Description and Notation
In this paper, we assume that multidimensional time series data have some crossdimensional correlation (like spatial correlation or common trends) and have data with non-negative constraints.In general, we organize the data collected by the M sensors with N time stamps as a matrix Y ∈ R M×N + .To denote the matrix, we utilize boldface and uppercase letters.(e.g., Y), boldface and lowercase letters to denote column vectors (e.g., y), and unbolded lowercase to denote scalars (e.g., a).We use the symbols listed in Table 1.The tth column of the matrix X Ω The set of observed entries λ AR ,λ w Regularization parameter

Constrained Non-Negative Matrix Factorization Algorithm
In this section, we will introduce the theory of the constrained non-negative matrix factorization algorithm, which is the basis of our proposed model.
Given a multidimensional non-negative time series data matrix Y ∈ R M×N + , the NMF can decompose the data matrix Y into an approximation of two K-dimensional low-rank non-negative matrices, such that Y ≈ UX(K ≪ min(M, N)), where X ∈ R K×N + is the latent temporal factor matrix and U ∈ R M×K + is the latent correlation factor matrix.Only under the sole non-negativity constraint, basic NMF will not obtain a unique solution [35].So, in order for NMF to incorporate prior knowledge and adequately represent or reflect the problem's relevant features, this can be achieved by adding regularization terms to U and/or X.It is easy to solve the non-negative matrix factorization with constraints by minimizing the objective function, min where λ U and λ X are the regularization parameters used to balance the goodness of fit and the constraint, J 1 (U), and J 2 (X) are the penalty terms; they are utilized to enforce application-specific requirements.For example, J 1 (U) = ∥U∥ 2 F , J 1 (X) = ∥X∥ 2 F can be added to balance the goodness of fit and constrain the overall parameter size [35], where ∥ ∥ F is the Frobenius norm.The objective function (1) uses the squared Euclidean distance to calculate the difference between the original data matrix and the approximation UX.

Proposed Method
In this section, we propose RTNMFFM.It is a multidimensional time series forecasting model that effectively handles noise, outliers, and missing values.First, we will introduce how RTNMFFM handles anomalous data and establishes temporal dependencies in Section 3.3.1.Then, in Section 3.3.2,we will describe how RTNMFFM can alleviate the decrease in prediction accuracy when data are severely missing.In Section 3.3.3,we will describe in detail how the objective function is optimized.Finally, Section 3.3.4describes how the model predicts.The objective function (1) measures the difference between the original matrix Y and the approximate UX in the form of the squared error.Due to the large mean square error of the anomalous data, the objective function is easily dominated by such data.RTNMFFM applies the L 2,1 norm to quantify the error between the original time series data and the approximation matrix.The L 2,1 norm of the matrix [34,36] Y is defined as Note that it satisfies the triangle inequality as well as the three norm conditions and the L 2,1 norm is a legitimate norm [34].RTNMFFM uses it as a measure of error. min where the first term is the robust formulation of the error function [37], ∥ ∥ F is the Frobenius norm for preventing overfitting [38] or guaranteeing strong convexity [3], and λ U and λ X are the regularization parameters.The robust formulation of the error function is equivalent to Equation ( 4) treats the L 2,1 norm as a measure of the loss of the reconstruction error.When there are outliers in the data, the data will show a fat-tailed distribution, and the Laplace distribution is one of the statistical methods to analyze the fat-tailed data, which has a low sensitivity to outliers and more robust features.Assume that the observation data vector x t is contaminated by noise ε t , obeying a Laplace distribution with mean zero.
where θ t is an unobservable truth value that can be considered as a point in a K-dimensional subspace (K < M), i.e., where x t is the projection of y t onto the subspace defined by the columns of U. Assuming that ε t obeys a Laplace distribution with zero mean and scale parameter b, thus y t ∼ La(θ t , b), and since in general, each vector y t in Y is independent, the probability distribution of y t conditional on θ t is the L 2,1 metric loss function has a rotational invariant property, while the pure L 1 loss function does not have such a desirable property.The literature [39] emphasizes the importance of rotational invariance in the context of learning algorithms.The reason the L 2,1 norm possesses rotational invariance is that its inner layer first solves for the L 2 norm of the vector, and rotational invariance is a fundamental property of Euclidean spaces with the L 2 norm [36].Since the subspace is not uniquely determined before mapping the data to a lower-dimensional space, it is common to model the data with distributions that satisfy rotational invariance [36].By the rotational invariance of the L 2,1 norm, the literature [36] generalizes the Laplace distribution to the rotationally invariant Laplace distribution (Equation ( 8)).
according to the strategy of maximizing the log-likelihood of the data, it can be obtained that max Maximizing the data likelihood is equivalent to minimizing the summation part of Equation ( 9), so by replacing θ t with Ux t , we obtain Equation (10), In Equation (10), assuming that the fitting error obeys a rotationally invariant Laplace distribution, the maximum likelihood problem is transformed into a non-negative matrix factorization problem with L 2,1 norm as the loss function by imposing the constraints U ≥ 0,X ≥ 0 [37].Typically, L 0 norms are ideal for eliminating outliers.The L 0 norm is the number of non-zero elements in a vector, and it can realize the sparsity constraint.However, solving the L 0 norm is an NP-hard problem and it is difficult to optimize.A common method is to solve an approximation problem for the L 0 norm [11].Both the L 2,1 norm and the L 1 norm have the property that the L 0 norm makes the solution results sparse, so we believe that the L 2,1 norm can achieve the same robustness goal.In contrast to the loss function of the L 1 metric, the L 2,1 metric is convex and can be easily optimized [11,34].
From a computational point of view, the error for each data point in the robust formula is ∥y t − Ux t ∥ 2 , which prevents large errors due to outliers from dominating the objective function in the form of squared errors [34,37].We use two-dimensional toy data to show the robustness of the L 2,1 metric loss function by generating 10 original two-dimensional data points, two of which are outliers.For each data point, we fit the original data points using the L 2,1 metric loss function and the standard non-negative matrix factorization, respectively.All data projections will be in the one-dimensional subspace.The residuals corresponding to each data point are shown in Figure 2, where the nonnegative matrix factorization of the L 2,1 metric loss function has a much smaller error compared to the standard NMF and is minimally affected by the two outliers.This robust approach keeps large errors caused by outliers from dominating the objective function by compressing the larger residual values.However, the model defined by the loss function in Equation ( 4) is not yet able to predict the data, and next, we will describe in detail how RTNMFFM constructs temporal correlations and predicts future data.In general, the latent state of a particular timestamp may be related to the latent state of one or more timestamps preceding that timestamp.The relationship between the columns of the latent temporal factor matrix can be expressed as follows: where L is a lag set storing the timing relationships between columns in X, w l is a vector of timing coefficients for x t−l , and all w l are combined into a timing coefficient matrix W. As in the timing structure shown in Figure 3, the state of each timestamp is then related to the state of the first and third timestamps preceding that timestamp.In this example, L = {1, 3}, W = [w 1 , w 3 ].In this example, any column x t of the matrix X is a combination of its first previous column x t−1 and its third previous column x t−3 in the following manner: Based on the above temporal relationship between the columns of the latent temporal factor matrix X, by introducing an AR temporal regularizer [3] on top of the non-negative matrix factorization approach J AR (X | W, L, λ w ), it learns in an autoregressive manner the matrix of temporal coefficients W, the matrix of latent correlation factor U, and the matrix of latent temporal factor X that are best adapted to the system without having to realize the artificial setting of the timing coefficients W.
The matrix factorization model that incorporates a temporal regularizer establishes a temporal relationship generation mechanism between the columns of X, and at the same time achieves the prediction of high-dimensional time series data.The AR regularizer is where the first term in Equation ( 13), L = {l 1 , l 2 , ..., l d } is a time lag set that indicates temporal correlation topology, the temporal structure indicated by the time lag set can be reflected in the matrix of latent temporal factors.l d is the maximum value of the time lag set (l d = max l∈L (l)), ⊙ is the element-wise product, w l is K × 1 coefficient vector, and it is a learnable parameter representation of the autoregressive coefficients, and the last term is the regularization term of the W = (w 1 , w 2 , ..., w l d ).The aim is to prevent overfitting.The set of time lags in AR regularizers can be discontinuous to flexibly embed seasonality [3].
Introducing constraint (13) into the loss function (3), we get min The illustration of RTNMFFM is shown in Figure 4, where RTNMFFM decomposes the data matrix Y on the left side into the dimensional feature matrix U and the temporal feature matrix X on the lower right-hand side, while the upper right side denotes the temporal dependence of RTNMFFM mining x t and its historical data through the autoregressive regularizer.It is important to note that the model proposed in this paper can be viewed as a special kind of dynamic factor model (DMF) that Compared to TRMF, our proposed method is more robust and uses NMF rather than standard MF.The non-negative constraint of NMF on the latent temporal factor favors the generation of sparse encoding (more zero values).In learning autoregressive coefficients, fewer elements are involved in the training, thus reducing overfitting.Although more complex temporal regularizers have been studied and applied [29,32], RTNMFFM still chooses to capture the linear relationship of the AR model because NMF can only extract the linear structure of the data, and our study is more focused on robustness.
The difference between our proposed method and Equation ( 15) is that first, our proposed model assumes that the noise ε t obeys a Laplace distribution.This assumption is more appropriate for describing the distribution of data in the presence of outliers.Another difference is that most of the existing studies [40][41][42] use principal component analysis (PCA) to obtain the sequence correlation, while our method uses NMF.Compare PCA, which uses a linear combination of all eigenbasis vectors to represent the data.Only the linear structure of the data can be extracted.In contrast, NMF represents data using combinations of different numbers and differently labeled basis vectors, so it can extract the multilinear structure of the data and has some nonlinear data processing capability.

Periodic Smoothing Penalty for Severe Missing Values
The percentage of missing data can be as high as 50% or more of the data collected by real sensors [10].Severe missing values can result in inaccurate forecast values.A similar phenomenon was reported by Fernandes et al. [43], which they refer to as the "misalignment problem in matrix factorization with missing values".This problem is illustrated in Figure 5: even though the nonnegative matrix factorization captures accurately the evolution of the time series in the missing gaps, the estimated range of values is far from the actual range of values.An intuitive observation of this problem is that the variation between consecutive timestamps is smooth, whereas this smoothness disappears when data are severely missing.This unsmooth character carries over into the potential time factor matrix obtained by matrix factorization.When observations are missing, it causes the latent temporal factor matrix derived from the temporal regularization matrix factorization algorithm to lose its smoothing properties.This is manifested numerically in the form of lower numerical values of the latent temporal factor vectors for the time slices with missing data relative to the time slices with no missing time slices, and there is a misalignment of the vectors.The loss of smoothing the latent temporal factor matrix leads to overfitting, which prevents the autoregressive parameters from correctly representing the temporal relationships, and ultimately makes the prediction accuracy worse.Based on the above motivation, we need a way to improve the generalization ability, and our main goal is to create strategies for automatically smoothing these latent temporal factor vectors where the misalignment problem occurs.
The matrix factorization-based temporal regularization framework can be considered a specific example of a dynamic factor model (DFM) [44] that searches for linear links between high-dimensional observations and hidden states [30].Imposing a smoothing penalty constraint on the latent variable can make the two variables similar, and we hope to reduce the effect of missing observations by adding some kind of smoothing penalty to the objective function to make the latent states of time slices with more missing values smoother.
Assuming a seasonal period of T, seasonality can make latent temporal factors X that are in the same in-phase similar, and in order to account for this, we need to rewrite the objective function to account for: where f w () denotes the AR temporal regularizer.The temporal regularization can be formulated as follows: where x ′ t is historical temporal feature vectors x t−1 , x t−2 , ..., x t−l d as input and make a forecast of The temporal regularizer can be viewed as x t doing smoothing with the predicted value x ′ t of the autoregressive model, and this smoothing makes x t similar to x ′ t , and due to seasonality, x t will be similar to both x t−T vectors.Based on this result, we envisioned whether we could do a smoothing penalty on the vectors of latent temporal factors that have lost their smoothing due to severe data missing versus the vectors of latent temporal factors with historical seasonality so that these factor vectors would regain their smoothing properties.
Since the missing values result in low numerical values of latent temporal factor vectors, we force smoothing of latent temporal factor vectors with low numerical values and latent temporal factor vectors with high numerical values, where the latent temporal factor vectors with high numerical values are in the same phase as the vectors with low values, by regularization penalties based on the presence of seasonality in the time series.For example, for any x t , the latent temporal factor vectors in the same phase as they are x t−T , x t−2T , ... etc.The rationale is that latent temporal factor vectors that are in the same phase should be similar due to seasonality, as mentioned before.This is done to minimize the misalignment illustrated in Figure 5.
Based on this motivation, we propose to consider reducing the misalignment induced by missing values by applying a smoothing penalty to latent temporal factor vectors with lower numerical values.This smoothing penalty takes the form where λ T is the regularization parameter for the periodic smoothing penalty, x t is a vector of low numerical value latent temporal factors that need to be smoothed, and x ht is a vector of latent temporal factors with smoother, higher numerical valued states of the potential state.This smoothing approach leads to two questions: (1) how to select the vectors that need to be smoothed?and (2) these selected vectors with low numerical values are smoothed with what kind of vectors x ht ?
We propose to control these vectors and whether to use the periodic smoothing penalty based on the energy of the vectors in the temporal factor matrix P t at the moment t, where energy at the moment t of the time slice x t is We define P mean-T as the average of the latent temporal factor vector energies of all latent temporal factors of vector x t up to moment t and with x t in the same phase.The form of P mean-T is To prevent overfitting, the periodic smoothing penalty is not applied to all time slices and is only used when the energy P t of the current latent temporal factor is smaller than P mean-T , which is consistent with the motivation that we mentioned before, i.e., we only need to smooth potential space-time factor vectors that have lower values.The energy P t of the potential temporal factor vector with lower values will be smaller.For example, suppose the seasonality is T = 3 and the time lag set is L = {1, 3}.We need to compute P mean-T for x 10 .The steps are to first calculate the energies P 7 and P 4 for x 7 and x 4 .After that, find their average value P mean-T = (P 7 + P 4 )/2.Finally, when P 10 is smaller than P mean-T , a periodic smoothing penalty on x 10 is required.If P 10 is bigger than P mean-T , we will force the regularization parameter λ T = 0 to indicate that x10 does not participate in the periodic smoothing penalty.This judgment prevents higher-energy time slices from participating in the periodic smoothing penalty and prevents normal potential time factor vectors from losing certain features that should be present due to overfitting previous time vectors instead.We utilize the property that the seasonality of the time series leads to similar values of the latent temporal factor vectors, assuming that the current latent temporal factor vector x t is too low due to missing data, the energy of this vector will be lower than the average energy of the historical vectors of the same phase, and we pick out vectors that need to be smoothed by this method.
For the second problem, we define the high-confidence latent temporal vector x ht of x t to be the latent temporal vector that is historically in the same phase as x t and has the highest energy.The reason for this is that latent temporal factor vectors with lower values need to be smoothed with latent temporal vectors with higher values, which should have relatively higher energies.Secondly, due to seasonality, latent temporal factors that are in the same phase should be similar.For example, if x 10 needs to perform a periodic smoothing penalty, we need to find the maximum energy latent temporal factor vector in x 7 and x 4 and use it as the high-confidence latent temporal vector x ht for x 10 .
We give an intuitive explanation of the results of the latent temporal factor matrix decomposed by the temporal regularization matrix factorization after adding the periodic smoothing penalty we proposed, as compared to the latent temporal factor matrix without adding the periodic smoothing penalty, using an example.We experimented with temporal regularization matrix factorization on the Guangzhou Urban Traffic Speed dataset with a data size of 214 × 1380, and we let all the data from t = 360 to t = 1200 as missing.We set the rank of the low-rank matrix at K = 40.In Figure 6, the blue line is the result after adding the periodic smoothing penalty, and the red line is not added.It can be seen that the latent temporal factors are smoother with the added periodic smoothing penalty, which is very obvious at t = 360 to t = 1200.This shows that our proposed periodic smoothing penalty can make the latent temporal factor matrix smoother, more conducive to the learning of the parameters in the temporal regularization term, and more conducive to the final prediction.The loss function with the addition of the periodic smoothing penalty is min The objective function in Equation ( 21) focuses on the severely missing data.For lowerenergy time slices, RTNMF actively utilizes high confidence slices, while for higher-energy time slices, no smoothing is performed to prevent overfitting.

Optimization
To minimize the objective function in Equation ( 21), we propose a gradient descentbased algorithm.The optimization of the L 2,1 norm can be found in [11,37].RTNMFFM will alternate the optimization of each factor matrix; it updates just one-factor matrix at a time while fixing all other factor matrices.We use the Adam optimizer [12], which has been successful in deep learning, to accelerate training.Suppose we can update the pertinent parameters as follows at iteration p.
Updates for latent correlation factor matrix U. Find the partial derivative of matrix U.
where ⊤ is the transpose of a matrix or vector and D is a diagonal matrix with the diagonal elements given by Update the matrix U (p+1) where α is the learning rate and P + denotes the projection operator that forces all elements of it to be projected onto a non-negative semi-axis; this takes the form of Equation (25).
Updates for latent temporal factor matrix X.The gradient of the solver matrix X is divided into two parts.First, calculate the gradient of the first term in Equation ( 21): where the matrix D has the same form as Equation (23).Second, for vectors x t t = 1, 2, . . ., N, calculate the gradient of the AR regularization term and the periodic smoothing penalty term by column.When t > l d : combine all column vectors with Equation ( 27) into a matrix and add Equation ( 26) in ∇ X F (U (p+1) , X (p) , W (p) ) to update in matrix form Updates for AR regularizer parameters W. For vectors w i , i = 1, 2, . . ., l d : combine all column vectors into a matrix ∇ W F (U (p+1) , X (p+1) , W (p) ) to update in matrix form Overall training.Algorithm 1 describes the entire training process of RTNMFFM.First, initialize each factor with a non-negative constraint.We update sequentially each factor matrix for each iteration while using the Adam optimizer [12] to accelerate training.We will select the model with the lowest error on the validation set as the final model.

Forecasting
For RTNMFFM forecasting, given matrices U, X ,and the time regularizer parameter W, the latent time factor vector at the moment t + 1 in the future is forecasted by the autoregressive model in a single step The model is based on one-step forecasting and performs multi-step forecasting by recursively reusing the predicted values as input and then estimating observed time series data with ŷt+1 = U xt+1 .Figure 7 illustrates how the model makes multi-step forecasts.

Experiments
To demonstrate the performance of the proposed model, in this section, we experiment with five datasets.We set up multiple noise and missing forms to fully validate the model.All experiments are performed on a machine equipped with Intel i7-10700.The software environment is Python 3.8, Numpy 1.22.3.

Experimental Settings
where Ω indicates the test set, Y it indicates a forecasted entry with index (i, t), and Ŷit is the forecast results.

Competitors
We compare RTNMFFM with the following competitors.
• TRMF [3]: A temporally regularized matrix decomposition method (https://github.com/xinychen/transdim, accessed on 18 January 2024).We rewrite this as a gradientbased version.Framework for predicting data with AR. • ARNMF [4]: Non-negative constrained version of TRMF.Our purpose in using this method as a baseline is to verify whether NMF is the main reason for the better performance achieved by RTNMFFM.Framework for predicting data with AR. • SVD-AR(1) [3]: The rank-K approximation of Y = USV ⊤ is first obtained by singular value decomposition (SVD).After setting F = US and X = V ⊤ , a K-dimensional AR(1) is learned on X for forecasting.• BTMF [30]: A Bayesian Temporal Matrix Factorization (BTMF) framework by incorporating a VAR layer into Bayesian probabilistic matrix factorization algorithms (https://github.com/xinychen/transdim,accessed on 18 January 2024).The framework is predicted by using the observation at the corresponding position in the previous time period as the predicted value.It can be considered as a seasonal naive forecast method after the data are complemented with high accuracy.

Hyper-Parameter
The hyper-parameters used are in Table 2.The last few timestamps in the data are used as the test and validation sets, the lengths of which are shown in Table 2.For matrix factorization models using AR as a regularizer, all regularization parameters and learning rates are selected from the validation set, using the same rank-K, time lag set, and size of the forecasting window.For all algorithms, use a grid search to select all regularization parameters and learning rates, and set the hyperparameter set for which the model obtains the minimum RMSE on all sub-datasets as the final hyperparameters.All gradient-based algorithms are trained using Adam [12] acceleration.For BTMF and LSTM-ReMF, we use the hyperparameters suggested in the original paper.For all models, the data were min-max normalized before training the models, and we chose to have the lowest ND in the validation set as the final model.

Forecasting Accuracy
In Table 3, we compared the forecasts of RTNMFFM with those of competitors on the raw data.The results of all experiments are given by "ND/RMSE".RTNMFFM(0) indicates that RTNMFFM does not use a periodic smoothing penalty.RTNMFFM has almost obtained the best results, and we consider that the distribution of residuals on some datasets is the root cause of the poor performance of RTNMFFM.If the residuals obey a Gaussian distribution, their maximum likelihood problem can be converted to a standard NMF problem, which may be the reason why models based on standard NMF can take the lead in Data (H) and Data (E).Even so, the method proposed in this paper still gives sub-optimal results on both datasets.A comparison of the run times of the models can be found in Appendix A. Moreover, to verify the effect of noise on RTNMFFM, we added Gaussian noise and Laplace noise to all the data in Section 4.2.1.In Section 4.2.2, we test the forecasting performance of the proposed model under multiple missing modes.

Forecasting Performance in Noisy Data
To verify the validity of the model under different background noises, we added Gaussian white noise (mean is µ = 0, variance is σ = 1) and Laplace noise (position parameter µ = 0 and scale parameter b = 1) to each dataset to generate signal-to-noise ratios of 5, 2, 0, −2, and −5.The definition of the signal-to-noise ratio is SNR = 10 log 10 where P s is the power of the signal and P n is the power of the noise.The data we obtained are discrete, so for one piece of time series data y i of the observation matrix Y, the signal power is To add noise of a certain signal-to-noise ratio to a signal, the method is to first compute the average power of the original signal, then compute the noise power by Equation (32).Finally, generate noise that obeys the Laplace distribution or the standard normal distribution and increase it by √ P n times.These added noises are only present in the training set and have no effect on the valid set and the test sets.In order to maintain the distribution of noise added to the data, we relaxed the non-negativity of the data and subjected the data to min-max normalization before training.It is important to note that here RTNMFFM does not use the period smoothing penalty (i.e., λ T = 0).
Forecasting results and Analysis.The forecasting performance of the model after adding Gaussian noise and Laplace noise to the five datasets is given in Figures 8 and 9.All errors are calculated after undoing the scaling.The RTNMFFM model does not impose a periodic smoothing penalty.
RTNMFFM shows competitive forecasting results in these matrix-based decomposition models.When the source of real noise is particularly complex, Gaussian noise may be considered the best simulation of real noise.In Figure 8, RTNMFFM has an absolute advantage in forecasting performance in Gaussian noise.For datasets (G) and (H), the forecasting results of RTNMFFM were more significant compared to other competitors.A possible reason is that the length of these two datasets is smaller, and RTNMFFM is better able to learn temporal correlation from the limited information under noise.Comparing the results of RTNMFFM and ARNMF, the L 2,1 norm plays a dominant role in handling the noise.In addition, to more fully validate the RTNMFFM, the performance of the model in Laplace noise is shown in Figure 9.As we have seen, RTNMFFM can still learn the temporal dependencies from noisy data, and it can mitigate the effects of noise and make better forecasts.

Forecasting Performance in Missing Data
In this section, we validate the effectiveness of the proposed model in making forecasts under various data loss scenarios.For one type of random point-wise missing (RM), we randomly removed certain observations and modeled random missingness with a value of zero.The other is continuous missing (CM), where data is missing for a sustained period of time.Continuously missing corresponds to a situation where the sensor has a certain probability of failing on a certain day.The missing setup method is taken from the literature [32], where continuous missing exists for multiple intervals and all sensors will have missing data.We tested the forecasting accuracy of each model at 60%, 50%, 40%, and 20% missing proportions.These added missing data are only present in the training set and have no effect on the valid set and test sets.We apply the proposed period smoothing penalty to TRMF as well to test its scalability (named TRMF(1)).
Forecasting results and Analysis.Tables 4 and 5 demonstrate the forecasting performance of RTNMFFM and other baseline models for datasets (G), (H), (P), (C), and (E).For random missing values, RTNMFFM leads to prediction accuracy across the board.RTNMFFM works well for two categories of missing data when more than 50% of the data is missing.For continuous missing values, except in dataset (H), RTNMFFM exhibits superior predictive accuracy even in the presence of persistent missing data.Comparing RTNMFFM with BTMF, BTMF is as robust in the dataset (H), and its excellent performance stems from the Bayesian probabilistic matrix factorization algorithms.
Comparing RTNMFFM with its version without the period smoothing penalty (RTN-MFFM(0)), the forecasting results show that our proposed period smoothing penalty can improve the prediction ability to some extent when severe data are missing.
From the forecasting results, it can be seen that RTNMFFM(0) can also obtain good results in some cases.This model based on the L 2,1 norm measure of matrix factorization error is also known as indirect sparse matrix factorization with L 2,1 norm [33], which indirectly optimizes the upper bound of the F-norm function in a way that is the reason for its ability to find more effective information from sparse data and obtain more accurate predictions.In order to compare the variability of the predictions, we also compared the results of the Diebold-Mariano test [45] for RTNMFFM and the best alternative model, and the conclusions are shown in Appendix B.
Comparing TRMF and TRMF(1), TRMF(1) has better forecasting accuracy, which proves the scalability and effectiveness of the period smoothing penalty proposed in this paper.Similar to the previous conclusions, the model after adding the period smoothing penalty is more effective when the amount of missing data is high.The forecasting results of RTNMFFM are more significant on datasets (P), (C), and (E).For datasets (C) and (E), they have more sampling cycles, RTNMFFM can select highconfidence time slices from more samples to apply the period smoothing penalty, and the number of time series entries in the dataset (P) is much larger than the timestamps.The above features may be the reason why RTNMFFM is more competitive.Even on the poorly performing dataset (G), RTNMFFM remains optimal even at more than 50% missing values.
However, for 20% of the missing data, RTNMFFM performs slightly worse.The biggest reason is that overfitting is generated.Therefore, we suggest using the period smoothing penalty when there are more missing values or performing a period smoothing penalty only in the initial phase of model learning.Next, we will show the forecast visualization of RTNMFFM on missing data.

Conclusions and Future Work
This paper proposes a robust forecasting model for multidimensional time series data.We integrate the L 2,1 norm and the AR regularizer into the non-negative matrix factorization algorithm.This combination can better establish the temporal correlation of multidimensional time series data when there is noise and missing values in the data.Also, the proposed period smoothing penalty for RTNMFFM improves the accuracy of the prediction task on incomplete data.RTNMFFM provides a powerful tool for multidimensional time series prediction tasks, and we have examined the model on several real-world time series matrices.RTNMFFM has shown superior performance compared to other baseline models.There are several directions for future research to explore.First, an important reason why time series data is difficult to predict is data drift, i.e., the data distribution may change as time evolves, and an online learning framework is needed to correct the prediction results.Secondly, our model can establish trends and seasonality well, but it is not easy to establish cyclicality.Finally, our model can only establish linear correlations, and the modeling of nonlinear temporal correlations based on deep learning matrix decomposition and diffusion models is the focus of future research.support, we cannot use the DM test to determine which multidimensional time series forecasting model is more effective, but these results can inform future research.

LoadFigure 1 .
Figure 1.Illustration of two data missing patterns and outliers.

Figure 2 .
Figure 2. Plot of residue: |X − UX| by using standard NMF and the proposed method.Data points #2 and #6 are outliers.

Figure 5 .
Figure 5. Example of the matrix factorization misalignment concerning the observed time series.
periodic smoothing penalty No adding the periodic smoothing penalty (b) No.30

Figure 6 .
Figure 6.Results with and without periodic smoothing penalty added.

Figure 7 .
Figure 7. Illustration of the one-step rolling forecasting scheme.

Figure 8 .Figure 9 .
Figure 8. Test RMSE of RTNMFFM and competitors for varying SNR with Gaussian noise.RTNMFFM TRMF ARNMF BTMF SVD-AR Figure 10 shows the prediction visualization of RTNMFFM on the Hangzhou metro passenger flow dataset, with 60% of the data randomly missing in the figure; the red line is the predicted value, and the blue line is the true value.As shown in the figure, the red and blue lines match each other for drastic short-term changes and long-term fluctuations, demonstrating our method's effectiveness under severe missingness.(a) Sensors No. 1

Table 1 .
Table of symbols.

vector u ⊤ j The jth row of the matrix U x t
We change the data to reflect hourly speed by aggregating blocks of six columns and organizing the raw data set into a time series matrix of 214 × 1380 and there are 1.3 percent missing values.• Dataset (H): Hangzhou metro passenger flow.(https://tianchi.aliyun.com/competition/entrance/231708/inf ormation accessed on 18 January 2024),With a 10-minute resolution, this dataset collects inbound passenger flows from 80 subway stations in Hangzhou, China.We ignore the timeframe 0:00 a.m.-6:00 a.m. and structure the dataset into a matrix of 80 × 2700 and there are no missing data.• Dataset (P): Pacific surface temperature.(http://iridl.ldeo.columbia.edu/SOURCES/.CAC/, accessed on 18 January 2024) This data set includes measurements of the Pacific Ocean's monthly sea surface temperature taken over 396 consecutive months between January 1970 and December 2002.The size of the matrix is 2520 × 396 and there are no missing data.• Dataset (C): California freeways occupancy rate (https://archive.ics.uci.edu/ml/datasets/PEMS-SF, accessed on 18 January 2024).These data comes from the California Department of Transportation.These data record lane occupancy rates on San Francisco Bay Area freeways.We combine the 10-minute sampling data into hourly data and confine our research to the last 2736 timestamps to organize the data into a matrix of 963 × 2736 and there are no missing data.• Dataset (E): Electricity load data (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, accessed on 18 January 2024).A total of 370 clients' electricity consumption data with a sampling frequency of 15 min.We use the last 10272 timestamps of electrical data to organize the data into a matrix of 370 × 10272 and there are no missing data.

Table 2 .
Default hyper-parameter setting, V-length/T-length indicates the number of validation set/test set timestamps.

Table 4 .
Performance comparison (ND/RMSE) for forecasting tasks with random missing.

Table 5 .
Performance comparison (ND/RMSE) for forecasting tasks with continuous missing.
The best results are underlined.

Table A2 .
The DM test results for RTNMFFM and the best alternative algorithm in 60% RM.

Table A3 .
The DM test results for RTNMFFM and the best alternative algorithm in 60% CM.