Multivariate Time Series Imputation: An Approach Based on Dictionary Learning

The problem addressed by dictionary learning (DL) is the representation of data as a sparse linear combination of columns of a matrix called dictionary. Both the dictionary and the sparse representations are learned from the data. We show how DL can be employed in the imputation of multivariate time series. We use a structured dictionary, which is comprised of one block for each time series and a common block for all the time series. The size of each block and the sparsity level of the representation are selected by using information theoretic criteria. The objective function used in learning is designed to minimize either the sum of the squared errors or the sum of the magnitudes of the errors. We propose dimensionality reduction techniques for the case of high-dimensional time series. For demonstrating how the new algorithms can be used in practical applications, we conduct a large set of experiments on five real-life data sets. The missing data (MD) are simulated according to various scenarios where both the percentage of MD and the length of the sequences of MD are considered. This allows us to identify the situations in which the novel DL-based methods are superior to the existing methods.


Background
It is well-known from the classical literature on time series that a multivariate time series data set is obtained by measuring K > 1 variables at time points 1, . . . , T. The observations are stored in a matrix with T rows and K columns. For ease of writing, we use the notation Z for this matrix. In our notation, the bold letters are used for vectors and matrices. Due to recent technological advances, both T and K are very large for the data sets that are collected nowadays. The massive amount of data poses difficulties for both the storage and the processing. Another challenge comes from the fact that some of the entries of the big matrix Z are missing. The data are incomplete for various reasons: malfunction of the sensors, problems with the transmission of the measurements between devices, or the fact that it is practically impossible to collect all the data (for example, this happens often in astronomy).
The conventional approach is to estimate the missing data and then to use the resulting complete data set in statistical inference. The estimation methods span a wide range from the simple ones that perform imputation for each time series individually by considering the mean or the median, or employ the last value carried forward or the next value carried backward, to the more advanced ones that involve the evaluation of the (Gaussian) likelihood, see for example [1].
In here we do not discuss the imputation methods that can be easily found in the time series textbooks, but we briefly present the newer methods that have been compared in [2]. For instance, we consider the method DynaMMO from [3], which is closer to the traditional methods in the sense that it uses a technique akin to the Kalman filter (see again [1]) to estimate the missing values. The other methods that we outline below are based on the decomposition of the matrix Z; hence, it is not surprising that they have a certain relationship with the singular value decomposition (SVD). In fact, SVDImp from [4] explicitly uses the SVD factorization UΣV of the matrix Z, after replacing the missing values with the mean computed for the corresponding row. Note that the symbol (·) denotes transposition. The most significant κ rows of V are selected, where the value of κ is chosen empirically. Then the linear regression is used to express each row of Z as a linear combination of the most significant κ rows of V . In this way, new estimates are obtained for the missing data, and the procedure above is applied again to the matrix Z that contains these imputed values. The iterations are continued until the change of the magnitudes of the imputed values between two consecutive iterations is smaller than a threshold selected by the user.
In [5], the imputation problem is formulated as a matrix completion problem. The obtained estimate Z has the same entries as Z at the locations for which the measurements are available. The matrix Z is found by solving a penalized least-squares problem whose expression contains a coefficient λ that balances the two terms involved: (i) half of the sum of the squares of the approximation errors, or equivalently 1 2 || Z − Z|| 2 F (the notation || · || F stands for the Frobenius norm), and (ii) the sum of the singular values of Z, or equivalently the nuclear norm of Z. The solution is given by a soft-thresholded SVD of Z, where the soft threshold is λ. This suggests the name Soft-Impute of the method for which we use the acronym SoftImp. In SoftImp, the solutions are obtained in a computationally efficient manner for all the values of λ on a predefined grid. Another algorithm that solves the same penalized least-squares problem by computing the soft-thresholded SVD at each iteration is the Singular Value Thresholding (SVT) from [6]. A particular attribute of SVT is that it automatically finds the optimal rank of the estimated matrix.
An approximation of SVD, which is called centroid decomposition (CD), is employed in [7,8] for representing the matrix Z as Z = LR , where L ∈ R T×K and R ∈ R K×K . Either interpolation or extrapolation is applied for estimating the missing entries of Z and then a vector s ∈ {−1, 1} T is found such that the Euclidean norm ||Z s|| 2 is maximized. The vector c = Z s is obtained and is further used to obtain the first column of R, R :1 = c/||c|| 2 , and the first column of L, L :1 = ZR :1 . For a better understanding of why the method is named CD, we mention that c is the first centroid vector. The "new" matrix Z is taken to be Z − L :1 R :1 , and the algorithm continues until all K columns of L and R are obtained.
Only the first κ columns are used to obtain the approximation of Z given by κ ∑ i=1 L :i R :i , and this approximation yields the estimates for the missing values. It is interesting that the selection of κ is performed by an entropy-based criterion. When CD is employed in time series recovery we call it CDRec (as in [2]).
Another imputation method is dubbed Grassmannian Rank-One Update Subspace Estimation (GROUSE), see for example [9,10]. The name comes from the set of all subspaces of R T of dimension κ that is called Grassmannian. It is evident that an element of the Grassmanian can be represented by any matrix U ∈ R T×κ with the property that U U = I, where I denotes the identity matrix of appropriate dimension. GROUSE finds the matrix U that minimizes the objective function K ∑ j=1 ||∆ j (Z :j − UU Z :j )|| 2 2 , where ∆ j ∈ R T×T is a diagonal matrix which has on the main diagonal ones on the locations corresponding to the data that are available for the jth column of Z and zeros otherwise. The presence of ∆ j in the formula above shows that the algorithm can work directly with the columns of Z that have missing data. In fact, GROUSE optimizes the cost function by considering one column of Z at a time, and at each such step, the matrix U is updated by adding a rank-one matrix to the matrix U obtained at the previous step. Once the "final" U is found at the last step, the incomplete columns are projected onto the low-rank subspace that was identified to complete the matrix. In robust principal component analysis (RPCA), the data matrix (which is supposed to be complete) is represented as a low-rank matrix plus a sparse matrix [11]. Because the recovery of the low-rank matrix is computationally intensive, an efficient algorithm called Robust Orthogonal Subspace Learning (ROSL) was proposed in [12]. The algorithm was altered in [2] to be applied to an incomplete data matrix for estimating the missing values. We use the acronym ROSL for the version of the algorithm from [2].
The imputation method from [13] relies on the nonnegative matrix factorization (NMF) technique and, to be suitable for electricity consumption, uses temporal aggregates. The optimization problem solved by the algorithm proposed in [13] takes into consideration the correlation between time series. As in [2], we call this method temporal NMF (TeNMF). The matrix factorization is also used in [14], but in contrast to TeNMF, the entries of the two factor matrices are not constrained to be nonnegative. An important feature of the method is that the regularization term of the objective function takes explicitly into consideration the temporal structures, and this is why the method is termed Temporal Regularized Matrix Factorization (TRMF).
Another matrix factorization that can be instrumental in time series imputation is the one generated by dictionary learning (DL).

Organization of the Paper and the Main Contributions
In this article, we extend the DL-based solution for time series imputation which we proposed in our earlier work [15]. Our previous results from [15] consist of two imputation methods: DLU (for univariate time series) and DLM (for multivariate time series). However, because of the computational complexity, the original version of DLM can be utilized only when the number of the time series involved is very small. In contrast, the variants of DLM that we introduce in this study can be applied to data sets that contain tens of time series.
DLM is presented in Section 2.2 after briefly discussing the DL optimization problem in Section 2.1. An important characteristic of DLM is that it solves the optimization problem by minimizing the Frobenius norm of the errors. In many practical situations, the imputation should be performed to minimize the 1 -norm of the errors and not the sum of the squared errors. In Section 3, we demonstrate how the optimization problem can be solved when the Frobenius norm is replaced with the sum of the magnitudes of errors. The method that involves the 1 -norm is dubbed DLM 1 . Another characteristic of DLM (which is also inherited by DLM 1 ) is the use of a structured dictionary. In Section 4, we present the expressions of the IT criteria that are employed to select the size of the structured dictionary, the size for each of its blocks as well as the sparsity level of the representation. Section 5 is focused on the techniques that we propose for dimensionality reduction. It allows us to apply DLM and DLM 1 to data sets that comprise tens of time series with thousands of measurements. For demonstrating how the new algorithms can be used in practice, we conduct a large set of experiments with real-life data. The experimental settings are presented in Section 6, and the empirical results are discussed in Section 7. Section 8 concludes the paper.
Hence, the main contributions of this work are the following: • A flexible approach that allows the user to choose the norm of the errors (Frobenius norm or 1 -norm) minimized in the optimization problem. • An automatic method for selecting the sparsity as well as the size for each block of the dictionary. • The exemplification of two techniques for dimensionality reduction that enable DLM to impute values on multivariate time series for which K is large. • An extensive empirical study which compares DLM with nine other imputation methods on data sets with various characteristics. On many of these data sets, DLM has the best performance among the considered methods when the missing data are simulated by sampling without replacement.

Preliminaries
The DL problem is formulated as follows. Given N signals of length m that are grouped into the matrix Y ∈ R m×N , we approximate Y by the product DX, where D ∈ R m×n is the dictionary, and its columns are usually named atoms. The Euclidean norm of each atom equals one. The matrix X ∈ R n×N is sparse in the sense that each of its columns contains at most s non-zero entries, the parameter s being named sparsity level. We emphasize that both D and X are learned from the signals by solving the following optimization problem [16]: subject to ||X : || 0 ≤ s, ∈ {1, . . . , N} ||D :j || 2 = 1, j ∈ {1, . . . , n} the -th column of X is denoted X : , and the j-th column of D is denoted D :j . The symbol · 0 represents the number of the non-zero entries for the vector in the argument. The algorithm that solves the optimization problem in (1) is initialized with a dictionary D, which is generally randomly generated. The user selects the number of iterations, and the following steps are executed at each iteration: (i) The current dictionary D is used to find the matrix X, which provides a representation for the signals in Y. This goal is achieved by employing the Orthogonal Matching Pursuit (OMP). (ii) The dictionary D is updated by using the current sparse representation X. This is performed by using the Approximate K-Singular Value Decomposition (AK-SVD) algorithm.
The two steps of the main algorithm are presented in [17].
There are other ways of posing the DL problem. For example, one may add a sparsity enhancing term to the objective and thus impose sparsity globally, not on each representation; thus, one can obtain a matrix X that has around sN nonzeros without the explicit constraint that each of its columns has s nonzeros. A representative of this approach is [18]. Convolutional DL [19] does not split the time series into signals of size m, but works with a single long signal that is approximated as a linear combination of atoms of length m that may be placed at any position; the same atom can be used repeatedly. These approaches may provide more flexibility, but they require more fine tuning of the parameters and adaptation of dictionary size criteria. We prefer AK-SVD because it is one of the fastest DL-algorithms that have been proposed in the previous literature. Another important feature of the algorithm is its conceptual simplicity, which allows it to be easily modified for solving particular formulations of DL that appear in the context of imputation.

Optimization Problem for Incomplete Data: Formulation, Solution and Applications
When some of the entries of the matrix Y are missing, the optimization problem in (1) becomes [20]: subject to ||X : || 0 ≤ s, ∈ {1, . . . , N} ||D :j || 2 = 1, j ∈ {1, . . . , n} where M is a mask matrix with the same size as Y. Its entries are equal to zero for the positions in Y that correspond to the missing data. All other entries of M are equal to one. The operator is the element-wise product. The role of M is to guarantee that only the available data are used in learning. Note that the missing data are replaced with zeros in Y. For the optimization problem in (2), a specialized version of the AK-SVD algorithm is applied [16] [Section 5.9]; the representation matrix X is found using OMP by ignoring the missing samples and working only with the present ones; the atom update formulas are simple adaptations of AK-SVD rules to the incomplete data case.
An important application of (2) consists of filling the gaps of an incomplete image and is called image inpainting. In the case of this application, the matrix Y is generated as follows (see, for example, [16] [Section 2.3.1]). A patch of pixels of size √ m × √ m is selected from a random location in the image. Then its columns are stacked to generate the signal y, which is a column of Y. The procedure continues until all N signals are produced. Obviously, it is not allowed to select the same patch twice, but it is highly recommended to select patches that overlap. In what concerns the sizes of the patches, the value √ m = 8 is often used. Values such as √ m = 12 and √ m = 16 have also been used, but they are not commonly employed because of the increased computational burden. Once the dictionary is learned, the product DX yields an estimate Y of Y. Any missing pixel is obtained by averaging its values from all entries of Y where it appears. More details about image inpainting can be found in [21][22][23][24]. The use of the inpainting was extended from images to audio signals in [20,25].
Our main goal is to show how DL can be employed for estimating the values of the missing data in multivariate time series. As we have already pointed out, the use of DL in the imputation of time series was discussed in [15]. The approach adopted in the multivariate case should take into consideration the dynamic interrelationships between K > 1 variables whose measurements collected at time points 1, . . . , T are stored in matrix Z. Suppose that some of the entries of Z are missing. As the positions of the missing data are not necessarily the same for all the time series, we use the symbol Ψ k to denote the indexes of the measurements that are available for the k-th time series, where 1 ≤ k ≤ K. Obviously, the set of indexes of missing data for the k-th time series is Let z be one of the columns of the data matrix Z in which the missing data indexed by Ψ are replaced with zeros. We define a matrix Y as follows: Y = z 1:m z 1+h:m+h · · · z 1+qh:m+qh .
For an arbitrary vector v, v a:b denotes the entries of the vector whose indexes belong to the set {a, a + 1, . . . , b}, where a < b. The number of rows of the matrix Y is m, and its choice depends on the sampling period. The parameter h is called signal shift and controls the overlapping between the columns of Y, and the value of q is given by (T − m)/h . It follows that the number of columns of the matrix Y is N = q + 1. Herein we take h = 1, which leads to q = T − m and N = T − m + 1.
If t ∈ Ψ, then z t is a missing value, and this will be represented as a zero-entry in Y. As there is an overlap between the columns of Y, the missing value z t leads to several zero-entries in Y. We collect all the values of these entries from Y = DX and compute an estimate for z t by averaging them.
For example, suppose 4 ∈ Ψ, which means that z 4 is missing. Then the matrix Y is given by: In addition, the entries of the matrix Y are: where N = T − m + 1. It results that the estimate of the missing value z 4 is computed by averaging the red-colored entries of Y, whose positions are the same as the positions of z 4 in Y. Thus, we obtain z 4 as: This imputation method was introduced in [15]. As it can be easily seen from the description above, the method is suitable for univariate time series, and for this reason it is named DLU (see again Section 1.2). In the multivariate case, the data matrix is where Y i ∈ R m×(T−m+1) is made of data measured for the i-th time series for i ∈ {1, . . . , K}.
In [15], it was pointed out that in the DLM algorithm designed for the multivariate case, a structured dictionary should be used: where D 1 , . . . , D K ∈ R m×n d and D K+1 ∈ R m×n K+1 . The dictionary D i is dedicated to the representation of the i-th time series, while D K+1 is common for all time series. It follows that the number of atoms used in the representation of each time series is n u = n d + n K+1 . The values of n u and n K+1 as well as the sparsity level s are selected by using information theoretic (IT) criteria [26]. The main advantage is that the procedure for choosing the triple (n u , n K+1 , s) does not rely on prior knowledge. We take the sizes of dictionaries D 1 , . . . , D K to be equal to simplify the decision process, but also to use similar representation power for all times series (or groups of time series, as we will see later). We mention that there are many DL algorithms that choose the size of the dictionary. Most of them are based on heuristics, such as, for example, growing a small dictionary [27] or removing atoms from a large one [28], with the general purpose of achieving parsimony. Other approaches are more principled, using Bayesian learning [29] or an Indian Buffet Process [30]. We have used IT criteria in [26], where we also presented a more detailed view of the topic, including bibliographic references. IT criteria offer a sound evaluation of the trade-off between dictionary size and representation error.
Next we show how the new algorithm DLM 1 can be devised.
We modify the algorithm proposed in [31] such that it is suitable for the missing data case. The algorithm is an adaptation of the AK-SVD [17] idea and consists of iterations containing the usual two steps, sparse representation and dictionary update, as described in Section 2.1. We will next discuss these steps in detail.

1 -Norm OMP with Missing Data
We present a 1 -norm version of the greedy approach whose most prominent representative is OMP [32]. It is enough to consider a single signal y ∈ R m , for which we have to minimize m (y − Dx) 1 , where D ∈ R m×n is the given dictionary, x ∈ R n must have at most s nonzero elements, and m ∈ R m is the mask whose entries are zeros and ones.
We denote y ∈ R µ (µ ≤ m) the vector that results from y by keeping only the elements that correspond to nonzero values in m. Similarly, D ∈ R µ×n is the matrix obtained from D by keeping only the rows corresponding to nonzero values in m. We are thus left with the problem y − Dx 1 , which is a usual 1 -norm sparse representation problem for which the algorithm was described in [31].
For the sake of completeness, we revisit the main operations here. The algorithm has s steps. Denoting x the representation at the beginning of the current step and r = y − D x the current residual, the next selected atom d j is that for which min j∈{1,...,n} is attained. Thus, we follow the idea of finding the atom with the best projection on the residual. The problem min ξ r − ξd 1 (we lighten the notation for the sake of simplicity) can be easily solved. It is not only convex, but its solution can be found by inspection [33]. Denote c i = r i /d i , for i ∈ {1, . . . , µ}. Denote c the vector containing the elements of c sorted increasingly and π(·) the permutation for which c i = c π(i) . Denote d i = d π(i) . The desired minimum is ξ = c π(k) , where the index k is the largest for which So, finding the solution essentially requires only a sort operation. Moreover, in solving (6), some atoms can be ignored if their scalar product (usual orthogonal projection) with the residual is small.
Once the current atom has been found, it is added to the support, and the optimal 1 -norm representation with that support is computed. This is a convex problem and can be solved by several nonlinear optimization algorithms (we have used a few coordinate descent iterations). Moreover, a good initialization is available in the representation at the previous step. (In OMP, these operations correspond to finding a least-squares solution.)

Dictionary Update with Missing Data in the 1 -Norm
The update stage of AK-SVD optimizes the atoms one by one, also updating the coefficients of the corresponding representations. Denote d j the current atom and I j the set of signals where this atom contributes to the representation. Denote E = Y − DX the current residual and the error without the contribution of d j , keeping only the columns where d j appears in the representation.
With lighter notation, namely d for the current atom, x for the vector of its nonzero representation coefficients and M for the mask (even though the signals where d is not used are removed), the atom update problem becomes Denoting M ⊂ N 2 the indexes of available data, the problem can be written as The minimization can be performed on each d i separately and has the form min d i r − d i x 1 , where r and x are vectors that can be easily built. We thus end up with a problem similar to that described after (6).
Keeping the updated atom fixed, we can now optimize the associated representation coefficients by solving (7) was written as (8). The problem is separable on each x j and can be solved as above.
We note that we use the same basic algorithm in the 1 -norm OMP, atom and representation update. The approach can be extended to p -norms, with p = 1, transforming the p -norm AK-SVD from [31] to the missing data case, similarly to the transformations described in this section.

Information Theoretic Criteria
We have already mentioned in Section 2.2 that we employ IT criteria for selecting the triple (n u , n K+1 , s). More precisely, the criteria that we use are derived from the well-known Bayesian information criterion (BIC) [34]. For evaluating the complexity of the model, we need to calculate the number of parameters (NoP). We have that NoP = sN + (m − 1)n. The first term is given by the number of the non-zero entries for the representation matrix X, which is estimated from the available data by solving the optimization problem (2). The second term is equal to the number of the entries of the estimated dictionary D; for each column of D, we count m − 1 entries (and not m entries) because each column is constrained to have the Euclidean norm equal to one. Furthermore, we define the matrix of residuals U = M (Y − D X). Note that the residuals located at the positions corresponding to the missing data are forced to be zero. With the understanding that η is the number of the entries of M that are equal to one, the expression of the first IT criterion that we employ is: where ln(·) denotes the natural logarithm. This criterion was proposed in [15], where its "extended" variant was also used (see [26,35]): EBIC(Y; n u , n K+1 , s) = BIC(Y; n u , n K+1 , s) + N ln n s .
However, because of the way in which they have been derived, the formulas in (9) and (10) can only be used when D and X are outputs of the DLM algorithm. When the estimation is performed by applying the DLM 1 algorithm from Section 3, we employ the following formula for BIC: The expression above is based on the criterion obtained in [31] for signals in additive Laplacian noise and is altered to be suitable for the missing data case. Its "extended" variant is easily constructed by adding the term N ln n s to the expression above (see again [31]). For clarifying the notation, we mention that when we write DLM+BIC, it means that the criterion in (9) is applied for model selection. At the same time, DLM 1 +BIC involves the criterion from (11). A similar convention is used for the "extended" criteria.
Whenever DLM is applied to multivariate time series with missing values, 10 random initializations of the dictionary are considered. For each initialization, 50 iterations of the two-step algorithm that involves OMP and AK-SVD for incomplete data are executed for each possible combination of n u , n K+1 and s. Then the triple (n u , n K+1 , s) which minimizes BIC/EBIC is selected. The procedure is the same when DLM 1 is used instead of DLM.

Dimensionality Reduction via Clustering
When K is large, we group the time series to reduce the dimensionality. For exemplification, we refer to the particular case in which K is an even number, and the time series are clustered into two groups such that each group contains K/2 columns of Z. The set of the indexes of the time series that belong to the first group is Φ = {φ 1 , . . . , φ K/2 }, whereas the set of the indexes corresponding to the second group is Φ = {φ 1 , . . . , φ K/2 }. It is evident that Φ ∪ Φ = {1, . . . , K} and Φ ∩ Φ = ∅. Furthermore, we re-arrange the columns of the matrix Z ∈ R T×K into the matrix Z (c) as follows: The newly obtained matrix Z (c) is regarded as a multivariate time series that contains K (c) = 2 time series observed at time points 1, . . . , T (c) , where T (c) = (K/2) × T. Hence, the data matrix Y used by the DL algorithm (see (3)) has the expression is constructed from the entries of the i-th column of Z (c) as in Section 2.2. According to the convention from (4), the structure of the dictionary D is given by D = [D 1 D 2 D 3 ]. It follows that the block D 1 is for the time series in group Φ, the block D 2 is for the time series in group Φ and the block D 3 is common for all the time series. The estimation of the missing values is performed as it was described in the previous sections.

Time Series Grouping
When deciding what time series to assign to the group Φ, we should take into consideration that all these time series are represented by using atoms from the same blocks: D 1 and D 3 . Hence, it is desirable (i) to minimize the overlap of the sequences of missing data for the columns of Z that belong to Φ, and (ii) to maximize the linear dependence between any two time series in Φ. Because it is non-trivial to combine the two requirements, we first focused on the condition (i). After some preliminary experiments, we came to the conclusion that the approach does not lead to good results. Then we investigated more carefully the condition (ii). The result of this investigation is the heuristic for cluster selection that we present below and which is based on the evaluation of the absolute value of the Pearson correlation between the pairs of columns of the matrix Z.
For ease of exposition, we introduce the following notation. If W ∈ R p×p is symmetric, |w ij |, where |w ij | is the magnitude of the entry of W located at the intersection of i-th row and the j-th column. Let C ∈ R K×K be the matrix of the pairwise correlations between the columns of Z. For a subset Φ (g) ⊂ {1, . . . , K} whose cardinality equals K/2, we take C (g) to be the block of C that corresponds to the rows and the columns indexed by Φ (g) . Then we select Φ as follows: We note that the cardinality of Φ is also K/2. The formula implies that we find Φ with the property that the sum of the absolute pairwise correlations of the time series in cluster Φ is as close as possible to the sum of the absolute pairwise correlations of the time series in cluster Φ plus the sum of the absolute correlations of the pairs that contain a time series from Φ and a time series from Φ. Remark that, in the particular case when the correlations for all the pairs that contain a time series from Φ and a time series from Φ are zero, the sum of the absolute pairwise correlations of the time series in cluster Φ are approximately equal to the sum of the absolute pairwise correlations of the time series in cluster Φ. This approach has two limitations: (i) it can be applied only when the number of groups is two, and (ii) the computational burden is too high when K is large. An alternative solution is a greedy algorithm which can be employed when the number of groups, K (c) , is greater than two. For simplicity, we assume that K is a multiple integer of K (c) . The algorithm constructs the groups as follows. Initially, the two time series that have the largest absolute correlation are included in the first group Φ 1 . In other words, we take (φ 1 , φ 2 ) = argmax 1≤i<j≤K |c ij |, and then Φ 1 = {φ 1 , φ 2 }. At the next step, it is included in Φ 1 the time series that increases the sum of the absolute correlations in the group the most: In general, after it was decided that Φ 1 = {φ 1 , . . . , φ r }, where 2 ≤ r < K/K (c) , the (r + 1)th time series is selected as follows: Once the first cluster is built, the second one is initialized with the two time series from {1, . . . , K} \ Φ 1 that have the largest absolute correlation, and the steps described above are applied for obtaining the second cluster. The procedure continues until K (c) − 1 groups are produced. The last group results automatically.

Simulation of the Missing Data
When we conduct experiments on a real-life multivariate time series Z ∈ R T×K , we randomly select the positions of the missing data. The selection is performed such that the number of missing data, M miss , is the same for each of the K time series. Hence, all the sets Ψ 1 , . . . , Ψ K have the same cardinality. It follows that the percentage of missing data is for each time series. In our experiments, we consider ρ = 5%, ρ = 10%, ρ = 15% and ρ = 20%. The indexes of the missing data for a particular time series are independent of the positions of the missing data in the other time series from the same data set. They are selected by either sampling without replacement M miss integers from the set {1, . . . , T} or by using the Polya urn model (with finite memory), which was introduced in [36]. The Polya urn model is well-known, and it was employed in various applications. Some of these applications are: modeling of the communication channels [36][37][38][39], image segmentation [40] and modeling of epidemics on networks [41]. In [15], we have proposed the use of the Polya urn model for simulating the missing data in time series.
In the Polya model, the urn initially contains R red balls and S black balls (R < S). At each time moment t ≥ 1, a ball is drawn from the urn and, after each draw, (1 + ∆) balls of the same color as the drawn ball are returned to the urn. We take ∆ > 0. More details about the selection of ∆ are provided below after the presentation of the most important properties of the model. Since we want the model to have finite memory, the experiment is performed as described above only for 1 ≤ t ≤ M, where the parameter M is a positive integer (see the discussion in [36]). At each time moment t > M, a ball is drawn from the urn and, after each draw, two operations are executed: (i) (1 + ∆) balls of the same color as the drawn ball are returned to the urn and (ii) ∆ balls of the same color as the ball picked at time t − M are removed from the urn.
A sequence of random variables {Ξ t } 1≤t≤T is defined as follows: Ξ t = 1 if the ball drawn at time t is red and Ξ t = 0 if the ball drawn at time t is black. It was proven in [36,39] that the sequence {Ξ t } is a Markov process of order M. For t > M, let S t denote the state (Ξ t−M , . . . , Ξ t−1 ). The Polya urn model has the remarkable property that the probability of having Ξ t = 1 after the state S t was observed depends on the number of ones in (Ξ t−M , . . . , Ξ t−1 ), but not on their locations. We mention that M = 5 in our settings.
The indexes of the missing data correspond to the positions of ones in the sequence {Ξ t } 1≤t≤T . It is known that P(Ξ t = 1) = R R + S , where the symbol P(·) denotes probability.
In our simulations, R and S are chosen such that P(Ξ t = 1) = M miss T . According to [39], . This property allows us to simulate bursts of missing data by taking δ = 1. Obviously, this is different from the situation when the sampling without replacement is applied and when it is more likely to have isolated missing data. At the same time, the simulation of the missing data by using the Polya urn model is different from the approach in [2], where blocks of missing data are considered.

Data Pre-Processing
After the missing values are simulated, each time series is decomposed into trend, seasonal component and remainder. Then the DLM imputation method is applied on the T × K matrix of the remainder components. For each time series, both the trend and the seasonal components are added to the estimates produced by DLM to obtain the estimates for the missing data.
The decomposition uses the implementation for the R package imputeTS [42,43], which is available at https://github.com/SteffenMoritz/imputeTS/blob/master/R/na_seadec.R (accessed on 28 February 2022). The implementation returns a specific output when it cannot detect a seasonal pattern for the analyzed time series. From the package imputeTS, we only use the decomposition technique and not the imputation methods because all the imputation methods are designed for univariate time series; thus they are sub-optimal for the multivariate case.

Performance Evaluation
Let z be a column of the data matrix Z ∈ R T×K . We collect in the vector z Ψ the entries of the time series z that are indexed by the elements of the set Ψ. With the convention that z Ψ is the vector of estimates produced by DLM for the missing values of z, we calculate the following normalized error: The normalized errors are computed similarly for the estimates yielded by the imputation methods from [2]. To rank the methods, we calculate scores as follows. For each time series z, the imputation method that achieves the minimum normalized error yields two points, the method that leads to the second smallest normalized error yields one point, and all other methods yield zero points. The number of points accumulated by each method from the experiment with all time series in Z are divided by 2K to ensure that the scores take values in the interval [0, 1]. When the imputation is performed by using DLM 1 , the expression in (14) is replaced with and the scores are calculated as explained above.
In the empirical comparison of the methods, we have used the code available at https: //github.com/eXascaleInfolab/bench-vldb20.git (accessed on 3 October 2021) for the imputation methods that have been assessed in [2]. Short descriptions of these methods have been given in Section 1.1.
In the next section, we present the results obtained by DLM on five data sets that have been also used in [2]. For the sake of conciseness, we report the scores for DLM 1 only for three data sets. The experimental results can be reproduced by using the Matlab code available at https://www.stat.auckland.ac.nz/%7Ecgiu216/PUBLICATIONS.htm (accessed on 17 June 2022).

Climate Time Series (K = 10, T = 5000)
The data set comprises monthly climate measurements that have been recorded at various locations in North America from 1990 to 2002. We do not transform the time series with the method from Section 6.2 because it does not improve the quality of the imputation. As K is relatively small, we cluster the time series into K (c) = 2 groups that are found by using (12): Φ = {5, 6, 7, 8, 10} and Φ = {1, 2, 3, 4, 9}. It is interesting that the rule in (12) yields the same grouping for all percentages of the missing data, for both sampling without replacement and for the Polya model.
Since the data are sampled monthly, it is natural to take the signal length m = 12. We have n u ∈ {5 × 2m, 5 × 3m, 5 × 4m}, and for each value of n u , we take n 3 ∈ {5 × m, 5 × 2m, . . . , n u − 5 × m}. Observe that n 3 = n K (c) +1 , and it denotes the size of the block of the structured dictionary which contains atoms that are common for all time series. The sparsity level s is selected from the set {2, 3, 4}. It follows that the total number of triples (n u , n 3 , s) that we consider is 18.
We compute the normalized errors (see Tables A1-A8 in Appendix A.1.1), which lead to the scores shown in Figure 1. From the plot in the left panel of the figure, it is evident that both DLM+BIC and DLM+EBIC work better than other methods when the missing data are simulated by sampling without replacement. The method DLM+BIC is slightly superior to DLM+EBIC for all missing percentages, except for ρ = 10%. In the right panel of the figure, where the Polya urn model is employed for simulating the missing data, we observe the following: the method DynaMMo is ranked the best for all missing percentages, except for ρ = 20%, where DLM+BIC works better than DynaMMo. In Figure 2, we can see that DLM 1 +BIC and DLM 1 +EBIC are also very good when sampling without replacement is employed, but their performance diminishes in the case of the Polya model (for more details, see Tables A9-A16 in Appendix A.1.2).   The measurements represent weather data collected in Swiss cities from 1980 to 2018. After simulating the missing data, we remove both the trend and the seasonal components for each time series (see Section 6.2). The transformed time series are clustered into K (c) = 2 groups as follows: Φ = {1, 2, 3, 4, 5} and Φ = {6, 7, 8, 9, 10} (see (12)).
As the time interval between successive observations of these time series is 10 min, the most suitable value for m would be 6 × 24 = 144 (which corresponds to 24 h), but this makes the computational complexity too high. For keeping the computational burden at a reasonable level, we conduct experiments for three different values of m: m = 6 × 4 = 24, m = 6 × 6 = 36 and m = 6 × 8 = 48. Note that each value of m corresponds to a time interval (in hours) that is a divisor of 24. For each value of m, n u is selected from the set {5 × 2m, 5 × 3m, 5 × 4m} and for each value of n u , we have that n 3 ∈ {5 × m, 5 × 2m, . . . , n u − 5 × m}. The sparsity level s is chosen from the set {3, 4, 6}. We use these settings for the case when the missing data are generated by sampling without replacement and ρ = 5%. The results can be found in Table A17. It can be easily noticed that both DLM+BIC and DLM+EBIC yield very good results for all values of m.
Taking into consideration these results, we further conduct the experiments for all cases of missing data simulations by setting m = 24 and s = 3. Hence, we have six candidates for (n u , n 3 , s). The grouping for which Φ = {1, 2, 3, 4, 5} is employed. The results are reported in Tables A18-A25, in Appendix A.2.1, and in Figure 3. Both DLM+BIC and DLM+EBIC have outstanding performance for sampling without replacement, and they are very good when the Polya model is used. In the latter case, DLM+BIC is less successful when ρ = 20%. According to the scores shown in Figure 4, which are based on Tables A26-A33 (see Appendix A.2.2), the ranking of the imputation methods does not change significantly when the algorithm DLM is replaced with DLM 1 .   These are water discharge time series recorded by BAFU (BundesAmt Für Umwelt -Swiss Federal Office for the Environment) on Swiss rivers from 2010 to 2015. We do not remove the seasonality from the simulated incomplete time series because the method from [42] does not detect seasonal components for 3 out of 10 time series. Hence, the imputation is performed on the original time series. The grouping Φ = {1, 3, 6, 7, 9} and Φ = {2, 4, 5, 8, 10} is used to cluster the time series into K (c) = 2 groups when the missing data are simulated by sampling without replacement (for all missing percentages) or by the Polya model with ρ = 15%. For the other three cases of missing data simulations, the grouping Φ = {2, 3, 6, 7, 9} and Φ = {1, 4, 5, 8, 10} is applied.
Relying on some of the empirical observations that we made for the MeteoSwiss time series and taking into consideration the fact that the sampling period for the BAFU time series is equal to 30 min, we set m = 12 (which corresponds to a time interval of 6 h). The possible values for n u and n 3 are calculated by using the same formulas as in the experiments with climate and MeteoSwiss time series. In those experiments, we have noticed that the small values are preferred for the sparsity level, hence we take s = 3. This implies that the number of candidates for (n u , n 3 , s) is the same as in the case of MeteoSwiss time series.
The normalized errors for this data set are given in Appendix A.3 (see Tables A34-A41). In Figure 5, we show the scores computed for various imputation methods. From the left panel of the figure, it is clear that both DLM+BIC and DLM+EBIC are the best for ρ = 5%, 10%, 20%, where their scores are approximately one. The imputation method DLM+BIC does not work as well as DLM+EBIC when the percentage of the missing data is ρ = 15%. In the right panel of the figure, we observe that both DLM+BIC and DLM+EBIC have modest performance when the missing data are simulated by the Polya model. In this case, DynaMMo and ROSL are the winners.  The data set contains average daily temperatures recorded at various sites in China from 1960 to 2012. After removing the trend and the seasonal components from all 50 time series with simulated missing data, we cluster them by applying the greedy algorithm which was introduced in Section 5.2. The resulting groups of time series are listed in Table A42 (see Appendix A.4). Note that the total number of groups is K (c) = 5, and there are 10 time series in each group. It is interesting that the way in which the time series are assigned to the clusters depends very much on the method employed for simulating the missing data and on the percentage of missing data.
Mainly based on the lessons learned from the experiments with the data sets that have been analyzed in the previous sections, we have decided to take the signal length m = 12. We opt for relatively small values for n u , thus n u is selected from the set For n u = 10 × 1 2 m, we have that n 6 = n u , which means that all the atoms are common for all the time series, and atoms that are specific for a certain group do not exist. For the other two values of n u , we allow n 6 to be selected from the set {n u − 5m, n u }. The sparsity level s is selected from the set {2, 3}. Simple calculations lead to the conclusion that the total number of triples (n u , n 6 , s) equals 10. The normalized errors from Tables A43-A50 (see Appendix A.4) lead to the scores displayed in Figure 6. For the case of sampling without replacement, DLM+BIC, DLM+EBIC and ROSL have similar performance, and they are followed by CDRec. It is interesting that although SVT has a very modest rank when the percentage of the missing data is small, it becomes superior to all other methods when ρ = 20%. As we have already observed for the BAFU time series, DLM+BIC and DLM+EBIC are not well ranked when the data are simulated by the Polya model. The difference compared with the results for the BAFU time series comes from the fact that ROSL and DynaMMo are not clear winners. This time, ROSL and CDRec are better than the other competitors. Similar to the case when the missing data are simulated by sampling without replacement, SVT outperforms all other imputation methods when ρ = 20%.  The data set comprises hourly sampled air quality measurements that have been recorded at monitoring stations in China from 2014 to 2015. For the case when the missing data are simulated by sampling without replacement (with ρ = 5%, 10%, 15%) and the Polya model (with ρ = 5%, 10%, 20%), the groups produced by the criterion in (12) are Φ = {2, 3, 5, 9, 10} and Φ = {1, 4, 6, 7, 8}. For the other two cases of missing data simulations, the clustering is: Φ = {1, 2, 4, 5, 9} and Φ = {3, 6, 7, 8, 10}. We do not remove the trend and seasonal components since, for each missing data simulation, the seasonal patterns are not detected in more than half of the time series.
We have applied DLM and DLM 1 algorithms on this data set. In both cases, we take the signal length m = 24 because we analyze hourly data. The parameter n u is selected from the set {5 × 2m, 5 × 3m, 5 × 4m}. For each value of n u , we have that n 3 ∈ {5 × m, 5 × 2m, . . . , n u − 5 × m}. The sparsity level s is selected from the set {2, 3, 4}.
In Figure 7, we show the scores obtained by various imputation methods when the DLM algorithm is applied, and the normalized errors are computed with the formula from (14), see also Tables A51-A58 in Appendix A.5.1. It is evident that both DLM+BIC and DLM+EBIC have better performance when sampling without replacement is employed to simulate the missing data. SVT is clearly the best method for both cases of missing data simulations and for all percentages ρ. The ranking of the imputation methods is the same in Figure 8, where the scores have been calculated by using (15), see Tables A59-A66 in Appendix A.5.2. An important difference between the two figures is that DLM 1 and the corresponding IT criteria are employed in Figure 8.

Final Remarks
In this work, we have exemplified how two techniques for dimensionality reduction can be employed for extending the use of the DLM algorithm to data sets that contain tens of time series. It remains to be investigated by the future research how the previous results on time series clustering can be utilized for grouping the time series and finding the number of groups (see, for example [44], for the definition of a metric that evaluates the dissimilarity between two time series). We have also derived the novel DLM 1 algorithm and the corresponding BIC and EBIC criteria. We have conducted a large set of experiments for comparing DLM and DLM 1 with nine algorithms that represent the state-of-the-art in the multivariate time series imputation.
Our empirical study confirms a fact already observed in the previous literature: There is no imputation method which always yields the best results. Although not always the best, our method clearly outperforms the other studied methods for some missing data models. So, it appears to be a useful addition to the existing methods. DLM tends to work better when the missing data are isolated (see the results for simulation by sampling without replacement) than in the case when the sequences of missing data are long (see the results for simulation by using the Polya model).
Both BIC and EBIC are effective in selecting the best structure of the dictionary and the sparsity. It is interesting that small values are chosen for the sparsity s, and this supports the idea that sparse models are appropriate for the multivariate time series. The values selected for n u are also relatively small. Recall that n u equals the number of atoms used in the representation of a specific time series (or a group of time series) plus the number of atoms that are common for all the time series in the data set.
Our imputation method can also be applied when the percentage of the missing data is not the same for all the time series in the data set. Based on the experimental results, it is easy to see that the percentage of missing data does not have an important influence on how DLM is ranked in comparison with other methods; thus we expect the same to be true when the number of missing data varies across the time series. Acknowledgments: X.Z. would like to thank to Mourad Khayati from Department of Computer Science of the University of Fribourg, Switzerland for numerous hints on how to use the data and the code from https://github.com/eXascaleInfolab/bench-vldb20.git (accessed on 3 October 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The tables below contain the normalized errors (see (14) and (15)) for the imputation methods that are assessed in this work. Note that the errors are computed for each time series in the data sets used in the empirical study. For each row in the tables, the best result is written in red and the second best result is written with bold font. For the methods DLM and DLM 1 , we also show the values of the triple (n u , n K+1 , s) selected by BIC (see (9) and (11)) and EBIC (10). In the caption of each table, we provide information about the missing data: how they have been simulated and what is the value of ρ (13). The only exception is Table A42 in Appendix A.4, which does not contain normalized errors, but shows the clustering of the time series.