Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series

This work is focused on latent-variable graphical models for multivariate time series. We show how an algorithm which was originally used for finding zeros in the inverse of the covariance matrix can be generalized such that to identify the sparsity pattern of the inverse of spectral density matrix. When applied to a given time series, the algorithm produces a set of candidate models. Various information theoretic (IT) criteria are employed for deciding the winner. A novel IT criterion, which is tailored to our model selection problem, is introduced. Some options for reducing the computational burden are proposed and tested via numerical examples. We conduct an empirical study in which the algorithm is compared with the state-of-the-art. The results are good, and the major advantage is that the subjective choices made by the user are less important than in the case of other methods.


Introduction
Graphical models are instrumental in the analysis of multivariate data. Originally, these models have been employed for independently sampled data, but their use has been extended to multivariate, stationary time series [1,2], which triggered their popularity in statistics, machine learning, signal processing and neuroinformatics.
For better understanding the significance of graphical models, let x be a random vector having a Gaussian distribution with zero-mean and positive definite covariance matrix Γ. A graph G = (V, E) can be assigned to x in order to visualize the conditional independence between its components. The symbol V denotes the vertices of G, while E is the set of its edges. There are no loops from a vertex to itself, nor multiple edges between two vertices. Hence, E is a subset of {(a, b) ∈ V × V : a = b}. Each vertex of the graph is assigned to an entry of x. We conventionally draw an edge between two vertices a and b if the random variables x a and x b are not conditionally independent, given all other components of x. The description above follows the main definitions from [3] and assumes that the graph G is undirected. Proposition 1 from the same reference provides a set of equivalent conditions for conditional independence. The most interesting one claims that x a and x b are conditionally independent if and only if the entry (a, b) of Γ −1 is zero. This shows that the missing edges of G correspond to zero-entries in the inverse of the covariance matrix, which is called the concentration matrix.
There is an impressive amount of literature on graphical models. In this work, we focus on a generalization of this problem to time series. The main difference between the static case and the dynamic case is that the former relies on the sparsity pattern of the concentration matrix, whereas the latter is looking for zeros in the inverse of the spectral density matrix. One of the main difficulties stems from the fact that the methods developed in the static case cannot be applied straightforwardly to time series.
Parametric as well as non-parametric methods have been already proposed in the previous literature dedicated to graphical models for time series. Some of the recently introduced estimation methods are based on convex optimization. We briefly discuss below the most important algorithms which belong to this class.

•
Reference [4] extends the static case by allowing the presence of latent variables. The key point of their approach is to express the manifest concentration matrix as the sum of a sparse matrix and a low-rank matrix. Additionally, they provide conditions for the decomposition to be unique, in order to guarantee the identifiability. The two matrices are estimated by minimizing a penalized likelihood function, where the penalty involves both the 1 -norm and the nuclear norm. Interestingly enough, the authors of the discussion paper [5] pointed out that an alternative solution, which relies on the Expectation-Maximization algorithm, can be easily obtained.

•
In the dynamic case, reference [6] has an important contribution which consists in showing that the graphical models for multivariate autoregressive processes can be estimated by solving a convex optimization problem which follows from the application of the Maximum Entropy principle. This paved the way for the development of efficient algorithms dedicated to topology selection in graphical models of autoregressive processes [7,8] and autoregressive moving average processes [9]. • A happy marriage between the approach from [4] and the use of Maximum Entropy led to the solution proposed in [10] for the identification of graphical models of autoregressive processes with latent variables. Similar to [4], the estimation is done by minimizing a cost function whose penalty term is given by a linear combination of the 1 -norm and the nuclear norm. The two coefficients of this linear combination are chosen by the user and they have a strong influence on the estimated model. The method introduced in [10] performs the estimation for various pairs of coefficients which yield a set of candidate models; the winner is decided by using a score function.
According to the best of our knowledge, there is no other work that extends the estimation method from [5] to the case of latent-variable autoregressive models. The main contribution of this paper is to propose an algorithm of this type, which combines the strengths of Expectation-Maximization and convex optimization. The key point for achieving this goal is to apply the Maximum Entropy principle.
The rest of the paper is organized as follows. In the next section, we introduce the notation and present the method from [10]. Section 3 outlines the newly proposed algorithm. The outcome of the algorithm is a set of models from which we choose the best one by employing information theoretic (IT) criteria. Section 4 is focused on the description of these criteria: We discuss the selection rules from the previous literature and propose a novel criterion. The experimental results are reported in Section 5. Section 6 concludes the paper.

Preliminaries and Previous Work
Let x 1 , . . . , x T be a κ-dimensional (κ > 1) time series generated by a stationary and stable VAR process of order p. We assume that the spacing of observation times is constant and x t = [x 1t , . . . , x κt ] . The symbol (·) denotes transposition. The difference equation of the process is where A 1 , . . . , A p are matrix coefficients of size κ × κ and t is a sequence of independently and identically distributed random κ-vectors. We assume that the vectors { t } T t=1 are drawn from a κ-variate Gaussian distribution with zero mean vector and covariance matrix Σ 0. Additionally, the vectors {x t } 0 t=1−p are assumed to be constant.
The conditional independence relations between the variables in x t are provided by the inverse of the spectral density matrix (ISDM) of the VAR-process {x t }. The ISDM has the expression where j = √ −1 and (·) H is the operator for conjugate transpose. We define A 0 = −I, where I stands for the identity matrix of appropriate size, and The sparse structure of the ISDM contains conditional dependence relations between the variables of x t , i.e., two variables x a and x b are independent, conditional on the other variables, if and only if [1,11] In the graph corresponding to the ISDM Φ −1 (ω), the nodes stand for the variables of the model, and the edges stand for conditional dependence, i.e., there is no edge between conditionally independent variables.
In a latent-variable graphical model it is assumed that κ = K + r, where K variables are accessible to observation (they are called manifest variables) and r variables are latent, i.e., not accessible to observation, but playing a significant role in the conditional independence pattern of the overall model. The existence of latent variables in a model can be described in terms of the ISDM by the block decomposition where Φ m (ω) and Φ (ω) are the manifest and latent components of the spectral density matrix, respectively. Using the Schur complement, the ISDM of the manifest component has the form [10] (Equation (21)): When building latent variable graphical models, we assume that r K, i.e., few latent variables are sufficient to characterize the conditional dependence structure of the model. The previous formula can therefore be written where S(ω) is sparse, and Λ(ω) has (constant) low-rank almost everywhere in (−π, π]. Furthermore, we can write [12] (Equation (4)): where ∆ (ω) = I, e jω I, . . . , e jωp I is a shift matrix, and X and L are K (p + 1) × K (p + 1) positive semidefinite matrices. We split all such matrices in K × K blocks, e.g., The block trace operator for such a matrix is D(·), defined by For negative indices, the relation D −i (L) = D i (L) holds. Note that (8) can be rewritten as The first p + 1 sample covariances of the VAR process are [13]: However, only the upper left K × K blocks corresponding to the manifest variables can be computed from data; they are denotedR i . WithR = [R 0 . . .R p ], we build the block Toeplitz matrix It was proposed in [10] to estimate the matrices X and L by solving the optimization problem where tr(·) is the trace operator, log(·) denotes the natural logarithm and det(·) stands for the determinant. Minimizing tr(L) induces low rank in L and λ, γ > 0 are trade-off constants. The function f (·) is a group sparsity promoter whose expression is given by Note that D i (X + L)(a, b) is the i-th degree coefficient of the polynomial that occupies the (a, b) position in the matrix polynomial Φ −1 m (ω). Sparsity is encouraged by minimizing the 1 -norm of the vector formed by the coefficients that are maximum for each position (a, b).

New Algorithm
The obvious advantage of the optimization problem (13) is its convexity, which allows the safe computation of the solution. However, a possible drawback is the presence of two parameters, λ and γ, whose values should be chosen. A way to eliminate one of the parameters is to assume that the number r of latent variables is known. At least for parsimony reasons, it is natural to suppose that r is very small. Since a latent variable influences all manifest variables in the ISDM (5), there cannot be too many independent latent variables. Therefore, giving r a fixed small value is likely to be not restrictive.
In this section, we describe an estimation method which is clearly different from the one in [10]. More precisely, we generalize the Expectation-Maximization algorithm from [5], developed there for independent and identically distributed random variables, to a VAR process. For this purpose, we work with the full model (4) that includes the ISDM part pertaining to the r latent variables. Without loss of generality, we assume that Υ (ω) equals the identity matrix I; the effect of the latent variables on the manifest ones in (5) can be modeled by Υ m alone. Combining with (2), the model is where the matrices Q i have to be found. The main difficulty of this approach is the unavailability of the latent part of the matrices (11). Were such matrices available, we could work with SDM Φ(ω) estimators (confined to order p) of the formΦ where C i denotes the i-th covariance lag for the VAR process {x t } (see also (1) and (11)). We split the matrix coefficients from (15) and (16) according to the size of manifest and latent variables, e.g., To overcome the difficulty, the Expectation-Maximization algorithm alternatively keeps fixed either the model parameters Q i or the matrices C i , estimating or optimizing the remaining unknowns. The expectation step of Expectation-Maximization assumes that the ISDM Φ −1 (ω) from (15) is completely known. Standard matrix identities [5] can be easily extended to matrix trigonometric polynomials for writing down the formula Identifying (16) with (18) gives expressions for estimating the matrices C i , depending on the matrices Q i from (15). The upper left corner of (18) needs no special computation, since the natural estimator is where the sample covariancesR i are directly computable from the time series. It results that The other blocks from (17) result from convolution expressions associated with the polynomial multiplications from (18). The lower left block of the coefficients is Note that the trigonometric polynomial Υ m (ω)Φ m (ω) has degree 2p, since its factors have degree p. With (20) available, we can compute where δ i = 1 if i = 0 and δ i = 0 otherwise. Although the degree of the polynomial from the lower right block of (18) is 3p, we need to truncate it to degree p, since this is the degree of the ISDM Φ −1 (ω) from (15). This is the reason for computing only the coefficients i = −p, p in (21). The same truncation is applied on (20); note that there we cannot compute only the coefficients that are finally needed, since all of them are required in (21).
In the maximization step of Expectation-Maximization, the covariance matrices C i are assumed to be known and are fixed; the ISDM can be estimated by solving an optimization problem that will be detailed below. The overall solution we propose is outlined in Algorithm 1, explained in what follows.

Initialization:
EvaluateR i for i = 0, p; (see (11) and the discussion below it) (22)); Maximum Entropy Expectation-Maximization (penalized setting): Maximum Entropy Expectation-Maximization (constrained setting): to computeΦ −1 λ (ω); Find the matrix coefficients of the VAR-model by spectral factorization ofΦ −1 λ (ω) and compute ITC(Data; SP λ ). The initialization stage provides a first estimate for the ISDM, from which the Expectation-Maximization alternations can begin. An estimate for the left upper corner of Φ −1 (ω) is obtained by solving the classical Maximum Entropy problem for a VAR(p)-model, using the sample covariances of the manifest variables. We present below the matrix formulation of this problem, which allows an easy implementation in CVX (Matlab-based modeling system for convex optimization) [14]. The mathematical derivation of the matrix formulation from the information theoretic formulation can be found in [6,9].
First Maximum Entropy Problem [ME I (R)]: The block Toeplitz operator T is defined in (12). The size of the positive semidefinite matrix variable X is K(p + 1) × K(p + 1). For all i = 0, p, the estimateQ In order to compute an initial value for Υ m (ω), we resort to the eigenvalue decomposition (EIG) ofQ When the covariances C i are fixed in the maximization step of the Expectation-Maximization algorithm, the coefficients of the matrix polynomial that is the ISDM (15) are estimated from the solution of the following optimization problem: Since now we work with the full model, the size of X is (K + r)(p + 1) × (K + r)(p + 1). The function f (·) is the sparsity promoter defined in (14) and depends only on the entries of the block corresponding to the manifest variables. The equality constraints in (23) guarantee that the latent variables have variance one and they are independent, given the manifest variables, corresponding to the lower right block of (15).
obtained after these iterations are further employed to computeΦ −1 λ (ω) by using (15). If λ is large enough, thenΦ −1 λ (ω) is expected to have a certain sparsity pattern, SP λ . Since the objective of (23) does not ensure exact sparsification and also because of the numerical calculations, the entries ofΦ −1 λ (ω) that belong to SP λ are small, but not exactly zero. In order to turn them to zero, we apply a method similar to the one from [6] (Section 4.1.3). We firstly compute the maximum of partial spectral coherence (PSC), for all a = b with 1 ≤ a, b ≤ K. Then SP λ comprises all the pairs (a, b) for which the maximum PSC is not larger than a threshold Th. The discussion on the selection of parameters N it and Th is deferred to Section 5.
The regularized estimate of ISDM is further improved by solving a problem similar to (23), but with the additional constraint that the sparsity pattern of ISDM is SP λ , more precisely: This step of the algorithm has a strong theoretical justification which stems from the fact that Φ −1 (ω) is the Maximum Entropy solution for a covariance extension problem (see [10] (Remark 2.1)). The number of iterations, N it , is the same as in the case of the first loop.
The spectral factorization of the positive matrix trigonometric polynomialΦ −1 λ (ω) is computed by solving a semidefinite programming problem. The implementation is the same as in [8], except that in our case the model contains latent variables. Therefore, the matrix coefficients produced by spectral factorization are altered to keep only those entries that correspond to manifest variables. The resulting VAR model is fitted to the data and then various IT criteria are evaluated. The accuracy of the selected model depends on the criterion that is employed as well as on the strategy used for generating the λ-values that yield the competing models. In the next section, we list the model selection rules that we apply; the problem of generating the λ-values is treated in Section 5.
As already mentioned, the estimation problem is solved for several values of λ: From the description above we know that, for each value of the parameter λ, Υ m (ω) gets the same initialization, which is based on (22). It is likely that this initialization is poor. A better approach is an ADAPTIVE algorithm which takes into consideration the fact that the difference λ i − λ i−1 is small for all i = 2, L. This algorithm initializes Υ m (ω) as explained above only when λ = λ 1 . When λ = λ i for i = 2, L, the initial value of Υ m (ω) is taken to be the estimate of this quantity that was previously obtained by solving the optimization problem in (23) for λ = λ i−1 .
The effect of the ADAPTIVE procedure will be investigated empirically in Section 5. The newly proposed estimation method outlined in Algorithm 1 is dubbed AlgoEM. The Matlab code for AlgoEM can be downloaded from https://www.stat.auckland.ac.nz/~cgiu216/ PUBLICATIONS.htm.

IT Criteria
It is well-known that the IT criteria can be expressed as the sum of a goodness-of-fit (GOF) term and a penalty. They are derived on various grounds, but their expressions cannot be obtained easily for the problem we investigate. Due to this reason, we resort to the methodology applied previously to VAR without latent variables, where the criteria originally proposed for model order selection have been modified such that to be employed for finding the best sparsity pattern. As the GOF term is obtained straightforwardly by fitting the model to the data, the difficult part is the alteration of the penalty term. Based on the observation that all the penalty terms of the criteria for model order selection involve the number of parameters of the model, reference [6] proposed to replace it with the effective number of parameters: Note that N 0 is the number of zeros in the lower triangular part of SP λ . The expression above can be obtained straightforwardly by counting the number of non-zero entries in the K × K upper-left block of the matrices Q (N it ) i p i=0 produced by Algorithm 1; counting takes into consideration all the existing symmetries.
The formula in (26) was used in [6] in order to modify three celebrated criteria: Schwarz Bayesian Criterion-SBC [15], Akaike Information Criterion-AIC [16] and its corrected version-AIC c [13] (pp. 432) and [17]. We name SBC the criterion from [15] because this is the term used in time series literature; the same selection rule is called Bayesian Information Criterion (BIC) in other works. Based on the empirical evidence from [6], SBC is ranked best when the sample size is large, whereas AIC c works better for small sample sizes. This makes us to employ these two criteria in our experiments. Their expressions are [6]: whereΣ is the error covariance matrix. For simplicity, we write SP instead of SP λ . Even if our notation does not emphasize on this fact, it is clear that both N ef andΣ depend on λ.
Another criterion employed in our tests is a variant of the Final Prediction Error-FPE [18]. The formula we use was obtained in [8] by relying on the asymptotic equivalence between log FPE and AIC. For ease of comparison with other criteria, we do not give the expression of FPE, but that of log FPE: where η = N ef /K. We also apply the Renormalized Maximum Likelihood (RNML) criterion. Its derivation is related to the problem of encoding losslessly measurements assumed to be a sequence of outcomes from an unknown distribution. It has been proven in [19] that there is a unique distribution which, if it is used to encode these measurements, then it leads to the minimum code length in the worst case scenario. This particular distribution was further utilized in [20,21] in order to introduce the RNML criterion for model selection. The major problem comes from the fact that, for most of the family of models, its closed-form expression is hard to be obtained. The expression of RNML that can be employed for choosing the order of VAR-models was firstly derived in [22]. The properties of this criterion have been investigated in [8], where the criterion was also altered such that to be applied in the selection of the sparsity pattern for ISDM of VAR-models. With our notation, the RNML criterion can be written as follows: where Γ K (·) is the multivariate Gamma function and Γ(·) is the Gamma function. The matrixR 0 has the same significance as in (12). It is remarkable that the penalty term of RNML depends on the actual measurements, and not only on the triple (T, K, N ef ).
A common feature of all the criteria presented so far is that they do not take into consideration how big is the family of the competing candidates. This might be problematic because the total number of possible sparsity patterns is as large as 2 K , where K = K(K − 1)/2. The solutions proposed in the previous literature for circumventing this difficulty are called extended IT criteria. We show below how these criteria can be applied for selecting the sparsity pattern. In this context, we also propose a novel variant of RNML.

Extended IT Criteria
First we write down the expression of the extended SBC proposed in [23]. In the statistical literature, this criterion is named EBIC (Extended Bayesian Information Criterion): where γ ∈ [0, 1]. It is evident that EBIC is equivalent to SBC when γ = 0. More interestingly, reference [24] uses arguments from information theory for justifying the use of a penalty term which counts the number of models that have the same number of parameters. For our problem, this is equivalent to choosing γ = 1 in (29). Because this is also the value of γ that we use in this work for evaluating EBIC, we explain briefly the significance of the supplementary penalty term. The key point is to consider a scenario in which Data should be transmitted losslessly from an encoder to a decoder by employing the model given by SP. According to [24], the first step is to transmit the value of N 0 .
The assumption that all possible values of N 0 are equally probable leads to the conclusion that the code length for N 0 is − log 1/(K + 1) = log(K + 1). As this quantity is the same for all models, it can be neglected. Then the decoder should be informed about the actual locations of the zeros in the sparsity pattern SP. Since the list of all sparsity patterns for a given N 0 is known by both the encoder and the decoder, all that remains is to send to the decoder the index of SP in this list. Under the hypothesis that all the sparsity patterns in the list are equally probable, the code length for the index is log K N 0 . In (29), this quantity is multiplied by two because of the scaling factor used in the definition of SBC(Data; SP). Another formulation of EBIC was introduced in [25] for finding the graphical structure of a Gaussian model (static case), in the situation when the number of variables and the number of observations grow simultaneously. After modifying the criterion from [25] by replacing the number of edges of the graph with N ef , we get the following formula: where γ has the same significance as in (29), and again we take γ = 1. Remark that the term 4γN ef log K grows when N 0 decreases; the term in (29), 2γ log K N 0 = 2γ log K K − N 0 , does not have the same property. Relying on the asymptotic equivalence between RNML and SBC [8], we alter RNML by adding half of the extra penalty from (30); the scaling factor is needed because the ratio between the GOF term in (28) and the GOF term in (27) tends to 1/2 when T → ∞. The new criterion, which is dubbed RNML FD , has the following expression: For all the selection rules listed above, the best model is the one which minimizes the value of the criterion. For evaluating the performance of IT criteria, we conduct an empirical study. The main results of this study are reported in the next section.

Artificial Data
In our simulations, the order of the VAR model in (1) is taken to be one, the number of manifest variables is K = 15, and there is one single latent variable (r = 1). Let KS denote the number of non-zero entries in the lower triangular part of the sparsity pattern of size K × K. We consider three different values for KS: 2, 3 and K. Remark that the larger is KS, less sparse is the ISDM. The locations of the non-zero entries for each value of KS are graphically represented in Figure 1.   (2) have only ones on their main diagonals. The entries of the K × K upper-left block of the matrix Q i , which should be non-zero according to Figure 1, are equal to 0.5/(i + 1). Additionally, the entries on the last row and on the last column of Q i , except the one on the main diagonal, are equal to 0.3/(i + 1). Integer multiples of the κ × κ identity matrix are added to Q 0 until the resulting ISDM is positive definite. Furthermore, the spectral factorization is applied in order to obtain the matrix coefficients of the VAR-model from the ISDM (see [8] (Section 4.2) for more details). Hence, for each value of KS, one single VAR-model is produced and this is used for generating N tr = 10 κ-variate time series of length T = 50,000. To this end, we utilize Matlab R2014b functions from the package available at the address http://climate-dynamics. org/software/#arfit. After discarding from each time series the component corresponding to the latent variable, the simulated data are used for evaluating the performance of AlgoEM.

Settings for AlgoEM
The order of VAR, as well as the number of latent variables, is assumed to be known. The parameter λ takes values on a regular grid defined on the interval [10 −3 , 10 −1 ], for which the grid step is 10 −3 . It follows that the total number of values for λ is L = 100. The threshold Th, which is used in conjunction with (24) in order to get the estimated sparsity pattern, equals 10 −3 .
We are interested to evaluate the impact of the adaptive initialization procedure that was introduced in Section 3. This is why we run AlgoEM with and without this procedure, for all the time series we have generated. For each time series and for each value of λ on the grid, the estimated SP λ is compared to the true sparsity pattern. The comparison reduces to computing the distance between the two sparsity patterns, which is given by the number of positions below the main diagonal where the patterns differ. For the case KS = 15, statistics related to this distance are presented in Figure 2.
Remark that values of λ close to zero lead to estimated patterns which are not sparse. As expected, this happens disregarding if the adaptive procedure is applied or not. The use of the procedure has the positive effect that, for a large range of λ-values, the estimated patterns are close to the true one. The same is true for both KS = 2 and KS = 3, which makes us apply the adaptive procedure in all the experiments outlined below.
The results reported in Figure 2 are obtained by taking the number of iterations to be N it = 4. As the computational burden of the algorithm depends strongly on N it , we investigate the effect of reducing the number of iterations to N it = 3 and N it = 2, respectively. In each case, the evaluation of performance is done by an oracle having complete knowledge about the true sparsity pattern. From the set of sparsity patterns produced when applying AlgoEM to a particular time series, the oracle selects the one which is closest to the true sparsity pattern. The closeness is measured by the distance defined above. The average distances computed from N tr = 10 trials are plotted in Figure 3. We can see in the figure that, in the case when KS = 2 and AlgoEM performs only two iterations, the true sparsity pattern is always in the set of the candidates produced by the algorithm and this makes the average distance to be zero. In general, all the results shown in the figure are good as the average distance is smaller than one in all cases. Since the increase of N it does not guarantee the improvement in performance, we take N it = 2 for reducing the complexity of the algorithm.  Figure 1c. With the convention that dist denotes the distance between the estimated sparsity pattern and the true one, we plot mean(dist) ±1 standard deviation(dist) versus the parameter λ. The statistics are computed from N tr = 10 trials, for both the adaptive and the non-adaptive case.  Figure 3. Impact of N it on the performance of AlgoEM: Evaluation is done by replacing in AlgoEM the IT criterion with an oracle having full knowledge about the true sparsity pattern. For each KS and for each N it , we run N tr = 10 trials for calculating the average distance between the true sparsity pattern and the sparsity pattern selected by oracle.
It is clear from the description of Algorithm 1 that N it is the same for the two major loops of AlgoEM. We name the first loop MEEM(Pen) and the second one MEEM(Con). Obviously, MEEM is the acronym for Maximum Entropy Expectation-Maximization, Pen stands for penalized settings, and Con means constrained settings. Because we want to quantify the influence of MEEM(Con) on the accuracy of the estimation, we show in Figure 4 the average distances to the true pattern, when the estimates are produced by MEEM(Pen) and by MEEM(Con), respectively. This time we do not report results obtained only when an oracle is used for selecting the sparsity pattern, but also for the case when the selection is done with the IT criteria defined in Section 4. We can observe that AIC c and FPE perform very poorly, whereas both EBIC FD and RNML FD are very good. The fact that RNML FD is superior to RNML demonstrates the importance of the extra-term in (31). Remark also that EBIC FD is more accurate than SBC, while EBIC has the same level of performance as SBC. The most important conclusion is that removing MEEM(Con) from AlgoEM does not deteriorate the final outcome.
The next step is to compare AlgoEM with the algorithm that solves the optimization problem in (13). We use the implementation from [12], which is publicly available at the address https: //drive.google.com/file/d/0BykD2O6uX6KjSGlkQTRYWFVBZGM/view, and we call it AlgoSL. The name comes from the fact that the method in (13) is sparse plus low-rank.

Comparison of AlgoEM and AlgoSL
In [12], the sparsity pattern for a given pair of parameters (λ, γ) is found by solving (13). Reference [10] uses further this sparsity pattern as an initialization for a constrained optimization problem (see also the discussion below Equation (25)). In both cases, a set of sparsity patterns is generated by choosing various values for the parameters λ and γ. For selecting the best one, they consider an alternative to IT criteria, which is dubbed score function (SF). The key point is that, when using SF, it is not needed to compute explicitly the matrix coefficients of the VAR-model. More details are provided below.
LetΦ m be an estimate of Φ m in (4), which is constrained to have a certain sparsity pattern, SP. Using an idea from [9], reference [10] suggests to employ Data for computing the correlogramΦ c m (with Bartlett window) [26], and then to evaluate the relative entropy rate: where This formula can be regarded as a generalization of the I-divergence of two positive definite matrices, which was originally introduced in connection with graphical models (static case) [3]. The expression in (32) was also employed by [27] in the inference of graphical models for time series. Some of the properties of the relative entropy rate are discussed in [28]. It is worth noting that this index also belongs to the Tau divergence family [29] and to the Beta divergence family [30]. It has been proven in these references that the use of (32) in the formulation of the Maximum Entropy problem leads to the simplest solution, in the sense of the minimum McMillan degree.
The most important is that, in [10], D(Φ c m ||Φ m ) is utilized for quantifying the adherence of the model to the data. The complexity of the model is determined by N e , the total number of edges in the graph (including the edges that connect the latent variables to the manifest variables). For instance, if the model has a single latent variable, then N e = K(K + 1)/2 − N 0 . Note that N 0 has the same significance as in (26). It follows that N ef = N e + p(K 2 − 2N 0 ), which leads to the conclusion that N ef > N e , and the difference between the two quantities increases when the order of the model raises. If p = 1 and the number of latent variables is at least three (r ≥ 3), then N ef < N e when N 0 is large.
The score functions given in [10] are: where D(·||·) is defined in (32). The formula of SF 1 is logged for ease of reading.
We apply AlgoSL to all the time series we have simulated (see again Section 5.1). In our experiments, we consider the pairs (λ, γ) for which λ ∈ {0.1, 0.2, . . . , 0.6} and γ ∈ {0.01, 0.02, . . . , 0.5}. For the selection of the sparsity pattern, we do not use only the score functions in (33)-(35), but we also employ the IT criteria from Section 4. The results are shown in Figure 5. The best criterion is RNML FD , which is able to find the true sparsity pattern in all the experiments; the second best is EBIC FD . The comparison of the plots in Figure 5 to those in Figure 4 leads to the conclusion that AlgoSL works better than AlgoEM when RNML FD is employed for selecting the model. For understanding these results, we should take into consideration two important aspects: (i) Oracle gives perfect results for all KS in the case of AlgoSL, but not in the case of AlgoEM; to some extent this is due to the fact that 300 (λ, γ)-pairs allow to produce a better set of candidates for AlgoSL than the one generated by 100 λ-values for AlgoEM; (ii) The score functions have been used in [10] for time series of hundreds of samples, whereas the size of the time series we simulated is much larger (T = 50,000).
In order to clarify the second aspect, we conduct an experiment with simulated time series for which T = 500. The simulation procedure is the same as in Section 5.1 and all the settings are the same, except that the non-zero entries of the matrices {Q i } p i=0 (which are not located on the main diagonal) have values that are fifty times larger. For AlgoEM, the uniform grid for λ takes values on the interval [10 −3 , 10 −1 ]; the grid step is 10 −3 (see also [31]). Based on some empirical evidence, we use the parameters λ ∈ {0.02, 0.03, . . . , 0.3} and λγ ∈ {0.01, 0.015, . . . , 0.2} for AlgoSL. All IT criteria and all the score functions are used for both AlgoEM and AlgoSL, and the estimation results are reported in Figure 6. In general, they are worse than the results for large sample size (T = 50,000), and the difference is more evident when KS = 15. This shows that, for small T, it might be difficult to recover the true structure of ISDM when it is not sparse enough. It is encouraging that, even for KS = 15, oracle used in conjunction with AlgoSL finds the true pattern in all trials. To gain more insight, we give in Figure 7 a graphical representation of all the distances which are computed by oracle for a time series whose ISDM has KS = 15. It follows from the way in which we have chosen the experimental parameters that the total number of such distances calculated for AlgoSL is more than eleven times larger than the number of distances for AlgoEM. We should keep in mind that we need to compute the distance from the pattern estimated by each candidate in the list to the true sparsity pattern. After a laborious process of selecting the values of λ and λγ, we ended-up with a two-dimensional grid which yields a good number of estimated patterns that are identical to the true pattern (see the white area inside the square shown in Figure 7b). The behavior of AlgoEM is different: For a large range of λ-values, the K × K upper-left block of the estimated pattern is equal to the identity matrix, which makes the distance to the true pattern to be equal to KS. This happens when the number of latent variables used in AlgoEM equals the true number (r = 1) as well as when r = 2 is utilized in estimation. However, it is evident in Figure 7a that the estimation results are better when r = 1. If the number of latent variables is not known a priori, one possibility might be to use an ITC for selecting it. Note that the definition of N ef in (26) should be modified. The score functions given in (33)-(35) do not need to be altered in order to be employed in selection of the number of latent variables.

Real-World Data
In addition to the experiments with simulated data, we test the capabilities of AlgoEM on multivariate time series of daily stock markets indices at closing time, which can be downloaded from the following address: http://au.mathworks.com/matlabcentral/fileexchange/48611-internationaldaily-stock-return-data-for-system-identification. This dataset was produced by pre-procesing the original data from http://finance.yahoo.com, which have been measured from 4 January 2012 to 31 December 2013. We refer to [10] for details regarding the pre-processing. The time series is K-variate with K = 22, and the sample size is T = 518. . The official names of the price indices can be found in [10].
Similar to [10], we take the order of VAR-model to be p = 1, and we assume that there is a single latent variable (r = 1). For AlgoEM, we have N it = 4 and λ takes values on a uniform grid on the interval [2 × 10 −3 , 2 × 10 −1 ], for which the grid step is 2 × 10 −3 . When AIC c is used for choosing the model, the results are disappointing because the selected sparsity pattern contains very few zeros. The same type of outcome is also obtained when applying SF 2 or SF 3 . It is interesting that SF 1 as well as six different IT criteria (SBC, EBIC, EBIC FD , FPE, RNML, RNML FD ) select exactly the same sparsity pattern. This one is compared in Figure 8 to the sparsity pattern given in [10], for the same data set. Observe that the conditional independence graph yielded by AlgoEM has only four edges which connect manifest variables: Three of them connect vertices corresponding to Asian markets, (HK,CH), (HK,JA), (JA,KO), and the fourth one connects two European markets, (GR,AT). This is different from the conditional graph in [10], where all the edges between manifest variables connect only the European markets (with the exception of Greece). An in-depth analysis of the two graphs is beyond the interest of this paper.  Figure 8. International stock markets data: Comparison between the sparsity pattern for manifest variables from [10] and the pattern produced by AlgoEM when either SF 1 or one of the following IT criteria is used: SBC, EBIC, EBIC FD , FPE, RNML, RNML FD .

Final Remarks
The main motivation for this study is the solution proposed in [5] for the static case, as an alternative to the method from [4]. We have shown how the estimation method from [5] can be generalized for the dynamic case. The resulting algorithm is dubbed AlgoEM. We have conducted an empirical study in which we have investigated the capabilities of AlgoEM and we have also compared it with the generalization of the method from [4], which was proposed in [10]. It is important to emphasize that the two methods for latent-variable autoregressive models, which we have compared, have some common features: apply the Maximum Entropy principle, use convex optimization, and generate a set of candidate models from which the best one is selected by a certain rule. In the case of AlgoEM, the set of the candidates depends strongly on the parameter λ, which is chosen by the user. For the method in [10], the user should choose two parameters, λ and γ. Based on our experience, the selection of the two parameters is much more difficult than the selection of the single parameter for AlgoEM. Another important aspect is how to pick-up the winner from the competing models. In [10], this was restricted to the use of score functions. We have demonstrated empirically that the IT criteria might be an option to consider. Especially when the sample size is large, it is recommended to employ the criterion RNML FD which we have introduced in this work.