Bayesian Nonparametric Modeling of Categorical Data for Information Fusion and Causal Inference

This paper presents a nonparametric regression model of categorical time series in the setting of conditional tensor factorization and Bayes network. The underlying algorithms are developed to provide a flexible and parsimonious representation for fusion of correlated information from heterogeneous sources, which can be used to improve the performance of prediction tasks and infer the causal relationship between key variables. The proposed method is first illustrated by numerical simulation and then validated with two real-world datasets: (1) experimental data, collected from a swirl-stabilized lean-premixed laboratory-scale combustor, for detection of thermoacoustic instabilities and (2) publicly available economics data for causal inference-making.


Introduction
Modeling and decision-making in complex dynamical systems (e.g., distributed physical processes [1], macro-economy [2] and human brain [3]) often rely on time series collected from heterogeneous sources. Fusion of the information extracted from an ensemble of time series is a critical ingredient for better prediction and causal inference.
In many dynamical systems, the characteristic time of the physical process under consideration is small (e.g., around 2 ms in a typical combustion process) relative to the time-scale of respective decision-making (e.g., tenths of a second for active combustion control). Therefore, fast and accurate prediction of the system states and estimation of the associated parameters is essential for online monitoring and active control of the dynamical system; for example, real-time prediction of future states can significantly improve active control of thermoacoustic instabilities [4]. One way to achieve this is to make predictions based on different but correlated information sources. Although several methods have been proposed for prediction based on fusion of heterogeneous time series (e.g., [5][6][7]), they lack a coherent probabilistic interpretation and may not be able to accommodate more general interactions between current measurements and the measurement history. Furthermore, these methods may not be sequentially implementable and hence they may not be very useful for real-time applications.
Identification of causal relationships is essential for understanding the consequences of transitions from empirical findings to actions and thus forms a significant part of knowledge discovery. Various analytical techniques (e.g., [8][9][10]) have been proposed for causal inference-making; among these techniques, the concept of causality introduced by Granger [11], hereafter called Granger causality, is apparently one of the most widely used in time series analysis [12]. Granger causality does not rely on the specification of a scientific model and thus is particularly applicable to investigation of empirical cause-effect relationships. It is noted that Granger causality is especially suited for continuous-valued data based on frequentist hypothesis testing.
The goal of this paper is to develop a flexible and parsimonious model of categorical time series in a Bayesian nonparametric setting for fusion of correlated information from heterogeneous sources (e.g., sensors of possibly different modalities), which can be used for sequential classification and causal inference. From this perspective, major contributions of the paper are delineated as follows: 1. By introducing latent variables and sparsity inducing priors, a flexible and parsimonious model is developed for fusion of correlated information from heterogeneous sources (e.g., sensors of possibly different modalities), which can be used to improve the performance of sequential classification tasks. 2. By testing the dimension of latent variables in the setting of Bayes factor analysis [13], Granger causality [11] is extended to categorical time series. 3. Validation of the above concept with experimental data, generated from a swirl-stabilized lean-premixed laboratory-scale combustor [14], for real-time detection of thermoacoustic instabilities. 4. Testing of the underlying algorithm with public economics data to infer the causal relationship between two categorical time series.
The paper is organized into eight sections including the current section. Section 2 introduces the concept of Granger causality and develops the model. Section 3 discusses the algorithm for posterior computation using Gibbs sampling, and hypothesis testing using Bayes factor analysis. Section 4 presents the sequential classification algorithm with the proposed model. The underlying algorithms are tested with simulation data in Section 5 while Section 6 validates the proposed method with some experimental data, collected from a swirl-stabilized lean-premixed laboratory-scale combustor, for thermoacoustic instabilities early detection. Section 7 validates the proposed concept on publicly available economics data. Section 8 concludes the paper and provides a few recommendations for future research. The nomenclature and list of acronyms are provided at the end before the list of references.

Model Development
This section first introduces the concept of Granger causality and the corresponding regression model. Next, the underlying model's algebraic and statistical specifications are elaborated. 1. θ Granger-causes y but not the vice versa; 2. y Granger-causes θ but not the vice versa; 3. θ and y Granger-cause each other; 4. θ does not Granger-cause y and vice versa.

Remark 2.
If the explanatory power of θ t−1 , . . . , θ t−D θ to the regression is significant, then the null hypothesis (that θ does not Granger-cause y) is rejected and the alternative hypothesis (that θ Granger-causes y) is accepted. Hypothesis tests on the significance of time-lags are elaborated later in Equation (15) (see Section 3.2).

Remark 3.
If y and θ are correlated in the sense of Granger causality, the information contained in one source can be used to predict the future values in another source. Accordingly, information fusion of different sources enables fast and accurate prediction because of Granger-causality. It is noted that if the information contained in two sources is statistically independent, then information fusion cannot enhance prediction accuracy.

Conditional Tensor Factorization
This subsection addresses fusion of different sources of information by making use of the concept of conditional probability tensor that was first reported in [15], a formal definition of conditional probability tensor follows.

Definition 2.
(Conditional probability tensor) Let C 0 denote the number of categories of the (one-dimensional) variable, y t , and let C j denote the number of categories of z j,t for j = 1, . . . , q, where is the number of predictors. The quantity p(y t | z t ) is treated as a (q + 1)th order tensor in the C 0 × C 1 · · · × C q dimensional space, hereafter called the conditional probability tensor.
Let C y and C θ respectively denote the numbers of categories of the variables, y and θ. It follows from Definition 2 that C 1 = · · · = C D y = C y and C D+1 = · · · = C q = C θ . Then, each one of these conditional probability tensors has a higher order singular value decomposition (HOSVD) of the following form [15]: where 1 ≤ k j ≤ C j for j = 1, . . . , q; and each of the parameters λ s 1 ...s q (y t ) and ω (j) s j (z j,t ) is non-negative while the following constraints are satisfied:

Remark 4.
Since there exists a factorization as in Equation (2) for each one of the conditional probability tensors, the two constraints Equations (3) and (4) are not restrictive. Furthermore, it is ensured that ∑ C 0 y t =1 p(y t | z t ) = 1.

Bayesian Nonparametric Modeling
In order to build a statistically interpretable model, two techniques can be used to convert the tensor factorization in Equation (2) to a Bayes network, i.e., (1) introduce latent allocation-class variables; (2) assign sparsity-inducing priors. To this end, T pairs of variables and their respective predictors are collected in one dataset, and it is rearranged as {y t , z t } T t=1 , where t is an index with range from 1 to T.
The conditional probability p(y t | z t ), factorized as in Equation (2), is then reorganized in the following form: where x t ≡ (x 1,t , . . . , x q,t ) denotes the latent class-allocation variables. For index j = 1, . . . , q and index t = 1, . . . , T, it then follows that where Mult(•) is the multinomial distribution [16] and ω (j) ≡ {{ω in this mixture probability matrix is a probability vector itself (i.e., it sums to 1). Moreover,λ ≡ {λ s 1 ,...,s q } (s 1 ...s q ) is a conditional probability tensor where λ s 1 ,...,s q ≡ {λ s 1 ,...,s q (c)} C 0 c=1 is a probability vector for each string (s 1 , . . . , s q ). The hierarchical reformulation of HOSVD above illustrates the following features of this model in Equation (5): • Soft clustering for each one of the predictors z j ≡ {z j,t } T t=1 is implemented following Equation (6). This allows for inheritance of statistical strengths across different categories.
• The distribution of variable y t is determined by a probability tensorλ of reduced order, following Equation (7). • In order to capture the interactions among different predictors, class assignment variables x j ≡ {x j,t } T t=1 are used. They work in an implicit and parsimonious way by allowing the latent populations with the index of (s 1 , . . . , s q ) to be shared across various state combinations of predictors.

Remark 5.
Here it is very critical to distinguish these two different concepts: (1) the number of clustersk j generated by the latent class variables x j and (2) the dimensions k j of the probability vector ω (j) (c) from the mixture probability matrix. The former one represents the number of groups generated by the data, and is smaller than the latter. It should be noted thatk j determines if the predictor z j should be included in the model, because p(y t | z t ) will not change with z j,t if z j has just a single latent cluster. Thus the significance of some particular predictor could be tested using onk j , which is later elaborated in Section 3.2.
In many real-world applications, the tensorλ often has more components than needed, since the product ∏ q j=1 k j can be large even for modest values of q and C j . To deal with this problem, tensorλ is then clustered within different combinations of (s 1 , . . . , s q ) nonparametrically by imposing a Pitman-Yor process prior [17]. Then, by using the stick-breaking representation of the Pitman-Yor process [18], it follows that where the bold symbols Dir(•) and Beta(•) represents the uniform Dirichlet distributions and Beta distributions [16] respectively, and λ l ≡ (λ l (1), . . . , λ l (C 0 )). Moreover, 0 ≤ b < 1 and a > −b.

Remark 6.
As the parameter µ j grows larger, the exponential prior in Equation (14) will assign increasing probabilities to smaller values of k j , and it becomes a uniform prior distribution on {1, . . . , C j } when µ j is zero. Commonly, people have prior beliefs that as time lags increase, they will a have vanishing impact on the distribution of the current response variable. To impose this prior belief, we can assign larger µ j to time lags further back in the history.
By combining Equations (6)- (14) together, a Bayes network representation of the model is created and Figure 1 illustrates its structures.

Estimation and Inference
This section presents the details of an algorithm for computing posteriors as well as Bayesian hypothesis testing by using Bayes factors.

Posterior Computation
Despite the fact that the posterior distribution does not have any specific analytical form, we can still perform the inference of the corresponding Bayes network by using Gibbs sampling method. Because the dimension of ω (j) may vary with k j , constructing a stationary Markov chain by plain Gibbs sampling is difficult. To infer a model with variable dimensions, a common analytical tool-the reversible jump Monte Carlo Markov chain (MCMC) [19], which does trans-dimensional exploration in the model space-is often used.
Product partition modeling [20,21] can help alleviate difficulties occurring in trans-dimensional modeling by constructing a stationary Markov chain on the clustering space. For this proposed method, the dimension ω (j) is being integrated out for the sampling of k j directly from p(k j | x j , z j ), which will create a partially collapsed Gibbs sampler [22] that alternates between these two spaces: (1) the space with all the variables and (2) the space with all the variables but To compute the posterior probabilities of the Pitman-Yor process, the infinite-dimensional tensors π and λ after their Lth component are truncated, as performed in [18]. For achieving desired accuracy, an appropriate L needs to be chosen. Other than this, the posterior sampling is rather straightforward. The detailed process is presented in Algorithm 1, in which it is not explicitly mentioned that x ≡ {x t } T t=1 and ξ collects the variables.

8: end for
To successfully run Algorithm 1, certain hyperparameters need to be chosen. The aforementioned determination of µ j and L have been carefully discussed along with their implications, so we focus on the other hyperparameters. Among those hyperparameters, a and b will determine the clustering ability of the Pitman-Yor process (which are set to be 1 and 0 in this case), rendering it a Dirichlet process; this is sufficient for applications discussed in this paper. It should be noted that α and β j are Dirichlet Distribution's hyperparamters and serve the role of pseudo-counts. The determination of these reflects the users' prior belief. They are often manually chosen to be some small values without additional information which can justify larger values. In the following sections, they are chosen to be: α = 1 and β j = 1/C j across different applications.

Bayesian Factor and Hypothesis Testing
This subsection discusses hypothesis testing techniques on the significance of all the predictors to the regression Equation (1). It can be used to make causal inference in order to provide a better understanding of the model and to better allocate computational resources for the sequential classification task by including only the important predictors (and discard the unimportant ones). As previously noted, a particular predictor z j is considered important if and only if the number of clustersk j formed by their corresponding latent class allocation variables x j is greater than 1.
Let Λ ⊂ {1, . . . , q} be the set of predictors under consideration. To perform the Bayesian hypothesis testing, we only need to compute the Bayes factor [23] in favor of H 1 :k j > 1 for some j ∈ Λ against H 0 :k j = 1 for any j ∈ Λ, given by where y ≡ {y t } T t=1 , z ≡ {z t } T t=1 ; and p(H 0 |y, z), p(H 1 |y, z) are numerically computed as the fraction of samples in which thek j 's conform to H 0 and H 1 , respectively; the prior probabilities p(H 0 ) and p(H 1 ) can be obtained by the following probability equation: Specifically, to test whether θ Granger-causes y, it is only necessary to choose

Sequential Classification
In Section 3, a Gibbs sampling algorithm is developed to infer the posterior distribution of model parameters given the observed data. In this section, a classification algorithm for dynamical systems based on the posterior predictive distribution, which is derived by marginalizing the likelihood of unobserved data over the posterior distribution of model parameters, is proposed. This algorithm consists of two phases: (1) off-line training phase and (2) online testing phase. Suppose there are M different classes of dynamical systems that are of interest, C i , i = 1, 2, · · · , M, for each of them we collect a training set (i) The requirement for this dataset is that the data are categorical (e.g., quantized categories from continuous data), and for each class they have an identical number of categories of predictors and variables.
During the training phase, training set (i) D T i is used to compute the posterior of samples for each one of the class C i , as previously described in Algorithm 1. Then, during the test phase, the test set D T will be classified. Among these M classes, one will be identified as the class to which D T most likely belongs. In order to do so, the following conditional probability p(D T | (i) D T i ) will be computed: Following the above calculation of conditional probabilities p(D T | (i) D T i ), the posterior probability of the test data D T belonging to class C i (denoted as p(C i | D T )) can be then calculated as: where p(C i ) is the prior probability of the class C i . Next, the classification result is generated by: The prior probability p(C i ) reflects user's subjective beliefs and can also be designed to optimize some objective criterion. The reason that the detection algorithm is "sequential" is due to the fact the conditional probability p(D T | (i) D T i ) is evaluated one by one as shown in Equation (16). In real-world applications, values of p(y t | z t ; (i) D T i ) in Equation (17) are often precomputed and stored for various values of (y t , z t ), in order to achieve faster computations.
For the binary classification case, we can construct the likelihood ratio test [24] as: where in this equation Θ is a certain threshold. To choose the threshold Θ, one could rely on the receiver operating characteristic (ROC). ROC curves are often obtained by changing Θ in order to make a trade-off between the probability of (successful) detection p D = Prob(decide 1 | 1 is true) and the false alarm probability p F = Prob(decide 1 | 0 is true). Using those ROC curves, an optimal combination of p D and test set data length for a given p F can be selected, which would then determine the threshold Θ.

Numerical Example
This section presents a numerical example which utilizes the proposed method to infer causal relationships between two categorical time series. In this example, the data generation model is known and thus can be compared with the results from the proposed algorithm for evaluation of performance. The data generation details are given below.
In this particular numerical example, there are two binary sequences of symbols y t and θ t . Symbol sequences y t are generated using a known Markov model p(y t | y t−1 , y t−3 , y t−4 ), where only the time-lags y t−1 , y t−2 , y t−5 are important predictors. Symbol sequences θ t are generated from another Markov model p(θ t | θ t−1 , θ t−2 , y t−1 , y t−3 ), where θ t−1 , θ t−2 and y t−1 , y t−3 are the key predictors. In other words, the variable y Granger-causes the variable θ but not the other way around because y only depends on its own past. Table 1 lists the transition probabilities for y t , where it is seen that the predictors are y t−1 , y t−3 , y t−4 only. Table 2 lists the transition probabilities for θ t , where the predictors are y t−1 , y t−3 , θ t−1 , and θ t−2 only.    and {θ t } 1005 t=1 are being collected simultaneously. Based on the prior belief that y t−D and θ t−D are no longer important for making predictions about y t and θ t when D is greater than 5, predictors for both y t and θ t are set as follows: From these data sets, 1000 training samples are chosen for testing the proposed algorithm.
To calculate posteriors using Algorithm 1 for p(y t |z t ), since there is no other prior knowledge, µ j is set to be 1 across j = 1, . . . , 10. Initially, 200,000 samples are used in a burn-in period: they are fed into the algorithm and then discarded. The next 50,000 samples (after burn-in) are downsampled further by taking every 5th sample to reduce their autocorrelation. Figure 2 summarizes the results, in which Figure 2a displays the log-likelihood for 10,000 iterations of this model and Figure 2b illustrates the ability to correctly identify all the important predictors for the proposed method. For this example, the key predictors should be 1, 3 and 4, and the results from the prediction (y t−1 , y t−3 and y t−4 ) are the same as the ground truth. Figure 2c shows the relative frequency of number of predictors that are important. Furthermore, the proposed method also creates parsimonious representations of the model as seen in Figure 2d,e. As previously discussed in Section 2.2, the tensor λ s 1 ...s q (y t ) has more components than needed but it can be clustered in a nonparametric way to reduce the number of combinations. Referring to [13], Figure 2f shows the Bayes factors calculation as mentioned in Section 3.2 for all of the predictors. Bayes factor BF 10 in Equation (15) can be regarded as the evidence against H 0 . After setting a commonly-used threshold of t = 20, it can be concluded that those predictors with higher BF 10 have implications of their evidences being strong. Furthermore, having BF 10 > 150 indicates even stronger evidence against the hypothesis H 0 [13]. It should be noted here that when the inclusion proportions of different lags in Figure 2b are equal to 1, then their corresponding Bayes factors in Figure 2f should tend to infinity (as for predictors 1, 3 and 4 in this example). Similarly, Figure 3 shows the results using the same set of data as in Figure 2 but instead of estimating p(y t |z t ), we are estimating p(θ t |z t ) here. Besides the ability to correctly identify the structure of the model, the proposed method can also perform transition probability estimation. Figure 4 illustrates two arbitrarily selected cases from Table 1 and Table 2. Setting y t−1 = 0, y t−3 = 1, and y t−4 = 0, from Table 1 we can get the transition probability of the model of (y t = 1) is 0.70. Similarly, setting y t−1 = 1, y t−3 = 0, θ t−1 = 1, and θ t−2 = 0, from Table 2 we can get the transition probability of (y t = 1) is 0.50. In Figure 4, the estimated transition probability using the proposed method is displayed along with their running mean as well as their 5% and 95% percentiles. From both subplots of Figure 4, it is observed that the running mean of the transition probability is actually close to the true transition probability as given in the data generation tables. Even with a limited amount of data, the proposed method can not only estimate the transition probabilities, but also give an uncertainty bound in terms of their respective quantiles.  The causal relationship between y and θ is identified by Bayes factor analysis (see Section 3.2. The results are summarized in Table 3, which show that y Granger-causes θ but not the other way, which is in line with the ground truth. Table 3. Hypothesis Testing of Granger causality in the Numerical Example.

Null Hypothesis
Bayes Factor BF 10 θ does not Granger-cause y 0.43 y does not Granger-cause θ Infinity

Validation with Experimental Data: The Combustor Apparatus
This section validates the nonparametric regression model with experimental data generated from a swirl-stabilized lean-premixed laboratory-scale combustor apparatus [14].

Background and Description of the Experimental Procedure
This subsection presents a brief background of thermoacoustic instabilities in the combustor apparatus along with the experimental details for data collection. Thermoacoustic instabilities occur from highly nonlinear coupled phenomena that evolve from mutual interactions among thermofluid dynamics, unsteady heat release, and acoustics of the combustor chamber. The resulting self-sustained high-amplitude pressure oscillations often impose severe negative impacts on the performance and operational life of gas turbine engines [25][26][27].
Technical literature abounds with studies on combustion instabilities and their early detection by time series analysis, especially by using Markov chains [28,29]. However, current methods are largely limited to individual investigations of pressure or chemiluminescence measurements, and have apparently not taken the machine-learning-theoretic approach to information fusion into consideration; consequently, fast detection of thermoacoustic instabilities may not be achieved to the full extent based on the individual information of different sources only. Moreover, parameter estimation is difficult in current methods, even for moderately high-order Markov chains, due to the paucity of data, let alone a more sophisticated information fusion model. As for the detection procedure, empirical thresholds are often used in existing literature, without taking advantage of methods in statistical detection theory (such as sequential testing techniques); therefore, those applications are very limited in real-time detection cases. Figure 5 presents a schematic diagram of the combustor apparatus [14] that consists of an inlet section, an injector, a combustion chamber, and an exhaust section. The combustor chamber consists of an optically-accessible quartz section followed by a variable-length steel section. Experiments have been conducted at 62 different operating conditions by varying the equivalence ratio and percentage of pilot fuel, as listed in Table 4. Under each operating condition, 8 s of pressure and chemiluminescence measurements have been collected at the sampling rate of 8192 Hz, where stable and/or unstable modes are recorded along with each time series data. To alleviate the problem of (possible) oversampling, the pressure and chemiluminescence measurements from combustors are first downsampled, which is obtained from first minimum of the average mutual information [30]. Then, the continuously varying time series data for both stable and unstable modes are quantized using maximum entropy partitioning [31,32] with a ternary alphabet Σ = {1, 2, 3}. The quantized pressure measurements are denoted as y t and the chemiluminescence measurements are denoted as θ t at time instant t.

Training Phase
This subsection describes details in the nonparametric regression model training, wherein 500 samples have been used after downsampling the quantized pressure time series data under stable and unstable conditions. The maximum memory D of each of y t and θ t in this dataset is observed to be generally limited to 5 for both stable and unstable cases. Hence, predictors of y t or θ t are set to be z t ≡ (y t−1 , y t−2 , . . . , y t−5 , θ t−1 , θ t−2 , . . . , θ t−5 ) and the corresponding regression model is hereafter referred to as "full order model". Since y t and θ t has three categories, it follows that C y = C θ = 3.
To compute posteriors, as in Algorithm 1, the values are assigned to µ j for j = 1, . . . , 10. After discarding 200,000 data points during the burn-in period, remaining 50,000 samples are then downsampled by taking every 5th data point to reduce their autocorrelation. Gibbs sampling results of pressure data are represented as p(y t | y t−1 , . . . , y t−5 , θ t−1 , . . . , θ t−5 ) in Figure 6a,b for a stable mode and in Figure 6c,d for an unstable mode. Similarly, Gibbs sampling results of chemiluminescence data are represented as p(θ t | y t−1 , . . . , y t−5 , θ t−1 , . . . , θ t−5 ) in Figure 7a,b for a stable mode and in Figure 7c,d for an unstable mode. Figures 6a,c and 7a,c show the log likelihood with different iterations for pressure and chemiluminescence data under stable and unstable conditions, respectively. Similarly, Figures 6b,d  and 7b,d illustrate the Bayes factors of predictors for pressure and chemiluminescence data under stable and unstable conditions, respectively. Based on the Bayes factor analysis, the important predictors for stable pressure data are identified as: while those for unstable pressure data are identified as Using the identical set of hyperparameters and number of iterations, Gibbs sampling has been performed on the same set with pressure data y t only; this is referred to as the "reduced order model" in the following text. In this case the predictors are set as z t ≡ (y t−1 , y t−2 , . . . , y t−5 ). The stable and unstable cases are shown in Figure 8a-d respectively. The important predictors for y t using this reduced order model are: y t−2 , y t−4 , and y t−5 for the stable mode, and y t−1 , y t−2 , y t−4 , and y t−5 for the unstable mode. Similarly, for chemiluminescence data, the important predictors are identified as: y t−1 , y t−3 , y t−4 , y t−5 , θ t−1 , θ t−2 , θ t−3 , θ t−4 and θ t−5 while those for unstable chemiluminescence data are identified as: y t−1 , y t−2 , θ t−1 , θ t−2 , θ t−3 , θ t−4 and θ t−5

Granger Causality
To identify the Granger causal relationship between pressure and chemiluminescence data, Bayes factor analysis has been performed for both stable and unstable cases as described in Section 3.2. The results are summarized in Table 5, which show that pressure and chemiluminescence measurements Granger-cause each other under both stable and unstable conditions; this implies that fusion of these two measurements can enhance the accuracy of prediction. This kind of mutual interaction between pressure and chemiluminescence measurements could be caused by a third unknown physical quantity, the exploration of which is a topic of future research.

Sequential Classification
For evaluation of the performance of the sequential classification for thermoacoustic instability identification, 100 instances of 50-sample datasets, which are not included in the training set, have been selected (also from their downsampled quantized pressure measurements for both stable and unstable modes). Figure 9 exhibits the profiles of posterior probability of each class as a function of the length of the observed data, where the top plate (i.e., Figure 9a) uses the full order model, and the bottom plate (i.e., Figure 9b) uses the reduced-order model for the same test data sequence. While the test sequences are correctly classified by both models, the reduced-order model is slower than the full-order model that contains more information.  Figure 9. Posterior probabilities using different models. Figure 10 shows the receiver operating characteristic (ROC) curves for the proposed detection algorithm with different lengths of the test data. These ROC curves are plotted for both full-order and reduced-order models to show that, when testing with the same dataset, the full order model achieves better detection performance in terms of the area under the ROC. In other words, the full-order model may achieve the same performance as the reduced order model in a shorter time, which is desirable for active control of thermoacoustic instabilities in real time. It is also observed that the ROC curves tend to improve (i.e., move toward the top left corner) considerably as the length of test data is increased from 5 to 9. This is expected because the information contents monotonically increase with the length of test data and hence better results are obtained.

Validation with Economics Data
This section validates the nonparametric regression model with (publicly available) real-world economics data. Specifically, monthly data of the U.S. consumer price index (CPI) and the U.S. Dollar London Interbank Offered Rate (LIBOR) interest rate index with one-month maturity from January 1986 to December 2016 are used. It is noted that: (i) U.S. CPI is a measure of the average change over time in the prices paid by urban consumers for U.S. market of consumer goods and services, and (ii) U.S. Dollar LIBOR is a benchmark for short-term interest rates around the world, which is not a monetary measure associated with any country, and which does not reflect any institutional mandate in contrast to, e.g., when the Federal Reserve sets interest rates. Economics theory [33] indicates that low interest rates can cause high inflation, and empirical research [34] has been conducted to investigate the causal relationship between inflation and nominal or real interest rates for the same country or region.
To avoid spurious regression [35], the raw data of U.S. CPI and U.S. Dollar LIBOR are preprocessed to achieve stationarity. U.S. CPI raw data are used to calculate the monthly percentage increase, and then this percentage increase is converted into a categorical variable by discretizing to quintiles (e.g., 5-quantiles in this study) that are denoted as y t ; the rationale for discretization of (noise-contaminated) continuously varying data is to improve the signal-to-noise ratio [36]. Similarly, U.S. LIBOR raw data are used to calculate the monthly difference, and then this difference is convertedin to a categorical variable by discretizing to quintiles, denoted as θ t . The entire dataset is used for training the proposed algorithm.
To estimate the regression model in Equation (1), based on the assertion that y t−D y and θ t−D θ are not important for predicting y t and θ t if both D y and D θ are greater than 6 (i.e, six months for both CPI and LIBOR), the predictor for y t and θ t is set as: z t ≡ (y t−1 , y t−2 , . . . , y t−6 , θ t−1 , θ t−2 , . . . , θ t−6 ) To compute the posterior probabilities using the proposed Algorithm 1, µ j are assigned to be j/2 for j = 1, . . . , 6 and (j − 6)/6 for j = 7, . . . , 12. After the initial 100,000 samples are discarded during the burn-in period, the remaining 50,000 samples are then downsampled by taking every 5th to reduce their autocorrelation. Figures 11 and 12 respectively summarize the results for y t and θ t . These figures have similar characteristics to their counterparts in the numerical example in Section 5. The results show that, for y t or CPI, the important lags are y t−1 , y t−2 , y t−3 and θ t−1 . Similarly, for θ t or LIBOR, the important lags are θ t−1 , θ t−1 , θ t−3 . These results show that LIBOR Granger-cause CPI, but not vice versa. This conclusion is summarized by Bayes factor analysis in Table 6.

Summary, Conclusions, and Future Work
The proposed Bayesian nonparametric method provides a flexible model for information fusion of heterogeneous, correlated time series data. The proposed method has been validated on a real-world application by using the experimental data collected from a laboratory-scale swirl-stabilized combustor apparatus, as well as on the publicly available economics data. It is demonstrated that the proposed method is capable of enhancing the accuracy for real-time detection of thermoacoustic instabilities and correctly identifying the Granger causal relationship between key economic variables.
There are many promising directions in which the proposed model can be further explored, such as: 1. Variational inference algorithm development for the proposed model [37]. 2. Extension of the present analysis to hidden Markov models (HMM) [38] and information transfer [39]. 3. Exploration of an unknown physical quantity that may cause the appearance of mutual interactions between pressure and chemiluminescence measurements. 4. Investigation of the empirical performance of the proposed approach utilizing extensive simulation studies.

Acknowledgments:
The authors are grateful to Domenic Santavicca and Jihang Li from Pennsylvania State University for kindly providing the experimental data that were used to validate the theoretical results.

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature of Pertinent Parameters
a Hyperparameter of prior on probability vector π b Hyperparameter of prior on probability vector π C j Number of categories of the jth predictor C i ith class of dynamical systems