Ridge Type Shrinkage Estimation of Seemingly Unrelated Regressions And Analytics of Economic and Financial Data from “Fragile Five” Countries

In this paper, we suggest improved estimation strategies based on preliminarily test and shrinkage principles in a seemingly unrelated regression model when explanatory variables are affected by multicollinearity. To that end, we split the vector regression coefficient of each equation into two parts: one includes the coefficient vector for the main effects, and the other is a vector for nuisance effects, which could be close to zero. Therefore, two competing models per equation of the system regression model are obtained: one includes all the regression of coefficients (full model); the other (sub model) includes only the coefficients of the main effects based on the auxiliary information. The preliminarily test estimation improves the estimation procedure if there is evidence that the vector of nuisance parameters does not provide a useful contribution to the model. The shrinkage estimation method shrinks the full model estimator in the direction of the sub-model estimator. We conduct a Monte Carlo simulation study in order to examine the relative performance of the suggested estimation strategies. More importantly, we apply our methodology based on the preliminarily test and the shrinkage estimations to analyse economic data by investigating the relationship between foreign direct investment and several economic variables in the “Fragile Five” countries between 1983 and 2018.


Introduction
A seemingly unrelated regression (SUR) system, originally proposed by Zellner (1962), comprises multiple individual regression equations that are correlated with each other. Zellner's idea was to improve estimation efficiency by combining several equations into a single system. Contrary to SUR estimation, the ordinary least squares (OLS) estimation loses its efficiency and will not produce best linear unbiased estimates (BLUE) when the error terms between the equations in the system are correlated. This method has a wide range of applications in economic and financial data and other similar areas (Shukur 2002;Srivastava and Giles 1987;Zellner 1962). For example, Dincer and Wang (2011) investigated the effects of ethnic diversity on economic growth. Williams (2013) studied the effects of financial crises on banks. Since it considers multiple related equations simultaneously, a generalized least squares (GLS) estimator is used to take into account the effect of errors in these equations. Barari and Kundu (2019) reexamined the role of the Federal Reserve in triggering the recent housing crisis with a vector autoregression (VAR) model, which is a special case of the SUR model with lagged variables and deterministic terms as common regressors. One might also consider the correlations of explanatory variables in SUR models. Alkhamisi and Shukur (2008) and Zeebari et al. (2012Zeebari et al. ( , 2018 considered a modified version of the ridge estimation proposed by Hoerl and Kennard (1970) for these models. Alkhamisi (2010) proposed two SUR-type estimators by combining the SUR ridge regression and the restricted least squares methods. These recent studies demonstrated that the ridge SUR estimation is superior to classical estimation methods in the presence of multicollinearity. Srivastava and Wan (2002) considered the Stein-rule estimators from James and Stein (1961) in SUR models with two equations.
In our study, we consider preliminarily test and shrinkage estimation, more information on which can be found in Ahmed (2014), in ridge-type SUR models when the explanatory variables are affected by multicollinearity. In a previous paper, we combined penalized estimations in an optimal way to define shrinkage estimation (Ahmed and Yüzbaşı 2016). Gao et al. (2017) suggested the use of the weighted ridge regression model for post-selection shrinkage estimation. Yüzbaşı et al. (2020) gave detailed information about generalized ridge regression for a number of shrinkage estimation methods. Srivastava and Wan (2002) and Arashi and Roozbeh (2015) considered Stein-rule estimation for SUR models. Erdugan and Akdeniz (2016) proposed a restricted feasible SUR estimate of the regression coefficients.
The organization of this paper is as follows: In Section 2, we briefly review the SUR model and some estimation techniques, including the ridge type. In Section 3, we introduce our new estimation methodology. A Monte Carlo simulation is conducted in Section 4, and our economic data are analysed in Section 5. Finally, some concluding remarks are given in Section 6.

Methodology
Consider the following model: the i th equation of an M seemingly unrelated regression equation with T number of observations per equation. Y i is a T × 1 vector of T observations; X i is a T × p i full column rank matrix of T observations on p i regressors; and β i is a p i × 1 vector of unknown parameters. Equation (1) can be rewritten as follows: where Y = Y 1 , Y 2 , . . . , Y M is the vector of responses and ε = ε 1 , ε 2 , . . . , ε M is the vector of disturbances with dimension TM × 1, X = diag (X 1 , X 2 , . . . , X M ) of dimension TM × p, and β = β 1 , β 2 , . . . , β M of dimension p × 1, for p = ∑ M i=1 p i . The disturbances vector ε satisfies the properties: and: where Σ = [σ ij ], i, j = 1, 2, . . . , M is an M × M positive definite symmetric matrix, ⊗ stands for the Kronecker product, and I is an identity matrix of order of T × T. Following Greene (2019), we assume strict exogeneity of X i , E [ε|X 1 , X 2 , . . . , X M ] = 0, and homoscedasticity: E ε i ε i |X 1 , X 2 , . . . , X M = σ ii I.
Therefore, it is assumed that disturbances are uncorrelated across observations, that is, E ε it ε js |X 1 , X 2 , . . . , X M = σ ij , if t = s and 0 otherwise, and it is assumed that disturbances are correlated across equations, that is, The OLS and GLS estimator of model (2) are thus given as: and: β OLS simply consists of the OLS estimators computed separately from each equation and omits the correlations between equation, as can be seen in Kuan (2004). Hence, it should use the GLS estimator when correlations exist among equations. However, the true covariance matrix Σ is generally unknown. The solution for this problem is a feasible generalized least squares (FGLS) estimation, which uses covariance matrix Σ of Σ in the estimation of GLS. In many cases, the residual covariance matrix is calculated by: where ε i = Y i − X i β i represents residuals from the i th equation and β i may be the OLS or ridge regression (RR) estimation such that (X i X i + λI) −1 X i Y i with the tuning parameter λ ≥ 0. Note that we use the RR solution to estimate Σ in our numerical studies because we assume that two or more explanatory variables in each equation are linearly related. Therefore, Ω = Σ ⊗ I, the FGLS of the SUR system, is: By following Srivastava and Giles (1987) and Zeebari et al. (2012), we first transform Equation (2) by using the following transformations, in order to retain the information included in the correlation matrix of cross equation errors: Hence, Model (2) turns into: The spectral decomposition of the symmetric matrix X * X * is X * X * = PΛP with PP = I. Model (3) can then be written as: with Z = X * P, α = P β and Z Z = P X * X * P = Λ, so that Λ is a diagonal matrix of eigenvalues and P is a matrix whose columns are eigenvectors of X * X * .
The OLS estimator of model (4) is: The least squares estimates of β in model (2) can be obtained by an inverse linear transformation as: Furthermore, by following Alkhamisi and Shukur (2008), the full model ridge SUR regression parameter estimation is: where > 0 for i = 1, 2, . . . , M and j = 1, 2, . . . , p i . Now let us assume that uncertain non-sample prior information (UNPI) on the vector of β parameters is available, either from previous studies, expert knowledge, or researcher's experience. This information might be of use for the estimation of parameters, in order to improve the quality of the estimators when the sample data have a low quality or may not be reliable Ahmed (2014). It is assumed that the UNPI on the vector of parameters will be restricted by the equation for Model (2), where R = diag(R 1 , R 2 , . . . , R M ), R i , i = 1, . . . , M is a known m i × p i matrix of rank m i < p i and r is a known ∑ M i m i × 1 vector. In order to use restriction (7) in Equation (2), we transform it as follows: where H = RP and α = P β, which is defined above. Hence, the restricted ridge SUR regression estimation is obtained from the following objective function: where Z K = (Z Z + K).

and,
Var Thus, the risk of α RR is directly obtained by definition.

Preliminary Test and Shrinkage Estimation
Researchers have determined that restricted estimation (RE) generally performs better than the full model estimator (FME) and leads to smaller sampling variance than the FME when the UNPI is correct. However, the RE might be a noteworthy competitor of FME even though the restrictions may, in fact, not be valid; we refer to Groß (2003) and Kaçıranlar et al. (2011). It is important that the consequences of incorporating UNPI in the estimation process depends on the usefulness of the information. The preliminary test estimator (PTE) uses UNPI, as well as the sample information. The PTE chooses between the RE and the FME through a pretest. We consider the SUR-PTE of α as follows: where F m,M·T−p (α) is the upper α-level critical value from the central F-distribution, I(A) stands for the indicator function of the set A, and F n is the F test for testing the null hypothesis of (8), given by: where m is the number of restrictions and p is the total number of estimated coefficients. Under the null hypothesis (8), F n is F distributed with n and M · T − p degrees of freedom (Henningsen et al. 2007).
The PTE selects strictly between FME and RE and depends strongly on the level of significance. Later, we will define the Stein-type regression estimator (SE) of α. This estimator is the smooth version of PTE, given by, where is the optimum shrinkage constant. It is possible that the SE may have the opposite sign of the FME due to small values of F n . To alleviate this problem, we consider the positive-rule Stein-type estimator (PSE) defined by:

Simulation
In this section, the performance of the preliminary test and shrinkage SUR ridge estimators of β are investigated via Monte Carlo simulations. We generate the response from the following model: The explanatory variables are generated from a multivariate normal distribution MVN p i (0, Σ x ), and the random errors are generated from MVN M (0, Σ ε ). We summarize the simulation details as follows: The ρ x regulates the strength of collinearity among explanatory variables per equation. In this study, we consider ρ x = 0.5, 0.9. Further, the response is centred, and the predictors are standardized for each equation. 2. The variance-covariance matrix of errors for interdependency among equations is defined by diag(Σ ε ) = 1 and off − diag(Σ ε ) = ρ ε = 0.5, 0.9, and the errors are generated from MVN M (0, Σ ε ) and M = 2, 3 for each replication. 3. We consider that an SUR regression model is assumed to be sparse. Hence, the vector of coefficients can be partitioned as β = β 1 , β 2 where β 1 is the the coefficient vector of the main effects, while β 2 is the vector of nuisance, which means that it does not contribute to the model significantly.
We set β 1 = (1, −3, 2) and β 2 = 0. For the suggested estimations, we consider the restriction of β 2 = 0 and test it. We also investigate the behaviours of the estimators when the restriction is not true. To this end, we add a ∆ value to one component of β 2 so that it violates the null hypothesis.
Here, we use ∆ values between zero and two and use α = 0.05. We also consider that the lengths of the nuisance parameter β 2 are two and four, respectively. Therefore, the restricted matrices are: where i = 2, 3 are the number of equation. Hence, R will be diag(R 1 , R 2 ) and diag(R 1 , R 2 , R 3 ). Furthermore, r will be (r 1 , r 2 ) and (r 1 , r 2 , r 3 ) . 4. The performance of an estimator is evaluated by using the relative mean squared error (RMSE) criterion. The RMSE of an estimator α * with respect to α RR is defined as follows: where α * is one of the listed estimators. If the RMSE of an estimator is larger than one, it indicates that it is superior to α RR . Table 1 provides notations and a symbol key for the benefit of the reader.
the magnitude of violation of the null hypothesis [0,2] We plot the simulation results in Figures 1 and 2. The simulation results for some other parameter configurations were also obtained, but are not included here for the sake of brevity. According to these results: 1. When ∆ = 0, which means that the null hypothesis is true or that the restrictions are consistent, the RE estimator always performs competitively when compared to other estimators. The PTE mostly outperforms the SE and PSE when p i = 5, while it looses its efficiency when p i = 7 when compared to PSE. The SE may perform worse than the FME due its sign problem, as is indicated in Section 3.
2. When ∆ > 0, which means that the null hypothesis is violated or the restrictions are invalid, the RE looses its efficiency, and its RMSE goes to zero, meaning that it becomes inconsistent. The RMSE of PTE decreases and remains below one for some values of ∆, but approaches one for larger values of ∆. The performance of PSE decreases, but its efficiency remains above the FME, for intermediate values of ∆, while it acts as the FME for larger values of ∆. It can be concluded that the PSE is a robust estimator even if the restriction is not true. 3. We examined both medium and high correlation between disturbance terms. The results showed that the performance of suggested estimators was consistent with its theory; see Ahmed (2014). 4. We examined both medium and high correlation between regressors across different equations.
The results showed that the performance of suggested estimators was consistent with its theory; see Yüzbası et al. (2017).

Application
In the following section, we will apply the proposed estimation strategies to a financial dataset to examine the relative performance of the listed estimators. To illustrate and compare the listed estimators, we will study the effect of several economic and financial variables on the performance of the "Fragile Five  Table 2 provides information about prediction variables, and the raw data are available from the World Bank 1 . We suggest the following model: where i denotes countries (i = TUR, ZAF, BRA, IND, IDN) and t is time (t = 1, 2, . . . , T). Following Salman (2011), the errors of each equation are assumed to be normally distributed with mean zero, homoscedastic, and serially not autocorrelated. Furthermore, there is contemporaneous correlation between corresponding errors in different equations. We test these assumptions along with the assumptions in Section 2. We first check the following assumptions of each equation: Nonautocorrelation of errors: There are a number of viable tests in the reviewed literature for testing the autocorrelation. For example, the Ljung-Box test is widely used in applications of time series analysis, and a similar assessment may be obtained via the Breusch-Godfrey test and the Durbin-Watson test. We apply the Ljung-Box test of (Ljung and Box 1978). The null hypothesis of the Ljung-Box Test, H 0 , is that the errors are random and independent. A significant p-value in this test rejects the null hypothesis that the time series is not autocorrelated. Results reported in Table 3 suggest a rejection of H 0 for the equations of both TUR and IND at any conventional significance level. Thus, the estimation results will be clearly unsatisfactory for these two equation models. To tackle this problem, we performed the first differences procedure to transform the variables. After transformation, the test statistics and p-values of the equation TUR and IND were χ 2 (1) = 1.379, p = 0.240 and χ 2 (1) = 0.067, p = 0.794, respectively. Hence, each equation satisfied the assumption of nonautocorrelation. We confirmed our result using the Durbin-Watson test.  (Breusch and Pagan 1979). The results in Table 4 failed to reject the null hypothesis in each equation. The assumption homoscedasticity in each equation was thus met. Normality of errors: To test for normality, there are various tests such as Shapiro-Wilk, Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov, and Jarque-Bera. In this study, we performed the Jarque-Bera goodness-of-fit test (Jarque and Bera 1980). The null hypothesis for the test is that the data are normally distributed. The results reported in Table 5 suggested a rejection of H 0 only for ZAF. We also performed the Kolmogorov-Smirnov test for ZAF, and the results showed that the errors were normally distributed. Thus, each equation satisfied the assumption of normality.
Cross-sectional dependence: To test whether the estimated correlation between the sections was statistically significant, we applied the Breusch and Pagan (1980) Lagrange multiplier (LM) statistic and the Pesaran (2004) cross-section dependence (CD) tests. The null hypothesis of these tests claims there is no cross-section dependence. Both tests in Table 6 suggested a rejection of the null hypothesis that the residuals from each equation were significantly correlated with each other. Consequently, the SUR model would be the preferred technique, since this model assumed contemporaneous correlation across equations. Therefore, the joint estimation of all parameters rather than OLS, on each equation, was more efficient (Kleiber and Zeileis 2008).  Ramsey (1969) is a general specification test for the linear regression model. It tests the exogeneity of the independent variables, that is the null hypothesis is E [ε i |X i ] = 0. Thus, rejecting the null hypothesis indicates that there is a correlation between the error term and the regressors or that nonlinearities exist in the functional form of the regression. The results reported in Table 7 suggested a rejection of H 0 only for IDN. Multicollinearity: We calculated the variance inflation factor (VIF) values among the predictors. A VIF value provides the user with a measure of how many times larger the Var(β j ) will be for multicollinear data than for orthogonal data. Usually, the multicollinearity is not a problem, as the VIFs are generally not significantly larger than one (Mansfield and Helms 1982). In the literature, values of VIF that exceed 10 are often regarded as indicating multicollinearity, but in weaker models, values above 2.5 may be a cause for concern. Another measure of multicollinearity is to calculate the condition number (CN) of X i X i , which is the square root of the ratio of the largest characteristic root of X i X i to the smallest. Belsley et al. (2005) suggested that a CN greater than fifteen poses a concern, a CN in excess of 20 is indicative of a problem, and a CN close to 30 represents a severe problem. Table 8 displays the results from a series of multicollinearity diagnostics. In general, EXPORTS, IMPORTS, and BALANCE were found to be problematic with regard to VIF values, while the others may be a little concerning. On the other hand, the results from the CN test suggested that there was a very serious concern about multicollinearity for the equations of ZAF, BRA, and IDN. In light of these results, it was clear that the problem of multicollinearity existed in the equations. According to Greene (2019), the SUR estimation is more efficient when the less correlation exists between covariates. Therefore, the ridge-type SUR estimation will be a good solution of this problem. Structural change: To investigate the stability of the coefficients in each equation, we used the CUSUM (cumulative sum) test of Brown et al. (1975) that checks for structural changes. The null hypothesis is that of coefficient constancy, while the alternative suggests inconsistent structural change in the model over time. The results in Table 9 suggested the stability of coefficients over time. Table 9. CUSUM test.  Table 10. After that, the sub-models were constituted by using these variables per equation. In light of the selected variables in Table 10, we construct the matrices of restrictions as follows:

Equation Test Statistic p-Value
thus, the reduced models are given by: Next, we combined Model (14) and Models (15)-(19) using the shrinkage and preliminarily test strategies outlined in Section 3. Before we performed our analysis, the response was centred, and the predictors were standardized for each equation so that the intercept term was omitted. We then split the data by using the time series cross-validation technique of Hyndman and Athanasopoulos (2018) into a series of training sets and a series of testing sets. Each test set consisted of a single observation for the models that produced one step-ahead forecasts. In this procedure, the observations in the corresponding training sets occurred prior to the observation of the test sets. Hence, it was ensured that no future observations could be used in constructing the forecast. We used the function createTimeSlices from the caret package in the R project here. The listed models were applied to the data, and predictions were made based on the divided training and test sets. The process was repeated 15 times, and for each subset's prediction, the mean squared error (MSE) and the mean absolute error (MAE) were calculated. The means of the 15 MSEs and MAEs were then used to evaluate the performance for each method. We also report the relative performances (RMAE and RMSE) with respect to the full model estimator for easier comparison. If a relative value of an estimator is larger than one, it is superior to the full model estimator.
In Table 11, we report the MSE and MAE values and their standard errors to see the stability of the algorithm. Based on this table, as expected, the RE had the smallest measurement values since the insignificant variables were selected as close to correct as possible. We saw that the performance of the PSE after the RE was best by following the SE and the PTE. Moreover, the performance of the OLS was the worst due to the problem of multicollinearity. In order to test whether the two competing models had the same forecasting accuracy, we used the two-sided statistical Diebold-Mariano (DM) test (Diebold and Mariano 1995) when the forecasting horizon was extended to one year, and the loss functions were both squared errors and absolute errors. A significant p-value in this test rejected the null hypothesis that the models had different forecasting accuracy. The results based on the absolute-error loss in Table 12 suggested that the FME had different prediction accuracy with all methods except RE. Additionally, the forecasting accuracy of the OLS differed from the listed estimators. On the other hand, the results of the DM test based on the squared error loss suggested that the observed differences between the RE and shrinkage estimators were significant. The numbers in parenthesis are the corresponding p-values; LS is the "loss function" of the method to compute; * p < 0.1, ** p < 0.05, *** p < 0.01.
Finally, the estimates of coefficients of all countries are given in Table 13.  The numbers in parenthesis are the corresponding standard errors.

Conclusions
In this paper, we proposed the shrinkage and preliminary test estimation methods in a system of regression models when the disturbances were dependent and correlations existed among regressors in each equation. To build the model, we first multiplied both sides of Model (1) by the inverse variance-covariance matrix of the disturbances and transformed the values using spectral decomposition. We defined the full model estimator by following Alkhamisi and Shukur (2008) and the restricted estimator by assuming a UNPI on the vector of parameters. Finally, we combined them in an optimal way by applying the shrinkage and preliminary test strategies. To illustrate and compare the relative performance of these methods, we conducted a Monte Carlo simulation. The simulated results demonstrated that the RE outperformed all other estimators when there was sufficient evidence that the vector nuisance parameters were a zero vector, that is ∆ = 0. However, the RE lost its efficiency as ∆ increased and became unbounded when ∆ was large. The PSE dominated the FME at the small values of ∆, while the SE and PSE outshone the FME in the entire parametric space. However, the PSE was better than the SE because it controlled for the over-shrinking problem in SE. We also investigated the performance of the suggested estimations via a real-world example using financial data for the "Fragile Five" countries. The results of our data analysis were consistent with the simulated results.
For further research, one can use the other penalized techniques for the SUR model such as the smoothly clipped absolute deviation (SCAD) by Fan and Li (2001), the least absolute shrinkage and selection operator (LASSO) by Tibshirani (1996), and the adaptive LASSO estimators by Zou (2006), as well as our preliminary and shrinkage estimations.