Improved Average Estimation in Seemingly Unrelated Regressions

: In this paper, we propose an efﬁcient weighted average estimator in Seemingly Unrelated Regressions. This average estimator shrinks a generalized least squares (GLS) estimator towards a restricted GLS estimator, where the restrictions represent possible parameter homogeneity speciﬁcations. The shrinkage weight is inversely proportional to a weighted quadratic loss function. The approximate bias and second moment matrix of the average estimator using the large-sample approximations are provided. We give the conditions under which the average estimator dominates the GLS estimator on the basis of their mean squared errors. We illustrate our estimator by applying it to a cost system for United States (U.S.) Commercial banks, over the period from 2000 to 2018. Our results indicate that on average most of the banks have been operating under increasing returns to scale. We ﬁnd that over the recent years, scale economies are a plausible reason for the growth in average size of banks and the tendency toward increasing scale is likely to continue


Introduction
Seemingly unrelated regressions (SUR) was introduced by (Zellner 1962) and is one of the econometric developments that has been widely used in applied work. The relative ease of estimation, applying a large class of modeling and testing problems, and the availability of data representing a sample of cross section units observed over several time periods are related to the popularity of this model. (Zellner 1962) proposed a generalized least squares (GLS) estimator for estimating the coefficients of a set of SUR and established that it yields, at least asymptotically, to more efficient estimators than those obtained by single-equation least squares. See the surveys by (Srivastava and Dwivedi 1979;Fiebig 2001) and the book by (Srivastava and Giles 1987) for a concise coverage of the literature in this area.
Shrinkage estimations in SUR was first introduced by (Zellner and Vandaele 1975), which extends the results of (James and Stein 1961) and (Sclove 1968) to multivariate regression equations and presents a technique of constructing an estimator whose risk is smaller than the risk of the GLS estimator. However, the resulting estimator depends on some unknown matrices and is not practical.  investigates the properties of the estimator when consistent estimators are substituted for these unknown matrices. (Maddala 1991) reviewed the shrinkage estimators and showed that these estimators appear to perform better than both pooled and single-equation least squares estimators (see also Maddala and Hu 1996;Maddala et al. 1997;Choi and Li 2000). Maddala et al. (2001) show the superior properties of shrinkage estimators among single-equation estimators and various averaging estimators in a heterogeneous panel data model under error homoscedasticity framework. In univariate equation models, recently, (Hansen 2016) introduces shrinkage for general parametric models by evaluate the accuracy of the approximations. Results from our empirical example are presented in Section 6. Finally, Section 7 contains some concluding remarks, and proofs are given in Appendix A.

The Model and Notation
Consider the following m seemingly unrelated linear regressions where y i = (y i1 , y i2 , . . . , y iT ) is a T × 1 vector of observations on the dependent variable y it , with T being the number of observations, X i is a T × k matrix of observations on the k vector of regressors including the intercept (that is, x it,1 = 1) 1 , β i is a k × 1 vector of unknown coefficients and u i = (u i1 , u i2 , . . . , u iT ) is a T × 1 vector of disturbances, for i = 1, 2, . . . , m. It is convenient to stack the m equations above in the following form: or compactly as y We assume, Assumption 1. The mT × 1 vector of disturbances, u, has a zero conditional mean E(u|X 1 , X 2 , . . . , X m ) = 0.
Assumption 2. The disturbances are uncorrelated across observations but correlated across equations, where I T is the T × T identity matrix, and we assume Ω is positive definite.
1 Note that we do not assume that X i 's are the same, nor do we assume they are different across equations. In other words, our model supports complete heterogeneity, partial heterogeneity, and complete homogeneity of regressors.
Assumption 3. The disturbances are normally distributed with mean zero and variance-covariance matrix Ω.
We define some notations below, which will be used in the following sections. So let where If we partition Q in the sub-matrices of T × T as below then we define Also, we define where Ψ ii is the ith diagonal T × T sub-matrix of Ψ which is partitioned as below

Estimators
Our goal is to estimate the vector of slope parameters, β, in Equation (3). We consider three estimators of the slope parameters. The first estimator is the (Zellner 1962) GLS estimator (the unrestricted GLS estimator), which is the standard estimator in SUR. The second estimator is a restricted GLS estimator that ignores the slope parameters heterogeneity and estimates a pooled model. The third estimator, called the average estimator, is a weighted average of the restricted and the unrestricted GLS estimators where the weight is proportional to a weighted quadratic loss function.

Unrestricted Estimator
The typical estimator of the slope parameters in SUR is a feasible GLS estimator defined as where Ω is an estimator of Ω, which can be calculated as where Σ is an estimator of Σ, such that its (i, j)th element, s ij , estimates σ ij , using a single-equation estimator of β i , defined asβ i = (X i X i ) −1 X i y i , for i = 1, 2, . . . , m. Hence, s ij is equal to where

Restricted Estimator
The restricted estimator is defined under the parameter homogeneity assumption across equations, which can be written as whereβ is a weighted average of the slope parameters, β i 's, defined as in which J = (I k , I k , . . . , I k ) is a k × mk matrix, where I k denotes the k × k identity matrix. Equivalently, the parameter homogeneity assumption can be formulated as a restriction matrix as where Hence, we can derive the restricted estimator from the following minimization The solution to the above minimization can be formulated as a feasible restricted GLS estimator in below where 2 R = I mk − J(J X Ω −1 X J) −1 J X Ω −1 X is an estimate of R.

Average Estimator
We define the average estimator as below where D is a weighted quadratic loss function defined as with W an arbitrary symmetric positive definite weight matrix with elements of order O(T), and τ is a positive characterizing parameter. We will defer describing the optimal choice for this parameter in the next section. 3 The idea behind the average estimator defined above is that when the difference between the restricted and the unrestricted GLS estimators is small (D is small), the average estimator gives a higher weight to the restricted GLS estimator, as it is the most efficient estimator. However, when the difference between the restricted and the unrestricted GLS estimators is substantial, the bias of the restricted GLS estimator, resulting from ignoring the parameter heterogeneity, could be more than its variance efficiency gain, so the average estimator assigns a higher weight to the unrestricted GLS estimator.

Large-Sample Approximate Bias and MSE
We employ the large-sample approximations method developed by (Nagar 1959), to analyze the bias, mean squared error matrix (MSEM) and risk of the average estimator.
and the MSEM of the average estimator up to order O(T −2 ) is and for the symmetric positive definite weight matrix W of order O(T), the risk of the average estimator up to order O(T −1 ) is where φ = β R W Rβ = O(T), φ p = β R W 1/2 PW 1/2 Rβ = O(T), λ max (.) denotes the maximum eigenvalue, and P = W 1/2 R(X Ω −1 X) −1 R W 1/2 .

Proof. Appendix A.
We note that, see (Srivastava 1970) for a proof, hence The weight can be replaced by a positive part weight, that is, when τ/D < 0, we assign a zero weight to the restricted estimator. It is easy to verify that the MSE of the positive part is smaller. However, it will not change the approximations, so for simplicity we do not impose it at this stage. Nevertheless, the Monte Carlo and the application results are reported using the positive part weights.
where H = X (Σ −1 ⊗ Φ)X − X ΠX. From Theorem 1, it follows that the average estimator dominates the unrestricted GLS estimator in terms of having a smaller risk, when the second term on the right-hand side of Equation (20) is negative, which will hold when given tr(P) > 2φ p /φ. As the upper bound of the condition above depends on the slope parameters, one could replace it with an infimum value. Let d be defined as which lies in the range d ∈ [0 , (m − 1)k], as P is a non-zero positive semi-definite matrix. Therefore, when d > 2, an infimum value for the upper bound is 2 tr(P) − 2 λ max (P) . 4 Therefore, given d > 2, an equivalent condition for the condition in (23) can be written as In other words, when d > 2, and τ satisfies the condition in Equation (25), the risk of the average estimator is less than the risk of the unrestricted GLS estimator up to the order of interest. In addition, as the choice of the characteristic parameter is user-specified, its optimal value, τ opt , that minimizes the upper bound of the risk of the average estimator (the last term in Equation (20) Since the optimal τ depends on the unknown value of Ω, one could substitute it with its estimated value, and use an estimate of τ opt , as below where P = W 1/2 R(X Ω −1 X) −1 R W 1/2 , is an estimate of P. Corollary 1. Under Assumptions 1-3, when d > 2, then up to order O(T −1 ) we have where β A is the average estimator with τ opt .

Proof. Appendix A.
Two arbitrary choices of W are TI mk , and X Ω −1 X, where the former one in the risk gives the mean squared error and the latter one, provides the (in-sample) mean squared forecast error (MSFE).

Corollary 2. Under Assumptions
The optimal value of τ that minimizes the MSFE of the average estimator, provided (m − 1)k > 2, is and the associated optimal MSFE of the average estimator up to order O(T −1 ) is where Proof. Appendix A.

Monte Carlo Simulation
The results below are the simulation results of the model of Section 2, where x it,1 = 1 and the remaining regressors are independently generated from the standard normal distribution. The sample size varies from T = 100, m = 3, 6, and k = 3, 5, leading to four combinations of m, and k. u 1t is generated as I IDN(0, 1), while u it = c u 1t + v it , for i = 2, . . . , m, where v it ∼ I IDN(0, 1) and c = 0.5. We consider two DGPs for generating β i 's, the first one is under a complete heterogeneity in coefficients where we assume that DGP1: . . , 1) , and the second DGP is under a weak heterogeneity where we assume that where [m/2] denotes the largest integer value that is smaller than m/2, and δ takes values on a 10-point grid on [0, 1]. The results of 1000 monte carlo simulations are given in Figures 1-4, where the vertical axes measure the relative mean squared error (RMSE) of the unrestricted GLS estimator, the restricted GLS estimator and the average estimator to the unrestricted GLS estimator. Hence, the RMSE of the unrestricted GLS estimator is equal to one. The horizontal axes measure the degree of parameter heterogeneity, δ, which is set between zero and one with 0.1 grid value.   The Monte Carlo results support our theoretical findings of the previous section. The figures show that the RMSE of the average estimator for the whole parameter heterogeneity is below that of the unrestricted estimator. This shows the superiority of the average estimator relative to the unrestricted GLS estimator.
The RMSE of the average estimator in DGP1 of a complete heterogeneous SUR, is smaller than that of the restricted GLS estimator except for very small values of parameter heterogeneity (δ). This is expected because as δ takes higher values, the bias of the restricted GLS estimator increases, which then results in higher MSE. In DGP2 where the SUR is characterized by some degrees of homogeneity, the RMSE of the restricted GLS estimator remains smaller than that of the unrestricted GLS estimator for larger values of δ relative to DGP1. In this case, the unrestricted GLS estimator can be inferior to the restricted GLS estimator even with the presence of weak degrees of heterogeneity. This is because although the unrestricted GLS estimator is unbiased, it is inefficient, especially under small sample sizes, and a high number of regressors. In contrast, the restricted GLS estimator properly makes the use of cross equation variations and thus provides a more accurate results.
In general, we find that the average estimator performs robustly well in SUR with various degrees of heterogeneity. When there is a strong heterogeneity, the average estimator prevails. When there is a relatively weak heterogeneity, the average estimator tends to gain more from the efficiency of the restricted GLS estimator by assigning a high weight to this estimator and thus still remains one of the best choices.

Application: Returns to Scale in US Banking Industry
In this section we apply the average estimator studied in the previous sections to regressions of the cost system for U.S. commercial banks. We are interested in estimating the returns to scale (RTS) for these banks over the past recent years.
Over the past few years, the number of U.S. commercial banks fell by almost 70%, where in 1984 the total number of U.S. commercial banks was 14,391 and dropped to 4773 in 2018. Over the same period of time, the average asset value of U.S. banks (adjusted for inflation), which is also a measure of bank size, increased by about ten times, from 140 million dollars in 1984 to 1400 million dollars in 2018 (See Figure 5). To support this bank size expansion, bank executives and analysts claim that due to the changes in regulation (such as the permission of interstate branching and combination of banks) and because of technological and financial innovations (such as communication technologies, the securitization and sale of bank loans) over the past few years, the cost of production for larger banks has reduced and encouraged banks to grow larger and/or merge. On the other hand, critics contend that this decrease in the number of operating banks, and having banks with large assets not only impact the market competition, but also result in agency problems and disproportionate benefits of government policies in favor of large banks. In particular, the financial crisis of 2007 focused attention on large financial institutions considered as "too-big-to-fail". These together have brought attention of policy makers for regulatory limits on bank size. However, any policy intervention needs to consider the potential efficiency benefits of operating at a large scale. Therefore, estimation of scale economies and RTS is essential for analyzing the costs and benefits of any policy intervention to control the size of banks.
The estimation of scale economies and RTS for U.S. banking industry has stimulated a substantial body of studies. Older empirical studies that used data from the 1980s and 1990s did not find scale economies in banking industry except for very small banks. But recent research that used data from the 2000s, and more modern methods for estimating the banking models, has found considerably more evidence of scale economies in banking. These studies include (Hughes et al. 1996(Hughes et al. , 2000(Hughes et al. , 2001Berger and Mester 1997;Mester 1998, 2013;Wheelock and Wilson 2001;Feng and Serletis 2009). Most of the studies in the literature partition banks based on their asset size in groups and estimate each group independently from the other groups. However, not only there is no reason to believe these categories are set based on banks underlying technology, at the same time, it is hard to believe that these groups are not affected by some unknown factors that could have resulted in correlations between groups. Therefore, there are two issues that need to be carefully considered by researchers. First, estimating cost efficiency using all observations as one group has the advantage of smaller variance but at the same time, it means ignoring the potential heterogeneity bias due to difference in technology. Second, partitioning banks in different groups and using single-equation estimators are inefficient compared to pooled estimators which ignore heterogeneity. Hence, there is a trade-off between bias and variance efficiency between these two estimators. As the average estimator introduced in the previous sections results in the optimal balance between bias and variance efficiency, we recommend using this estimator in the estimation of the returns to scale for banking industry to obtain robust and efficient estimators.

The Model
We follow the so called "intermediation approach" framework of (Sealey and Lindley 1977), which is broadly employed in the literature. According to this approach, a bank's balance sheet is assumed to capture the essential structure of a bank's core business. Inputs are considered to be liabilities (core deposits and purchased funds), physical capital and labor. The inputs result in the bank's productions which are assets (other than the physical, includes loans and trading securities).
With regard to variable specification, we define five inputs and five outputs that are the ones used in the literature. We define the following output quantities: consumer loans (y 1 ), real estate loans (y 2 ), loans to business and other institutions (y 3 ), federal funds sold and securities purchased under agreements to resell (y 4 ), and other assets (y 5 ). The input variables are: labor quantities (x 1 ), premises and fixed assets (x 2 ), purchased funds (x 3 ), interest-bearing transaction accounts (x 4 ), and non-transaction accounts (x 5 ). For each input x j , its price w j is obtained by dividing its total expenses by the corresponding input quantities.
For modeling the cost of banking industry, we consider a translog cost function and normalize it, so that the homogeneity (in input prices) property is automatically satisfied. We allow for individual (fixed) effects by adding intercepts in each regression, to control for specific group characteristics, heterogeneity in skills and so on. Hence, the cost equation for each group i = 1, 2, . . . , m, is considered as ln(C it /w 5,it ) = β 0,i + 4 ∑ n=1 β n,i ln(w n,it /w 5,it ) + 5 ∑ l=1 γ l,i ln(y l,it ) + 1 2 5 ∑ n=1 5 ∑ l=1 γ nl,i ln(y n,it ) ln(y l,it ) where T i is the number of observations in group i (the number of banks operating within group i), and C it is the total cost of bank t, in group i, defined as C it = w 1,it x 1,it + w 2,it x 2,it + w 3,it x 3,it + w 4,it x 4,it + w 5,it x 5,it , t = 1, 2, . . . , T i , and i = 1, . . . , m.
The cost function is symmetric which requires the imposition of the following restrictions on the parameters as below RTS is defined as the inverse of the sum of cost elasticities. If we define the output elasticity of the model for output j of bank t in group i, as Ecy j,it = ∂ ln(C it ) ∂ ln(y j,it ) and the sum of cost elasticities as ∂ ln(y j,it ) , then RTS of bank t in group i is defined as RTS it = 1/Ecy it . Also, the RTS for group i is a vector of T i × 1 defined as which is used for calculating mean, quartiles, and deciles of RTS for group i with T i banks, see Section 6.3. A bank with RTS > 1, has increasing returns to scale, that is for one percent increase in all outputs, cost is increased by less than one percent, and the bank is operating below its efficient scale size (RTS = 1) when RTS < 1.

The Data
The data we use is obtained from the Reports of Income and Condition (Call Reports) 5 , over the period from 2000 to 2018. We omit observations where negative values for assets, equity, outputs, and prices are reported. The summary of the data for years 2000 and 2018 is in Table 1 Following (Feng and Serletis 2009) and others in the literature, we classify the banks into three groups which is mainly based on the standard asset size categories that are used by the Federal Financial Institutions Examination Council (FFIEC). Banks with over $500 million in total assets are classified as Large banks, banks with assets between $100 million and $500 million are classified as Medium banks, and banks with under $100 million in assets are classified as Small banks. In order to have a consistent partitions over time, the asset size caps in each year are justified upward by the growth in the CPI. Table 2 presents the number and share of banks in each group with their corresponding asset ranges for years 2000 and 2018.

Estimation
We estimate model of Equation (30) using the average estimation method developed in the previous sections for each year separately. Basically, our SUR at each year consists of three cost equations representing Large, Medium, and Small bank groups, and the observations for each regression are the operating banks data under each bank group. Since the sample size for each group is different, we face a SUR with unequal number of observations and to estimate the variance-covariance matrix (Ω), we consider the following procedures recommended in the literature (See Schmidt 1977;Baltagi et al. 1989): 1. Ignore the extra observations in estimating Ω; 2. Use the extra observations to estimate variances. This procedure has the disadvantage of producing estimates of Ω that are not positive definite; 3. Use the extra observations to estimate variances, and modifying the estimates of covariances using the method of (Srivastava and Zaatar 1973); 4. Use all observations in estimation, following the method of (Hocking and Smith 1968). 5 The data from 2000-2010 is downloaded from the Federal Reserve Bank of Chicago website, and the rest of the data from 2011-2018 is downloaded from the FFIEC Central Data Repository's Public Data Distribution website. 6 The summary of the data for other years are not reported to save the space, but it is available upon request.  It is known in the literature that the results of the above procedures are much the same. Likewise, we find that the procedures above, generate similar results, so we only report the results of method 3.
After estimating Equation (30), we obtain the sum of cost elasticities for each bank by where Ecy it is the sum of cost elasticity of bank t in group i, and the parameters are replaced with their estimated values. Then, the RTS is calculated following Equation (33). We also obtain the RTS using the unrestricted GLS estimator, and the restricted GLS estimator.
The results of years 2000, 2009 and 2018 7 are reported in Table 3, which presents the extreme deciles, quartiles and means of our estimated RTS using the three estimation methods. The patterns of mean, extreme deciles and quartiles for all years are plotted in Figure 6. The results base on the restricted estimator over the most recent years suggest increasing RTS at each decile. However, these results are not economically reasonable. On the other hand, we find evidence of decreasing RTS for almost 50% of Small and Medium banks using the unrestricted estimator over the most recent years. These contradicting results show the importance of the average estimator which is used to respond to this model uncertainty. Comparing the RTS of banks using the average estimator over the sample period shows that, on average majority of banks have increasing returns to scale. In most recent years, the results exhibit more signs of cost efficiency for Large and Small banks, such that all of Large banks have increasing RTS and only less than 25% of Small banks have exhausted their cost efficiency. However, we find that more than 50% of Medium banks are operating under decreasing returns to scale near the end of the sample. The results are consistent with some recent studies (e.g., References Feng and Serletis 7 We only report the results for these three years to save the space. However, the results for other years are available upon request. 2009; Hughes and Mester 2013;Wheelock and Wilson 2001;Henderson et al. 2015;Mailkov et al. 2015) although we are not aware of any study from 2011 to 2018.

Conclusions
In this paper, we introduce an averaging estimator for a SUR model. The introduced estimator is a weighed average of an unrestricted GLS estimator which is the (Zellner 1962) estimator and a restricted GLS estimator. The weight is inversely related to a quadratic loss function which measures the weighted distance between the unrestricted and the restricted GLS estimators. The bias, MSE matrix, and risk of the average estimator using the large-sample approximations of (Nagar 1959) are derived. The superiority conditions of the average estimator in terms of the weighted mean squared error is given for any user-specific symmetric positive definite weight matrix, and is not limited to the case where the weight is the inverse of the variance-covariance matrix of the unrestricted GLS estimator.
We also provide some Monte Carlo results which support our theoretical claims. Finally, as our estimator is motivated by economic theory, we use U.S. Commercial banking data, and estimate a cost system for the banking industry to show how our estimator can be used in the applied work. We also estimate the cost system using single-equation least squares and a pooled estimator, and compare them with our proposed average estimator. We found more reliable estimation results with the cost system using our average estimator than the other estimators. We found that on average majority of banks have been operating under increasing returns to scale over the sample period from 2000 to 2018.
Author Contributions: A.M., A.U. contributed equally to the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
which gives the results in Equation (A.1). Now, by using Equation (A.1), we have By using the above results, we have Proof. Theorem 1: Using the results of Lemma A1, in Equation (10), we have where A −1/2 and A −1 are defined below, and the suffixes show the order of magnitude in probability, Using Equation (A.5) in Equation (17), we have where φ = β R W Rβ = O(T), and the last equality above holds by using the standard geometric expansion. Also, the use has been made of Equations (A.1)-(A.4). The terms with order O p (T −2 ) and smaller are dropped, because they will not enter in the calculation of the bias and MSE of the average estimator up to the orders of interest. Employing Equations (A.4) and (A.6) in Equation (16), we obtain The bias of the average estimator using the above equation up to order O(T −1 ) is where the use has been made of because, both A −1/2 and A −1 are odd functions of the error term which has a normal distribution. The MSE matrix up to order O(T −2 ) is where Ξ 1 -Ξ 4 are defined as below Ξ 4 = τ φ (RA −1/2 + R −1/2 β)( β − β) . Now, we obtain the expectations of Ξ 1 -Ξ 4 . E(Ξ 1 ) = τ 2 φ 2 Rββ R , (A.11) where the last equality holds by noting that the second term on the right-hand side of the first equality in the above equation is an odd function of normal distributions, and utilizing the following two equations below E(A −1/2 A −1/2 ) = (X Ω −1 X) −1 X Ω −1 E(uu )Ω −1 X(X Ω −1 X) −1 = (X Ω −1 X) −1 , (A.14) R(X Ω −1 X) −1 R = I mk − J(J X Ω −1 X J) −1 J X Ω −1 X (X Ω −1 X) −1 I mk − J(J X Ω −1 X J) −1 J X Ω −1 X = (X Ω −1 X) −1 − J(J X Ω −1 X J) −1 J = (X Ω −1 X) −1 R = R(X Ω −1 X) −1 .