Evaluation of Bayesian Multimodel Estimation in Surface Incident Shortwave Radiation Simulation over High Latitude Areas

: Surface incident shortwave radiation ( SSR ) is crucial for understanding the Earth’s climate change issues. Simulations from general circulation models (GCMs) are one of the most practical ways to produce long-term global SSR products. Although previous studies have comprehensively assessed the performance of the GCMs in simulating SSR globally or regionally, studies assessing the performance of these models over high-latitude areas are sparse. This study evaluated and intercompared the SSR simulations of 48 GCMs participating in the ﬁfth phase of the Coupled Model Intercomparison Project (CMIP5) using quality-controlled SSR surface measurements at 44 radiation sites from three observation networks (GC-NET, BSRN, and GEBA) and the SSR retrievals from the Clouds and the Earth’s Radiant Energy System, Energy Balanced and Filled (CERES EBAF) data set over high-latitude areas from 2000 to 2005. Furthermore, this study evaluated the performance of the SSR estimations of two multimodel ensemble methods, i.e., the simple model averaging (SMA) and the Bayesian model averaging (BMA) methods. The seasonal performance of the SSR estimations of individual GCMs, the SMA method, and the BMA method were also intercompared. The evaluation results indicated that there were large deﬁciencies in the performance of the individual GCMs in simulating SSR , and these GCM SSR simulations did not show a tendency to overestimate the SSR over high-latitude areas. Moreover, the ensemble SSR estimations generated by the SMA and BMA methods were superior to all individual GCM SSR simulations over high-latitude areas, and the estimations of the BMA method were the best compared to individual GCM simulations and the SMA method-based estimations. Compared to the CERES EBAF SSR retrievals, the uncertainties of the SSR estimations of the GCMs, the SMA method, and the BMA method are relatively large during summer.


Introduction
Surface incident shortwave radiation (SSR) is not only an important parameter in many atmospheric, oceanic, and land process models [1], but also a critical ingredient in energy exchanges between the Earth's surface and the atmosphere [2]. It also plays an important role in the hydrological and carbon cycles [3][4][5] and ultimately influences the Earth's climate. Therefore, knowing the spatial distribution and temporal evolution of SSR is essential for improving our understanding of the Earth's climate and climate change.
This study evaluated the SSR simulations from 48 CMIP5 models using the quality-controlled surface observations between 2000 and 2005 from 21 Greenland Climate Network (GC-NET) sites, 5 BSRN sites, and 18 GEBA sites in the high-latitude areas (>60 • ) and compared these SSR simulations with the SSR satellite retrievals from CERES EBAF, which have the most accurate SSR retrievals among satellite-based products [6]. We evaluated the SSR simulations and intercompared the performance of both the SMA and BMA methods for estimating the SSR in high-latitude areas. The seasonal performance of the outputs from individual GCMs, the SMA method, and the BMA method were also intercompared.
This paper is organized as follows. The SSR data used in this study are described in Section 2. The methodology for the evaluation is given in Section 3. The results are analyzed in Section 4, and the study is concluded and discussed in Section 5.

Data
Three kinds of SSR data are used in this study: CMIP5 simulations, ground measurements, and satellite retrievals. The details of these data sets are described in the following subsections.

CMIP5 GCMs
The CMIP5 in the IPCC's Fifth Assessment Report (IPCC AR5) provides simulations from a number of GCMs developed and maintained by different institutions around the world [17]. These data have been organized by the Program for Climate Model Diagnosis and Intercomparison (PCMDI) for the IPCC AR5. Compared to the third phase of the Coupled Model Intercomparison Project (CMIP3) in the fourth IPCC assessment report (IPCC AR4), the GCMs in IPCC AR5 have many improved model types, with more interactive components, including aerosols, dynamic vegetation, atmospheric physics and carbon and hydrological cycles [24]. Most dynamic, physical, and chemical algorithms were also improved in the IPCC AR5 models [10,11].
At the time of this study, 48 GCM simulations of SSR were available. We selected the "r1i1p1" ensemble and the "historical" experiment, which was aimed at accurately reconstructing the climate evolution of the 20th century by considering all major natural and anthropogenic forcing factors, such as changes in atmospheric greenhouse gases, aerosol loadings (tropospheric and stratospheric volcanic), solar output, and land use [11]. Most historical and r1i1p1 simulations in CMIP5 were run for the period from 1850 to 2005.
The spatial resolution of the 48 CMIP5 GCMs used in this study varies from 0.56 • × 0.56 • to 3.75 • × 3.75 • . Detailed information on the 48 CMIP5 GCMs, host institutions, countries, and spatial resolutions, is summarized in Table 1.

Ground Measurements
Ground observations of SSR over high-latitude areas used to evaluate SSR estimates stem from three networks: BSRN (five sites) [27], GC-NET (21 sites) [28], and GEBA (18 sites) [29]. Figure 1 shows the geographical distributions of the observation sites used in this study.

Ground Measurements
Ground observations of SSR over high-latitude areas used to evaluate SSR estimates stem from three networks: BSRN (five sites) [27], GC-NET (21 sites) [28], and GEBA (18 sites) [29]. Figure 1 shows the geographical distributions of the observation sites used in this study. The BSRN was operational in the early 1990s and was established by the World Climate Research Programme (WCRP) to provide radiation measurements of high accuracy and high temporal resolution at a limited number of sites in various climate zones [30]. The SSR measurement instruments at BSRN sites are calibrated every six months and regularly maintain a SSR uncertainty of less than 5% [27]. To improve data continuity, the SSR was measured by two parallel observation systems [27]. At present, the BSRN project has archived more than 60 sites covering a wide latitude range from 89.98 • S to 82.49 • N and a wide longitude range from 156.61 • W to 169.69 • E [2].
The GC-NET consists of more than 20 automatic weather stations (AWSs) mainly distributed in the accumulation areas of ice sheets. GC-NET AWSs are equipped with various factory-calibrated instruments to measure surface energy and mass balance components [31]. By 2013, all the AWS have gone through a rigorous data quality control process to reduce interference from the typical problems experienced by unattended weather stations, such as station tilt, low cosine responses at large solar zenith angles, riming on sensor domes, and sensor overheating [32]. The GC-NET provides hourly radiation observations with instruments located between 0.1 and 5 m above the surface, depending on the local accumulation rates and tower heights.
The GEBA is a database maintained at ETH Zurich (Switzerland) for the central storage of worldwide measured energy fluxes at the Earth's surface [33,34]. The current version (2017) contains 2500 worldwide locations with an average of approximately 500,000 monthly entries of various surface energy balance components [33]. Thus, the SSR is the most widely measured quantity in the GEBA. The GEBA has undergone substantial changes in terms of available data, data access, and internet appearance and has been widely used to evaluate the SSR estimates from satellite observations, reanalysis data, and GCMs [6].
Since the BSRN and the GC-NET only provide instantaneous values of the SSR, critical quality control procedures were applied to estimate the monthly SSR observations from the instantaneous values of the surface measurements from the GC-NET and the BSRN. First, the daily mean SSR were obtained by integrating the instantaneous values in a day which has at least 80 percent valid observed values. The missing data were estimated via the simple linear interpolation method. Then, the monthly values were calculated by averaging the available daily values within the month. If daily mean SSR data were missing for more than 10 days in one month, the monthly SSR data were excluded from the evaluation.
The CERES EBAF SSR products begin in 2000, and most CMIP5 GCMs SSR simulations end in 2005. Therefore, the study period is the overlap: 2000-2005. A subset of the 21 GC-NET sites, five BSRN sites, and 18 GEBA sites at high-latitude areas, which provides at least 10 months of records within the period of 2000-2005, was used in this study. Detailed information of the sites used in this study, including their latitudes, longitudes, and elevations, is summarized in Table 2.

CERES EBAF SSR Retrievals
The distribution of the radiation ground sites is widespread as shown in Figure 1, but gaps remain particularly over the Antarctic continent and the high-latitude oceans. Thus, this study also evaluated the SSR estimations from the MME methods and individual GCMs by comparing them with the satellite SSR retrievals, which can provide globally gridded values of the SSR. The monthly mean satellite derived SSR retrievals with a spatial resolution of 1 • × 1 • from the CERES EBAF product (Ed 4.0), Remote Sens. 2019, 11, 1776 7 of 24 which has been reported to be more accurate than other gridded SSR products [2,9,35], are used in this study.
The CERES EBAF retrieves shortwave fluxes based on TOA radiance observations from the passive Terra, Aqua, and Suomi-National Polar-Orbiting Partnership satellites [36]. Moreover, the surface irradiances are adjusted using radiative kernels in the retrieval algorithm of the CERES EBAF. The surface irradiance adjustment process is composed of bias correction and a Lagrange multiplier [37]. In the bias correction process, the bias in the temperature and the specific humidity between 200 and 500 h Pa is corrected based on observations incorporated from the atmospheric infrared sounder (AIRS), which is on board the MODIS-Aqua satellite. The bias in the cloud fraction (CF) is corrected based on the observations from the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations (CALIPSO) and CloudSat. In the Lagrange multiplier process, errors in the surface, cloud, and atmospheric properties are corrected. Over the high-latitude areas, the cloud properties are derived from the Terra and Aqua reflectance data using one of two retrieval algorithms depending on the existence of snow and ice [36,38]. To reduce the error in the surface irradiances and to increase the consistency in the TOA irradiances, surface irradiances are constrained using the CERES-derived TOA irradiance [39].

Bayesian Model Averaging (BMA) Method
The BMA method provides a way to combine different models to a multimodel and is a promising method for calibrating ensembles in forecasts [40]. Standard statistical analysis (e.g., regression analysis), proceeds conditionally on one assumed statistical model, which may be selected from among several possible competing models. Other models can also provide different results but with different uncertainties. BMA [41,42] overcomes the problem by conditioning, not on a single/best model, but on the entire ensemble of statistical models considered. To better take advantage of the GCM simulations and obtain more reliable results, we used the BMA method to combine 48 GCM SSR simulations.
In the BMA method, the output is a PDF, which is a weighted average of the conditional PDFs, weighted by their posterior model probabilities. We denote the quantity to be forecasted by y. The law of total probability gives the combined forecast PDF of the quantity y by: where p y X i , y T is the forecast PDF based on model X i alone, estimated from the training data y T , and K is the number of models using to be combined. p X i y T denotes the posterior probability of model X i is the best given the training data, reflecting how well model X i matches the training data. According to the Bayesian theory, this term is computed by: The initial value of p(X i ) can be assigned based on the previous knowledge of the model performance. In this study, uniform prior is assigned p(X i ) = 1 K . Thus, Equation (2) can be further simplified and rewritten as: Because BMA method assumes that the model forecasts are unbiased, the bias-correction methods should be applied in advance. The linear regression method is applied for each GCM SSR simulations: where f i denotes the bias-corrected forecast for model X i and y i is the forecast of the variable from model X i . Unique coefficients a i and b i for each model X i are determined through least squares approximation, with the observations in the training period as the dependent variable and the forecasts as the explanatory variables.
Considering the application of BMA method to bias-corrected forecast f i , Equation (1) can be rewritten as: where w i = p X i |y T is the BMA weight for model X i computed from the training data and reflects the relative performance of model X i on the training period. The BMA weights are nonnegative and add up to 1. The conditional probabilities p i y f i , y T can be interpreted as the conditional PDF of y conditional on f i and training data y T . These conditional PDFs are assumed to be normally distributed for computational convenience as: where the coefficients a i and b i are calculated from the bias-correction procedure described above. This means that the BMA predictive distribution becomes a weighted sum of normal distributions, with equal variance and centered at the bias-corrected forecast. A deterministic forecast can be obtained using the conditional expectation of y given the forecasts: The BMA weights and the variance are estimated by the maximum likelihood [43] from the training dataset. For given parameters to be estimated, the likelihood function is defined as the probability of the training data and is viewed as a function of the parameters. The BMA weights and variance are used to maximize this function, that is, the parameter values for which the observations were most likely to have been observed. It is convenient to maximize the log-likelihood function rather than the likelihood function itself. Following the recommendation of Raftery et al. [44], we use the expectation maximization (EM) algorithm [45] to maximize the log-likelihood function. In brief, the EM algorithm is iterative and alternates between the E (or expectation) step and the M (or maximization) step. More detailed description of the BMA method is provided by Raftery et al. [44].
In this study, the SSR ground observations over six years (2000-2005) from 44 stations spread across high-latitude areas were randomly selected as the training (22 stations) and testing (22 stations) data for the BMA analysis.

Normalized RMSE
To evaluate the performance of the SSR estimations from individual GCMs and the MME methods, the bias between simulations and reference data, the corresponding RMSE and correlation coefficient (R) were calculated. These three statistics can quantify the similarity between the simulations and the reference data.
To compare the RMSE of each GCM SSR simulation more intuitively, the average RMSEs were normalized following [46]. The normalization results are called nRMSEs, which are defined as follows: where RMSE m , the "typical" GCM error, is defined as the median of RMSE values. We use the median rather than the average value here to prevent GCMs with unusually large errors (outliers) from unduly affecting the results. The nRMSE is a measure of how well a GCM (with respect to particular surface observations) compares with the typical GCM error. If nRMSE > 0, then the GCM is inferior to the typical GCM (the GCM whose RMSE is determined as RMSE m ); the greater the nRMSE value is, the worse the GCM. Conversely, if nRMSE < 0, then the GCM is better than the typical GCM; the smaller the nRMSE value (greater than −1), the better the GCM.

Nash-Sutcliffe Efficiency
The performance of the SSR estimations from the GCMs, the SMA method, and the BMA method was also evaluated by the Nash-Sutcliffe efficiency (NSE), which reflects how close the plot of the model simulation versus observed data is to the 1:1 line. The NSE is calculated as follows: where O i and M i are the observed and estimated SSR, respectively. n is the size of data. O is the average value of the observed SSR. The NSE can range from −∞ to 1, with higher values indicating better agreement.

Evaluating CMIP5 GCMs SSR Simulations with Ground Measurements
The simulated SSR values on the original grid scale were directly compared with the ground SSR observations within the grid cells. Before evaluation, all the GCM SSR simulations were re-projected into a 1 • × 1 • spatial resolution using bilinear interpolation since they have different spatial resolutions. This study chose 44 sites from three networks with more than ten months of available SSR data between 2000 and 2005. Four statistical parameters were used to evaluate the individual GCM: R, RMSE, bias, and nRMSE. All the calculations were performed based on the monthly averages of the six year (2000-2005) time series.
The calculation results of the bias, RMSE and R for the 48 GCM SSR simulations were labeled on the scatter plots (Figure 2), and the bias, RMSE, and R histograms for the 48 GCM SSR simulations are shown in Figure 3. The biases in the GCM SSR simulations compared with the ground observations for all sites vary from −30 W m −2 to 15 W m −2 and only slightly more than half (only 26 GCMs) of the GCMs overestimate the SSR. Overall, the GCMs do not show an obvious tendency to overestimate the SSR over high-latitude areas. The absolute bias values of 28 GCM simulations are within 5 W m −2 . The biases of 11 GCMs SSR simulations range from 5 to 15 W m −2 . The biases of 9 GCM SSR simulations are less than −5 W m −2 . The MIROC4h SSR simulations have the largest positive bias of 13.23 W m −2 , followed by the GISS-E2-R-CC and GISS-E2-R SSR simulations, which have approximately the same bias values. Among the GCMs that underestimate the SSR when compared to the ground observations, the CMCC-CESM SSR simulations have the largest negative bias of approximately −28.98 W m −2 , the largest RMSE and the smallest R. The underestimation of SSR over high-latitude areas may be due to an overestimation of the total cloud cover in this region [9,47].     To further assess the influence of the radiation ground site selections and the quality of the measurements on the evaluation of the SSR simulations over high-latitude areas, this study repeated the above analysis with the sites from GEBA, BSRN, and GC-NET, respectively. The RMSEs at the BSRN sites for the individual GCM SSR simulations range from 15 to 40 W m −2 . These RMSEs are typically smaller than those from the GEBA and the GC-Net sites. However, the absolute bias values averaged over the five BSRN sites for most GCM SSR simulations are not smaller than the absolute average bias values from the 18 GEBA and 21 GC-NET sites. The SSR is underestimated by the 48 GCMs for the BSRN sites by 1.23 W m −2 , which is close to the average bias for the GC-NET sites (−0.97 W m −2 ), while the average bias for the GEBA sites is 1.62 W m −2 . The R at the BSRN sites for the individual GCM SSR simulations are typically larger than those at the GEBA and the GC-Net sites. Meanwhile, the R averaged over the 18 GEBA sites for most GCMs SSR simulations are typically smaller than the average R for the five BSRN and 21 GC-NET sites.
To compare the RMSEs of each GCM SSR simulation more intuitively, the RMSE m and the nRMSEs were also calculated. In this analysis, the RMSE m was approximately 27.88 W m −2 . As shown in Figure 4, the GCMs SSR simulations with negative nRMSEs are all greater than −0.2, which illustrates that their RMSE are only slightly smaller than the RMSE m . The RMSE of the GCM SSR simulations with positive nRMSEs are obviously larger than RMSE m . Among the GCM SSR simulations with negative nRMSEs, the GFDL-CM3 SSR simulations have the smallest nRMSE of −0.15, followed by the ACCESS1.  To further assess the influence of the radiation ground site selections and the quality of the measurements on the evaluation of the SSR simulations over high-latitude areas, this study repeated the above analysis with the sites from GEBA, BSRN, and GC-NET, respectively. The RMSEs at the BSRN sites for the individual GCM SSR simulations range from 15 to 40 W m -2 . These RMSEs are typically smaller than those from the GEBA and the GC-Net sites. However, the absolute bias values averaged over the five BSRN sites for most GCM SSR simulations are not smaller than the absolute average bias values from the 18 GEBA and 21 GC-NET sites. The SSR is underestimated by the 48 GCMs for the BSRN sites by 1.23 W m -2 , which is close to the average bias for the GC-NET sites (−0.97 W m -2 ), while the average bias for the GEBA sites is 1.62 W m -2 . The R at the BSRN sites for the individual GCM SSR simulations are typically larger than those at the GEBA and the GC-Net sites. Meanwhile, the R averaged over the 18 GEBA sites for most GCMs SSR simulations are typically smaller than the average R for the five BSRN and 21 GC-NET sites.
To compare the RMSEs of each GCM SSR simulation more intuitively, the RMSEm and the nRMSEs were also calculated. In this analysis, the RMSEm was approximately 27.88 W m -2 . As shown in Figure 4

Evaluating the MME Method Results with the Ground Measurements and the CERES EBAF Retrievals
This study used the SMA and BMA methods to generate higher accuracy SSR estimations by combining 48 GCMs SSR simulations. Figure 5 shows the weights of the individual GCMs calculated by the BMA method. The weights vary across GCMs. The greatest contributor to the SSR ensemble is the GFDL-ESM2G which has the largest weight (0.0272), more than 30% larger than the priori weight (0.0208), followed by GFDL-ESM2M (0.0259), CESM1-CAM5.1.FV2 (0.0254), and HadCM3

Evaluating the MME Method Results with the Ground Measurements and the CERES EBAF Retrievals
This study used the SMA and BMA methods to generate higher accuracy SSR estimations by combining 48 GCMs SSR simulations. Figure 5 shows the weights of the individual GCMs calculated by the BMA method. The weights vary across GCMs. The greatest contributor to the SSR ensemble is the GFDL-ESM2G which has the largest weight (0.0272), more than 30% larger than the priori weight (0.0208), followed by GFDL-ESM2M (0.0259), CESM1-CAM5.1.FV2 (0.0254), and HadCM3 (0.0250). The weight of FGOALS-g2 is only 0.0123, which is about 40% lower than the priori weight, because of its relatively poor performance.
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 24 (0.0250). The weight of FGOALS-g2 is only 0.0123, which is about 40% lower than the priori weight, because of its relatively poor performance. Based on Figure 6, obvious and expected improvements in R occur for the results of the BMA method; specifically, the bias is 0.10 W m -2 and the RMSE is 16.79 W m -2 , which are lower than those of the SMA method. Although Figure 6 shows that the estimations from the BMA and SMA methods have clear advantages over the CERES EBAF retrievals in reducing the biases, the estimations of the two methods neither reduce the RMSE nor increase the R compared with the CERES EBAF retrievals. The SSR estimations from the SMA and BMA methods were also compared with the CERES EBAF SSR retrievals over high-latitude areas as shown in Figure 7. Bilinear interpolations are applied, to make the spatial resolution of these gridded GCM SSR simulations with different spatial resolutions consistent with that of the CERES EBAF SSR retrievals with a 1° × 1° spatial resolution. Based on Figure 6, obvious and expected improvements in R occur for the results of the BMA method; specifically, the bias is 0.10 W m −2 and the RMSE is 16.79 W m −2 , which are lower than those of the SMA method. Although Figure 6 shows that the estimations from the BMA and SMA methods have clear advantages over the CERES EBAF retrievals in reducing the biases, the estimations of the two methods neither reduce the RMSE nor increase the R compared with the CERES EBAF retrievals.
Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 24 (0.0250). The weight of FGOALS-g2 is only 0.0123, which is about 40% lower than the priori weight, because of its relatively poor performance. Based on Figure 6, obvious and expected improvements in R occur for the results of the BMA method; specifically, the bias is 0.10 W m -2 and the RMSE is 16.79 W m -2 , which are lower than those of the SMA method. Although Figure 6 shows that the estimations from the BMA and SMA methods have clear advantages over the CERES EBAF retrievals in reducing the biases, the estimations of the two methods neither reduce the RMSE nor increase the R compared with the CERES EBAF retrievals. The SSR estimations from the SMA and BMA methods were also compared with the CERES EBAF SSR retrievals over high-latitude areas as shown in Figure 7. Bilinear interpolations are applied, to make the spatial resolution of these gridded GCM SSR simulations with different spatial resolutions consistent with that of the CERES EBAF SSR retrievals with a 1° × 1° spatial resolution. The SSR estimations from the SMA and BMA methods were also compared with the CERES EBAF SSR retrievals over high-latitude areas as shown in Figure 7. Bilinear interpolations are applied, to make the spatial resolution of these gridded GCM SSR simulations with different spatial resolutions consistent with that of the CERES EBAF SSR retrievals with a 1 • × 1 • spatial resolution.   to 2005, respectively. There are obvious geographical patterns in the differences over high-latitude areas, which may be attributed to the uncertainties in the representation of the cloud microphysical processes in the GCMs. From Figure 7a, the BMA estimations commonly show positive biases over inland Antarctica, the seas around Antarctica, and the seas of the Arctic Ocean (the Greenland Sea, the Labrador Sea, and the Barents Sea). However, negative biases are found in many other regions, particularly in the high-latitude areas of Eurasia and North America, the central part of the Arctic Ocean, much of Greenland, and the coast of Antarctica. The largest negative bias is found in Iceland, up to −36.67 W m −2 , and the largest positive bias is found in eastern seas near Antarctica, up to 26.70 W m −2 .
To directly compare the geographical distribution of the SMA results and the BMA results, we calculate the different SSR values between the SMA results and the BMA results (Figure 7c). The geographical differences between the BMA results and the SMA results are much more distinct than the statistical calculation results (Figure 6). From Figure 7c, relative to the SMA results, the BMA results yield lower SSR over almost all the high-latitude areas, except Greenland and coastal areas of Antarctica. Grids with large SSR differences are concentrated in the coastal regions of Antarctica and Greenland, Europe, and Iceland and the seas around them. This geographical dissimilarity is mainly attributed to the differences weights assigned by the BMA method to different GCM SSR simulations, and the BMA results are more accurate than the SMA results. Figures 8 and 9 show the seasonal geographical distributions of the differences between the BMA results and the CERES EBAF retrievals and between the SMA results and the CERES EBAF retrievals over high-latitude areas, respectively. It is obvious that the differences are typically larger over high-latitude areas during local spring (MAM in the Northern Hemisphere and SON in the Southern Hemisphere) when the sea ice begins to melt in summer (JJA in the Northern Hemisphere and DJF in the Southern Hemisphere). During local fall (SON in the Northern Hemisphere and MAM in the Southern Hemisphere) and winter (DJF in the Northern Hemisphere and JJA in the Southern Hemisphere), the situation is the opposite. Since the SMA and BMA difference results have similar seasonal geographical distributions, this study only analyses the results of the BMA method ( Figure 8). areas, which may be attributed to the uncertainties in the representation of the cloud microphysical processes in the GCMs. From Figure 7a, the BMA estimations commonly show positive biases over inland Antarctica, the seas around Antarctica, and the seas of the Arctic Ocean (the Greenland Sea, the Labrador Sea, and the Barents Sea). However, negative biases are found in many other regions, particularly in the high-latitude areas of Eurasia and North America, the central part of the Arctic Ocean, much of Greenland, and the coast of Antarctica. The largest negative bias is found in Iceland, up to −36.67 W m -2 , and the largest positive bias is found in eastern seas near Antarctica, up to 26.70 W m -2 .
To directly compare the geographical distribution of the SMA results and the BMA results, we calculate the different SSR values between the SMA results and the BMA results (Figure 7c). The geographical differences between the BMA results and the SMA results are much more distinct than the statistical calculation results (Figure 6). From Figure 7c, relative to the SMA results, the BMA results yield lower SSR over almost all the high-latitude areas, except Greenland and coastal areas of Antarctica. Grids with large SSR differences are concentrated in the coastal regions of Antarctica and Greenland, Europe, and Iceland and the seas around them. This geographical dissimilarity is mainly attributed to the differences weights assigned by the BMA method to different GCM SSR simulations, and the BMA results are more accurate than the SMA results. Figures 8 and 9 show the seasonal geographical distributions of the differences between the BMA results and the CERES EBAF retrievals and between the SMA results and the CERES EBAF retrievals over high-latitude areas, respectively. It is obvious that the differences are typically larger over high-latitude areas during local spring (MAM in the Northern Hemisphere and SON in the Southern Hemisphere) when the sea ice begins to melt in summer (JJA in the Northern Hemisphere and DJF in the Southern Hemisphere). During local fall (SON in the Northern Hemisphere and MAM in the Southern Hemisphere) and winter (DJF in the Northern Hemisphere and JJA in the Southern Hemisphere), the situation is the opposite. Since the SMA and BMA difference results have similar seasonal geographical distributions, this study only analyses the results of the BMA method ( Figure  8).   During spring in the Northern Hemisphere, the BMA method tends to overestimate the SSR over the Norwegian Sea, the Danish Strait, and the Davis Strait, and underestimate the SSR over the land and the central Arctic Ocean compared with the CERES EBAF. In the Southern Hemisphere, the SSR is overestimated over the Weddell Sea and the Ross Sea, and over the coastal areas of the Antarctic continent. During summer, the underestimation of the SSR is the most prominent over the central Arctic region, with local positive biases presented around the ocean near Greenland and the Canadian Arctic Archipelago and the ocean and land near the Bering Strait. The ocean near Greenland and the land of Eurasia and North America shows a widespread overestimation. Figure 8c,g shows the differences in the SSR between the BMA results and the CERES EBAF during fall. In the Northern Hemisphere, the central Arctic Ocean, Europe, Iceland, Alaska, and coast of Greenland are the areas with the most obvious underestimations. In the Southern Hemisphere, the south Atlantic and the Bellingshausen Sea show a large underestimation, while the rest of the ocean typically shows a large overestimation. The coast of Antarctica shows a large underestimation, while the central and eastern areas of Antarctica show a large overestimation. As shown in Figure 8d, the underestimation of the SSR during winter occurs primarily over the coasts. The BMA results over coasts are approximately 5 W m -2 and 20 W m -2 lower than the CERES EBAF retrievals in the Northern Hemisphere and the Southern Hemisphere, respectively. The overestimation of the SSR over ocean is approximately 5 W m -2 and 20 W m -2 larger than the CERES EBAF retrievals in the Northern Hemisphere and the Southern Hemisphere, respectively.
To further investigate the precision differences of the SSR estimates in the bright (spring and summer) and dark (fall and winter) seasons, we also compared the seasonal geographical distribution of the normalized differences between the MME results and the CERES EBAF retrievals from 2000 to 2005. The normalized values were obtained by the SSR estimates divided by the mean SSR for the respective season. It was found that the normalized differences between the MME results and CERES EBAF retrievals were smaller than those for the unnormalized differences in both the Northern and the Southern Hemispheres in each season, but such differences between the bright and dark seasons were still obvious. This large seasonal discrepancy indicated that the GCM SSR simulations exhibited different precision in the bright and dark seasons. During spring in the Northern Hemisphere, the BMA method tends to overestimate the SSR over the Norwegian Sea, the Danish Strait, and the Davis Strait, and underestimate the SSR over the land and the central Arctic Ocean compared with the CERES EBAF. In the Southern Hemisphere, the SSR is overestimated over the Weddell Sea and the Ross Sea, and over the coastal areas of the Antarctic continent. During summer, the underestimation of the SSR is the most prominent over the central Arctic region, with local positive biases presented around the ocean near Greenland and the Canadian Arctic Archipelago and the ocean and land near the Bering Strait. The ocean near Greenland and the land of Eurasia and North America shows a widespread overestimation. Figure 8c,g shows the differences in the SSR between the BMA results and the CERES EBAF during fall. In the Northern Hemisphere, the central Arctic Ocean, Europe, Iceland, Alaska, and coast of Greenland are the areas with the most obvious underestimations. In the Southern Hemisphere, the south Atlantic and the Bellingshausen Sea show a large underestimation, while the rest of the ocean typically shows a large overestimation. The coast of Antarctica shows a large underestimation, while the central and eastern areas of Antarctica show a large overestimation. As shown in Figure 8d, the underestimation of the SSR during winter occurs primarily over the coasts. The BMA results over coasts are approximately 5 W m −2 and 20 W m −2 lower than the CERES EBAF retrievals in the Northern Hemisphere and the Southern Hemisphere, respectively. The overestimation of the SSR over ocean is approximately 5 W m −2 and 20 W m −2 larger than the CERES EBAF retrievals in the Northern Hemisphere and the Southern Hemisphere, respectively.
To further investigate the precision differences of the SSR estimates in the bright (spring and summer) and dark (fall and winter) seasons, we also compared the seasonal geographical distribution of the normalized differences between the MME results and the CERES EBAF retrievals from 2000 to 2005. The normalized values were obtained by the SSR estimates divided by the mean SSR for the respective season. It was found that the normalized differences between the MME results and CERES EBAF retrievals were smaller than those for the unnormalized differences in both the Northern and the Southern Hemispheres in each season, but such differences between the bright and dark seasons were still obvious. This large seasonal discrepancy indicated that the GCM SSR simulations exhibited different precision in the bright and dark seasons.

Comparing and Evaluating the GCM Simulations and the MME Results with the CERES EBAF Retrievals
It has been reported that some of the CMIP5 GCMs have large biases when simulating cloud water and ice paths compared to satellite retrievals [36], and clouds can have a significant influence on the SSR simulation. To understand the differences between the GCMs and the CERES EBAF in estimating the SSR over high-latitude areas, this study calculated the bias, RMSE, R, and NSE values of each GCM simulations compared with the CERES EBAF retrievals, as shown in Figures 10 and 11. In addition, this study also used Taylor diagrams (Figures 12 and 13) to show the gaps among the SSR outputs from the individual GCMs, the SMA method and the BMA method and the CERES EBAF.

Comparing and Evaluating the GCM Simulations and the MME Results with the CERES EBAF Retrievals
It has been reported that some of the CMIP5 GCMs have large biases when simulating cloud water and ice paths compared to satellite retrievals [36], and clouds can have a significant influence on the SSR simulation. To understand the differences between the GCMs and the CERES EBAF in estimating the SSR over high-latitude areas, this study calculated the bias, RMSE, R, and NSE values of each GCM simulations compared with the CERES EBAF retrievals, as shown in Figures 10 and 11. In addition, this study also used Taylor diagrams (Figures 12 and 133) to show the gaps among the SSR outputs from the individual GCMs, the SMA method and the BMA method and the CERES EBAF.     In the Northern Hemisphere, as shown in Figure 10a, most of the GCM SSR simulations exhibit relatively greater RMSE values in boreal summer than those in other seasons. A total of 39 out of 48 GCM SSR simulations show an underestimation tendency compared with ground measurements in summer (JJA). The possible reason may be that the GCMs have difficulties in capturing the insulating effect of clouds, since cloud properties represented by GCM SSR estimation models are one of the most important factors in regulating the estimated SSR. The SSR simulations from BMA method have the smallest RMSE values compared to any SSR estimates from individual GCM and the SMA method. Figure 10b shows that the absolute bias values of GCM SSR simulations are relatively smaller in fall (SON) and winter (DJF), but greater in summer (JJA). Figure 10c illustrates the R values of the SSR estimates. It is obvious that the R values for each GCM and the MME method are relatively small in summer (JJA). As shown in Figure 10d In the Northern Hemisphere, as shown in Figure 10a, most of the GCM SSR simulations exhibit relatively greater RMSE values in boreal summer than those in other seasons. A total of 39 out of 48 GCM SSR simulations show an underestimation tendency compared with ground measurements in summer (JJA). The possible reason may be that the GCMs have difficulties in capturing the insulating effect of clouds, since cloud properties represented by GCM SSR estimation models are one of the most important factors in regulating the estimated SSR. The SSR simulations from BMA method have the smallest RMSE values compared to any SSR estimates from individual GCM and the SMA method. Figure 10b shows that the absolute bias values of GCM SSR simulations are relatively smaller in fall (SON) and winter (DJF), but greater in summer (JJA). Figure 10c illustrates the R values of the SSR estimates. It is obvious that the R values for each GCM and the MME method are relatively small in summer (JJA). As shown in Figure 10d In the Southern Hemisphere, as shown in Figure 11a, the SSR evaluation results are similar to those in the Northern Hemisphere in the local spring, fall, and winter seasons. The biases shown in Figure 11b indicate that the underestimation tendency in the Northern Hemisphere does not exist in Southern Hemisphere. As shown in Figure 11c, the R values of the SSR estimates are greater than those in Northern Hemisphere, especially in summer. In both Northern and Southern Hemispheres, the BMA method SSR simulations have greater NSE values than any individual GCM and the SMA method SSR simulations.
In both the Hemispheres, the Taylor diagrams (Figures 12 and 13) show that the MME methods generally have better performance in simulating the SSR than the individual GCMs, except in local winter. According to these figures, it is obvious that the distribution of the spots is relatively scattered in local summer. It indicates that the differences in uncertainty of GCM SSR simulations are greater in local summer than in other seasons for both Northern and Southern Hemispheres. Additionally, the performance of the CERES EBAF SSR retrievals is not always the best among the SSR estimations from individual GCMs, SMA method, and BMA method in four seasons (Figures 12 and 13). Therefore, we conclude that the BMA method exhibited relatively better performance in simulating SSR compared to individual GCMs and the SMA method.

Discussion
In this study, we evaluated the SSR simulations and intercompared the performance of both the SMA and BMA methods for estimating the SSR in high-latitude areas using quality-controlled surface observations at 44 sites from three SSR observation networks (BSRN, GC-NET, and GEBA) and the CERES EBAF SSR retrievals from 2000 to 2005.
The evaluation results indicated that the precision of each GCM SSR simulations were quite different and GCMs with higher spatial resolutions did not guarantee better performance in simulating SSR. Relative to ground observations, the bias, RMSE, and R values of the SSR simulations ranged from −30 W m −2 to 15 W m −2 , 20 W m −2 to 50 W m −2 and 0.94 to 0.99, respectively. Previous studies showed that the GCM SSR simulations exhibit a tendency toward excessive SSR at the Earth's surface [11]. However, the validation results indicated that the GCM SSR simulations did not show an obvious tendency to overestimate the SSR over high-latitude areas in this study. Only slightly more than half of the GCMs (26 out of the 48 GCMs) overestimated the SSR over high-latitude areas when compared with the ground observations, although some discrepancies between the current GCM SSR simulations and ground measurements still existed over high-latitude areas. Cloud and aerosol properties represented by the GCMs are two important factors in regulating the estimated SSR [9]. The GCMs have been reported to overestimate aerosol optical thickness [48] and total cloud cover over high-latitude areas [9,36,47]. Thus, the excessive total cloud cover and aerosol optical thickness possibly contribute to the offset of the overestimation of the SSR in many of the GCMs. More data including the parameters related to cloud and aerosol are needed to conduct further investigation over high-latitude areas.
Besides cloud and aerosols, measurement errors, such as instrument replacement and drift and spatial representativeness of ground measurements were also potential error sources of SSR evaluations. For example, the monthly representation errors at the surface sites with respect to their 1 • surroundings are on average 3.7% (4 W m −2 ) [49]. The GCMs have different spatial resolution that varies from 0.56 • × 0.56 • to 3.75 • × 3.75 • which might cause certain biases for the SSR evaluation.
The function of the BMA method for the improvements of the SSR estimations was estimated based on the change in statistical calculations and Taylor diagrams. The performance of the BMA method was superior to that of any single GCM in simulating the SSR over high-latitude areas, especially during summer. Our results also showed that the use of different weights obtained by the BMA method for each GCM based on its performance can be a good alternative to the SMA method.
The BMA simulations were also compared with the CERES EBAF retrievals, and there were noteworthy geographical patterns in the differences between them over high-latitude areas.
Generally, the MME methods where the GCM weights are determined by the GCMs prior performance performs better than the SMA method, e.g., [24,[50][51][52][53][54]. Some studies have also obtained similar estimates generated by different MME methods (e.g., [24,55]), which partially reflects the alleged "equifinality" in which different combinations of GCM weights produce identical fit to the ground measurements. The performance of BMA method may be influenced by the training sample sizes. Although the advantage of BMA method was not apparent in this study, the BMA method provides a new option to generate more accurate SSR simulations through the entire ensemble of the model first considered.