Evaluation of Historical CMIP5 GCM Simulation Results Based on Detected Atmospheric Teleconnections

Atmospheric teleconnections are characteristic to the climate system and exert major impacts on the global and regional climate. Accurate representation of teleconnections by general circulation models (GCMs) is indispensable given their fundamental role in the large scale circulation patterns. In this study a statistical method is introduced to evaluate historical GCM outputs of the fifth phase of the Coupled Model Intercomparison Project (CMIP5) with respect to teleconnection patterns. The introduced method is based on the calculation of correlations between gridded time series of the 500 hPa geopotential height fields in the Northern Hemisphere. GCMs are quantified by a simple diversity index. Additionally, potential action centers of the teleconnection patterns are identified on which the local polynomial regression model is fitted. Diversity fields and regression curves obtained from the GCMs are compared against the NCEP/NCAR Reanalysis 1 and the ERA-20C reanalysis datasets. The introduced method is objective, reproducible, and reduces the number of arbitrary decisions during the analysis. We conclude that major teleconnection patterns are positioned in the GCMs and in the reanalysis datasets similarly, however, spatial differences in their intensities can be severe in some cases that could hamper the applicability of the GCM results for some regions. Based on the evaluation method, best-performing GCMs can be clearly distinguished. Evaluation of the GCMs based on the introduced method might help the modeling community to choose GCMs that are the most applicable for impact studies and for regional downscaling exercises.


Introduction
Atmospheric teleconnections are persistent, large-scale phenomena that link the variability of meteorological parameters across large geographical distances. Teleconnections are studied by the identification of distant geographical areas with co-varying meteorological state variables and are visualized by teleconnection patterns. The most intense regions of the teleconnections are called action centers [1][2][3]. Patterns of the large-scale atmospheric circulation in the Northern Hemisphere (hereinafter referred to as NH) were identified by Wallace and Gutzler [1] by examining teleconnectivity maps. These maps were obtained from the calculation of empirical correlations of gridded geopotential height datasets at the isobaric surface of the 500 hPa field (Z500). Horel [2] applied rotated principal NorESM1-M [46,47] Norwegian Climate Centre, Norway 2.5 • × 1.9 • 1 Experiment r1i1p1 was used except in the case of CCSM4 (where only r6i1p1 is available). In the nomenclature of Intergovernmental Panel on Climate Change (IPCC) the abbreviation r stands for the realization, i for the initialization and p for the physics. The letters r, i, and p are followed by numbers that represent different simulations.

Identification of the Teleconnections
Teleconnections are defined as geographical areas with associated geopotential height time series that are negatively correlated with each other [1,3]. Identification of teleconnections is performed by the analysis of correlations between the time series of the Z500 field using the teleconnection method [1,48].
We applied our analysis on the basis of empirical Pearson correlation coefficients (hereinafter referred to as correlations), which are computed as follows. First, each of the Z500 time series was correlated with all the other Z500 time series. As a result, 37 × 144 correlation values were obtained concerning each grid cell from which the minimum value was selected and was referred as the strongest negative correlation [1,48]. Those were concentrated in several centers (i.e., local minima) as it is demonstrated in Figure 1 with blue-shaded dots. Then, pairs of grid cells were selected with identical strongest negative correlations indicating a one-to-one correspondence (i.e., mutually unambiguous correspondence between the selected grid cells). Each pair of grid cells connected two local minima. If the associated correlation value was statistically significant the pair can be the potential action center (hereinafter abbreviated as PotAC) of a single teleconnection. PotACs are also exemplified in Figure 1 with red circles, which are connected with red lines. In our analysis teleconnections were represented with PotACs, therefore teleconnections were considered as dipoles. Consequently, each grid cell of the PotAC represents a single pole of the teleconnection hereinafter referred to as pole.
To decide whether correlations were statistically significant or not, a permutation test was carried out. According to this method, time series of the Z500 field associated with each grid point in a PotAC were replaced with randomly generated data series. Subsequently, correlations with all the other time series were computed again. The limit of statistically significant correlations was determined by examining the distribution of the randomly generated correlations. This test was repeated 1000 times in the case of each pole. Two sets of random data series were constructed. In the case of the first set, we assumed that the original Z500 time series were the realizations of normally distributed random variables, therefore datasets were generated, which follow normal distribution. The parameters of the distribution were estimated by using the method of moments. In the case of the second set, the significance test was conducted on datasets, which were the random permutation of the original Atmosphere 2020, 11, 723 5 of 22 time series. The results suggest that strongest negative correlations below −0.3 could be considered to be statistically significant at a significance level of 0.1%. visible in Figure 1. For the sake of brevity, the term strongest negative correlation hereinafter implies statistically significant correlations. Global correlation maps including PotACs for 30-year-long periods were exemplified by the NCEP-NCAR R1 in Figure S1 while correlation maps concerning only the NH for both reanalyses are shown in Figure S2 of the Supplementary Material. The similarity of Figure 1 and Figure S2b suggest that the applied method was less dependent on the selected area.
An investigation was carried out in the case of each reanalysis and GCM for all periods listed above in Section 2.1. It means the analysis of 21 × 6 cases for the 30-year-long periods and the examination of 21 × 10 cases for the 10-year-long periods.  Note that our analysis focused only on the NH because majority of the grid cells with statistically significant strongest negative correlations can be found here. This phenomenon is clearly visible in Figure 1. For the sake of brevity, the term strongest negative correlation hereinafter implies statistically significant correlations. Global correlation maps including PotACs for 30-year-long periods were exemplified by the NCEP-NCAR R1 in Figure S1 while correlation maps concerning only the NH for both reanalyses are shown in Figure S2 of the Supplementary Material. The similarity of Figure 1 and Figure S2b suggest that the applied method was less dependent on the selected area.
An investigation was carried out in the case of each reanalysis and GCM for all periods listed above in Section 2.1. It means the analysis of 21 × 6 cases for the 30-year-long periods and the examination of 21 × 10 cases for the 10-year-long periods.

Evaluation of the GCMs
In this study we evaluated the GCM performance by comparing historical model outputs with the results obtained from the two reanalysis datasets. Evaluation of the GCMs consists of the following two steps: (1) the comparison of the distribution of the strongest negative correlations using the temporal stability method, and (2) the analysis of spatiotemporal distribution of the PotACs.

GCM Evaluation: Stability Patterns
The first step of the GCM evaluation is the comparison of the distribution of the strongest negative correlations obtained from each GCM with those obtained from the reanalyses. An information entropy based method was applied, therefore a simple diversity index was constructed to quantify uncertainty. Details about the measurement of information entropy can be found in Shannon [49], Simpson [50], and Rényi [51]. In the case of each reanalysis and GCM with respect to each grid cell, the number of periods was determined when the strongest negative correlations were statistically significant. We restricted the analysis to the examination of strongest negative correlations below the 25th percentile of their distribution, which means that values below the 25th percentile were summarized in the case of 30-year-long and 10-year-long periods, respectively. This procedure enabled us to distinguish regions in the NH where the presence of teleconnections was considered to be stable in time. Furthermore, it facilitated the identification of grid cells containing substantial amount of information within the analyzed field. From this point of view, grid cells were the most common in Atmosphere 2020, 11, 723 6 of 22 which correlations below the 25th percentile never appeared while the most important grid cells were those where correlations below 25th percentile appear six (ten) times in the case of the six 30-year-long (ten 10-year-long) periods. The above defined diversity index was visualized in the form of maps and were referred to as stability patterns.
We assumed that GCMs did not perform equally well in every region relative to the reanalysis datasets. In order to take into account these GCM related differences and to avoid the arbitrary selection of the study region, Monte Carlo simulations were carried out based on the stability patterns. At first, the same 1% of the grid cells were selected randomly from each GCM and reanalysis dataset. Then, root-mean-square error values (RMSE) were computed between the selected GCM and the reanalyses to quantify their similarities/dissimilarities. To get significant results, this method was repeated 500,000 times. As a result, 500,000 × 19 RMSE values were calculated to evaluate each GCM with respect to each reanalysis dataset, both in the cases of 30-year-long and 10-year-long periods. Consequences are drawn from the distribution of those RMSE values. If the performance of a GCM (quantified by the RMSE) depends on the selection of the study region, the range between the minimum and maximum of its RMSE values will be larger.

GCM Evaluation: Loess
The second step of the GCM evaluation is the analysis of the distribution of the PotACs with the application of the local polynomial regression technique (locally estimated scatterplot smoothing, abbreviated as loess; [52]). Some GCMs might reproduce the spatial dimensions of the teleconnections accurately while some might fail to locate the PotACs accurately in comparison with the reanalyses. Since PotAC, which are pairs of grid cells, are typically aligned north-to-south in the reanalyses, the southerly and northerly located poles could be considered as parts of two hypothetical belts that surround the NH. Though the poles were not considered to be the realizations of continuous random variables, due to their large number in the case of the 10-year-long periods, regression curves fitted on them could be compared to each other. In this approach these regression curves were used to evaluate GCMs.
To fit regression curves, PotACs were used from all 10-year-long periods where the predictand (predictor) was the realization of geographic latitudes (longitudes). In accordance with Tukey's method [53] based on the interquartile range (IQR), data were considered as outliers and were omitted when they fall outside the range defined by the IQR of the latitudes ±1.5·IQR. After that, the values of the predictand, which belong to the same longitude were averaged. The span parameter was estimated by using the bias-corrected version of the Akaike information criterion [54].
Regression curves fitted on the PotACs obtained from the historical outputs of the GCMs were compared to the regression curves obtained from the reanalyses by calculating RMSE over a common area. To cover the largest possible area in the NH, regression curves were fitted on a map with an Atlantic view (from 177.5 • W eastward to 175 • E) and on another map with a Pacific view (from 0 • eastward to 2.5 • W). With this method a total of four regression models were fitted for each GCM and reanalysis dataset: two on the southerly located PotAC poles and two on the northerly located PotAC poles.
Based on the objective metrics described above, GCMs were ranked in terms of performance relative to the reanalysis datasets. Climate Data Operators [55] were used to pre-process data including to apply bilinear interpolation on the gridded datasets. Computation and visualization were carried out in the R statistical computing software version 3.6.1 [56]. Functions of the ncdf4 R package [57] were applied to handle netCDF files while the following R packages were used to plot the data (polar stereographic maps are created by the functions matrix.poly and val2col downloaded from the websites: http://menugget.blogspot.com/2012/04/create-polygons-from-matrix.html and http: //menugget.blogspot.com/2011/09/converting-values-to-color-levels.html. Last accessed on 04.07.2020.): fields [58], maps [59], maptools [60], mapproj [61], and RColorBrewer [62]. The loess method was applied by the functions of the fANCOVA package [63].

Teleconnections in the Reanalyses and the GCMs
According to the reanalyses, teleconnections can be identified in the NH in each examined time period over the North Pacific Ocean, over the North Atlantic Ocean, and in the region of the West Siberian Plain and the Gobi Desert (referred to as Asia). Teleconnections can be detected in almost all periods in the region of the Mediterranean Sea and the Red Sea (referred to as the Mediterranean region; Figure 1 and Figures S1 and S2 in the Supplementary Material). The most intense correlations were associated with grid cells over the North Pacific Ocean while the weakest correlations were detected in the Mediterranean region. This leads to less-pronounced local minima in the Mediterranean region if time-averaged strongest negative correlation fields were plotted ( Figure S3 of the Supplementary Material). The two reanalysis datasets provided teleconnections with similar magnitudes, which was the consequence of the similarly distributed correlations exemplified by histograms in Figure S4 of the Supplementary Material. However some differences in the positions of the teleconnections could be observed especially in the Euro-Atlantic region (Supplementary Material Figure S2).
In the case of the GCMs, the strongest negative correlations (i.e., local minima) could be detected in similar areas as in the reanalyses (Supplementary Material Figure S3; all PotACs obtained from the GCMs are shown in Figure S5 in the Supplementary Material). However the magnitudes of those correlations were different in the case of some GCMs. With the exception of the HadGEM2-AO, the strongest negative correlation fields of the GCMs and the reanalyses were associated with median values between −0.3 and −0.35 regarding the 30-year-long periods while those were between −0.3 and −0.4 concerning the 10-year-long periods. The main difference was observable in the shape of the distribution of the strongest negative correlations. According to Supplementary Material Figure S4, the strongest negative correlations had left-skewed distributions in the reanalyses, which was less pronounced in some GCMs. As a consequence, GCMs underestimate the intensity of teleconnections relative to the reanalyses. The HadGEM2-AO was exceptional because it produced remarkably stronger correlations. The associated medians were below −0.4 for the 30-year-long periods and −0.6 for the 10-year-long periods. Figure 2 shows the spatial distribution of the number of 30-year-long (10-year-long) periods in which the strongest negative correlation was below the 25th percentile of the correlations in a given grid cell for selected GCMs and for the two reanalyses (note that the application of the 25th percentile means variable threshold for the reanalyses and also for the GCMs; see Section 2.2.2). Selected GCMs that are presented in Figure 2 were representative to the whole set of GCMs. (similar maps for all GCMs are shown in Figure S6 of the Supplementary Material) Stability of the teleconnection was increasing with an increasing number of time periods. A small, non-zero number in a specific grid cell indicates temporal changes in the intensity of the teleconnection in that cell, while zero means that teleconnection did not occur in any period (according to the method used in this study).

Comparison of the Distribution of Stability Patterns
According to the stability patterns of the reanalyses presented in Figure 2, the most stable teleconnection was located in the region of the North Pacific Ocean. The second most stable could be detected in the North Atlantic Ocean. However, due to the eastward shift of the most intense regions of the teleconnection (Figures S1 and S2 in the Supplementary Material), the phenomenon shows more stability in the north-eastern part of the Atlantic Ocean. A stable but smaller teleconnection can be found in Asia. The least stable teleconnection was located in the Mediterranean region with correlations often not reaching −0.4 in this area.  Difference maps were created based on the stability maps to get further insight into the similarity/dissimilarity of the GCMs and the reanalyses, and also of the two reanalyses. Difference maps for selected GCMs are presented in Figure 3 while all maps can be found in Figure S7 of the Supplementary Material. In the followings we presented cases where differences between the two datasets were remarkably large. Concerning the 30-year-long periods (Figure 3a), over the south-eastern part of the North Pacific Ocean the teleconnection was more stable in the NCEP-NCAR R1 relative to the ERA-20C dataset. On the contrary, in the ERA-20C, teleconnections were more stable over a larger area in Asia and in the Mediterranean region compared to the NCEP-NCAR R1. In the case of the 10-year-long periods (Figure 3b) similar differences could be detected over the North Pacific Ocean as in case of the 30-year-long periods. However, over the eastern part of the North Atlantic Ocean, in the ERA-20C the number of periods with the strongest negative correlation below the 25th percentile was larger in comparison with the NCEP-NCAR R1 while the opposite was true for the western part of that region. This indicates a more easterly located teleconnection in the ERA-20C relative to the NCEP-NCAR R1.
In case of the GCMs similar consequences could be drawn from the examination of the 30-year-long and the 10-year-long periods. The GCMs typically reproduced the teleconnections in similar geographical areas as the reanalyses, but there were significant differences concerning their intensities and the position of the most intense regions ( Figure 3). The GCMs captured the most stable regions of the teleconnection over the North Pacific Ocean relatively well, though there was a tendency to locate it north-eastward relative to the reanalyses. Major discrepancies were noticed over the North Atlantic Ocean. Several GCMs, e.g., the ACCESS1-0 and the HadGEM2-CC, reproduced the most stable part of the teleconnection over the western part of the North Atlantic Ocean while in some GCMs, e.g., the MPI-ESM-LR, the MPI-ESM-MR and the MPI-ESM-P, the teleconnection is captured with smaller intensity as in the reanalyses. In Asia more than half of the GCMs underestimate the stability of the teleconnection. The representation of the teleconnection in the Mediterranean region is considered to be poor. The GCMs tended to produce weaker correlations than the reanalyses with the exception of the MIROC5, the NorESM1-M, and the HadGEM2-AO. The main areas with the discrepancies in the cases of the MIROC5 and the NorESM1-M were the North Pacific region and the subtropical areas. As it was mentioned, correlation field associated with the HadGEM2-AO was remarkably intense compared to any other reanalyses/GCMs (Supplementary Material Figure S3). However, red-shaded areas in Figure 3 imply an underestimation of teleconnections relative to the reanalyses. The reason behind this finding is the following. The HadGEM2-AO did not capture the structure of teleconnections, which were detected in the reanalyses and in other GCMs, but teleconnections covered almost the entire NH. As a consequence, the correlations were not as intense in those regions where teleconnections were detected in the reanalyses.

Comparison of Stability Patterns Based on Monte Carlo Simulations
In order to quantitatively analyze the performance of the models, the stability pattern of each GCM was compared with the stability patterns of the two reanalyses. As a reference the two reanalyses were also compared to each other (quantiles of the calculated RMSE values can be found in Table S1 of the Supplementary Material). We distinguished four cases depending on the chosen reanalysis and the length of the time periods analyzed. Results of the Monte Carlo simulations are summarized in the form of boxplots in Figure 4.  median of 1 (Figure 4; red dots). The RMSE median obtained from all GCMs was 2 and 2.5 based on the examination of 30-year-long and 10-year-long periods, respectively (red lines in Figure 4). Consequently, the analyzed GCMs differed from the reanalyses with 2-2.5 time periods on average, which was higher than the RMSE median associated with the reanalyses. In all four cases, the RMSE median of the ACCESS1-0, the CMCC-CM, the CMCC-CMS, the GFDL-ESM2M, the HadGEM2-CC, the MPI-ESM-LR, and the MPI-ESM-P were below the RMSE median obtained from all GCMs. For each reanalysis/GCM pairs the difference between the RMSE median obtained from the Monte Carlo simulations and the RMSE value calculated for the entire NH was negligibly small (<0.05). Consequently, the median values could not be used to quantify area-sensitivity of the GCMs. For that purpose, the range between the maximum and minimum values of the distributions of the RMSE values-hereinafter referred to as the RMSE range-were applied. RMSE ranges associated with the reanalyses were remarkably smaller than those associated with the GCMs. There was no simulation among the half-million in which the RMSE exceeded two while the RMSE range for the GCMs was three on average in all four cases presented in Figure 4. With a few exceptions, the RMSE ranges increased with increasing RMSE median. Note that a relatively small range did not If the median values of the RMSE values-henceforth referred to as the RMSE median-are examined then the largest similarities are observable between the two reanalyses with an RMSE median of 1 (Figure 4; red dots). The RMSE median obtained from all GCMs was 2 and 2.5 based on the examination of 30-year-long and 10-year-long periods, respectively (red lines in Figure 4). Consequently, the analyzed GCMs differed from the reanalyses with 2-2.5 time periods on average, which was higher than the RMSE median associated with the reanalyses. In all four cases, the RMSE median of the ACCESS1-0, the CMCC-CM, the CMCC-CMS, the GFDL-ESM2M, the HadGEM2-CC, the MPI-ESM-LR, and the MPI-ESM-P were below the RMSE median obtained from all GCMs.
For each reanalysis/GCM pairs the difference between the RMSE median obtained from the Monte Carlo simulations and the RMSE value calculated for the entire NH was negligibly small (<0.05). Consequently, the median values could not be used to quantify area-sensitivity of the GCMs. For that purpose, the range between the maximum and minimum values of the distributions of the RMSE values-hereinafter referred to as the RMSE range-were applied. RMSE ranges associated with the reanalyses were remarkably smaller than those associated with the GCMs. There was no simulation among the half-million in which the RMSE exceeded two while the RMSE range for the GCMs was three on average in all four cases presented in Figure 4. With a few exceptions, the RMSE ranges increased with increasing RMSE median. Note that a relatively small range did not necessarily imply good overall GCM performance in all areas, which can be exemplified by the HadGEM2-AO (Figure 4). In the worst cases the RMSE range exceeded 3.5 while the largest RMSE values even reached or exceeded the RMSE value of 4 (e.g., in the case of the MRI-ESM1; Figure 4).
The variability of the RMSE values (therefore the RMSE range) was larger when 10-year-long periods were examined. Therefore, we avoided the comparison of RMSE values against each other obtained from the analysis of time periods with different lengths. To get further information regarding the sensitivity of the GCMs with respect to the selected area/reanalysis, the GCMs were plotted on the two-dimensional space defined by the RMSE medians and the RMSE ranges. The results are presented in Figure 5 for the four cases presented in Figure 4. In general, the distribution of the GCMs shows a large similarity in all four cases. However, some discrepancies can be noticed. The CMCC-CMS performs exceptionally well in all four cases with relatively small RMSE medians and RMSE ranges. Two other GCMs performed similarly well as the CMCC-CMS, namely the MPI-ESM-LR on shorter time scales and the HadGEM2-CC when the ERA-20C was used as a reference dataset. Largest discrepancies relative to the reanalyses were associated with the HadGEM2-AO and the MRI-ESM1, which were considered as offscale GCMs in Figure 5.
necessarily imply good overall GCM performance in all areas, which can be exemplified by the HadGEM2-AO (Figure 4). In the worst cases the RMSE range exceeded 3.5 while the largest RMSE values even reached or exceeded the RMSE value of 4 (e.g., in the case of the MRI-ESM1; Figure 4).
The variability of the RMSE values (therefore the RMSE range) was larger when 10-year-long periods were examined. Therefore, we avoided the comparison of RMSE values against each other obtained from the analysis of time periods with different lengths. To get further information regarding the sensitivity of the GCMs with respect to the selected area/reanalysis, the GCMs were plotted on the two-dimensional space defined by the RMSE medians and the RMSE ranges. The results are presented in Figure 5 for the four cases presented in Figure 4. In general, the distribution of the GCMs shows a large similarity in all four cases. However, some discrepancies can be noticed. The CMCC-CMS performs exceptionally well in all four cases with relatively small RMSE medians and RMSE ranges. Two other GCMs performed similarly well as the CMCC-CMS, namely the MPI-ESM-LR on shorter time scales and the HadGEM2-CC when the ERA-20C was used as a reference dataset. Largest discrepancies relative to the reanalyses were associated with the HadGEM2-AO and the MRI-ESM1, which were considered as offscale GCMs in Figure 5.

Comparison of Loess-Based Regression Applied on the PotACs
Loess regression curves fitted on the PotACs in the case of the reanalyses are shown in Figure 6 while regression curves of selected GCMs and their comparison against the reanalyses can be found in Figure 7. Maps for all GCMs are presented in Figure S8 of the Supplementary Material. Parameters of the loess models (span parameter, explained variance (EV), and residual sum of squares (RSS)) are presented in Table S2 of the Supplementary Material. We would like to stress here that the PotACs were not supposed to be present in the fitted curve in all locations (i.e., their Figure 5. Characterization of GCMs in terms of their calculated RMSE medians and RMSE ranges in the case of (a,b) the 30-year-long periods and (c,d) the 10-year-long periods. The reference dataset is the ERA-20C in (a,c) and the NCEP-NCAR R1 in (b,d). GCMs with associated RMSE median below the RMSE median obtained from all GCMs in all four cases are denoted with triangles. The HadGEM2-AO and the MRI-ESM1 are considered as offscale GCMs, therefore those are omitted from (a,b) and from (c), respectively.

Comparison of Loess-Based Regression Applied on the PotACs
Loess regression curves fitted on the PotACs in the case of the reanalyses are shown in Figure 6 while regression curves of selected GCMs and their comparison against the reanalyses can be found in Figure 7. Maps for all GCMs are presented in Figure S8 of the Supplementary Material. Parameters of the loess models (span parameter, explained variance (EV), and residual sum of squares (RSS)) are presented in Table S2 of the Supplementary Material. We would like to stress here that the PotACs were not supposed to be present in the fitted curve in all locations (i.e., their geographical areas were localized rather than continuous in space). We used the fitting technique to quantify the similarity/dissimilarity between the observation based and the modeled circulation patterns.
In the case of the reanalyses the PotAC poles surround the planet in two hypothetical belts (hereinafter referred to as southern and northern belts), which are more pronounced above oceanic areas compared to continental areas due to the larger number of PotACs over the oceans as it can be seen in Figure 6. Each of the loess regression curves represents one of those belts. In general, the distance between the southerly and the northerly located regression curves was smaller in continental regions than above the oceans. The reanalyses were rather similar with RMSE values between 2.9 • and 4.4 • (RMSE values are presented in Table S3 of the Supplementary Material). The agreement between the two reanalyses was larger in the Euro-Atlantic region than in the region of the North Pacific Ocean and over North America. The above-mentioned areas were regions where teleconnections were detected, consequently the application of maps with an Atlantic and Pacific view that are centered on those regions was highly beneficial. The RMSE value between the loess regression curves centered on the North Pacific Ocean reached 4.4 • in the southern belt. Smaller but remarkable differences could be found in the Middle East region. Regarding the northern belt smaller discrepancies could be detected with RMSE values between 2.9 • and 3.1 • . The largest difference was found over Europe.
In the case of the GCMs, generally higher EVs can be computed for the loess models fitted in the southern belt than in the northern belt. Relatively low EV values were associated with the distribution of the PotAC poles, which were more scattered in some GCMs than in the reanalyses. A good example is the MPI-ESM-P with respect to the southern belt or the HadGEM2-AO for which the NH was mostly covered by PotACs ( Figure S5 in the Supplementary Material). Due to this anomalous behavior those GCMs were unable to reproduce the shape of the loess regression curves obtained from the reanalyses.
RMSE values associated with each GCM and medians for all GCMs are presented in Figure 8. We chose the best-performing GCMs from those that had EVs at least around 50% for all four loess models (i.e., models fitted on the Atlantic and Pacific view in the southern and northern belts). The CMCC-CMS and the MPI-ESM-MR were those GCMs that met the above-mentioned criterion and were among the best-performing 50% of the GCMs in all cases based on their RMSE values. According to their associated RMSE values, the MPI-ESM-P and the CMCC-CM could also be considered as best-performing GCMs, though their associated EVs were around or below 30% at least in one case, therefore the quality of those loess models were considered as poor. It is worth mentioning that the CNRM-CM5 was unique among the examined GCMs because its RMSE values obtained from the two belts were relatively close to each other in most cases. This implies good performance in both belts.
Atmosphere 2020, 11, x FOR PEER REVIEW 14 of 24 geographical areas were localized rather than continuous in space). We used the fitting technique to quantify the similarity/dissimilarity between the observation based and the modeled circulation patterns.
In the case of the reanalyses the PotAC poles surround the planet in two hypothetical belts (hereinafter referred to as southern and northern belts), which are more pronounced above oceanic areas compared to continental areas due to the larger number of PotACs over the oceans as it can be seen in Figure 6. Each of the loess regression curves represents one of those belts. In general, the distance between the southerly and the northerly located regression curves was smaller in continental regions than above the oceans. The reanalyses were rather similar with RMSE values between 2.9° and 4.4° (RMSE values are presented in Table S3 of the Supplementary Material). The agreement between the two reanalyses was larger in the Euro-Atlantic region than in the region of the North Pacific Ocean and over North America. The above-mentioned areas were regions where teleconnections were detected, consequently the application of maps with an Atlantic and Pacific view that are centered on those regions was highly beneficial. The RMSE value between the loess regression curves centered on the North Pacific Ocean reached 4.4° in the southern belt. Smaller but remarkable differences could be found in the Middle East region. Regarding the northern belt smaller discrepancies could be detected with RMSE values between 2.9° and 3.1°. The largest difference was found over Europe. In the case of the GCMs, generally higher EVs can be computed for the loess models fitted in the southern belt than in the northern belt. Relatively low EV values were associated with the distribution of the PotAC poles, which were more scattered in some GCMs than in the reanalyses. A good example is the MPI-ESM-P with respect to the southern belt or the HadGEM2-AO for which the NH was mostly covered by PotACs ( Figure S5 in the Supplementary Material). Due to this anomalous behavior those GCMs were unable to reproduce the shape of the loess regression curves obtained from the reanalyses.  In the case of the first column see Figure 6 for the explanation of the symbols and colors.

Figure 7.
Loess regression curves fitted on the PotACs of selected GCMs (first column), and comparison of the curves with the loess regression curves obtained from the reanalyses (second column). In the case of the first column see Figure 6 for the explanation of the symbols and colors.
Atmosphere 2020, 11, x FOR PEER REVIEW 17 of 24 As the reanalyses show significant differences in the above-mentioned regions in the NH, the similarity between some GCMs and the given reanalysis was higher than the similarity between the two reanalyses, which was never observable during the analysis of the stability patterns in Section 3.2.  view (b,d). The ERA-20C is used as a reference dataset in (a,b), while the NCEP-NCAR R1 is used in (c,d). Blue and red dots represent RMSE values represented by the northerly and the southerly poles, respectively, while grey squares indicate the average RMSEs. Red, blue, and grey straight lines denote the median GCM-performance (expressed by RMSE). GCMs with an EV at least around 50% are denoted with red, blue, and grey asterisks with respect to the southerly poles, the northerly poles and both poles of the PotACs, subsequently. The GCMs are ranked in decreasing order according to average RMSE values.

Synthesis
Based on the comparison of the results of the stability patterns and the loess regression curves, the CMCC-CMS was the only GCM that performed well with respect to both aspects (i.e., the stability patterns and loess regression curves) for both the 30-year and 10-year periods. The MPI-ESM-LR, the MPI-ESM-MR, and the MPI-ESM-P performed relatively well for the 10-year-long periods, however not for both aspects. The MPI-ESM-LR and the MPI-ESM-P performed better in the case of the analysis of the stability patterns while the MPI-ESM-MR was among the best-performing GCMs when loess regression curves were examined. The GFDL-ESM2M and the HadGEM2-CC were only among the best-performing GCMs when 30-year-long periods were analyzed, consequently those were not expected to perform well in the analysis of loess curves, which was only performed on 10-year-long periods. The MRI-CGCM3, the MRI-ESM1, and the NorESM1-M were among the least precise GCMs concerning both aspects. Results obtained from the HadGEM2-AO shall be treated with caution due to the anomalously low correlations produced by that model.  view (b,d). The ERA-20C is used as a reference dataset in (a,b), while the NCEP-NCAR R1 is used in (c,d). Blue and red dots represent RMSE values represented by the northerly and the southerly poles, respectively, while grey squares indicate the average RMSEs. Red, blue, and grey straight lines denote the median GCM-performance (expressed by RMSE). GCMs with an EV at least around 50% are denoted with red, blue, and grey asterisks with respect to the southerly poles, the northerly poles and both poles of the PotACs, subsequently. The GCMs are ranked in decreasing order according to average RMSE values.
Regarding the CMCC-CM and the MPI-ESM-MR similar RMSE values can be obtained from the loess models fitted on maps with Atlantic and Pacific views (the largest difference was smaller than 0.6). Concerning the southern belt, major discrepancies can be observed not only in the North Pacific region but also in the Euro-Atlantic region. Over North America their regression curves were neither similar to the ERA-20C nor to the NCEP-NCAR R1. The associated RMSE values regarding the CMCC-CMS were between 6.3 • and 7.1 • while RMSE values were between 5.1 • and 6.7 • in the case of the MPI-ESM-MR. With respect to the northern belts a high degree of similarity can be noticed in the North Atlantic region. The largest discrepancies were observable over the North Pacific region. It is worth mentioning that over Europe the loess regression curve obtained from the MPI-ESM-MR was closer to the regression curve of the NCEP-NCAR R1. RMSE values associated with the CMCC-CMS were between 3.9 • and 4.4 • while the RMSE values obtained from the MPI-ESM-MR were between 3 • and 3.8 • .
As the reanalyses show significant differences in the above-mentioned regions in the NH, the similarity between some GCMs and the given reanalysis was higher than the similarity between the two reanalyses, which was never observable during the analysis of the stability patterns in Section 3.2.

Synthesis
Based on the comparison of the results of the stability patterns and the loess regression curves, the CMCC-CMS was the only GCM that performed well with respect to both aspects (i.e., the stability patterns and loess regression curves) for both the 30-year and 10-year periods. The MPI-ESM-LR, the MPI-ESM-MR, and the MPI-ESM-P performed relatively well for the 10-year-long periods, however not for both aspects. The MPI-ESM-LR and the MPI-ESM-P performed better in the case of the analysis of the stability patterns while the MPI-ESM-MR was among the best-performing GCMs when loess regression curves were examined. The GFDL-ESM2M and the HadGEM2-CC were only among the best-performing GCMs when 30-year-long periods were analyzed, consequently those were not expected to perform well in the analysis of loess curves, which was only performed on 10-year-long periods. The MRI-CGCM3, the MRI-ESM1, and the NorESM1-M were among the least precise GCMs concerning both aspects. Results obtained from the HadGEM2-AO shall be treated with caution due to the anomalously low correlations produced by that model.

Discussion
In our study circulation patterns of GCMs were compared to those of two selected reanalyses, which means that reanalyses were considered as the appropriate realizations of the atmospheric processes. However, as it is suggested by Stryhal and Huth [30], the selection of the reanalysis as a reference dataset may have important effects on the GCM validation. Stryhal and Huth [30] pointed out on the basis of the statistical comparison of reanalyses of winter circulation patterns in different Euro-Atlantic sectors that the agreement between the ERA-20C and the NCEP-NCAR R1 was relatively good with the exception of the Icelandic and the Eastern Mediterranean region. Recently, studies claim that reanalyses have limited value as they represent only one realization of the many hypothetical atmospheric circulations. They propose to use an ensemble of climate model simulations to understand the behavior of the climate system, thus the teleconnection systems [64,65]. Since state-of-the-art reanalyses are considered to be the best possible reconstructions of the real atmospheric circulations, they are indispensable to quantify existing circulation patterns, with some degree of uncertainty. In our study the application of two different reanalyses enabled the quantification of uncertainties related to the detection of teleconnection systems.
The analysis of the correlation fields obtained from the Z500 is used as an objective tool for detecting teleconnections. Principal component analysis (PCA) is widely used in the literature to analyze atmospheric circulations and to evaluate GCMs. There are studies that apply PCA [9,10] while other studies use its rotational version (RPCA) [11,12]. There is no consensus about which version of PCA should be used [32,66]. Here we claimed that the correlation field analysis is a suitable alternative of the RPCA because of the simpler interpretation of the results that is more consistent with the observations. By computing correlations, PotACs are considered to be parts of teleconnections, and their uncertainty is quantified by associated significance levels. In PCA patterns, opposite-signed grid cells do not necessarily imply codependency [67], therefore the detection of teleconnection. Furthermore, the number of principal components (PCs) that are rotated should be carefully chosen to avoid both under and over-rotation. The former may lead to loss of information on relevant signals while the latter may lead to the over-regionalization of the signals [68].
The method of the teleconnection analysis affects the appropriate detection of the signal, thus the result of the GCM evaluation. The application of the teleconnection identification method proposed by Wallace and Gutzler [1] was used in this study. This technique enables the detection of teleconnections over the North Pacific Ocean and the North Atlantic Ocean as well as in regions of the Mediterranean Sea and the inner part of Asia simultaneously, which may indicate the presence of oscillations such as the NPO and the NAO. In our method the NH was not divided into different areas to detect atmospheric patterns such as the NAO or teleconnection in the region of the Mediterranean Sea. In the case of NAO, PCA can be performed over an Atlantic sector [5] but the boundaries of the target area are not precisely defined. Concerning the Mediterranean region, a predefined domain between 30 • N-60 • N and 30 • W-40 • E is recommended to be analyzed [69]. Clearly, subjective selection of the target domain affects the results of PCA [66]. The presented method does not suffer from subjective decisions on the domain selection, which is an advantage compared to the PCA method.
By the application of the teleconnection method, based on the time series associated with the PotACs, teleconnection indices can be constructed similarly to the mobile index introduced by Portis et al. [70]. With those teleconnection indices shifts in the position of the most intense regions of the teleconnections can be examined and their relationships with atmospheric variables in distant areas can be identified. Monitoring shifts in the circulation patterns due to anthropogenic forcing is an important task, i.e., expansion of the Hadley cell is studied in Lu et al. [71] using GCM outputs. The NAO moved easterly in the last few decades of the 20th century [72], though this tendency cannot be observed in the years after the millennium due to less intense westerly winds [73]. The eastward shift can also be observed over the North Pacific Ocean where the Aleutian low relocated eastward after 1970 [74]. Correlation maps obtained from the reanalyses and presented in Figure S1 of the Supplementary Material are in accordance with those detected eastward shifts over the oceanic areas. Furthermore, on correlation maps closer to the present (Figure 1) it is clearly visible that the above-mentioned eastward shift over the North Atlantic Ocean cannot be observed anymore. According to our comparisons, only a few GCMs (e.g., the CMCC-CMS) captured the eastward shift of the most intense regions of the NAO ( Figure S5 of the Supplementary Material). In the region of the North Pacific Ocean the larger part of the GCMs relocated the most intense regions of the teleconnections more easterly than the reanalyses. Exceptions were, for example, the CMCC-CMS in the case of the 30-year-long time periods and MPI-ESM-MR in the case of the 10-year-long time periods.
Our analysis was based on the examination of Pearson correlation fields, therefore linear components of the teleconnections were analyzed. However, nonlinear components may also have an important role to govern atmospheric processes, thus atmospheric circulation and their shifts [75]. In the last decade studies applying nonlinear approach became popular, for example using the method of self-organizing maps [76]. In the present approach the linear method was used for simplicity, but it is possible to extend it to take into account both linear and nonlinear effects.
We applied an information theory-based statistical method on the correlation fields of different time periods to evaluate the quality of CMIP5 GCMs. Usage of the diversity index and the loess-method enabled the evaluation of GCMs with respect to the intensity of various teleconnections without fading their signals. The application of information entropy to evaluate atmospheric models and to examine atmospheric teleconnections is gaining popularity recently [77,78]. The usage of the loess-method as a GCM evaluation tool with respect to atmospheric teleconnections is a novel method to the authors' knowledge. As it was demonstrated, comparing GCM regression curves against the reanalysis regression curves by calculating RMSE values are not sufficient to select the best-performing GCMs. The quality of the regression (i.e., EV) should also be taken into consideration.
The results of our analysis are in line with other studies. According to Stryhal and Huth [11] the HadGEM2-CC is also among those GCMs that simulate the winter atmospheric circulation well in the Euro-Atlantic region on a longer time scale (i.e., . In our analysis the HadGEM2-CC performed well on longer time scales. According to our findings that the CMCC-CMS could be chosen as the best-performing model might be connected to the fact that the CMCC-CMS is a GCM in which the atmosphere is vertically well resolved [79]. Atmospheric processes in the upper-troposphere and the lower-stratosphere can modulate teleconnections. Feldstein [28] pointed out that the signal of NAO intensifies with increasing height. A recent study [80] found that the CMCC-CMS is able to reproduce climate change in the North Pacific region in winter relative to a reanalysis dataset. However, it relocates the geopotential height anomalies over the North Pacific Ocean southerly. To evaluate the circulation pattern of GCMs, computing model metrics (e.g., RMSE) using the fields of significant correlations and establishing a GCM-ranking for each period and arbitrary-selected areas in the NH would provide little practical information because of the followings. (1) The GCMs must perform equally well both within the regions where teleconnections are detected and outside of them as well. If we restrict the analysis to some predefined areas where teleconnections can be detected, then GCMs (e.g., the HadGEM2-AO and the NorESM1) that perform less reliably outside those regions might be (incorrectly) considered to be of similar quality than GCMs that perform well in extensive areas (e.g., the CMCC-CMS and the HadGEM2-CC). We proposed that the Monte Carlo method (i.e., random sampling of grid cells in the NH) that we applied on the stability patterns is an appropriate, completely objective technique to compare the GCMs as the method considers each grid cell. In other words, the proposed Monte Carlo method avoids subjective selection of the target area. (2) There are GCMs that are characterized by improved/deteriorated performance closer to the end of the historical period (i.e., 2005) while the performance of other GCMs is not changing temporally. Consequently, if a GCM performs well in the last examined historical time period, there is no guarantee that it remains among the best performing GCMs if the representative concentration pathway (RCP) scenarios are examined (i.e., after 2006). The results of the study might support GCM selection for downscaling studies and for the quantification of the climate change signal in specific geographical regions.

Conclusions
In this paper an information theory-based statistical method was used to evaluate 19 GCMs from the CMIP5 project with respect to atmospheric teleconnections. To analyze temporal stability of the teleconnections, a simple diversity index was calculated based on the fields of strongest negative correlations obtained from the Z500 in the Northern Hemisphere's winter. Comparison of stability patterns based on the Monte Carlo method enabled us to take into account area-dependency of the GCMs and to avoid getting highly fluctuating GCM rankings regarding the selected time period/area.
The vast majority of the GCMs reproduced the signal of teleconnections in a similar position as the reference datasets (the ERA-20C and the NCEP-NCAR R1 reanalyses). However, there are major differences regarding the most intense regions of the teleconnections. For example, the GCMs tend to underestimate the intensity of the negative correlations (thus teleconnections themselves) in the eastern regions of the North Atlantic Ocean. To assess those differences, grid cells in the strongest negative correlations fields that are in mutually unambiguous correspondence with each other, i.e., potential action centers, were analyzed. The examined 10-year-long time periods provide enough data to enable us fitting local polynomial regression models on them and compare them with the reanalyses. Based on the analyses of stability patterns and loess regression models, the GCMs that perform relatively best can be selected. Among them the CMCC-CMS performed exceptionally well on both time scales. The best-performing GCMs can reproduce the teleconnections in a similar position and with similar intensity as the reanalyses. Information about the specific GCMs might support model development in the future.
It was found that teleconnections in the Euro-Atlantic region are highly variable in time and space. Due to the role of those teleconnections in the European weather events, future studies should focus on the analysis of the teleconnections and the expected changes, e.g., in the Carpathian Basin. GCM selection for such an analysis will be straightforward and easy based on the presented results.
The R code and the calculated datasets are available from the first author upon request.  Table S2: Span parameter, explained variance and residual sum of squares of the models based on local polynomial regression (loess). Table S3: Comparison of the GCM loess regression curves against the reanalysis loess regression curves by calculating RMSE. Figure S1: Global correlation maps obtained from the Z500 field of the NCEP-NCAR R1 for 30-year-long periods. Figure S2: Hemispherical correlation maps of the Z500 fields obtained from the ERA-20C and the NCEP-NCAR R1 in cases of 30-year-long periods. Figure S3: Averages of the strongest negative correlations in each grid cell obtained from the Z500 in case of each reanalysis/GCM based on the 30-year-long periods and the 10-year-long periods. Figure S4: Distribution of the strongest negative correlations obtained from the Z500 in case of each reanalysis/GCM based on the 30-year-long periods and the 10-year-long periods. Figure S5: Potential action centers (PotACs) obtained from the analysis of correlation maps in case of all 30-year-long periods and 10-year-long periods. Figure S6: Stability pattern of each reanalysis/GCM. Figure S7: Difference map of each reanalysis/GCM relative to the ERA-20C and the NCEP-NCAR R1. Figure S8: Loess regression curves fitted on the southerly and northerly located PotACs of each GCM on maps with Atlantic view and on maps with Pacific view. Comparison of the GCM loess regression curves against the reanalysis regression curves.