Means and Extremes: Evaluation of a CMIP6 Multi-Model Ensemble in Reproducing Historical Climate Characteristics across Alberta, Canada

: This study evaluates General Circulation Models (GCMs) participating in the Coupled Model Intercomparison Project Phase 6 (CMIP6) for their ability in simulating historical means and extremes of daily precipitation (P), and daily maximum (Tmax), and minimum temperature (Tmin). Models are evaluated against hybrid observations at 2255 sub-basins across Alberta, Canada using established statistical metrics for the 1983–2014 period. Three extreme indices including consecutive wet days (CWD), summer days (SD), and warm nights (WN) are deﬁned based on the peak over the threshold approach and characterized by duration and frequency. The tail behaviour of extremes is evaluated using the Generalized Pareto Distribution. Regional evaluations are also conducted for four climate sub-regions across the study area. For both mean annual precipitation and mean annual daily temperature, most GCMs more accurately reproduce the observations in northern Alberta and follow a gradient toward the south having the poorest representation in the western mountainous area. Model simulations show statistically better performance in reproducing mean annual daily Tmax than Tmin, and in reproducing annual mean duration compared to the frequency of extreme indices across the province. The Kernel density curves of duration and frequency as simulated by GCMs show closer agreement to that of observations in the case of CWD. However, it is slightly (completely) overestimated (underestimated) by GCMs for warm nights (summer days). The tail behaviour of extremes indicates that GCMs may not incorporate some local processes such as the convective parameterization scheme in the simulation of daily precipitation. Model performances in each of the four sub-regions are quite similar to their performances at the provincial scale. Bias-corrected and downscaled GCM simulations using a hybrid approach show that the downscaled GCM simulations better represent the means and extremes of P characteristics compared to Tmax and Tmin. There is no clear indication of an improved tail behaviour of GPD based on downscaled simulations.


Introduction
Simulations from the state-of-the-art General Circulation Models (GCMs) are becoming available for analysis and being included in the 6th assessment report (AR6) of the Intergovernmental Panel on Climate Change (IPCC) through the Coupled Model Intercomparison Project Phase 6 (CMIP6) [1]. One of the major differences between CMIP5 [2] and CMIP6 is the set of future scenarios used to project climate evolution. The CMIP6 tail implies a higher probability of more frequent extreme events [29,30]. Therefore, correct representation of the tail characteristics, as described by the shape parameter of the extreme value distribution [31], avoids serious under-or overestimation of occurrence probabilities of extreme events and ensures a proper assessment of future climate risks and uncertainties [32]. Hence, analyzing the tail behaviour of historical climate extreme events using CMIP6 data can provide important information and allows thorough investigation of CMIP6 model capabilities in projection of climate change and their impacts.
A wide range of research has been conducted using the extreme indices of temperature and precipitation developed by the Expert Team on Climate Change Detection and Indices (ETCCDI). Karl et al. [33] and Frich et al. [34] provided a comprehensive overview of ETCCDI developed temperature and precipitation extreme indices. Alexander et al. [34] and Donat et al. [35] also supported this effort by developing a gridded dataset of climate indices (e.g., HadEX and HadEX2) based on a reasonably dense global coverage. Most of these climate indices describe relatively moderate climate extremes with reoccurrence times of at least once a year [36]. However, the analysis of extreme climate events involves several difficulties compared to the analysis of mean events, including valid definitions and formulation of extremes [37,38]. This study focuses on climate extremes based on daily temperature and precipitation, such as the duration and frequency of extreme wet and hot days.
Given that CMIP6 climate projections have only recently been released, only limited research has been conducted to evaluate their simulations in reproducing means and extreme climate conditions [39][40][41][42][43][44][45][46][47]. These studies mainly compared the CMIP6 simulations with CMIP5 and found that CMIP6 medians are generally much closer to the observations [40]. They also found a general improvement in the simulation of climate extremes of CMIP6 models, and there is some indication that CMIP6 has reduced some of the warm biases [40,42]. They attributed the difference between the CMIP5 and CMIP6 simulated precipitation and temperature-based results to the sophistication of CMIP6 models that included representation of aerosol effects, new cloud fraction scheme, an improved snowalbedo scheme, model resolution, and parameterization improvements [42,44]. While these studies provide important information on the performance of CMIP6 models at global scale, they lack a detailed investigation at a local scale, where most decisions are made and actual impact assessments are performed.
The purpose of this paper is to evaluateCMIP6 models in reproducing local historical climate variables, such as means and extremes, over a large geographic region with diverse geospatial, topographic, and climatic characteristics. The specific objectives are to (1) evaluate how precisely the simulations of CMIP6 models reproduce the historical mean daily precipitation and temperature, (2) evaluate how precisely the simulations of CMIP6 models reproduce the tail behaviour of their probability distributions, and (3) investigate whether a correlation exists between the bias in tails and geospatial regions. We also applied a hybrid bias correction and statistical downscaling technique to check whether the performance of GCMs could be improved for the aforementioned objectives. To achieve these objectives, we focused our analyses in the province of Alberta, Canada, which is characterized by heterogeneous soil, land use, topographic, and climatic characteristics and has also been exposed to historical extreme precipitation events and droughts. Our study lays the foundation for future research based on CMIP6 data and enhance our understanding of climate change impact assessments that explore the probability of climate-related threats. Note that the presence of annual cycle or season-specific behaviour in the time series are not considered for the evaluation purposes and are beyond the scope of the study.

Study Area
Alberta extends for~1200 km from north to south (49 • N-60 • N) and~660 km (110 • W-120 • W) across the greatest width with an area spanning 661,000 km 2 . Elevation ranges from 152 m in the northeast to 3747 m in the Rocky Mountains along the south-  Figure 1. The province has more than one-third of its total area as agricultural land in the south, with a landscape varying from glacial mountain lakes, rolling foothills, and grassland in the south to vast boreal forests in the north [48]. It has a highly variable climate with mean annual precipitation ranging from~280 mm in the south to~1000 mm at the higher elevations of the Rocky Mountains with the provincial average of~500 mm [25]. Mean winter temperatures usually range from −25.1 • C to −9.6 • C, while mean summer temperatures vary between 8.7 • C and 18.5 • C, with the mean annual temperature ranging from 3.6 • C to 4.4 • C [48].
border Figure 1. The province has more than one-third of its total area as agricultural land in the south, with a landscape varying from glacial mountain lakes, rolling foothills, and grassland in the south to vast boreal forests in the north [48]. It has a highly variable climate with mean annual precipitation ranging from ~280 mm in the south to ~1000 mm at the higher elevations of the Rocky Mountains with the provincial average of ~500 mm [25]. Mean winter temperatures usually range from −25.1 °C to −9.6 °C, while mean summer temperatures vary between 8.7 °C and 18.5 °C, with the mean annual temperature ranging from 3.6 °C to 4.4 °C [48].
In this study, we evaluated GCMs' performances across four contiguous homogeneous regions (R1, R2, R3, and R4) of Alberta. These selected regions were defined in [49] and were developed using soft fuzzy clustering algorithms [50]. The regions were validated through a statistical homogeneity test from Hosking Wallis [51] using daily precipitation and multi-hour precipitation extremes data from Environment and Climate Change Canada [49]. Regions outside of the Alberta border include BC (British Columbia, west side) and SK (Saskatchewan, east side) Figure 1.

Data
We acquired daily precipitation (P), daily minimum (Tmin), and daily maximum temperature (Tmax) outputs from five GCMs (i.e., BCC-CSM2-MR, CNRM-CM6-1, EC- In this study, we evaluated GCMs' performances across four contiguous homogeneous regions (R1, R2, R3, and R4) of Alberta. These selected regions were defined in [49] and were developed using soft fuzzy clustering algorithms [50]. The regions were validated through a statistical homogeneity test from Hosking Wallis [51] using daily precipitation and multihour precipitation extremes data from Environment and Climate Change Canada [49]. Regions outside of the Alberta border include BC (British Columbia, west side) and SK (Saskatchewan, east side) Figure 1.
A reliable observed climate dataset with high spatio-temporal coverage is key to successfully evaluate any newly released climate datasets [52]. In this study, a unique 'hybrid' observational dataset was used as the comparison benchmark. It was produced by Faramarzi et al. [53] who employed five climate data sources, including station-based (https://climate.weather.gc.ca/; accessed on 28 June 2014) and gridded products (National Centers for Environmental Prediction's Climate Forecast System Reanalysis-CFSR, Climate Research Unit Time Series-CRU TS2, CRU TS3.21, and Natural Resources Canada-NRCan) to reproduce observed streamflow records of 130 hydrometric gauging stations across Alberta, Canada. However, in their study, none of the individual datasets showed the best performance as input to determine streamflow. Therefore, following a 'combination approach', they came up with a 'hybrid dataset' that was used to force a hydrological model from1983-2014 that resulted in a climate data set of 2255 sub-basins covering entire Alberta ( Figure 1). The hybrid climate dataset provides daily precipitation, minimum and maximum temperature. The spatial variation of these climate variables was tested and found to be similar to the Alberta Environment and Parks reported climate data [54]. These data have been successfully incorporated in many applications, including flood frequency analysis [16], crop yield simulations [55][56][57], surface and subsurface water interactions [58], and storage changes in wetlands [59]. In this study we employed this hybrid climate dataset (hereafter referred to as observations dataset) as an observed climate dataset for the GCM evaluations.
Due to the different grid projections of the CMIP6 GCMs (Table 1, Figure 1) and to be consistent with hybrid observation datasets at 2255 sub-basins, a reference grid with a horizontal resolution of~6km following Werner et al. [60] was used. This resulted in a reference grid of 242 longitudes × 181 latitudes covering the entire province (48 • N to 60 • N and 123 • W to 107 • W). All model simulated data were interpolated to this grid using thinplate spline interpolation algorithm [61] before performing any analysis. Thin-plate spline is not scale-invariant because three covariates (latitude, longitude and a climate normal) appear in a nonlinear way in the interpolation and it has been successfully applied to other studies in the northwestern North America [62] and Canada [63]. Taking the centroid of each sub-basin as the station location of the hybrid climate dataset, we applied the nearest neighbor algorithm in the ArcGIS to locate the closest grid point of GCMs to the centroid of sub-basin for the evaluation purposes.

Methods
A concise overview of the methods adopted for this study is illustrated in Figure 2, while a detailed description of various components is provided below. A concise overview of the methods adopted for this study is illustrated in Figure 2, while a detailed description of various components is provided below.

Figure 2.
A schematic diagram of the methodology adopted for this study using both downscaled and non-downscaled/raw data.

Evaluation of Mean Climate Characteristics
Mean characteristics (i.e., mean annual precipitation and temperature) of daily climate derived from GCM simulations are compared to corresponding observed characteristics to assess performance errors. The metrics chosen to evaluate Tmax and Tmin were as follows [67].
Slope of the Regression Line times the Coefficient of Determination (bR 2 ) where is the coefficient of determination between the observed and simulated values, and is the slope of the regression line [53,68]. The range of (0-1) describes how much of the observed dispersion is explained by the model prediction and a model can have good value (close to 1) but still over or under prediction. Therefore, the gradient b provides additional information and is weighting the values to reflect the model results in a more comprehensive way [68].
Nash-Sutcliffe Efficiency ( ) Figure 2. A schematic diagram of the methodology adopted for this study using both downscaled and non-downscaled/ raw data.

Evaluation of Mean Climate Characteristics
Mean characteristics (i.e., mean annual precipitation and temperature) of daily climate derived from GCM simulations are compared to corresponding observed characteristics to assess performance errors. The metrics chosen to evaluate Tmax and Tmin were as follows [67].
Slope of the Regression Line times the Coefficient of Determination (bR 2 ) where R 2 is the coefficient of determination between the observed and simulated values, and b is the slope of the regression line [53,68]. The range of R 2 (0-1) describes how much of the observed dispersion is explained by the model prediction and a model can have good R 2 value (close to 1) but still over or under prediction. Therefore, the gradient b provides additional information and is weighting the R 2 values to reflect the model results in a more comprehensive way [68]. Nash-Sutcliffe Efficiency (NSE) where O i and S i are the ith observed and simulated variable, respectively. O is the mean of observed data for the variable being estimated and N is the total number of observations. NSE is a normalized statistic that determines the relative magnitude of the residual variance ('noise') compared to the observed variance ('information') [69,70]. NSE values range from −∞ to 1, with NSE = 1 being a match between observed and simulated variable. Any NSE value greater or equal to zero indicates that the simulated value estimated the constituent of concern better than the mean observed value. Similar quantitative statistical metrics have been used in the literature to evaluate GCMs performance in simulating hydrometeorological variables [71][72][73].
The two sample Kolmogorov-Smirnov (KS) test [74] was incorporated to assess how well the distribution of GCMs simulated daily precipitation matched the observations. The KS test is a nonparametric test of the equality of continuous one-dimensional probability distributions and the KS test statistic (D) quantifies the distance between empirical distribution functions of two samples. The closer this D is to zero, the more likely the two samples are drawn from the same distribution. In this study, the null hypothesis, two samples are drawn from identical population, is examined with the 99% level of confidence. The performance metrics for temperature and KS test for precipitation were applied on each of the selected grids in the study area.

Evaluation of Extreme Characteristics
Precipitation and temperature extremes are defined as independent daily events exceeding an extreme threshold. We use the peak over threshold (POT) approach to sample independent events from each individual time series such that it provides a comprehensive description of extreme events by retaining both the magnitude and the timing of each event [31,75]. Independent events are then used to fit a Generalized Pareto Distribution (GPD) to analyze the probability of extreme events. However, selecting the optimum threshold that allows retention of the largest sample of excesses above the threshold and assures their independence are two of the major challenges against employing the POT approach [76,77]. At least 1.65 peaks per year are recommended to be selected by the POT method [78].
Defining the threshold value above which peaks are sampled from a data series is not trivial. The success of fitting a probability distribution on the peaks is primarily dependent on the compromise between the variance and the bias-too large a threshold will include very few values to model the tail of the distribution correctly (large variance due to only very extreme observations), whereas too low a threshold will result in selecting non-extreme values and a high bias in the analysis [77]. Here, we employed the Likelihood Ratio Test (LRT) to select the optimum threshold as it has shown to outperform other methods [79]. We employed an algorithm developed in R [80], a language and environment for statistical computing, to automate the LRT threshold selection method following the procedure outlined by Beirlant et al. [81] and Zoglat et al. [79]. In the LRT approach, let u i and u k (u i < u k ) be two threshold candidates and X ui (X uk ) the vector of observations exceeding u i (u k ). When u i an acceptable threshold, u k is also acceptable. The test accepts the lowest one when two thresholds are admissible. It is worth noting that while graphical methods are good candidates for the choice of threshold, they become unmanageable when the number of time series is large, and they depend on visual investigations that are highly subjective [16]. In this study, the initial u i was chosen as the 90th percentile (P90) for the dataset under consideration. Furthermore, to ensure that the extracted peaks are independent, a careful declustering was carried out. A few statistical tools have been developed to filter clustered realizations to yield datasets of independent events, including the most widely-adopted method, 'declustering' [31]. The extremal index is chosen to ensure the independence of exceedances [82,83]. Ideally, an extremal index of 1 indicates perfectly independent data while 0 indicates clustered data. The R package 'POT' was used for declustering and estimating the extremal index [84].
Three extreme indices, namely consecutive wet days (CWD), summer days (SD), and warm nights (WN), were defined based on the extracted peaks over the threshold Table 2. For each index, the duration and frequency were calculated separately. Duration is defined as the continuous sequence of days where the value of the climate variable (precipitation or temperature) exceeded the selected threshold before declustering. At each grid point, the durations of all extreme events over the entire period are averaged to represent one value. Frequency is determined by counting the total number of independent events over the entire period for each grid point. Table 2. Definitions of the extreme indices.

Sl Number
Index Definition

1
Consecutive wet days (CWD) Days with daily precipitation greater than the threshold * 2 Summer days (SD) Days with daily maximum temperature greater than the threshold 3 Warm nights (WN) Days with daily minimum temperature greater than the threshold * Initially the threshold is chosen as P90.
The behaviour of precipitation and temperature extremes in these data products is described by modeling the tails of their fitted distribution. The evaluation of the heaviness of the tails is done by comparing the tail index (i.e., the shape parameter) of the fitted Generalized Pareto Distribution (GPD) distribution, which governs the magnitude and frequency of extreme events. A light-tailed distribution generates less frequent and milder extremes compared to a heavy-tailed distribution. The tail index was therefore calculated and compared between observations and GCM simulations in order to evaluate the performance of CMIP6 models in capturing the precipitation and temperature extremes. To identify the tail behaviour of extremes, the Generalized Pareto Distribution (GPD) was adopted to model the independent and identically distributed (IID) excesses over an optimum threshold u. The cumulative distribution function (CDF) for the GPD is [85]: where x, is the extreme climate variables (in mm for precipitation and • C for temperature), u is the location parameter, σ is the scale parameter, and ξ is the shape parameter. The shape parameter (tail index) determines the qualitative behaviour of the GPD, where a ξ = 0 refers to the exponential distribution, for ξ > 0 the corresponding distribution has a heavy upper tail that behaves like a power function with exponent −1/ξ and for ξ = 1 the distribution is uniform. Having determined the optimum threshold, the parameters of the GPD were estimated using the Maximum Likelihood Estimator (MLE) [86].
Climate Imprint (CI) works together with Quantile Delta Mapping (QDM). It starts by calculating daily climate anomalies for the GCM dataset at the observation dataset's period. These daily GCM anomalies are interpolated to the observation dataset's grid and constitute the 'CI'. The high-resolution gridded observations are then grouped into months and a climatology is calculated for each month. The observed climatology is then added to the GCM-based CI. The QDM was performed on a point-by-point basis in the finer grid system. It uses the observations and the CI result as input and performs a quantile perturbation/quantile mapping bias correction. CA works independently by spatially aggregating the high-resolution gridded observation dataset up to the GCM scale and then proceeds to bias correct the GCM based on those observations. Then, it searches for the top 30 closest time steps in the gridded observations and makes an "analogue" [92]. This method can also generate estimates of extreme events during downscaling.
The ClimDown package uses an overall algorithm called BCCAQ to combine the results from CI-QDM and CA [60,91]. It reorders data for each fine-scale grid point within a month effectively breaking the overly smooth representation of sub-grid-scale spatial variability inherited from CI-QDM, thereby resulting in a more accurate representation of event-scale spatial gradients. This also prevents the downscaled outputs from drifting too far from the climate model's long-term trend.

Results and Discussion
To ensure that the P, Tmax, and Tmin extremes are IID, the excesses have been declustered. We monitored the extremal index for each sub-basins and confirmed the assumption of independence for both the precipitation and temperature peaks time series. Table 3 shows the total number of excesses for the study area and the number of peaks confirmed the assumption that excesses should be a minimum of 1.65 multiplied by the number of years.  Figure 3 shows the spatial distribution of relative difference (percentage change from observations) of mean annual precipitation for five GCMs. The BCC-CSM2-MR, EC-Earth3, and EC-Earth3-veg underestimate the mean annual precipitation in central to northern Alberta and overestimate in southern Alberta. The MRI-ESM2 also overestimates in the southern Alberta while showing a varied pattern in central to northern Alberta. The CNRM-CM6-1 greatly underestimates precipitation for the entire study area except in the Canadian Rockies. The behaviour of all GCMs is fairly similar to simulated precipitation in the mountainous region (south-western), where all models highly overestimate. The spatial pattern and magnitude of observed precipitation shows that this region receives high amounts of precipitation compared to other regions of the province Figure 3. However, the orographic effect and limited number of observation stations in the complex terrain of the Canadian Rockies may play an important role to such big difference. Wong et al. [93] compared several gridded precipitation products over Canada and found that precipitation amounts are overestimated in the Canadian Rockies. Li et al. [94] and Kuo et al. [95] found similar precipitation biases over the Canadian Rockies as simulated by the Weather Research and Forecasting model (WRF). The KS statistic (D) in Figure 4 supports our findings in Figure 3. Minimum D values are found in northern Alberta followed by central and southern Alberta for all models except the CNRM-CM6-1, which shows large D values across the province. Overall, GCMs have comparatively larger D in the Canadian Rockies. Based on multiple statistical metrics, Cheng et al. [71] also found poor performance of CMIP5 GCMs over the Canadian Rockies portion of the Athabasca River Basin in Alberta. The MRI-ESM2.0 has higher D values in southern Alberta where the magnitudes of overestimation are also higher Figure  3. The KS test for mean annual precipitation shows no statistically significant results for BCC-CSM2-MR, CNRM-CM6-1 and MRI-ESM2.0 for all 2255 sub-basins. The EC-Earth3 and EC-Earth3-veg produce significant results only for two and seven sub-basins, respectively. For evaluating precipitation time series, the number of significant results is in line with expectation, if we assume that precipitation values are spatially and temporally independent [96].  Figure 3. The KS test for mean annual precipitation shows no statistically significant results for BCC-CSM2-MR, CNRM-CM6-1 and MRI-ESM2.0 for all 2255 sub-basins. The EC-Earth3 and EC-Earth3-veg produce significant results only for two and seven sub-basins, respectively. For evaluating precipitation time series, the number of significant results is in line with expectation, if we assume that precipitation values are spatially and temporally independent [96].

Evaluation of Maximum and Minimum Temperature Mean Characteristics
The observed average annual Tmax and Tmin and their differences (simulation minus observation) corresponding to five GCMs are presented in Figure 3 for the period 1983-2014. The spatial pattern of biases is somewhat similar for all GCMs. The change is larger ± 5 • C in the south and south-western domain for all GCMs except the MRI-ESM2.0, which also shows overestimation (warm bias) in the central to the northern part of the province. In the BC part, all GCMs underestimate (cold bias) Tmax and Tmin. In the mountainous regions, every GCM overestimate both variables. Overall, most of the models underestimate temperature in southern Alberta and overestimate in the north. Similar temperature biases were reported over the complex topographic regions in other studies [95,97]. Sillmann et al. [37] also reported such limitations in high terrain regions while evaluating climate extremes indices in a CMIP5 multi-model ensemble. This bias could be partly attributed to the poor representation of complex topographic features and the land surface interactions in the parameterization scheme of the GCM [95]. The primary horizontal resolutions between models could introduce large differences in elevation of mountainous terrains, which can make the reproduction of precipitation and temperature difficult [94]. The overestimation of temperature in the mountainous regions might also be attributed to the absence of altitude effects in the interpolation algorithm (Section 2.2) as the temperature varies with elevation [98].

Evaluation of Maximum and Minimum Temperature Mean Characteristics
The observed average annual Tmax and Tmin and their differences (simulation minus observation) corresponding to five GCMs are presented in Figure 3 for the period 1983-2014. The spatial pattern of biases is somewhat similar for all GCMs. The change is larger ± 5 °C in the south and south-western domain for all GCMs except the MRI-ESM2.0, which also shows overestimation (warm bias) in the central to the northern part of the province. In the BC part, all GCMs underestimate (cold bias) Tmax and Tmin. In the mountainous regions, every GCM overestimate both variables. Overall, most of the models underestimate temperature in southern Alberta and overestimate in the north. Similar temperature biases were reported over the complex topographic regions in other studies [95,97]. Sillmann et al. [37] also reported such limitations in high terrain regions while evaluating climate extremes indices in a CMIP5 multi-model ensemble. This bias could be partly attributed to the poor representation of complex topographic features and the land surface interactions in the parameterization scheme of the GCM [95]. The primary horizontal resolutions between models could introduce large differences in elevation of mountainous terrains, which can make the reproduction of precipitation and temperature difficult [94]. The overestimation of temperature in the mountainous regions might also be attributed to the absence of altitude effects in the interpolation algorithm (Section 2.2) as the temperature varies with elevation [98].
The spatial evaluation metrics of mean annual daily temperature for the five GCMs are presented in Figure 4. The GCMs usually obtain higher skill scores for Tmax compared to Tmin. Similar results were found in a study by Cheng et al. [71], where they evaluated the performance of six CMIP5 GCMs over the Athabasca River Basin in Alberta. Figure 4 reveals a mostly north-south gradient for Tmax and Tmin. Results based on bR 2 and NSE show clear spatial patterns of improved simulation in the northern area of the province followed by middle, southern and south-western parts. The low bR 2 and negative NSE values in the south-western region (mountainous) of the province reflect the poor model performance. The bR 2 ranges from 0.19 to 0.70 and 0.09 to 0.69 for Tmax and Tmin, respectively. Likewise, the NSE metric tends to show a similar pattern as bR 2 with values ranging from −0.52 to 0.70 and −0.91 to 0.67 for Tmax and Tmin, respectively. Overall, the CNRM-CM6-1, EC-Earth3, and EC-Earth3-veg models perform better in simulating temperature compared to MRI-ESM2.0 and BCC-CSM2-MR. Wyser et al. [39] discussed the impact of forcing datasets (e.g., greenhouse gas concentrations, insolation, stratospheric ozone con- The spatial evaluation metrics of mean annual daily temperature for the five GCMs are presented in Figure 4. The GCMs usually obtain higher skill scores for Tmax compared to Tmin. Similar results were found in a study by Cheng et al. [71], where they evaluated the performance of six CMIP5 GCMs over the Athabasca River Basin in Alberta. Figure 4 reveals a mostly north-south gradient for Tmax and Tmin. Results based on bR 2 and NSE show clear spatial patterns of improved simulation in the northern area of the province followed by middle, southern and south-western parts. The low bR 2 and negative NSE values in the south-western region (mountainous) of the province reflect the poor model performance. The bR 2 ranges from 0.19 to 0.70 and 0.09 to 0.69 for Tmax and Tmin, respectively. Likewise, the NSE metric tends to show a similar pattern as bR 2 with values ranging from −0.52 to 0.70 and −0.91 to 0.67 for Tmax and Tmin, respectively. Overall, the CNRM-CM6-1, EC-Earth3, and EC-Earth3-veg models perform better in simulating temperature compared to MRI-ESM2.0 and BCC-CSM2-MR. Wyser et al. [39] discussed the impact of forcing datasets (e.g., greenhouse gas concentrations, insolation, stratospheric ozone concentrations, optical properties of stratospheric aerosols, and landuse changes) for climate model simulations and pointed GHG concentrations as a major contributor to the warming. The variation in results from GCMs is probably strongly model-dependent due to variations in considering these forcing datasets.

Evaluation of Precipitation Extreme Characteristics
The GCM simulations are also evaluated in terms of reproducing the observed duration and frequency of CWD. The duration based on observed data Figure 5 ranges from one to two days and the spatial distribution is uniform over the study area, with a few exceptions in the southern areas where the durations are longer. GCMs reproduce the duration of CWD very well. All GCMs slightly overestimate the duration with a maximum one day except for the BCC-CSM2-MR, which underestimate the duration in most sub-basins ( Figure 5).
Density curves of duration and frequency of SD and WN are shown in Figure 6. The GCM simulated distributions of SD are highly deviated from the observed distributions. All GCMs overestimate the duration and underestimate their frequency. For both characteristics, individual models' distributions are substantially different from each other. However, duration and frequency distribution curves of WN are similar to the observed distribution. GCMs can partially reproduce the density curves although they overestimate duration and underestimate the frequency of WN.  The mean annual frequencies of CWD compared with observed data are maximum in the northern area followed by the central and southern areas of the province Figure 5. The results show that the frequency estimated from the different GCMs is in good agreement with the observations, except that the BCC-CSM2-MR model overestimates the frequency in most of the sub-basins. The overestimation could be attributed to two aspects of insufficient resolution in the model including a) the resolution is too coarse to represent the landmass at all and b) the model does not resolve the topographically driven high precipitation regimes [99]. For most GCMs, the maximum underestimation of occurrence frequency is found in the south-western mountainous areas, and this underestimation could also be related to the improper representation of uneven topography in the climate model as explained previously.
The temporal distribution of the duration and frequency of CWD are plotted in Figure 6 through kernel density plots that display the distribution of these two characteristics over a continuous interval or time period for the entire province. The peaks of the plot indicate where values are concentrated over the interval. Results are similar to the spatial plots where all models partially coincide at the base of the kernel density curves for the duration and differ slightly at the peaks except for the BCC-CSM2-MR, which agrees at the base but underestimates the peak Figure 6. Likewise, the frequency Figure 6 shows that the graphical patterns of individual models are opposite compared to the pattern for the duration.

Tail Behaviour of Precipitation and Temperature Extremes
The tail behaviour of P, Tmax, and Tmin extremes is shown in Figure 7. The observed tail index of P extremes is mostly positive across the province, indicating a heavy-tailed distribution. Results reveal high values of tail index in the southern part of Alberta and BC portion of the study area. The north-eastern part of the province showed thin tails. All GCMs reveal similar spatial patterns of difference, whether over-or underestimation of the tail index for the precipitation extremes with values ranging around ± 0.25, as illustrated in Figure 6. GCMs highly overestimate the tail index. Such over-or underestimation of tail behaviour is also found for CMIP5-GCMs over the Euro-Mediterranean region [100]. The tail behaviour is known to be sensitive to the convective parameterization (CP) within GCMs, which aims to represent the effects of convection on the grid-scale but does not capture the dynamics of individual storms [101]. However, the CP is a major source of errors in climate simulations [94] and therefore, the bias in the tail index estimation is inherent to the climate model performances.
The observed tail index of both Tmax and Tmin extremes are negative for most of the sub-basins, indicating a short-tailed distribution Figure 7. The spatial distribution of tail index is somewhat different for both temperature extremes. Positive values of Tmax extremes are located in the south and north-eastern part of the study area. In addition to these locations, the tail index of Tmin extremes is positive in the foothills of the Rockies. Otherwise, a relatively thin tail (negative) is observed in the entire province. A thin-tailed probability distribution corresponds to the upper tail declining to zero exponentially or faster [102]. GCM simulations are quite biased in reproducing the tail behaviour of extremes. For Tmax extremes, models overestimate in the mountainous area and northernmost Alberta. In terms of bias estimation, the spatial distribution of tail index of Tmax extremes is analogous to the CNRM-CM6-1 and MR-ESM2. For the Tmin extremes, a comparable spatial pattern existed for the delta tail index for all GCMs. Similar to the results of Tmax extremes, the magnitude of bias (overestimation) based on Tmin extremes is higher for most of the sub-basins.
The temporal distribution of density curves of the tail index for P extremes Figure 8 shows that all GCMs slightly overestimate the index with comparable height except the

Evaluation of Maximum and Minimum Temperature Extremes Characteristics
The maximum duration (three to four days) of SD and WN is observed in the northernmost and south-western mountainous parts of the province Figure 5. However, the duration of SD progressively decrease from the mountainous area to the grassland region of southern Alberta. The duration of WN is as high as three days in the south-eastern region of Alberta. In rest of the province, the duration ranges from two to three days for both SD and WN. The provincial average duration of SD and WN are 2.7 and 2.5 days, respectively. In Figure 5, the difference of duration indicates that GCMs have a relatively low skill (overestimation) in reproducing the mean duration of SD. The results show that individual model performance varies substantially. The CNRM-CM6-1 model shows very poor performance while the MR-ESM2.0 is comparatively better compared to any other models. Only the MR-ESM2.0 model shows an underestimation of the mean duration of SD in the mountainous region. However, GCMs show better skill in reproducing the mean duration of WN across the province. The BCC-CSM2-MR, CNRM-CM6-1, and MRI-ESM2.0 underestimate mean duration in the Rockies. Overall, the variation in the estimation of the mean duration of WN is within ± 1 day.
We find an inverse relationship between duration and frequency of SD and WN. A sub-basin with a high duration of SD and WN has a low frequency of SD and WN and vice-versa. The spatial pattern of the frequency distribution of SD and WN are also similar to that of mean duration. Results in Figure 5 reveal the minimum frequency of SD and WN in the northern and mountainous regions of the province. Overall, the average number of SD and WN is 13 and 14 across the province, respectively. Based on the overall results, the underestimation of delta mean frequency ranges from 3 to 4 and 0.5 to 1.6, respectively for SD and WN. Nie et al. [44] pointed out the new cloud-fraction scheme updated in the CMIP6 version that might help to improve the simulation of temperature extremes by giving surface radiant fluxes in the low-and mid-latitudes. In their study, they found that the BCC-CSM2-MR model can simulate the warm and cold temperature extremes reasonably well.
Density curves of duration and frequency of SD and WN are shown in Figure 6. The GCM simulated distributions of SD are highly deviated from the observed distributions. All GCMs overestimate the duration and underestimate their frequency. For both characteristics, individual models' distributions are substantially different from each other. However, duration and frequency distribution curves of WN are similar to the observed distribution. GCMs can partially reproduce the density curves although they overestimate duration and underestimate the frequency of WN.

Tail Behaviour of Precipitation and Temperature Extremes
The tail behaviour of P, Tmax, and Tmin extremes is shown in Figure 7. The observed tail index of P extremes is mostly positive across the province, indicating a heavy-tailed distribution. Results reveal high values of tail index in the southern part of Alberta and BC portion of the study area. The north-eastern part of the province showed thin tails. All GCMs reveal similar spatial patterns of difference, whether over-or underestimation of the tail index for the precipitation extremes with values ranging around ± 0.25, as illustrated in Figure 6. GCMs highly overestimate the tail index. Such over-or underestimation of tail behaviour is also found for CMIP5-GCMs over the Euro-Mediterranean region [100]. The tail behaviour is known to be sensitive to the convective parameterization (CP) within GCMs, which aims to represent the effects of convection on the grid-scale but does not capture the dynamics of individual storms [101]. However, the CP is a major source of errors in climate simulations [94] and therefore, the bias in the tail index estimation is inherent to the climate model performances. CNRM-CM6-1, which has similar base as the observations with a short peak. For Tmax extremes, GCMs reproduce the tail index very well at the base; however, they overestimate the peak, which is higher than the observed tail index (Figure 8). GCMs show biases (overestimation) both at the base and peak for the Tmin extremes ( Figure 8) and their temporal distributions are similar. The three-parameter distributions (e.g., GPD) are less parsimonious than the two-parameter distributions and have large parametric uncertainty in estimating the tail index. However, there are ways (e.g., regionalization) to improve the accuracy of this parameter estimation by maintaining spatial homogeneity [49]. Regionalization increases the robustness in estimation by increasing the sample size by substituting space for time [51], which can be followed in future studies.

Regional Variation of GCM Performances
In this section, we discuss GCMs' performances in reproducing the extremes of P, Tmax, and Tmin in the four identified regions within Alberta Figure 1. The observed tail index of both Tmax and Tmin extremes are negative for most of the sub-basins, indicating a short-tailed distribution Figure 7. The spatial distribution of tail index is somewhat different for both temperature extremes. Positive values of Tmax extremes are located in the south and north-eastern part of the study area. In addition to these locations, the tail index of Tmin extremes is positive in the foothills of the Rockies. Otherwise, a relatively thin tail (negative) is observed in the entire province. A thin-tailed probability distribution corresponds to the upper tail declining to zero exponentially or faster [102]. GCM simulations are quite biased in reproducing the tail behaviour of extremes. For Tmax extremes, models overestimate in the mountainous area and northernmost Alberta. In terms of bias estimation, the spatial distribution of tail index of Tmax extremes is analogous to the CNRM-CM6-1 and MR-ESM2. For the Tmin extremes, a comparable spatial pattern existed for the delta tail index for all GCMs. Similar to the results of Tmax extremes, the magnitude of bias (overestimation) based on Tmin extremes is higher for most of the sub-basins.
The temporal distribution of density curves of the tail index for P extremes Figure 8 shows that all GCMs slightly overestimate the index with comparable height except the CNRM-CM6-1, which has similar base as the observations with a short peak. For Tmax extremes, GCMs reproduce the tail index very well at the base; however, they overestimate the peak, which is higher than the observed tail index (Figure 8). GCMs show biases (overestimation) both at the base and peak for the Tmin extremes ( Figure 8) and their temporal distributions are similar. The three-parameter distributions (e.g., GPD) are less parsimonious than the two-parameter distributions and have large parametric uncertainty in estimating the tail index. However, there are ways (e.g., regionalization) to improve the accuracy of this parameter estimation by maintaining spatial homogeneity [49]. Regionalization increases the robustness in estimation by increasing the sample size by substituting space for time [51], which can be followed in future studies.
Regionalization increases the robustness in estimation by increasing the sample size by substituting space for time [51], which can be followed in future studies.

Regional Variation of GCM Performances
In this section, we discuss GCMs' performances in reproducing the extremes of P, Tmax, and Tmin in the four identified regions within Alberta Figure 1.

Extreme Characteristics
The regional distributions of density curves (Figure 9) show fluctuation in densities of mean duration and frequency of CWD, SD, and WN. The shape of the distribution varies across regions. The western mountainous region (R1) has flat shaped density curves with reduced peaks for both duration and frequency. R2, which consists of grassland with

Regional Variation of GCM Performances
In this section, we discuss GCMs' performances in reproducing the extremes of P, Tmax, and Tmin in the four identified regions within Alberta Figure 1.

Extreme Characteristics
The regional distributions of density curves (Figure 9) show fluctuation in densities of mean duration and frequency of CWD, SD, and WN. The shape of the distribution varies across regions. The western mountainous region (R1) has flat shaped density curves with reduced peaks for both duration and frequency. R2, which consists of grassland with low annual precipitation and high temperature [103], has the highest peak for both duration and frequency. Different components of the biosphere control the regional climate feedback processes that may govern the variability of GCM simulations across regions. Consequently, we observe variation in results across regions as simulated by a single GCM. The inter-model variability in reproducing duration of CWD is less than that of frequency. For SD, all GCM simulations are very different from the observed distribution of duration and frequency of SD. For WN, GCMs slightly overestimate the duration and underestimate the frequency in all regions.

Tail Behaviour of Extremes
The tail behaviour of P extremes in Figure 10 shows that the majority of GCMs produce the tails index quite well in all regions, except the BCC-CSM2-MR which has a slight overestimation. In R4, the observed tail behaviour of P extremes is somewhat complex with multiple peaks, and the behaviour is not reproduced by any GCM. However, the performance of GCMs is better for the Tmax and Tmin extremes where simulated peaks of density curves are higher than that of observed density curve.
The tail behaviour of P extremes in Figure 10 shows that the majority of GCMs produce the tails index quite well in all regions, except the BCC-CSM2-MR which has a slight overestimation. In R4, the observed tail behaviour of P extremes is somewhat complex with multiple peaks, and the behaviour is not reproduced by any GCM. However, the performance of GCMs is better for the Tmax and Tmin extremes where simulated peaks of density curves are higher than that of observed density curve.

Evaluation of Mean and Extreme Characteristics of Downscaled Simulations
The KS statistic (D) is calculated using the downscaled (DS) and non-downsca (NDS) data against the observations ( Figure 11). The reduced magnitude of D clearly dicates the overall improved precipitation prediction skills of GCM simulations across province. The D value is decreased from 0.02-0. 50  The DS GCM simulations are further assessed using three extremes indices (CW SD, and WN). We extracted two characteristics (i.e., duration and frequency) of each ind using DS simulations and estimated the difference between DS and observations for ea sub-basin. The delta duration and frequency of CWD ( Figure 12) show the close repres tation of DS simulations to observations, indicating better skills of DS GCMs. Likewi the difference of duration and frequency based on SD and WN between the DS GCM si ulations and observations is reduced across the province. Similar to the mean character tics, DS GCMs show better skill in reproducing extreme precipitation characteristics co pared with the temperature extremes. Figure 13 shows the spatial distribution of the delta tail index of P, Tmax, and Tm extremes for DS GCM simulations. There is no clear indication that DS GCMs perfo

Evaluation of Mean and Extreme Characteristics of Downscaled Simulations
The KS statistic (D) is calculated using the downscaled (DS) and non-downscaled (NDS) data against the observations ( Figure 11). The reduced magnitude of D clearly indicates the overall improved precipitation prediction skills of GCM simulations across the province. The D value is decreased from 0.02-0.50 (NDS) to 0.01-0.11 (DS). NDS GCMs obviously show poor performance ( Figure 3) compared with DS GCMs in simulating precipitation. However, in contrast to the precipitation, bias correction and downscaling the temperature ensemble time series led to a poorer prediction skill for the GCMs (Figure 11). This observation indicates that the GCMs (NDS) that participated in CMIP6 are reasonably good at simulating daily maximum and minimum temperatures. Based on bR 2  The DS GCM simulations are further assessed using three extremes indices (CWD, SD, and WN). We extracted two characteristics (i.e., duration and frequency) of each index using DS simulations and estimated the difference between DS and observations for each sub-basin. The delta duration and frequency of CWD ( Figure 12) show the close representation of DS simulations to observations, indicating better skills of DS GCMs. Likewise, the difference of duration and frequency based on SD and WN between the DS GCM simulations and observations is reduced across the province. Similar to the mean characteristics, DS GCMs show better skill in reproducing extreme precipitation characteristics compared with the temperature extremes.
shown in the Figures 11-13. However, results show that the bias correction and downscaling approach reasonably reproduced mean and extremes (especially for precipitation). These results are not surprising as Li et al. [94] and Kuo et al. [95] used dynamical downscaling approach to downscale historical climate over western Canada and found similar results, although dynamical downscaling showed better ability over statistical downscaling [99]. Figure 11. The verification metrics between GCM simulations (downscaled: DS and nondownscaled: NDS) and observed data for the period of 1983-2014. The x-axis represents the 2255 sub-basins in Alberta and y-axis represents the corresponding evaluation metric for daily P, Tmax, and Tmin. The negative values of NSE were ignored and the x-axis was set at 0. Figure 11. The verification metrics between GCM simulations (downscaled: DS and non-downscaled: NDS) and observed data for the period of 1983-2014. The x-axis represents the 2255 sub-basins in Alberta and y-axis represents the corresponding evaluation metric for daily P, Tmax, and Tmin. The negative values of NSE were ignored and the x-axis was set at 0. Figure 13 shows the spatial distribution of the delta tail index of P, Tmax, and Tmin extremes for DS GCM simulations. There is no clear indication that DS GCMs perform better than the NDS GCMs ( Figure 6). The spatial pattern of the delta tail index shows that the magnitude of differences decrease in most of the sub-basins. There are some opposite signals also found for all extremes, and more specifically, overestimation by DS GCM and underestimation by NDS GCM. For example, the DS CNRM-CM6-1 overestimates ( Figure 13) while this GCM underestimates the tail index under NDS condition (Figure 7).
Note that the bias correction and downscaling generally involves processes that bring the climate model simulations close to the observations; however, the bias-corrected and downscaled simulations always carry the climate change signal from the host model [99]. Therefore, results based on DS simulations do not substantially increase the confidence as shown in the Figures 11-13. However, results show that the bias correction and downscaling approach reasonably reproduced mean and extremes (especially for precipitation). These results are not surprising as Li et al. [94] and Kuo et al. [95] used dynamical downscaling approach to downscale historical climate over western Canada and found similar results, although dynamical downscaling showed better ability over statistical downscaling [99].

Summary and Conclusions
This study analyzed the performance of five CMIP6 GCMs in their ability to simulate the observed spatio-temporal behaviour of climate means and extremes across 2255 subbasins in Alberta, Canada. Daily simulations of P, Tmax, and Tmin were evaluated against

Summary and Conclusions
This study analyzed the performance of five CMIP6 GCMs in their ability to simulate the observed spatio-temporal behaviour of climate means and extremes across 2255 subbasins in Alberta, Canada. Daily simulations of P, Tmax, and Tmin were evaluated against a hybrid observational data set that best represents the climatic and hydroclimatic conditions of the province. Despite similar spatial resolutions, model performances varied in their ability to simulate the means and extremes of precipitation and temperatures. From the various analyses presented and discussed in this study, we summarize and conclude the following main outcomes.

1.
The average bias in mean annual precipitation is reasonably low for all sub-basins, except for the CNRM-CM6-1 GCM. The EC-Earth3 and EC-Earth3-veg simulate the annual mean P quite well followed by the MRI-ESM2.0 and BCC-CSM2-MR. However, the performance of CNRM-CM6-1 is very poor with substantial underestimation. For temperature, the MRI-ESM2.0 shows the worst performance. The EC-Earth3 and EC-Earth3-veg show better skill followed by the BCC-CSM2-MR and CNRM-CM6-1.
Overall, models show better performance in simulating Tmax than Tmin. For both precipitation and temperature, models reproduced the observations better in the north and follow a gradient toward the south with poorest performance in the mountainous area.

2.
Minimum positive performance errors (overestimation) are found for the mean annual duration of CWD followed by WN and SD. The BCC-CSM2-MR performed poorly with respect to the duration of CWD, as did the CNRM-CM6-1 regarding the duration of both SD and WN (compared to other GCMs for the entire domain of study). The temporal distributions of duration by model simulations are reasonably superimposed to that of observations in the case of CWD; however, they are slightly and completely overestimated by GCMs for the duration of WN and SD, respectively. In general, there is an inverse relationship between the duration and frequency of occurrence of extreme indices. GCMs consistently underestimated the frequency whereas they overestimated the duration. Nevertheless, the performance of the individual models to simulate frequency is rather similar to that of duration. For all extreme indices, a pattern of over-or underestimating the duration/frequency observed in the southwestern side of the province where the Canadian Rockies are located. Therefore, it would be interesting to investigate the bias-topography relationship during subsequent verification studies across mountainous regions of North America. 3.
The observed tail index (shape parameter of the Generalized Pareto Distribution) indicated a heavy tail for P extremes and light tail for Tmax and Tmin extremes. The tail index reasonably follows the spatial distribution of observations. However, a little difference in the tail of distribution significantly affects the long return periods indicating the importance of good tail representation. In this aspect, GCMs still may not incorporate the convective parameterization scheme at the existing grid spacing. The individual model performance is quite similar for all extremes having the poorest performance (highest magnitude of errors) by the BCC-CSM2-MR for P, MRI-ESM2.0 for Tmax, and both BCC-CSM2-MR and MRI-ESM2.0 for Tmin extremes. 4.
The downscaled GCMs showed better skill in simulating mean annual precipitation compared to the non-downscaled GCMs. The performance of DS GCM simulations was not satisfactory for Tmax and Tmin. The DS technique improved Tmax simulations by the BCC-CSM2-MR and MRI-ESM2.0. Only the MRI-ESM2.0 showed better performances in Tmin after downscaling. However, GCMs showed good skills when reproducing the characteristics (duration and frequency of occurrence) of CWD, SD, and WN based on DS simulations (as compared to NDS simulations). Overall, the bias correction and downscaling approach worked well for reproducing extreme characteristics, and more specifically, improved CWD's characteristics over those associated with SD and WN. After downscaling, there is no clear indication of having an improved tail index of GPD based on precipitation and temperature extremes.
The downscaled simulations do not significantly increase our confidence to simulate climate variables, specifically the Tmax and Tmin time series.
It is obvious that results from each GCM would be different in terms of spatial and temporal scale. The physical reasons behind any over-or underestimation by individual GCMs would require more in-depth and perhaps process-based analyses. We considered the best available hybrid observation dataset to evaluate the performance of five CMIP6 GCMs to simulate the means and extremes of several climatic variables. However, their performances can also be evaluated through using different reanalysis and global and regional gridded climate products to draw more robust conclusions. Poor model performances in the mountainous region may arise due to the lack of high-quality observed data in that region. Several factors affect the GCMs' performance including the choice and quality of observed hybrid dataset and performance metrics. Application of quantitative statistical metrics (bR 2 and NSE) may compromise the presence of uncorrelated day-to-day variations, which account for a large portion of total variability. In such cases, validation statistics like distribution of time series and autocorrelation functions, that do not depend on direct correlation between time series, can be applied simultaneously to further check the GCM's performances in reproducing observed behaviour of climate means and extremes. Despite these limitations and based on our statistical analysis of precipitation and temperature data, our study provides useful information about GCMs performances in simulating means and extremes of daily P, Tmax, and Tmin across Alberta. Such information clearly lays the foundation for future climate impact analysis using CMIP6 GCMs.