Impact of Spatial Aggregation Level of Climate Indicators on a National-Level Selection for Representative Climate Change Scenarios

: For sustainable management of water resources, adaptive decisions should be determined considering future climate change. Since decision makers have difﬁculty in formulating a decision when they should consider a large number of climate change scenarios, selecting a subset of Global Circulation Models (GCM) outputs for climate change impact studies is required. In this study, the Katsavounidis-Kuo-Zhang (KKZ) algorithm was used for representative climate change scenarios selection and a comprehensive analysis has been done through a national-level case study of South Korea. The KKZ algorithm was applied to select a subset of GCMs for each subbasin in South Korea. To evaluate impacts of spatial aggregation level of climate data sets on preserving inter-model variability of hydrologic variables, three different scales (national level, river region level, subbasin level) were tested. It was found that only ﬁve GCMs selected by KKZ algorithm can explain almost of whole inter-model variability driven by all the 27 GCMs under Representative Concentration Pathways (RCP) 4.5 and 8.5. Furthermore, a single set of representative GCMs selected for national level was able to explain inter-model variability on almost the whole subbasins. In case of low ﬂow variable, however, use of ﬁner scale of climate data sets was recommended.


Introduction
For sustainable management of water resources, future climate change impacts should be assessed for adaptive decision making. For climate change impact studies, projections driven by a group of Global Circulation Models (GCMs) are generally used to capture a plausible range of changes in future climate conditions. Although it is considered desirable to employ as many GCMs as possible to quantify the inter-model variability (i.e., the variability across all the GCMs outputs), this task can be complicated due to large computational costs [1]. Further, decision-makers have difficulty in formulating a decision when they should consider a large number of climate change scenarios [1]. In this regard, a variety of researchers have developed optimum techniques to select a subset of GCMs for climate change impact studies [2][3][4][5][6][7][8]. In this paper, the term "GCMs" is used to synonymous with "climate change scenarios".
According to the agreement of many experts, a subset of GCMs should achieve the following: (1) favor models that accurately reproduce patterns of historical climate (e.g., mean monthly values and annual cycles) [9,10], thus enhancing the plausibility of results for future projections in the target regions, and (2) incorporate a potential range of future climate conditions in terms of the key

Scenario Selection: KKZ Algorithm
The KKZ algorithm is adopted to select the representative climate change scenarios in this study. Unlike k-mean clustering, the KKZ algorithm recursively selects models that best span the spread of an ensemble that best characterize high-density regions of multivariate space [1,12]. Given the number of GCMs, N, and the number of climate variables, P, the KKZ algorithm is applied as follows: (1) For the first GCM selection, the model that lies closest to the ensemble centroid, i.e., the GCM with the lowest sum of squared errors (SSE) to the centroid across all the climate variables is selected, as illustrated in Equation (1).
where y ip represents the value of the p th climate variable for the i th GCM, and y p forms the centroid value of the p th climate variable across all GCMs. (2) For the second GCM selection, the GCM that lies farthest from the first GCM is selected.
The Euclidean (P-space) distance is applied to calculate the distance, d(i,j), between two GCMs (the i th and j th GCMs).
(3) For the selection of the following GCMs (from the 3rd till the last selection), (i) the distances from each remaining GCM to the previously selected GCMs are calculated ("each remaining GCM" becomes from the 3rd till the last selection sequentially); (ii) only the lowest distance among those calculated in step 3(i) for each remaining GCM is retained; (iii) the GCM with the maximum distance among those determined in step 3(ii) is selected as the next GCM. (4) Step 3 is repeated until all GCMs have been placed in order. Figure 1 illustrates an example of the step-by-step procedure with a simple bi-variates case.

Climate Indices
Seo et al. [1] demonstrated that a group of key climate indices must be determined prior to the selection of a group of GCMs. Hence, it is not necessary to consider many climate indicators for selection of representative climate scenarios. Based on dependency of each hydrologic variable to climate indicators, three different pairs of climate indices were determined for the three hydrologic variables of interest, annual mean flow, three-day peak flow, and seven-day low flow, which represent mean, high, and low flow regimes, respectively. Three-day peak flow and seven-day low flow have been widely used for high-flow and low-flow variables in US [16][17][18][19]. Table 1 presents the selected climate indices. They are applied to this study based on previous study [1]. Changes in these climate indices for a certain future period are then used as y values in the KKZ algorithm. Step (4)select 4th GCM having the longest distance among a group of shortest distances to among the 1st to 3rd GCMs; (c) Step (4)-select 5th GCM having the longest distance among a group of shortest distances to among the 1st to 4th GCMs. Here, the gray arrows represent GCMs disregarded by the algorithm. Step (4)-select 4th GCM having the longest distance among a group of shortest distances to among the 1st to 3rd GCMs; (c) Step (4)-select 5th GCM having the longest distance among a group of shortest distances to among the 1st to 4th GCMs. Here, the gray arrows represent GCMs disregarded by the algorithm.

Explained Variability
To evaluate capability of capturing inter-model variability, "explained variability" is defined as the proportion of the variability that are identified by the selected GCMs to the entire variability driven by all the GCMs. For instance, if the "explained variability" of the selected six models among all the 27 models is 1 (i.e., 100%), it means that we do not need to use others than the five models (6/27, 22% of the entire models) in order to explain all the inter-model variability obtained by all the 27 models. In this study, explained variability across changes in climate indices is calculated to assess the ability of the KKZ algorithm for capturing inter-model variability. Furthermore, explained variability in changes in each hydrologic variable (three categories in Table 1) is also estimated to evaluate transferability of the selected GCMs to uncertainty in hydrologic variables.

Rainfall-Runoff Model: Tank Model
A modified conceptual rainfall-runoff model, Tank model [20] with soil moisture structure, was used as a hydrologic model for runoff simulation in this study. Having four tanks along with a soil moisture structure, the model simulates the net stream discharge as the sum of the discharges from the side outlets of the tanks. Mean areal precipitation, temperature and potential-evapotranspiration series are inputted into the model to simulate streamflow at outlet. In this study, Penman-Monteith equation was used to estimate potential-evapotranspiration [21]. Climatological records of solar radiation and humidity and wind speed are used along with temperature data. In order to consider the snow accumulation-melting module, the modified Tank model developed by McCabe and Markstrom [22] was used. Readers are referred to Sugawara [20] and McCabe and Markstrom [22] for details in terminologies and equations for all the processes. The parameters of the model are estimated using the shuffled complex evolution algorithm, one of the population-evolution-based global optimization methods [23]. Figure 2 illustrated the schematic diagram of the Tank model.

Explained Variability
To evaluate capability of capturing inter-model variability, "explained variability" is defined as the proportion of the variability that are identified by the selected GCMs to the entire variability driven by all the GCMs. For instance, if the "explained variability" of the selected six models among all the 27 models is 1 (i.e., 100%), it means that we do not need to use others than the five models (6/27, 22% of the entire models) in order to explain all the inter-model variability obtained by all the 27 models. In this study, explained variability across changes in climate indices is calculated to assess the ability of the KKZ algorithm for capturing inter-model variability. Furthermore, explained variability in changes in each hydrologic variable (three categories in Table 1) is also estimated to evaluate transferability of the selected GCMs to uncertainty in hydrologic variables.

Rainfall-Runoff Model: Tank Model
A modified conceptual rainfall-runoff model, Tank model [20] with soil moisture structure, was used as a hydrologic model for runoff simulation in this study. Having four tanks along with a soil moisture structure, the model simulates the net stream discharge as the sum of the discharges from the side outlets of the tanks. Mean areal precipitation, temperature and potential-evapotranspiration series are inputted into the model to simulate streamflow at outlet. In this study, Penman-Monteith equation was used to estimate potential-evapotranspiration [21]. Climatological records of solar radiation and humidity and wind speed are used along with temperature data. In order to consider the snow accumulation-melting module, the modified Tank model developed by McCabe and Markstrom [22] was used. Readers are referred to Sugawara [20] and McCabe and Markstrom [22] for details in terminologies and equations for all the processes. The parameters of the model are estimated using the shuffled complex evolution algorithm, one of the population-evolution-based global optimization methods [23]. Figure 2 illustrated the schematic diagram of the Tank model.

Study Area: South Korea
For a national-level application of the proposed methodology, South Korea was tested. South Korea covers an area of approximately 100,000 km 2 and has a population of almost 52 million. The climatic conditions of South Korea are dominated by the Asian monsoon; therefore, approximately two-thirds of annual precipitation and runoff take place during the summer season that spans from July to September. Monsoons form the major driver behind the timing, magnitude, and distribution of wet season rainfall, rainfall inter-annual variability, and rainfall extremes in South Korea [5]. As shown in Figure 3, South Korea includes 5 major river regions and 113 subbasins (except Jeju Province). For assessment of spatial scale impacts, representative GCMs were selected for South Korea, each river region, and each subbasin separately. In other words, a single set of GCMs is selected for South Korea, five different sets of GCMs are selected for each river region, and 113 different sets of GCMs are selected for each subbasin. Mean areal climate data sets for South Korea, river regions, and subbasins are applied, respectively.

Study Area: South Korea
For a national-level application of the proposed methodology, South Korea was tested. South Korea covers an area of approximately 100,000 km 2 and has a population of almost 52 million. The climatic conditions of South Korea are dominated by the Asian monsoon; therefore, approximately two-thirds of annual precipitation and runoff take place during the summer season that spans from July to September. Monsoons form the major driver behind the timing, magnitude, and distribution of wet season rainfall, rainfall inter-annual variability, and rainfall extremes in South Korea [5]. As shown in Figure 3, South Korea includes 5 major river regions and 113 subbasins (except Jeju Province). For assessment of spatial scale impacts, representative GCMs were selected for South Korea, each river region, and each subbasin separately. In other words, a single set of GCMs is selected for South Korea, five different sets of GCMs are selected for each river region, and 113 different sets of GCMs are selected for each subbasin. Mean areal climate data sets for South Korea, river regions, and subbasins are applied, respectively.

Observed Meteorological Data Sets
Observed meteorological data sets from 1976 to 2015 were obtained. Daily series of mean areal precipitation and temperature for all the subbasins were calculated using the Thiessen polygons [24] from 60 of the Korea Meteorological Administration's Automated Surface Observing System (ASOS) gauges (https://data.kma.go.kr/data/grnd/selectAsosList.do?pgmNo=34).

GCM Data Sets
Historical simulation  and future projections, 2030s (2016-2045) and 2060s (2046-2075), of daily precipitation and temperature series were obtained from 27 GCMs of the Coupled Model

Observed Meteorological Data Sets
Observed meteorological data sets from 1976 to 2015 were obtained. Daily series of mean areal precipitation and temperature for all the subbasins were calculated using the Thiessen polygons [24] from 60 of the Korea Meteorological Administration's Automated Surface Observing System (ASOS) gauges (https://data.kma.go.kr/data/grnd/selectAsosList.do?pgmNo=34).

GCM Data Sets
Historical simulation    Table 2 lists the 27 GCMs applied in this study. The coarse spatial scale of the GCMs was first interpolated to the resolution of the ASOS gauges spacing (≈45 km) based on an inverse distance weighting scheme. Mean areal values for the subbasins were subsequently calculated with the Thiessen polygons as the same as done for observed meteorological data sets. Systematic biases in these GCMs forcing data sets were corrected with Quantile Delta Mapping (QDM) approach [44]. Readers should refer to Eum and Cannon [45] for the specific steps involved in the QDM algorithm implemented for precipitation and temperature series in this study. Through the implementation of the QDM algorithm, bias-corrected values preserve relative changes in all quantiles modeled through GCMs.

Modelling Framework
Climate data sets (27 GCMs) are first pre-processed to obtain mean areal precipitation and temperature series for the three different spatial scales (national, river region, and subbasin). Then the rank of GCMs is estimated by the KKZ algorithm using the pair of climate indices for each hydrologic variable. Next, explained variabilities on climate indicators corresponding to the selected GCMs are calculated. Impacts of selected GCMs on hydrologic variables are then evaluated for the three different spatial scales as a form of explained variability of changes in hydrologic variables. Since the KKZ algorithm gives the rank of GCMs based on selection priority, by increasing the number of GCMs, corresponding explained variability are estimated. Figure 4 comprises a modelling framework of this study that illustrates the way in which the climate indices, KKZ algorithm, GCMs, Tank model, and hydrologic variables are associated and sequenced in the modelling process.

Tank Model Parameter Estimation
Parameters of the Tank model were estimated using the observed data sets from 1976 to 2000, and the model performance was validated comparing simulated streamflow to the observed streamflow from 2001 to 2015. Daily observed streamflow (dam inflow) series from 1966 to 2016 at 35 dam sites were collected from the K-water Institute. These observed dam inflow series were used to calibrate parameters of the Tank model and the simulated streamflow series driven by the calibrated parameters were validated by comparing to the observed series. NSE values for the 35 watersheds ranged from 0.68 to 0.91, and Percent BIAS values ranged from −1.57 to 8.93.
To simulate long term streamflow for ungauged watersheds (in other words, to estimate parameters for ungauged watersheds in which observed streamflow data set does not exist), multiple regression equations were derived to estimate regional parameters of the Tank Model [46]. The six watershed characteristic factors were used as predictor variables of the regression equations. The correlation coefficients of the regional regression equations ranged from 0.64 to 0.78 [46]. As a result of the regionalization, we could be able to obtain the parameter values for all the 113 subbasins.

Selection of Representative Scenarios
GCMs were placed into different orders by the KKZ algorithm corresponding to the three hydrologic variables as introduced in Table 1. Further, these GCMs ranks were also placed differently corresponding to the three different spatial scales of climate data sets as shown in Figure 4. Tables 3 and 4 present the rank from 1st to 5th GCMs for the three different hydrologic variables and two different spatial scales under RCP 4.5 and 8.5, respectively. Two periods of future projections, 2030s and 2060s, are presented. The GCMs rank for the subbasin scale is excluded due to difficulty in enumeration of all the subbasins.

Tank Model Parameter Estimation
Parameters of the Tank model were estimated using the observed data sets from 1976 to 2000, and the model performance was validated comparing simulated streamflow to the observed streamflow from 2001 to 2015. Daily observed streamflow (dam inflow) series from 1966 to 2016 at 35 dam sites were collected from the K-water Institute. These observed dam inflow series were used to calibrate parameters of the Tank model and the simulated streamflow series driven by the calibrated parameters were validated by comparing to the observed series. NSE values for the 35 watersheds ranged from 0.68 to 0.91, and Percent BIAS values ranged from −1.57 to 8.93.
To simulate long term streamflow for ungauged watersheds (in other words, to estimate parameters for ungauged watersheds in which observed streamflow data set does not exist), multiple regression equations were derived to estimate regional parameters of the Tank Model [46]. The six watershed characteristic factors were used as predictor variables of the regression equations. The correlation coefficients of the regional regression equations ranged from 0.64 to 0.78 [46]. As a result of the regionalization, we could be able to obtain the parameter values for all the 113 subbasins.

Selection of Representative Scenarios
GCMs were placed into different orders by the KKZ algorithm corresponding to the three hydrologic variables as introduced in Table 1. Further, these GCMs ranks were also placed differently corresponding to the three different spatial scales of climate data sets as shown in Figure 4. Tables 3 and 4 present the rank from 1st to 5th GCMs for the three different hydrologic variables and two different spatial scales under RCP 4.5 and 8.5, respectively. Two periods of future projections, 2030s and 2060s, are presented. The GCMs rank for the subbasin scale is excluded due to difficulty in enumeration of all the subbasins.  In Table 3, bold symbols in the five river regions represent the same GCMs that are also included in the rank for national level (South Korea). Underline symbols in 2060s represents the same GCMs that are also included in 2030s. It is found that several GCMs, in general, were overlapped across different spatial scales and locations. Especially in the 2030s period, three or four out of five GCMs were overlapped. Thus, it would be anticipated that uncertainties in climate indices of finer spatial level (i.e., river region level) can be explained at some extent by the GCMs selected for coarser spatial level (i.e., national level). This premise is further discussed in Section 4.4. Table 4 shows the rank from 1st to 5th GCMs under RCP 8.5, which are presented based on the same fashion in Table 3, but red-colored symbols represent the same GCMs that are also included under RCP 4.5. Interestingly, only a few GCMs (one or two GCMs out of the five) were overlapped between RCP 4.5 and 8.5. There was no general agreement on selected GCMs between RCP 4.5 and 8.5. Given that RCP 8.5 represents the business as usual scenario whereas RCP 4.5 represents the possible mitigation measure scenario, the selection of representative climate scenarios should be carefully implemented under various hypotheses on the future adaptation strategies.

Explained Variability on Climate Indices
Although Tables 3 and 4 only show the selected GCMs to the 5th rank, the KKZ algorithm makes the entire GCMs ordered according to their priority. Figures 5 and 6

Explained Variability on Climate Indices
Although Tables 3 and 4 only show the selected GCMs to the 5th rank, the KKZ algorithm makes the entire GCMs ordered according to their priority. Figures 5 and 6 present explained variability of changes in climate indices for 2030s and 2060s periods under RCP 4.5 and RCP 8.5 scenarios, respectively.
As shown in Figures 5 and 6, it is found that total 80% of explained variability (presented on yaxis) was quickly arrived by only the three of four GCMs (presented on x-axis). Further, the majority (close to 100% of) of explained variability was arrived by the five GCMs in most cases. Thus, the KKZ algorithm successfully identified the top five GCMs (among 27) that are able to explain the majority of inter-model variability from all the GCMs.    As shown in Figures 5 and 6, it is found that total 80% of explained variability (presented on y-axis) was quickly arrived by only the three of four GCMs (presented on x-axis). Further, the majority (close to 100% of) of explained variability was arrived by the five GCMs in most cases. Thus, the KKZ algorithm successfully identified the top five GCMs (among 27) that are able to explain the majority of inter-model variability from all the GCMs.

Explained Variability on Hydrologic Variables
Since the majority (more than 80%) of inter-model variability from all the GCMs were successfully explained under both the RCP 4.5 and 8.5, the results of explained variability on hydrologic variables under RCP 4.5 are presented in this section. We do not present the results from RCP 8.5 because the both RCP 4.5 and 8.5 output very similar results. Figure 7 presents the explained variabilities of changes in three hydrologic variables (mean flow, high flow, and low flow) in the 2030s period under RCP 4.5. The explained variabilities were calculated for all the 113 subbasins using the top five GCMs that are ranked by the KKZ algorithm. The explained variabilities in the figures on the left, middle, and right were estimated by the GCMs ranked by the national, river region, and subbasin level of climate data sets, respectively. In other words, the explained variabilities of all the subbasins were estimated by the same set of top five GCMs ranked by the national level selection (on the left) while the explained variability of each subbasin was estimated by each set of top five GCMs ranked by the subbasin level selection (on the right). The explained variabilities of all the subbasins that consist of the same river region were estimated by the same set of top five GCMs ranked by the river region level selection (on the middle).
In general, explained variablilities of changes in mean flow (Figure 7a) was well preserved by the top five GCMs regardless of spatial scale of climate data sets. It is found that only one set of 5 GCMs has capability of capturing almost the whole inter-model variability driven by the 27 GCMs. In case of high flow (Figure 7b), although inter-model variabilities were not sufficiently explained by the national level selection in a few subbasins (on the left), when it comes to the subbasin level selection (on the right), most of inter-model variabilities were successfully explained. There was similar pattern in Figure 7c though the explained variabilities were relatively lesser then the above two variables (Figure 7a,b). Thus, when we used finer scale of climate data sets for the KKZ algorithm, the transferability of the selected GCMs to uncertainty in hydrological impacts was increased. Nonetheless, note that a single set of GCMs ranked by the national level selection also has remarkable ability to explain inter-model variability of changes in hydrologic variables on each subbasin. Figure 8 shows the explained variabilities of changes in three hydrologic variables (mean flow, high flow, and low flow) in the 2060s period under RCP 4.5 for the 113 subbasins. Similar to the 2030s period, the explained variabilities of changes in mean flow and high flow (Figure 8a,b) were well preserved by the top five GCMs regardless of spatial scale of climate data sets except the Han river region. Nonetheless, explained variabilities of changes in mean flow and high flow in the Han river region were also well preserved under river region and subbasin level selection. Similarly, in case of changes in low flow, although explained variabilities were relatively low in the Seomjin and Yeongsan river regions, they were also well preserved under river region and subbasin level selections. Thus, it shows that the transferability of the selected GCMs to uncertainty in local-scale hydrological impacts studies can be improved by using finer scale of climate data sets for the KKZ algorithm. In terms of spatial scale of climate data sets for selection of GCMs, it is found that a single set of top five GCMs (i.e., the five GCMs selected for national level, South Korea) can explain almost the whole range of inter-model variabilities except a few subbasins in case of changes in mean and high flow although it may not be applicable to changes in low flow.
Regarding the findings, final suggestion on the representative climate change scenarios for South Korea were determined as shown in Tables 5 and 6 for 2030s and 2060s future period, respectively, under RCP 4.5. Note that we chose the RCP 4.5 as a realistic future scenario since it represent the possible mitigation measure unlike the RCP 8.5, which represents the worst case for the climate change adaptation. For climate change impact studies on mean flow and high flow variables, a single set of GCMs was assigned for all the 113 subbasins for each future period. For low flow variable, on the other hand, the five different sets of GCMs were assigned to the subbasins belonging to the five river regions, respectively. Based on the selecting process of the KKZ algorithm, the 1st GCM represents the centroid across all the ensemble members so that the first GCM can represent the median of changes in the hydrologic variable. From the second selection, on the other hand, since the GCM that lies farthest from the previous selection(s) is selected, the 2nd to the 5th GCMs represent the values close to either the maximum or minimum changes in the hydrologic variable. Thus, by considering these top five GCMs selected by the KKZ algorithm, almost the whole inter-model variabilities of changes in hydrologic variables can be explained.
Thus, by considering these top five GCMs selected by the KKZ algorithm, almost the whole intermodel variabilities of changes in hydrologic variables can be explained.

Conclusions
Appropriate spatial scales for the selection of representative climate scenarios were analyzed in this study with a case study for South Korea. Overall, in case of such a small country like South Korea (approximately 0.1 million km 2 ), a single set of representative GCMs would be able to capture inter-model variabilities of hydrologic variables for local-level impact studies. However, for low flow variable, finer scale of GCMs selection would be required considering different sets for each local-level case studies. Furthermore, the representative GCMs need to be selected separately for different future periods since uncertainties in GCMs get larger with lead time, which is consistent with the previous study by Hawkins and Sutton [47].
We found that the inter-model variabilities of changes in hydrologic variables are well preserved by only a single set of GCMs (i.e., the top five GCMs from the total 27 in this study). This can certainly reduce computational burden on water resources planning and management studies for which potential uncertainties in climate change should be addressed [48]. However, for local-scale hydrologic impact studies (i.e., subbasin level in this study), decision makers may have difficulty in determining which set of GCMs they should use. If only a single set of GCMs is provided, they do not have to worry about which scenarios to use. On the contrary, different sets of GCMs are provided for each subbasin, which would hinder easy implementation of GCMs for local-scale studies.
Nonetheless, given that a single set of GCMs ranked by the national level selection has the remarkable ability to capture potential inter-model variabilities of changes in mean and high flow variables, the single set of representative climate scenarios can provide sufficient information on uncertainties in future changes in mean-and high-flow regimes. Only for low flow variables would a certain set of GCMs selected by finer spatial scale of climate data sets that are specified for local basins of interest be required for better understanding of uncertainties in future changes in the low flow regimes. Furthermore, we found that different sets of GCMs were selected for the two different future periods (the 2030s and 2060s in this study). This implies that utilization of the same set of GCMs selected for near-term future would not be reliable for long-term future. Since relative importance of uncertainties in GCMs increases with time [47], it seems reasonable that different sets of GCMs have to be selected for the near-term and long-term future period separately. In addition, different sets of GCMs were selected under two different RCP scenarios (4.5 and 8.5). This implies that different sets of representative GCMs can be selected according to what assumption on the future adaptation strategy would be considered.
We should note that the area of national-level application tested in this study, South Korea, is relatively small. When a national-level application is implemented on much wider nations, such as United States (USA), China, and Australia, spatial heterogeneity on climate data sets would diminish due to smoothing impact driven by averaging climate data sets on multiple locations. In these cases, sub-national level (e.g., states in USA) selection can be recommended for the selection of representative climate change scenarios.