Evaluating Multiple WRF Configurations and Forcing over the Northern Patagonian Icecap (NPI) and Baker River Basin

Marcelo Somos-Valenzuela 1,2 and Francisco Manquehual-Cheuque 2,* 1 Butamallin Research Center for Global Change, Universidad de La Frontera, Av. Francisco Salazar 01145, Temuco 4780000, Chile; marcelo.somos@ufrontera.cl 2 Department of Forest Sciences, Faculty of Agriculture and Forest Sciences, Universidad de La Frontera, Av. Francisco Salazar 01145, Temuco 4780000, Chile * Correspondence: f.manquehual01@ufromail.cl


Introduction
In Patagonia, as many places in South America, access to well-distributed climate observations is a major challenge when working with physical hydrological models [1]. The few stations that are generally available are interpolated to cover areas with no data. The result of this process is that the data is averaged out, smoothed over details, and is ultimately unable to capture spatial variability. Additionally, point observations are located in places of easier access (airport, roads, bridges, etc.), which have different characteristics than areas where the topography may have a larger impact. On the other hand, several datasets provide reanalysis data in coarse resolutions worldwide (i.e., GFS, NCEP-NCAR atmospheric reanalysis, ERA-interim, to name a few). However, they cannot be used directly in watershed hydrology, since the resolution is not adequate. They can not resolve convective precipitation that occurs at smaller scales (around 4 km or less) [2,3]. To provide a higher resolution, we need to downscale the reanalysis data to a resolution that satisfies our needs. There are two main techniques to do this; one is statistical downscaling, and the second is dynamic downscaling. Statistical downscaling is popular among hydrologists, because it is computationally inexpensive; thus, it is possible to downscale multiple datasets in a single personal machine. One of the most popular methods is Bias Correction and Statistical Downscaling (BCSD) [4,5], but there are several methods that are available with similar principles. In Northern Patagonia Ice Cap (NPI), statistical downscaling was used by [6]. However, these methodologies rely on good and abundant observation points to establish relationships, which are not the case for most parts in the world and not the case in the Chilean Patagonia. Dynamic downscaling, on the other hand, is a more involved process. It refers to the production of high-resolution data using a numerical weather prediction (NWP) model, which is expensive computationally. Still, it allows having a physically-based approximation of the climate at a smaller scale [5,7]. During the last decade, several projects have arisen to dynamically downscaled climate projections, such us NARCCAP [8,9] in the United States and, lately CORDEX that covers South America. Besides, there are different studies in the area that dynamically downscale reanalysis data to study specific events or processes in higher resolution. Examples of the dynamic downscaling process from coarse global to high resolution (<15 km) regional climate using NWP in South America are [10] that downscaled Global Forecast System (GFS) analysis data for most of South America, Villarroel et al. [11] which was also used in [6] (4 km). Yáñez-Morroni et al. [12] used NWP for forecasting precipitation in complex topography in the "Quebrada San Ramón" near Santiago in central Chile. Although there are some examples of dynamic downscaling in Chilean Patagonia, there has not been implemented numerical experimentation that allows comparing different parameterizations with observations before producing a long time series. It is critical to learn how sensitive the model is to the physic selection and select appropriate parameters that allow generating ensemble downscaled dataset [13]. In this work, we are interested in the generation of climate forcing for hydro-glaciological models. Therefore, we generated an ensemble of results using 24 different combinations of parameters within WRF for three years and compared them with the weather stations available in the region to select the best set of parameters within our WRF experiment for further long-term studies.

Study Area
The region of interest corresponds to the Baker River basin, and the Northern Patagonia Ice Cap (NPI) (Figure 1) located in the Western (Chilean) Patagonia. In this region, the austral Andes perturbs the barometric waves that result in one of the most dramatic precipitation gradients on earth [14]. In western Patagonia, precipitation decreases abruptly to the east. The western side has a hyper humid climate with large precipitation at a synoptic scale that is magnified by the orographic effect. Frontal systems moving from the west are forced upward immediately after they cross over land, releasing most of the water content, and little moisture is left for the east side of Patagonia [15]. Precipitation is uniformly distributed around the year [16] with an increase during the winter months (see Aysen and Balmaceda stations, Figure 2 in [17]). This enables the formation of dense forests, large rivers, and glaciers, such as the NPI [14]. Zonal winds, on the other hand, influence seasonal air temperatures with a difference in mean temperature between summer and winter of 5 • C at the ice caps to 15 • C on the east side of west Patagonia [14]. The Baker River basin has an area of approximately 27,000 km 2 ; this basin produces the largest streamflow in Chile (>1000 m 3 ·s −1 ) with significant contributions from glacier melting from the east side of the NPI. The Baker River Basin is one of the most diverse systems and data-sparse regions in the world, with precipitation ranging from 220 to 1707 mm per year and elevations changes of more than 3000 m in less than 60 km of distance [1]. Storage from glaciers and lakes regulate streamflow in most of the rivers in the Baker Basin [1,18,19].

Meteorological Observations
We were able to access data for 36 weather stations in the area of study (Table 1)  The parameters available are relative humidity (nine stations), precipitation (27 stations), pressure (six stations), wind speed (eight stations), daily mean air temperature (22 stations), and daily max and min air temperature (23 stations). From all the stations, we used 30 stations. We identify with yes the stations that we used and no the ones we did not use. They are stations measuring the same meteorological variables in a location but are maintained by different institutions, so we just kept one of them (i.e., Precipitation in Chile Chico see Table 1). Additionally, there are cases in which none of the configurations could obtain a result (i.e., Precipitation in Glaciar San Rafael), so, following [11], we ignored them to compare configuration performance. For that, we kept the stations with the Nash Sutcliffe Efficiency (NSE) score of above 0. More details are provided in the discussion section. Table 1. Meteorological stations available in the study area from public institutions. A distinction was made between used (Yes) and unused (No) stations, and variables not measured (-) by the station. The variables are precipitation (PP), relative humidity (RH), surface pressure (SP), mean air temperature (T2), maximum temperature (Tmax), minimum temperature (Tmin), and wind speed (WS).

Climate Reanalysis Downscaling
Global datasets that are available for this region have resolutions that are not suitable for high-resolution studies and cannot capture processes that occur in small scales [19]. We will use the Weather Research Forecast (WRF) V4 [23] model to dynamically downscaled reanalysis data for the Baker River Basin and NPI. WRF is a NWP and atmospheric system designed for both research and operational applications [23]. WRF is used in operational forecasts in at least ten countries and hundreds of institutions for research purposes. In our study, we will use three different boundary conditions. First, ERA-interim atmospheric reanalysis, which is a dataset that goes from 1979 to 2018, the resolution is approximately 80 km with six hours of temporal resolution [24]. ERA5 is the latest reanalysis produced by the European Center for Medium-range Weather Forecast (ECMWF). ERA5 is an hourly product that ranges with a 0.25 × 0.25 degrees resolution [25,26]. And FNL0832 Operational Model Global Tropospheric Analyses (ds083.2) that goes every 6-hours from 1999-07-30 to present [27]. The calibration of NWP is impractical due to the thousands of parameters that would need to be adjusted. Therefore, the attention will be put on the physics options that the namelist provides. We follow a similar approach than [28][29][30].
For the model, we used two nested grids at 22.5 and 4.5 km (1:5 ratio) with 34 vertical levels ( Figure 1). For the physics, we combined the options from [11,13]. Ruiz et al. [13] determined that for the convective parameterization the BMJ model (cu physics 2 in the namelist) produces better results, so we used it for the coarser domain at 22.5 km.
We used two micro-physics options, the simpler implementation WSM3 [31] and the Thompson implementation [32]. For the radiation, we used two combinations of physics to represent longwave and short wave radiation. The first pair spectral-band longwave and shortwave schemes used in the NCAR Community Atmosphere Model (CAM 3.0) for climate simulations [33]. The second pair is the Rapid Radiative Transfer Model (RRTM) Longwave [34] and the Dudhia [35] scheme, which has a simple downward integration of solar flux, accounting for tunable clear-air scattering, as well as water vapor absorption, and cloud albedo and absorption [23].
We used three Planetary Boundary Layer (PBL), the Yonsei University (YSU), which is the state of the art PBL for Medium-Range Forecast models. The Mellor-Yamada-Janjic (MYJ) PBL, a local implementation of the Mellor-Yamada 2.5 scheme [23]. Additionally, the Quasi-Normal Scale Elimination (QNSE) scheme with EDMF [36]. For the Land Surface Model (LSM), we used five-layer thermal diffusion with soil temperature only scheme and the Noah-MP (multi-physics) Land Surface Model [37,38]. We generated 24 combinations with these options using the main namelist's options described below. Table 2 summarize the namelist options we used and Table 3 shows the 24 combinations of the options from Table 2.

Evaluation
In the evaluation process, we compared the model results with the meteorological observations (Table 1). There are multiple indicators to evaluate the quality of the results, depending on the application [40]. In this work, we use: Nash Sutcliffe Efficiency (NSE) (Equation (2)), correlation (Equation (3)), and percentage bias (PBIAS) (Equation (4)) that measure the average tendency of the simulated data. The ratio of the root means square error to the standard deviation of measured data (RSR) (Equation (5)) that standardizes Root Medium Square Error (RMSE) using the observations standard deviation (STDEV obs ).
Additionally, we used a more recent indicator, the Kling-Gupta efficiency (KGE) [41] (Equation (6)), that seeks to reduce the errors of NSE by decreasing the distance of the three-component of NSE (correlation, mean squared error, and variability) [41]. KGE is often used for calibration. where: and where: r = correlation coefficient, µ (s/o) = mean value of simulation/observation, σ (s/o) = standard deviation of simulation/observation.

Selection of the Best Configuration
We identified the best configuration through a scoring procedure [42]. For this, in a station, we select a variable (i.e., precipitation) and a statistic (i.e., NSE), we sorted the performance of the configuration from better to worse (Figure 2,NSE). A value of 1 is assigned to the best-performing configuration and add a unit to the next. This process is repeated for each statistic in all of the stations (i.e., Figure 2,PBIAS). Subsequently, the scores obtained for each configuration are added (Figure 2,NSE+PBIAS). The configuration with the lower value is the one that performs better for the variable. Then we normalized the scores so we can compare variables with a different number of observations points. In Figure 2, we provided an example with three configurations and two stations for mean temperature. In the example provided in Figure 2, the best configuration is Conf-22 and the worse is Conf-2.  Table 4 shows the ranking for the performance of the 24 WRF configurations used in this work. The best five settings for precipitation were 9, 10, 13, 11, and 1. For relative humidity, the best were 2, 13, 14, 9, and 1. For mean temperature, the best performance was obtained with 14,2,22,10,and 19. For maximum temperature,14,20,22,2, and 10 perform better. For the minimum temperature, the best settings were 2, 10, 6, 7, and 22. Finally, when considering all of the variables, the best parameterizations were 10, 14, 2, 9, and 22. Figure 3 shows the spatial distribution of the percentage bias for precipitation, relative humidity, and surface pressure, and bias for mean, minimum, and maximum temperatures for the best five configurations according to Table 4. From all of the variables evaluated in this work, precipitation was the variable with more bias among configurations and spatial distribution. The temperature, on the other hand, was the variable with better results.  (E), minimum temperature (BIAS); and, (F), maximum temperature (BIAS). Wind speed is not included, since its performance is not good. WGS84 coordinate system.

Precipitation
Most of the configurations, on average, can do a decent or good job with average correlations above 0.66 and RSRs lower than 0.88. However, they do not have a homogenous performance for the entire domain. Correlation for conf-5, 7, and 8, for example, range from 0 to 0.85. Conf-9 and 10 that use the same parameterization except for the land surface model have the best performance on average. Conf-9 and 10 have a mean correlation of 0.75, with a minimum of 0.53 and 0.43, respectively. Additionally, Conf-9 and 10 are the only stations that have NSE above 0.05 for all of the stations with an average of 0.43 ( Figures 3A and 4).

Temperature
Temperature in its three versions: mean ( Figure 5), maximum ( Figure A1) and minimum ( Figure A2), has better results than the other variables. All of the models have a correlation above 0.87 for all the stations with an average maximum bias of 2.7 degrees Celsius. As a consequence, NSE and KGE, in general, are above 0.5 and RSR lower than 1. All of the estimation results are excellent, except for HNG San Rafael on the west side of NPI, where they are not great but acceptable, which may be a consequence of the extreme conditions on the ice cap.

Relative Humidity and Surface Pressure
In general, the results for relative humidity have problems with RSR that affect NSE scores ( Figure 6). All of the configurations have mean RSR above 1, which indicated a large bias in the data. As a result, none of the configurations have average NSE above 0. Therefore, if we use NSE, all of the results for relative humidity are inadequate. The same is found in surface pressure ( Figure A3), where the model for all configurations does a very poor job with RSR and, consequently, NSE. On the other hand, percentage bias results are good (lower than 15%), and correlations are acceptable (above 0.5 for most of the stations), which ends up in acceptable KGEs, above 0.42, for 15 out 24 configurations. Therefore, the errors may be related to the poor performance of the surface pressure calculation more than the mixing ratio, given that the relative humidity calculation depends on the results of surface pressure (Equation (1)). Therefore, the performance of the models for relative humidity and surface pressure, in particular, the ones with better scores in Table 4, is not good, although that is not conclusive. It depends on which statistic is considered to evaluate the results.

Wind Speed
From all of the variables, wind speed has the worst results. In all of the stations, it performs poorly in daily correlation and bias; therefore, all the other statistics are weak. In Figure 7, we can see the bias for ten stations wherein most of them the bias is very high. The reason why the wind speed's results are not good is probably because of a poor representation of the topography that can affect the wind speed at scales of meters in particular in mountain areas. The model has a resolution of 4.5 km, which makes it impossible to capture local features. A secondary source of error maybe the fact that the stations that measure wind speed are not always at 10 m following the WMO standard [43]. In Figure 7, we show the performance of all the configurations in a box plot for ten stations. We also include lines for the three best configurations (lower bias) in wind speed that have acceptable results [44] in the stations around the Baker River (stations 3,4,8,9,10).

Multiple Forcing
Configuration 9 gives the best results for precipitation (Table 4). We used configuration 9 with two other datasets (Era5 and FNL0832) in 2017 to compare different forcing performances. Additionally, we changed the number of vertical levels from 34 to 40 in one of the models with ERA5. Then considering that ERA5 has better resolution, we added another simulation in which we maintain the finer resolution at 4.5 km and the outer domain as a resolution of 13.5 km; therefore, the ratio between the nested domains is 3 to 1. For the precipitation, the model that used ERA-interim performs better than ERA5 and FNL0832; the average NSE is higher than 0.2, while all the other forcing have average NSE lower than 0. The percentage bias is lower than 40%, while the other forcings have a percentage bias above 76% (Figure 8). On the other hand, for average temperature, the model that used FNL0832 forcing, outperformed all ERA-interim and ERA5 reanalysis. The average bias is 1.5 • C and NSE 0.55. However, configuration 10 that was run with ERA-interim, which is the best configuration according to Table 4, provides better results with an average absolute bias of 1.18 • C and an NSE of 0.63 (Figure 9).

Observation Network Quality
There are almost 40 official weather stations in the study area. Given the latitude and difficulties, it seems that the region is well instrumented. However, we found several limitations and deficiencies with the observation network. The first limitation we faced is related to discontinuities. The datasets are generally incomplete, with several windows with no data, which makes it harder to rely on all of the stations. The discontinuities may be generated, because it is difficult and expensive to maintain instruments in Patagonia; access is difficult, and the government's presence is unfrequent. This situation had improved in the last decade, particularly after 2017, when several stations started to operate. Even other meteorological variables are being observed now, such as solar radiation (not used here) and relative humidity.
There is duplication in the observations. Different institutions are observing the same variables almost in the same place. In Table 1, we can see that there are several places where DGA, INIA, DCM are measuring the same meteorological variables. For example, in Chile Chico, there are three stations measuring precipitation, minimum and maximum temperature. The same situation is found in Cochrane. This shows a lack of coordination among public institutions. If there was better communication, the repeated weather stations could be moved to other areas to increase the points with observations.
The quality of the observations is not guaranteed. Dussaillant J. et al. [1] mentioned that, in this region, data quality is a limitation. This assertion is corroborated in Figure 10, where we compared observations in the same location from different institutions, and we can see that the values not always matched. We also compared observations from the same institution but published in two different places (DGA, GDGA). DGA provided the data aggregated daily, GDGA provided the data hourly. However, when we aggregated the values from GDGA, they did not match with the data from DGA (Figure 10,top left panel). Then, we also have the situation, where data collected for different institutions in the same location do not match up ( Figure 10).

Model Selection
Microphysics selection affects the non-convective precipitation at grid scales [45]. The complexity of the microphysics increases as the resolution reaches convection-permitting levels where microphysics with graupel size representation in the scheme is essential to the precipitation production [46]. The PBL plays a critical role in the transportation of energy (including momentum, heat, and moisture) into the upper layers of the atmosphere. It acts as a feedback mechanism in wind circulation [47]. PBL parameterization schemes are essential for the better simulations of wind components and turbulence in the lower part of the atmosphere [48]. LSMs usually compute heat and moisture fluxes over land and sea-ice points and provide them to the considered PBL schemes [49], which may affect the performance of WRF [50]. The radiation scheme seeks to estimate total radiative flux [51].
The best five combinations of physics for precipitation have in common that four out of five used RRTM for longwave radiation, Dudhia for short wave radiation, YSU for PBL, and NoahMP for the land surface model. The top three used Thompson for microphysics, and none of them use QNSE for PBL. For mean temperature, four out of the five best configurations use YSU for PBL, and five out of five used the five-Layer Thermal for the LSM. For the other options, WSM3 performs better than Thompson for the microphysics, and for the long and short radiation, CAM models rank better. For the maximum and minimum temperature, the models' performance is similar to mean temperature, except that only three out of five configurations used YSU for PBL, and two used QNSE. And four out of the five best configurations used the 5-Layer Thermal for the LSM. Finally, for RH, all the best five configurations used YSU for PBL, four out of the five used WSM3 for microphysics, and three out of five used RRTM for longwave, and Dudhia for short wave radiation. In Figure 11, we provide examples of different configuration in particular stations that have large differences in performance.
Finding these combinations is useful, because, for example, Villarroel et al. [11] used Conf-5 from Table 4 in the same region. That option rank 20 out of 24, the only difference with the best option from this work is the PBL used, in our best configuration YSU and Villarroel et al. [11], QNSE.

Multiple Forcing and General Remarks
In our experiment, Era-interim provided a better result, by a long-distance, than ERA-5 and FNL0832 for precipitation. ERA 5 has similar results in average for the root mean square than ERA-Interim, as it can be seen in the Taylor diagram (Figure 12) for temperature and precipitation. It also has a better mean correlation. However, for precipitation, the mean for the standard deviation for ERA5 is very high when compared to ERA-Interim, which explains that it ended up with a more inferior score in our selection procedure.
FNL0832 gave better results in the case of temperature; however, the performance of all the models was good. Therefore, working with ERA-Interim from 1978 to 2018 seems to be the best option in this region. We did not experiment with changes in resolution, which may have improved results with ERA-5 that has better spatial and temporal resolution.
In general, the model did well for temperature (mean, minimum and maximum) for most of the stations. Precipitation was good, but just a few configurations stood out (conf-9 and conf-10). Surface pressure was not good or bad; the results were okay if we used KGE, but they were not okay if we used NSE. There is a systematic low RSR for all the stations and configurations. We think that there may be a couple of reasons why this happens. The first is the resolution, which is 4.5 km, so the details of the terrain are smoothed, and the second is that the model surface level may not be well represented, so it needs to be evaluated in more detail. The errors for relative humidity are similar to the errors in surface pressure. Large errors in RH were also found by [49] in India. However, Cohen et al. [48] obtained good results for specific humidity using YSU and MYJ for the PBL in the southeastern U.S. Therefore, we think that the pressure is the primary source of error since we used it along to calculate RH (Equation (1)). And Finally, none of the models could capture wind speed; in general, the bias was high. This could be because the model provides wind velocity at 10 m from the ground. The stations are not higher than 2-4 m in general, or because the terrain representation and station location are not well represented. Therefore, this is something that needs to be explored in more detail. Conf-2  mm day −1 Figure 11. Top: examples of monthly mean Temperature time series between observed values (obs) and best ranked configurations (14, 2 and 22) and worsts (15, 4, and 23). Bottom: examples of monthly precipitation time series between observed values (obs) and three best ranked configurations (9, 10, and 13) and worsts (8, 19, and 20). In summary, from more to less certain. We found that for PBL, the best option is to use YSU, which also gives good results in central Southamerica by [13,47], Korea [48] and Antarctic [52]. For the LSM, the best option is the 5-Layer Thermal. WRF is sensitive to the selection of LSM [49,50]. In Ethiopia, Teklay et al. [50] found that the five-Layer Thermal model had a low cold bias for maximum and minimum temperature. However, for precipitation, Noah has better results, which is similar to our results. RRTM performs well for longwave; it also performs well in the Antarctic Winter [52]. Dudhia does better than CAM for short wave radiation. Zempila et al. [53] found that Dudhia provides better results than more sophisticated schemes in a study in Greece, particularly for clear skies. For the microphysics, Thompson does better work. However, Jeworrek et al. [46] indicates that the microphysics does not have a significant impact when working at the spatial resolution of this study. For the LSM, we did not explore the multiple options that the NoahMP model provides and used the default settings, so there may be NoahMP configurations that can do better, so we left that open for future research. For the PBL in general, QNSE is the one that had the lower performance. Summarizing, the best set of physics in our numerical experiment is YSU for PBL, for the LSM, five-Layer Thermal for the LSM, RRTM for longwave, Dudhia for short wave radiation, and Thompson for the microphysics.