Extending Limited In Situ Mountain Weather Observations to the Baseline Climate: A True Veriﬁcation Case Study

: The availability of in situ atmospheric observations decreases with elevation and topographic complexity. Data sets based on numerical atmospheric modeling, such as reanalysis data sets, represent an alternative source of information, but they often suffer from inaccuracies, e.g., due to insufﬁcient spatial resolution. sDoG (statistical Downscaling for Glacierized mountain environments) is a reanalysis data postprocessing tool designed to extend short-term weather station data from high mountain sites to the baseline climate. In this study, sDoG is applied to ERA-Interim predictors to produce a retrospective forecast of daily air temperature at the Vernagtbach climate monitoring site (2640 MSL) in the Central European Alps. First, sDoG is trained and cross-validated using observations from 2002 to 2012 (cross-validation period). Then, the sDoG retrospective forecast and its cross-validation-based uncertainty estimates are evaluated for the period 1979–2001 (hereafter referred to as the true evaluation period). We demonstrate the ability of sDoG to model air temperature in the true evaluation period for different temporal scales: day-to-day variations, year-to-year and season-to-season variations, and the 23-year mean seasonal cycle. sDoG adds signiﬁcant value over a selection of reference data sets available for the site at different spatial resolutions, including state-of-the-art global and regional reanalysis data sets, output by a regional climate model, and an observation-based gridded product. However, we identify limitations of sDoG in modeling summer air temperature variations particularly evident in the ﬁrst part of the true evaluation period. This is most probably related to changes of the microclimate around the Vernagtbach climate monitoring site that violate the stationarity assumption underlying sDoG. When comparing the performance of the considered reference data sets, we cannot demonstrate added value of the higher resolution data sets over the data sets with lower spatial resolution. For example, the global reanalyses ERA5 (31 km resolution) and ERA-Interim (80 km resolution) both clearly outperform the higher resolution data sets ERA5-Land (9 km resolution), UERRA HARMONIE (11 km resolution), and UERRA MESCAN-SURFEX (5.5 km resolution). Performance differences among ERA5 and ERA-Interim, by contrast, are comparably small. Our study highlights the importance of station-scale uncertainty assessments of atmospheric numerical model output and downscaling products for high mountain areas both for data users and model developers.


Introduction
Availability and quality of in situ meteorological observations dramatically decrease with elevation and topographic complexity (e.g., [1][2][3]). Maintenance of weather stations at high altitudes in complex topography is hampered by many practical obstacles. Most sites are not accessible by car and because the statistical procedures are not based on process understanding [21]. The most commonly applied technique for model selection and uncertainty estimation in statistical forecasting is cross-validation [22]. In contrast to split-sample validation, cross-validation allows each observation to be used both in the model training and in the evaluation process, and is thus particularly useful in the case of observation scarcity [23,24]. However, cross-validation can be misleading, e.g., when applied to validate bias correction of free-running climate model simulations [25]. As an alternative, the authors of [25] highlighted the importance of validating noncalibrated temporal and spatial aspects of the modeled time series. For example, the authors of [26] showed that statistical corrections applied to a considered time scale (e.g., daily) may be detrimental to other time scales (e.g., monthly or annual). Statistical downscaling, particularly bias correction methods, have ultimately received considerable criticism in cases when the downscaling results were communicated without reliable uncertainty estimates (e.g., [27][28][29]). The authors of [27] argued that statistical postprocessing often masks rather than reduces uncertainty in climate science. In fact, the practitioner's dilemma has been identified as no longer being the lack of downscaled data, but how to choose an appropriate data set and to assess its credibility [30]. One of the major uncertainties in statistical downscaling relates to the stationarity assumption, which is difficult to verify (e.g., [29,31]). sDoG (statistical Downscaling for Glacierized mountain environments) is a statistical downscaling tool designed to extend limited observation time series from high mountain weather stations to complete multidecade time series in the past [23,24]. sDoG relies on statistically adjusting reanalysis data to local-scale conditions with a strategy to circumvent the pitfalls of fitting temporally short and highly autocorrelated records. It is one-dimensional in the physical and variable space, and can be applied to various atmospheric quantities at a daily time scale (e.g., air temperature, precipitation, wind speed, relative humidity). The authors of [24] trained and cross-validated sDoG to daily air temperature measured at the Vernagtbach climate monitoring site (Central European Alps at 2640 MSL) using measurements for the period from 2002 to 2012. sDoG uncertainty estimates are based on cross-validation within this period (2002 to 2012) and at the time scale of the model training (daily). The availability of daily air temperature measurements at the Vernagtbach climate monitoring site back to 1979 allows us to perform a true evaluation, in contrast to cross-validation. In this study, we use the term true evaluation for assessing the performance of sDoG for the 23 years period from 1979 to 2001 (hereafter referred to as the true evaluation period). sDoG is compared to the measurements for different time aspects, and the sDoG performance is benchmarked with the performance of various state-of-art reference data sets at very distinct spatial resolutions, that are available for the site and extend over the true evaluation period. In Section 2.1, the Vernagtbach climate monitoring site is introduced. The sDoG tool is presented in Section 2.2. The evaluation strategy of the present study is outlined in Section 2.3. Results for different temporal aspects and results relative to different reference data sets are shown in Sections 3.2 and 3.1. A verification of the cross-validation-based uncertainty estimates is shown in Section 3.3. Finally, we discuss and summarize the analyses of this study in Section 4.

The Vernagtbach Climate Monitoring Site (VERNAGT)
The Vernagtbach climate monitoring site, hereafter referred to as VERNAGT, is located at 2640 MSL in the Vernagtbach glacier basin in the Austrian European Alps (see Figure 1). The European Alps are a 1200 km long, approximately 200 km wide and up to 4800 MSL high mountain range, characterized by strong spatial gradients of weather and climate [32]. VERNAGT is situated in an inner alpine dry valley close to the main alpine crest that includes peaks above 3000 MSL. The mean annual precipitation at VERNAGT is about 1500 mm [33]. VERNAGT is surrounded by rocky terrain, with a distance of about 1500 m to the glacier terminus in 2012 [34]. When VERNAGT was installed in fall 1973 as part of a long-term glacier monitoring programme, the glacier terminus was at a distance of approximately 1000 m from the station [35]. Since then, VERNAGT has undergone several revisions that were necessary to adapt to changes in the discharge conditions of the Vernagtbach, measurement techniques and the available funding [4,36]. VERNAGT data considered in this study cover the period 1979 to 2012 and were downloaded by PANGAEA (Data Publisher for Earth and Environmental Science, https://pangaea.de/). For the analysis in this study, VERNAGT observations downloaded as five-minutes centered averages were converted to daily means (only for days with complete records) and to annual and seasonal means (with a maximum of five days of missing data allowed for each year or season). Data gaps mostly affect the winter and spring time series, when the measurement devices were buried under snow. Thus, within the twenty-three years long true evaluation period, there are eleven annual mean values, twenty-two autumn mean values, eleven winter mean values, fourteen spring mean values, and twenty-three summer mean values available.

sDoG: Statistically Postprocessing Reanalysis Data to the Station Scale (One-Dimensional)
This study applies the statistical downscaling method sDoG (statistical Downscaling for Glacierized mountain environments) [23,24]. sDoG is a statistical postprocessor of reanalysis data originally developed for glacierized mountain environments. The statistical procedures underlying sDoG, however, are not limited to glacierized mountain environments and can, in principle, be applied for any site of interest. sDoG adjusts reanalysis data to in situ meteorological observations to more accurately represent station-scale atmospheric conditions. The overarching goal of sDoG is to extend short, typically few years long weather station records to multidecade long records in the past, similar in concept to the studies by [37,38]. sDoG is designed to consider the pitfalls of fitting limited, often patchy weather station data. The minimum possible length of an observational time series as input for sDoG has been identified as approximately three years [23]. Next to the length of the observational time series, observation quality plays a key role in the development of skillful models with sDoG. Measurement errors deleteriously impact the model training, but the sDoG algorithm is designed to detect these problems. Low performance quantified in the double cross-validation procedure employed by sDoG can be either an indication of low predictive power of the coarse-scale predictors or a symptom of measurement errors [23,24].
In contrast to most downscaling methods, sDoG is applicable only to reanalysis data and not to freely running climate models as predictors, because it uses information about the time sequencing in the observations for fitting the statistical relationships [39,40]. sDoG, in contrast to the vast majority of downscaling studies, thus profits from the advantages of numerical weather prediction postprocessing techniques, i.e., generally shorter time series can be used for model training, and cross-validation can be applied for assessing the model accuracy. Downscaling of free running climate models, in contrast, is limited to correct long-term distributional aspects, and it is not recommended to use cross-validation for assessing the performance of these models [21]. Currently, and as presented in this study, the sDoG code is one-dimensional-that is, applicable to only one site and one atmospheric quantity; thus, sDoG does not consider intersite and intervariable correlations. sDoG is written in MATLAB and it is available as a bitbucket repository (https://bitbucket.org/MarlisH/sdog/src/master/).
A crucial element of sDoG is the predictor selection, i.e., the selection of the information from the reanalysis data set that is important for the quantity of interest [24]. Note that in its current version, sDoG applies ERA-Interim data as predictors, with the adaptation of sDoG to include ERA5 being under way. For the dimensionality reduction in the predictor space, sDoG combines least-squares regression with the Least Absolute Shrinkage and Selection Operator, LASSO [41]. Note that next to least-squares regression, generalized linear models and symmetry producing variable transformations are available options in sDoG (e.g., for precipitation). The sDoG model for VERNAGT air temperature in this study is based on a systematic analysis of different predictor options performed by the authors of [24]. More precisely, the authors of [24] compared the efficiency of using predictor information either in terms of horizontal fields of a single atmospheric quantity (G), a single atmospheric quantity at different vertical levels (L), different atmospheric quantities at one level and one grid point (V), or combinations thereof: horizontal fields of different atmospheric quantities (GV), or different atmospheric quantities on different vertical levels (VL) or horizontal fields of a variable at different levels (GL). This analysis was repeated for different sites (including VERNAGT) and atmospheric quantities individually for each day of year. The results of [24] showed high dependence of the model skill on the applied predictor option and the importance of using different predictors for different days of year. The analysis also revealed cases for which larger predictor data sets yielded lower model skill. In other words, considering more information in the modeling procedure did not necessarily improve the results, as found particularly the in case of limited observation quality [24]. For more information on the predictor selection algorithm, see [24].
The sDoG core includes a double cross-validation procedure with an inner loop for model selection and an outer loop for uncertainty estimation. The cross-validation considers serial correlation by applying a buffer (determined by the autocorrelation function) between training and evaluation observations (moving block cross-validation). Like the predictor selection, the double cross-validation procedure is performed for each day of the year separately. The development of different functional relationships for different days of the year is more sophisticated than combining seasonal standardization with a single model for the entire year, because the latter method does not account for seasonality in the model error [24]. sDoG calculates one and two standard error estimates for each day of the year by assuming a Gaussian distribution of the cross-validation-based test error. More precisely, the one standard error (1SE) is defined as the standard deviation of the cross-validation-based test error, and the two standard error (2SE) is defined as the 95th percentile of the cross-validation-based test error. Significance testing of the skill of the developed relationships is based on the moving block bootstrap; for details, see [24].
The authors of [24] applied sDoG for daily precipitation, air temperature, relative humidity, wind speed and solar radiation at three sites, all located in complex terrain in the close proximity to mountain glaciers: next to VERNAGT, the Mount Brewster measuring site in the Southern Alps of New Zealand, and the Artesonraju measuring site in the tropical South American Andes. Of all sites and assessed variables, the most successful model was obtained for VERNAGT air temperature, with the lowest uncertainty and the highest skill scores exceeding 0.9 for the entire year [24]. For other variables/sites, e.g., for precipitation at VERNAGT or for air temperature at the Artesonraju measuring site, sDoG shows larger cross-validation-based uncertainty estimates presumably related to shortcomings of the observations [24]. In this study, sDoG was used for the first time to extend an observational time series beyond the model training/cross-validation period. Furthermore, the sDoG performance was evaluated at various time scales beyond the daily time scale of the model training. We focus on VERNAGT air temperature in a true verification setting, to address potential problems of sDoG not identified by the cross-validation procedure (best-case scenario).

Evaluation Strategy
Decade-long measurement series such as those available from VERNAGT are exceptional for remote mountain sites. The true verification performed in this study consisted of the following steps. First, sDoG was trained using data from 2002 to 2012 only. Then, sDoG produced a retrospective forecast of daily air temperature at VERNAGT for the period from 1979 to 2001. Finally, the sDoG retrospective forecast was evaluated based on VERNAGT observations from 1979 to 2001. In the remainder of this paper, the term "obs-training" refers to VERNAGT observations from 2002 to 2012, and "obs-trueval" refers to the VERNAGT observations over the true evaluation period 1979 to 2001 ( Table 1). The evaluation focused on various time scales (day-to-day, seasonal, annual, season-cycle), and thus explicitly distinguished between calibrated and not calibrated aspects. Furthermore, added value of sDoG was quantified over alternative data sets (listed below). The comparison of sDoG to reference data sets not only shed more light on the potential of sDoG, but also added information for users with respect to each individual reference data set for VERNAGT. Along with the retrospective forecast, sDoG delivers cross-validation-based uncertainty estimates in terms of confidence intervals. The true evaluation performed in this study offers the possibility to test the validity of cross-validation by testing if the cross-validation-based uncertainty estimates (based on obs-training) hold for the true evaluation period.
Added value of the sDoG retrospective forecast over each of the reference data sets is calculated as percentage improvement (or reduction of error, RE). RE is calculated here after [42]: with SS is the mean squared error (MSE)-based skill score [43]. The term (t) 2 is the MSE of the model to be evaluated (here, sDoG), and r (t) 2 is the MSE of a given reference model. In this study, errors are calculated as differences between sDoG and the reference data sets (Table 1) to the VERNAGT observations from 1979 to 2001 (obs-trueval) at all considered time scales. Note that the range of RE is (−∞, 100]. A RE of sDoG close to zero thus implies that the performance of sDoG is similar to the performance of the reference data sets. A RE of sDoG close to 100% means that the term (t) 2 / r (t) 2 tends to zero and thus that sDoG clearly outperforms the reference data set. Negative values of RE cannot be interpreted in terms of percentage reduction of error, but they imply that the errors of sDoG are larger than the errors of the reference data set. RE thus quantifies if and how much sDoG adds value over all reference data sets considered here.
Application of Equations (1) and (2) for different temporal scales is performed here as follows. (t) is calculated as the difference between obs-trueval and sDoG, and r (t) as the difference between obs-trueval and each reference data set with the time series aggregated to each of the investigated time scales. The investigated time scales are (1) the overall daily time scale including all types of variability (daily, seasonal, and year-to-year), (2) day-to-day variability (corresponding to the overall daily time scale with the seasonal cycle and the year-to-year variations removed), (3) the 23-years mean seasonal cycle (thus, 365 values) and (4) year-to-year variations and season-to-season variations (with absolute values removed). This way, values of RE can be assigned to the different modes of variability considered in their calculation.
The significance of RE is tested based on the moving block bootstrap [24]. The moving block bootstrap procedure considers differences of the effective sample size for the different time scales investigated in this study [44]. In practice, it is more difficult to prove significance of RE values if the underlying error time series have few values and/or are affected by serial correlation because this reduces the effective sample size [44]. For example, for the annual, winter, spring, summer and autumn time series, only 11, 11, 13, 22 and 23 values of obs-trueval are available for the calculation of RE, respectively. Note also that differences in RE between different reference data sets are not tested here for significance, but can be interpreted in terms of "no added value" of one reference data set with a larger RE value over another with a smaller RE value. Smaller (larger) RE values of sDoG at a given time scale imply smaller (larger) MSEs, and thus errors of the reference data sets, respectively.
The true verification setting in this study allows us to verify the cross-validation-based uncertainty estimates 1SE and 2SE. 1SE and 2SE are the 68% (1SE) and 95% (2SE) confidence intervals of the sDoG retrospective forecast, estimated as standard deviation (1SE) and 95 percentile (2SE) of the test error based on cross-validation in the training period. The evaluation of 1SE and 2SE is performed by counting the portion of obs-trueval that effectively fall within sDoG ± 1SE and sDoG ± 2SE. This analysis is shown individually for each year, and on average over the true evaluation period. If the resulting portion exceeds 68% in case of sDoG ± 1SE and 95% in case of sDoG ± 2SE, this indicates that 1SE and 2SE estimated here by cross-validation underestimate the true uncertainties found in the evaluation period. Table 1 details the reference data sets used to benchmark the performance of sDoG in this study. Firstly, different types of reanalysis products are considered, including ERA-Interim (predictors of sDoG), available globally at approximately 80 km horizontal resolution and from 1979 to August 2019 [45]; ERA5, the newest reanalysis data set by the ECMWF on a 30 km grid (globally) available from 1979 to present [12]; ERA5-Land, a replay of the land component of ERA5 at 9 km horizontal resolution, extending back to 1981; the regional reanalysis UERRA HARMONIE on a 11 km grid available for the European domain from 1961 to present [46], and UERRA MESCAN-SURFEX data set, a land surface analysis of HARMONIE at 5.5 km horizontal resolution [14]. Furthermore, a regional climate model simulation of the past observed climate is considered, namely ALARO-0 within the CORDEX initiative driven by ERA-Interim at the initial and lateral boundaries and available at 12.5 km horizontal resolution (0.11 • ) from 1979 to 2010 [47]. Finally, two reference data sets based only on observations are included, namely SPARTACUS, a 1 km gridded air temperature data set based on quality controlled observations for Austria available from 1961 to present [8], and the air temperature time series of a station located at only four kilometers distance to VERNAGT: Vent station (indicated in Figure 1). Note that Vent station was not involved in the generation of SPARTACUS.

The Reference Data Sets
For data available on pressure levels, like ERA-Interim and ERA5, there are two options of extracting air temperature for a site of interest, (1) 2 m air temperature or (2) air temperature from the pressure level corresponding to the site. In a preliminary analysis for this study, we tested both 2 m air temperature and 750 hPa air temperature by ERA-Interim for VERNAGT (pressure at VERNAGT varies around 740 hPa), and found 750 hPa air temperature outperforming 2 m air temperature concerning all considered aspects (not shown). In the remainder of this study, we therefore show results only for 750 hPa air temperature for both ERA-Interim and ERA5. For all other reference data sets (not available on pressure levels), 2 m air temperature is used (see also Table 2). Note that ERA-Interim on pressure levels outperforming ERA-Interim surface data was also pointed out for VERNAGT, and for two other high mountain sites in New Zealand and Peru [24]. For all gridded reference data sets except SPARTACUS, the four closest grid points are bilinearly interpolated to the study site's coordinates. For SPARTACUS, the closest grid point is considered.
Due to the high topographic complexity around VERNAGT, none of the reference data sets corresponds to the altitude of the site exactly (see Table 1). Even for SPARTACUS on a 1 km grid, a height difference of about 300 m remains to VERNAGT. The 750 hPa pressure level considered in the case of ERA-Interim and ERA5 does not correspond exactly to the altitude of VERNAGT. Thus, all reference data sets suffer from an altitudinal bias in some form due to the atmospheric lapse rate ( Table 2, bias values in brackets). It is beyond the scope of this study to seek an observation-independent altitude adjustment for all reference data sets. However, to isolate the performance of the reference data sets for each temporal aspect addressed here from the altitude bias that would otherwise obliterate the results, all reference data sets (RD) are standardized (bc) to the mean of obs-training, as follows: with RD(2002-2012) being the temporal mean of a reference data set over the period 2002 to 2012, and obs-training the temporal mean of obs-training. This way, all reference data sets "have seen" VERNAGT observations, but are still independent of obs-trueval, like sDoG. In this study, no additional correction for a misrepresentation of the seasonal cycle in the reference data sets is applied. However, in the calculation of RE values for the year-to-year and season-to-season variability, absolute values are removed, and thus the general and seasonal offsets in the reference models are eliminated automatically. Seasonal offsets are then evaluated explicitly with the evaluation of the 23-years mean seasonal cycle. This way, we clearly distinguish between the performance in representing year-to-year variability and the performance in simulating the seasonal cycle correctly.  Table 2. Overview about the added value of sDoG over all considered reference data sets and at all considered time scales in the true evaluation period. The first column shows the biases of all reference data sets and sDoG (last row) to obs-trueval (true evaluation observations from 1979 to 2001). The values in brackets are the biases of the reference data sets without bias correction, thus including the altitude bias. The remaining columns show the added value in terms of reduction of error RE (Equation (1)) of sDoG over all assessed reference data sets (top to bottom) at different time scales (left to right). Values of RE not found to be significantly positive (at a 5% significance level) are shown in brackets. Negative scores are given for completeness, but are not equivalent to RE of a reference data set over sDoG, because the range of RE is (−∞, 1] see [42].

Mean Bias [ • C] Reduction of Error RE of sDoG (%)
Short-name overall day-to-day seasonal cycle year-to-year (see Table 1 Figure 2 shows an arbitrarily selected, six-months-long snapshot of VERNAGT air temperature daily means in the beginning of the true evaluation period. Shown are the measurements together with the sDoG retrospective forecast. Figure 2 shows that sDoG is in close agreement with the observations. The differences between sDoG and obs-trueval are small compared to the overall variability of the time series. For the displayed time window, there is a temperature range of more than 25 • C. . This is not a matter because sDoG is trained to daily values which show much larger variations than the annual values (see [26]). In the true evaluation period, errors between sDoG and obs-trueval are smallest for the autumn time series, with the mean absolute error amounting to 0.2 • C. The largest errors are found for the summer time series, with a mean absolute error of 0.38 • C. For the summer time series in particular, the errors between sDoG and obs-trueval are larger in the true earlier evaluation period than in the later evaluation period and they are systematically positive; sDoG simulates too warm summers. In the evaluation of the overall daily time series (e.g., in Figure 2), this bias is masked, because variability at the daily time scale is much larger.  Figure 5 compares the mean annual cycle of air temperature at VERNAGT over the true evaluation period as measured (obs-trueval) with the sDoG retrospective forecast. The mean annual cycle in obs-trueval ranges from -8 • C to +7 • C. The annual cycle averaged over the training period (obs-training) is also shown. sDoG corresponds closely to obs-trueval, with smaller errors in winter than in summer, where sDoG systematically overestimates obs-trueval with errors up to 0.6 • C. Differences between obs-trueval and obs-training, by contrast, exceed 3 • C for some days of the year, are positive from April through November and negative for winter and early spring. The differences between sDoG and obs-trueval being much smaller than the differences between sDoG and obs-training indicate that sDoG is well trained without overfit and that the predictors are meaningful. Also, while obs-training is lower than obs-trueval throughout the winter months December to February, sDoG slightly overestimates December temperatures, is almost identical to obs-trueval in January and slightly underestimates obs-trueval in February. Thus, although the overall variability in the time series used for training is much larger than the annual cycle (compare, e.g., Figures 2 and 5), sDoG is able to capture changes of the mean seasonal cycles from obs-training to obs-trueval. The overall bias between sDoG and obs-trueval is 0.21 • C (see Table 2), and between obs-trueval and obs-training 0.62 • C. sDoG is thus able to correct obs-training towards obs-trueval, but the remaining bias is still in an order of magnitude relevant for trend detection (e.g., [49]).

Added Value of sDoG Over the Reference Data Sets
In this section, the performance of sDoG in the true evaluation period is compared to the performance of the reference data sets listed in Table 1. Table 2 shows values of RE of sDoG over all considered reference data sets at all considered time scales. Furthermore, Figure 6 shows a snapshot of daily air temperatures at VERNAGT in the earlier true evaluation period together with sDoG, ERA5, SPARTACUS, ERA5-Land and MESCAN-SURFEX. The selection of reference data sets reflects the range of RE values, from low (ERA5 and SPARTACUS), intermediate (ERA5-Land), and high values of RE (MESCAN-SURFEX, see also Table 2). Figures 7 and 8 show autumn and summer time series with the same selection of reference data sets. Figures 9 and 10 show differences between the mean seasonal cycles averaged over the true evaluation period of all reference data sets to obs-trueval. How does the performance of sDoG compare to the performances of available alternative data sets, and how do the considered data sets perform amongst each other? RE values of sDoG are significantly positive over all reference data sets for the overall daily values (i.e., daily values including seasonal cycle and year-to-year variations) and the day-to-day variations (i.e., daily values with seasonal cycle and year-to-year variations removed). RE ranges from 24% for the best performing reference data set (ERA5) to 90% for the reference data sets with the largest errors (MESCAN-SURFEX). Regarding year-to-year variations, values of RE of sDoG are positive over all reference data sets for the annual, autumn, winter and spring time series, ranging from 0 (SPARTACUS spring time series) to 98% (MESCAN-SURFEX annual and winter time series), but are not significantly positive for all reference data sets. Failing to prove significance for the annual, autumn, winter and spring time series is also related to the fact that less values are available for the calculation of RE values than for the other time scales. Concerning the summer time series, values of RE of sDoG are positive over only three out of seven considered reference data sets, and significantly positive in only one case (ALARO). Concerning the representation of the seasonal cycle, values of RE of sDoG are significantly positive over all reference data sets except for SPARTACUS and ERA5. Failing to prove significance of RE of sDoG over ERA5 even though the value of RE amounts 41% relates to the high serial correlation of the season cycle time series that reduces the effective sample size in the significance estimation see [42]. When comparing the performance amongst the reference data sets, ERA5 is the best reference data set regarding the daily time scale (with and without seasonal cycle and year-to-year variations), and SPARTACUS is the best reference data set regarding the seasonal cycle and year-to-year variations (see Table 2 and Figures 6-9). Note also that while ERA5 and ERA-Interim show problems in modeling summer air temperatures similarly as sDoG, SPARTACUS captures the observed variations in the first part of the true evaluation period more closely. Regarding day-to-day variability, by contrast, ERA5 and ERA-Interim show slightly higher performances than SPARTACUS (e.g., Figure 6). ERA-Interim outperforms ERA5 concerning year-to-year variations and shows a performance very close to ERA5 concerning day-to-day variability, but improvement of ERA5 over ERA-Interim is evident in the representation of the seasonal cycle (see also Figure 10). SPARTACUS clearly outperforms Vent at all considered time aspects. Overall, the best performing reference data sets are SPARTACUS, ERA5 and ERA-Interim, and the reference data sets with the worst performance are ERA5-Land, HARMONIE, ALARO and MESCAN-SURFEX. Note also that the differences of RE values within the best performing reference data sets are small compared to the differences of RE values between the best and the worst performing reference data sets. In other words, the performances of SPARTACUS, ERA5 and ERA-Interim are comparably similar, while there is a larger gap to the performances of ERA5-Land, HARMONIE, ALARO, and MESCAN-SURFEX. In fact, values of RE of sDoG over ERA5-Land, HARMONIE, ALARO and MESCAN-SURFEX range up to 94% for the overall day-to-day, up to 91% for the day-to-day isolated variations, up to 98% for the seasonal cycle, and up to 98% for year-to-year variations. Only for the winter time series, ERA5-Land outperforms ERA5 as well as ERA-Interim. Within the worst performing reference data sets, ERA5-Land and HARMONIE outperform ALARO and MESCAN-SURFEX. MESCAN-SURFEX, the only product that includes two numerical-model-based downscaling steps, shows the lowest performance.
To sum up, the two global reanalysis products ERA5 and ERA-Interim outperform the higher-resolution, numerical-model-based downscaling products ERA5-Land, HARMONIE, MESCAN-SURFEX and ALARO. For the numerical-model-based downscaling products considered here, these results do not support the assumption of added value over their coarse scale drivers (e.g., [19]). However, sDoG applied to the coarsest scale reference data set considered here (ERA-Interim) clearly outperforms ERA-Interim and all higher-resolution reference data sets for all time aspects despite summer air temperature variability. Figures 11 and 12 demonstrate the verification of the cross-validation-based uncertainty estimates 1SE and 2SE, the 68% and 95% confidence intervals of sDoG, respectively. More precisely, for each year in the true evaluation period, the percentages of values of obs-trueval exceeding sDoG ± 1SE and sDoG ± 2SE (e.g., illustrated as dark and light grey shaded areas in Figures 2 and 6) are shown. The analysis performed at a daily time scale is shown for all data (Figure 11), and for data stratified to the individual seasons (Figure 12), to investigate how the shortcomings of the cross-validation-based estimates of 1SE and 2SE relate to each season. Figure 12 shows that sDoG uncertainties are more realistic for the autumn and winter seasons than for spring and summer. In fact for autumn and winter, the cross-validation-based 1SE almost fits the true 1SE, including over the entire true evaluation period, on average 66.7% and 66.8% of the data. For spring and summer by contrast, the 1SE estimate includes only 59% and 54% of the true evaluation data, respectively. Also evident in Figures 11 and 12 is that the cross-validation-based 1SE estimate is more realistic than the cross-validation-based 2SE estimate: for autumn and winter, the 2SE estimate includes 85% and 86% of the true evaluation data, and for spring and summer the 2SE estimate includes 81% and 75%. Furthermore, sDoG 1SE and 2SE estimates are more often exceeded in the first part of the true evaluation period (1979 to 1990, more distant to the training period), than in the later true evaluation period (1991 to 2001). This is evident in the annual data ( Figure 11), but particularly for the spring and summer seasons ( Figure 12). . Following the definition of 1SE and 2SE as 68% and 95% confidence intervals of sDoG, the white bars should on average not exceed 32% of the data (blue, upper solid line), and the grey bars should on average not exceed 5% of the data (black, lower solid line). The average of the white (grey) bars is indicated by the blue (black) dashed line individually for the true evaluation period and the training period, respectively. If the dashed lines are higher than the solid lines in the true evaluation period, this means that the cross-validation-based uncertainty estimates 1SE and 2SE underestimate the true uncertainties of sDoG.

Verification of the Cross-Validation-Based Uncertainty Estimates of sDoG
In the training period, the 1SE and 2SE estimates overestimate the true errors (see Figures 11 and 12: dashed black and blue lines are below the solid black and blue lines post 2001). This is because 1SE and 2SE are calculated from the cross-validation-based test errors. The difference between the solid and black lines in the training period shows the merit of the applied cross-validation procedure: it shifts the error estimates from the errors in the training period towards more realistic values.  Figure 11, but for autumn, winter, spring and summer. Limitations of the cross-validation procedure are particularly evident for the spring and summer seasons (see the larger gap between the dashed and the solid lines in the true evaluation period), and for the 95% confidence interval 2SE (because the gap between the black dashed and solid lines is larger than the gap between the blue dashed and solid lines).

Discussion and Conclusions
sDoG is a one-dimensional downscaling model designed to extend short-term weather station data from high mountain sites to the baseline climate. This study evaluates sDoG and its cross-validation-based uncertainty estimates for daily air temperature at the Vernagtbach climate monitoring site (2640 MSL in the European Alps). sDoG is trained and cross-validated using data from 2002 to 2012, while the evaluation considers data from 1979 to 2001. The results show that sDoG adds significant value over various reference data sets available for the study area in the true evaluation period at very distinct spatial resolutions, including global and regional reanalysis data sets (ERA-Interim, ERA5, ERA5-Land, UERRA HARMONIE, UERRA MESCAN-SURFEX), regional climate model output (historical simulation of the model ALARO-0 by the CORDEX initiative), and a 1 km-resolution gridded observation-based product available for the territory of Austria (SPARTACUS).
Added value of sDoG is demonstrated over all reference data sets at all considered time scales (day-to-day, seasonal cycle, and year-to-year). Problems of sDoG, however, emerge in the modeling of summer air temperatures, for which added value of sDoG is positive over only three out of eight reference data sets, and significant in only one case. This comparably poor performance of sDoG is most likely related to a nonstationarity of the microclimate of the Vernagtbach climate monitoring site that violates the stationarity assumption underlying sDoG. More precisely, throughout the evaluation period the Vernagtferner (glacier) terminus formerly close to the site retreated several hundred meters leaving behind rocky moraine terrain. Overall, the glacierized area in the region around the Vernagtbach climate monitoring site diminished (e.g., [34,35]). This nonstationarity of the microclimate is more important for spring and summer than for autumn and winter, and affects sDoG in an order of magnitude relevant for trend analysis (e.g., [49]). In the case of the Vernagtbach climate monitoring site the changes of the surrounding microclimate are well documented (e.g., [34][35][36]. This information is not yet considered in the sDoG modeling procedure, which next to the training observations uses only reanalysis data predictors. An avenue for further developing sDoG not investigated in this study could be to consider metadata available for the entire forecasting period (e.g., distance of the station to the glacier terminus) as predictors in the downscaling procedure.
Evaluation of the cross-validation-based standard errors of sDoG (daily values) shows that the one standard error, 1SE (68 percentile) is more accurately modeled than the two standard error, 2SE (95 percentile). Discrepancies of the cross-validation-based 1SE to the true 1SE affect mostly daily values in summer and spring and in the first part of the true evaluation period (i.e., prior to 1990). For the autumn and winter time series, by contrast, the cross-validation-based 1SE is very close to the true 1SE throughout the true evaluation period. Discrepancies are however more important concerning 2SE. The cross-validation-based 2SE underestimates the true 2SE throughout the true evaluation period and for all seasons. The underlying cause of this underestimation might be that the cross-validation sample (twelve years) is too short for an accurate determination of the 95% confidence interval of sDoG. This affects the applicability of sDoG for the analysis of extreme values. Further investigation is needed to determine the minimum amount of data required to accurately determine 2SE based on cross-validation, when the stationarity assumption is satisfied.
Next to the validation of sDoG, this study also offers a detailed, station-scale evaluation and comparison of all considered reference data sets. The evaluation of all reference data sets is performed after applying a very basic downscaling step to remove an average altitude bias. Without this bias correction, the altitude bias would dominate the evaluation of all considered reference data sets as even for the highest resolution data sets the altitude bias remains important. In practice, the simple bias correction proposed in this study has the same prerequisite like sDoG: a few-yearly observational time series of the site needs to be available. For sites without observations, removing the altitude bias is intricate because the thermal vertical profiles are known to vary locally and seasonally (e.g., [2]). Removing the altitude bias using pressure level information is an option for the data sets available on pressure levels (e.g., reanalysis data), but this step assumes that the lapse rate extracted by the coarse-scale data set is accurate.
The best performing reference data sets in this study are SPARTACUS, ERA5, and ERA-Interim. Concerning day-to-day and year-to-year variations, the performances of ERA5 and ERA-Interim are very similar, but ERA5 shows improvement over ERA-Interim in the representation of the 23-year mean seasonal cycle. The gridded observations-product SPARTACUS clearly outperforms the weather station Vent (at only four kilometers distance to the Vernagtbach climate monitoring site) for all considered time aspects. This hints at the sophistication of the air temperature interpolation algorithm underlying SPARTACUS and the added value of the gridded, quality controlled product even at a station scale see [2,8]. ERA5-Land is a surface analysis of ERA5 at 9 km (versus 31 km of ERA5) grid resolution. Our study, however, cannot evidence added value of ERA5-Land over ERA5 for air temperature at the Vernagtbach climate monitoring site. ERA5-Land shows larger mean squared errors than ERA5 (and also ERA-Interim) for all time scales except the winter time series. Similarly, the regional reanalysis HARMONIE at 11 km resolution does not show added value over its driving data set ERA-Interim at 80 km grid resolution. Even more underwhelming is the performance of the MESCAN-SURFEX, a 5.5 km grid surface analysis applied to HARMONIE. MESCAN-SURFEX shows the largest mean squared errors of all reference data sets at all considered time aspects. This performance drawback of MESCAN-SURFEX cannot be related to insufficient spatial resolution or dislocation. In fact, the performance differences between ERA-Interim and ERA5 (89 km versus 31 km grid) are small compared to the performance differences between both ERA-Interim and ERA5 to MESCAN-SURFEX (89 and 31 km versus 5.5 km grid). We also find the CORDEX ALARO simulation showing weaker performance than the driving data ERA-Interim for all considered time aspects. This must be contrasted to the lack of added value for MESCAN-SURFEX, ERA5-Land, and HARMONIE, because ALARO is run as freely evolving climate simulation constrained by ERA-Interim only at the lateral boundaries throughout the simulation period. Thus, added value is not to be expected concerning time sequencing aspects [47]. However in the representation of the 23-year mean seasonal cycle ALARO conceptually should add value over ERA-Interim.
Even though our study is based on only one site, the accordance between the best performing reference data sets (SPARTACUS, ERA5 and ERA-Interim) albeit largely varying spatial resolutions, and their superior performance over several higher resolution data sets are noteworthy. A similar analysis as performed in this study but for more sites and different atmospheric quantities could shed a more general light onto the lack of added value of the investigated RCM-based data sets found here for air temperature at the Vernagtbach climate monitoring site. In fact, a general conclusion on the added value of numerical-model-based downscaling or regional climate modeling does not exist, since it is known that regional climate models can also amplify errors present in a global climate simulation (e.g., [50]).
While this study does not evaluate downscaling models other than sDoG, note that sDoG differs from most downscaling approaches in that it is not applicable to freely running climate models as predictors. This is because sDoG uses information about the time sequencing in the observations for fitting the statistical relationships, which enables the application of sDoG to relatively short time series. Furthermore, sDoG is different from most downscaling models as its goal is to extend and/or complement weather station data for past periods [37,38]. More precisely, sDoG aims for a reconstruction of baseline climates rather than knowledge about future climate change. The exploration of statistical downscaling methods to extend or complement interrupted station records-like sDoG-has been suggested also for the generation of long-term gridded observation-based products like SPARTACUS see [2]. Output by sDoG, however, provided a successful evaluation, can be used for the training of downscaling models that focus on future climate change, (e.g., as "pseudo-observations"). This also includes the present day evaluation and further development of kilometer-scale regional climate simulation runs (e.g., [16]). This way, sDoG could contribute to increase knowledge about future climate change for data-scarce high mountain regions. While our study sheds light onto benefits and shortcomings of sDoG and other state-of-art atmospheric data sets, it confirms once more the continued need of a reliable and dense in situ observational network for high mountain regions (e.g., [51]).