Evaluation of Remote Sensing and Reanalysis Snow Depth Datasets over the Northern Hemisphere during 1980–2016

: Snow cover is a key parameter of the climate system and its signiﬁcant seasonal and annual variability have signiﬁcant impacts on the surface energy balance and global water circulation. However, current snow depth datasets show large inconsistencies and uncertainties, which limit their applications in climate change projections and hydrological processes simulations. In this study, a comprehensive assessment of ﬁve hemispheric snow depth datasets was carried out against ground observations from 43,391 stations. The ﬁve snow depth datasets included three remote sensing datasets, i.e., Advanced Microwave Scanning Radiometer for the Earth Observing System (AMSR-E), Advanced Microwave Scanning Radiometer-2 (AMSR2), Global Snow Monitoring for Climate Research (GlobSnow), and two reanalysis datasets, i.e., ERA-Interim and the Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2). Assessment results imply that the spatial distribution of GlobSnow and ERA-Interim exhibit overall better agreements with ground observations than other datasets. GlobSnow and ERA-Interim exhibit less uncertainty during the snow accumulation and ablation periods, respectively. In plain and forested regions, GlobSnow, ERA-Interim and MERRA-2 show better performances, while in mountain and forested mountain areas, GlobSnow exhibits the best performance. AMSR-E and AMSR2 agree better with ground observations in shallow snow condition (0–10 cm), while MERRA-2 shows more satisfying performance when snow depth exceeds 50 cm. These systematic and integrated understanding of the ﬁve representative snow depth datasets provides information on data selection and data reﬁnement, as well as data fusion, which is our next work of interest. their forcing data are in overall good agreement. GlobSnow presents closer relationships with ERA-Interim and MERRA-2, rather than AMSR-E and AMSR2, and particularly ﬁne correlations are found between GlobSnow and ERA-Interim over North America. Lower correlations with other datasets are observed between AMSR-E / AMSR2 and other datasets, and the lowest correlation occurs between AMSR-E and ERA-Interim.


Introduction
Snow cover is the most widely distributed and most dynamic component of the cryosphere, and its significant seasonal and annual variability have notable influences on the global water circulation and surface energy balance [1][2][3]. Snow melt supplies water for approximately one sixth of the world's population, and is a valuable water resource in arid and semi-arid regions [4][5][6]. Snow cover is not only a sensitive indicator that responds to the climate system, but is also an important feedback variable that

Data and Methods
In this Section, we introduce five hemispheric snow depth datasets being evaluated, four reference data and two ancillary data, as well as the evaluation scheme. Specifically, brief information and basic algorithms of the five snow depth datasets, including AMSR-E, AMSR2, GlobSnow, ERA-Interim, and MERRA-2, are described in Section 2.1. Ground reference data, including meteorological station data from China and Russia, snow survey data from Russia and the Global Historical Climatology Network (GHCN)-Daily dataset are described in Section 2.2. A digital elevation model (DEM) data and a land cover data which were imported to extract mountain and forested areas are introduced in Section 2.3. The evaluation scheme is described in Section 2.4.  [51]. Global SWE and snow depth datasets of AMSR-E and AMSR2 are provided by the National Aeronautics and Space Administration (NASA), and the Japanese Aerospace Exploration Agency (JAXA), respectively. In this study, we used the AMSR-E and AMSR2 snow depth datasets of from JAXA.
AMSR-E and AMSR2 daily snow depth datasets from JAXA use the retrieval algorithm from Kelly [52]. Before the snow depth retrieval, the "snow-free" area was identified and masked out by a hemispheric snow cover dataset [53] and MODQ11, a land cover dataset from Moderate Resolution Imaging Spectroradiometer (MODIS) [54]. Then, if the snow was considered shallow snow by the brightness temperature threshold detection, snow depth was set constant to 5 cm. For moderate or deep snow, an improved Chang algorithm that takes the forest cover fraction into consideration was implemented, and different coefficients were applied [55]. The most significant improvements of AMSR2 against AMSR-E are the smaller footprint and the optimization of the calibration system [56]. Although there is no temporal overlap between the two datasets, they are both cross calibrated against the Special Sensor Microwave Imager/Sounder (SSMI/S) to balance systematic errors.
AMSR-E and AMSR2 daily snow depth data were extracted for the periods 2003-2011 and 2013-2016, respectively (https://suzaku.eorc.jaxa.jp/). Note that in our study, a snow year is defined from September 1st of the previous year to August 31st of the current year. A spatial resolution of 0.25 • × 0.25 • was adopted for both AMSR-E and AMSR2 datasets, although AMSR2 also offers a finer spatial resolution of 0.10 • × 0.10 • . In addition, as daily detections of AMSR-E and AMSR2 do not fully cover the extent of the Northern Hemisphere, stripe gaps southward of 55 • N can be found in the daily images. Specifications of AMSR-E and AMSR2 (as well as other datasets introduced below) are displayed in Table 1. Note that "ρ" denotes snow density, "SD*" stands for the average snow depth in snow covered area of a pixel, while "fsc" represents the fraction of snow cover in a pixel.

GlobSnow Daily SWE Dataset
Global Snow Monitoring for Climate Research (GlobSnow) is a hemispheric SWE dataset from European Space Agency (ESA), and is produced from the data assimilation of microwave radiometer data and meteorological station data [57]. Specifically, microwave brightness temperature data from the Scanning Multichannel Microwave Radiometer (SMMR) for the period 1980-1987, the Satellite Program's Special Sensor Microwave Imager (SSM/I) and the Special Sensor Microwave Imager/Sounder (SSMI/S) since 1988 were applied. Ground snow depth observations of GlobSnow were derived from European Centre for Medium-Range Weather Forecasts (ECMWF). The model used was the Helsinki University of Technology (HUT) snow microwave emission model, which is a radiative transfer-based semi-empirical model [58].
Two major steps were included in production of GlobSnow. First, brightness temperature was simulated at the location of every station by the HUT model, with input of observed snow depth, optimal effective grain size, bulk density and forest canopy parameters. Then, the model was fit to satellite observed brightness temperature by optimizing the effective snow grain size [58]. After interpolating the observed snow depth and the optimized effective snow grain size over the Northern Hemisphere, gridded brightness temperatures were simulated by running the HUT model again, and finally, the SWE was adjusted after comparing satellite observations and model simulations [57]. In addition, a time-series melt-detection algorithm was applied to determine the onset of snowmelt season [59], and once the snow was defined as wet snow, SWE was set to 0.001 mm. Although the brightness temperature data from multiple sensors were employed in GlobSnow, cross-validation was not undertaken as the assimilation process is believed to be capable of decreasing part of the inconsistencies among different sensors. It should also be noted that in the data processing of GlobSnow, snow depth was calculated first, and then converted to SWE, thus, the universal snow density (0.24 g/cm 3 ) [60] would not add uncertainty to the snow depth dataset.
In the standard GlobSnow datasets, mountain areas were masked out, whereas the dataset used in this study is based on version 2.0, but mountain area are retained (http://www.globsnow.info/swe/). Nevertheless, snow depth from mountain areas was not assimilated as ground observations in mountain areas were removed in the data assimilation process. GlobSnow offers daily SWE data for the area of 35 • N-85 • N during the period 1980-2013 at the spatial resolution of 0.25 • × 0.25 • . SWE was transformed into snow depth by the universal snow density of 0.24 g/cm 3 . Serious data gaps are present during June to September, representing 63.63%, 99.05%, 100%, and 68.53% in June, July, August, and September, respectively. The data was provided every other day, with occasional data gaps of longer duration for the period 1980-1988, when SMMR offered brightness temperature at that frequency.

ERA-Interim Daily Snow Depth Dataset
ERA-Interim is the 4th generation reanalysis datasets from European Centre for Medium-Range Weather Forecasts (ECMWF, [61]). A large variety of surface parameters, as well as upper-air parameters and atmospheric fluxes are provided at the global scale since January of 1979 [62].
The snow scheme of ERA-Interim is the hydrology tiled ECMWF schemes for surface exchange over land (TESSEL). A single layer energy and mass balance model was applied to describe the snow evolution process at a six-hour time step [63]. The forcing data was from the atmospheric reanalysis of ECMWF, and the precipitation data was corrected with the Global Precipitation Climatology Project's (GPCP) dataset that merges satellite and rain gauge data [64]. An important change of ERA-Interim is the revise of the Operational Analysis (OA) in 2004 [65]. Before 2004, the snow depth pattern was mainly driven by a gridded hemispheric snow depth climatology that was constructed by synoptic stations, literature searches, and climatological records. In the Revised Operational Analysis (ROA) since 2004, the National Environmental Satellite Data and Information Service (NESDIS) Interactive Multisensor Snow and Ice Mapping System (IMS) snow extent, a satellite derived dataset from National Oceanic and Atmospheric Administration (NOAA), was imported in addition to the ground-based observations [63].
In this study, SWE and snow density of ERA-Interim was extracted for the period 1980-2016 (https://apps.ecmwf.int/datasets/data/interim-full-daily/), and average snow depth of a pixel was calculated as the ratio between SWE and snow density. The original resolution of ERA-Interim is 0.75 • × 0.75 • every six hours. Because multiple interpolated spatial and temporal resolutions are offered, the variables we extracted had a spatial resolution of 0.25 • × 0.25 • and a frequency of six hours. The Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2) is the latest global atmospheric reanalysis produced by NASA's Global Modeling and Assimilation Office (GMAO). MERRA-2 offers a variety of atmospheric and surface key variables at the global scale since January, 1980 [66].
The land surface model of MERRA-2 is Catchment, in which the basic computation unit is the hydrological basin instead of uniform grids. National Aeronautics and Space Administration's Seasonal to interannual Prediction Project (NSIPP), a three-layer catchment-based snow model, was used to diagnose snow properties [67,68]. The model-generated precipitation was derived from hourly hydrological datasets, and was corrected with precipitation gauge observations before being used as the forcing data [69]. After the correction, greatest improvements of precipitation were in the tropics, while no adjustments were applied poleward of 62.58 • N, due to the sparse distribution and poor quality of the gauge observations.
Average snow depth of the snow-covered area in a pixel and snow cover fraction were extracted for the period 1980-2016 (https://disc.gsfc.nasa.gov/datasets/). Mean snow depth of the pixel was calculated as the product between them. The original spatial resolution of MERRA-2 is 0.5 • × 0.625 • , and the data was resampled to 0.25 • × 0.25 • by nearest interpolation. Several temporal resolutions are offered and the average daily values were chosen.

Ground Observations
Four sources of ground observations, including meteorological station data from China and Russia, snow survey data from Russia and the Global Historical Climatology Network (GHCN)-Daily dataset, were collected and considered as the "ground reference" in the assessment. Basic information, data processing and the spatial distribution of the selected ground observations for spatial and temporal evaluation were introduced in this section.

Basic Information of the Ground Observations
Daily snow depth from China's meteorological stations during 1980-2013 was obtained from national meteorological information center (http://data.cma.cn/). The assembled dataset includes 945 meteorological stations, offering information on daily snow depth, station elevation, etc. Daily snow depth was measured manually with a ruler at 8:00 a.m. Data was calibrated and quality checked before importing to the national meteorological data platform, thus, no outlier or suspicious data was found in this dataset. Brief information about the dataset (as well as other ground observations introduced below) are listed in Table 2. Daily snow depth from Russian meteorological stations during 1980-2016 was collected through the Russian meteorological center (http://aisori.meteo.ru/ClimateR). Snow depth, elevation of the station and the quality control field were obtained from the dataset. Doubtful records, which were flagged in the quality control field were removed. After data filtration, a total of 620 stations were included in the dataset.
Snow survey data from Russia for the period 1980-2016 was also obtained via the Russian meteorological center (http://aisori.meteo.ru/ClimateR), including elevation of the measurement, the deepest, average, and shallowest snow depth, snow density, snow age etc. This information was Remote Sens. 2020, 12, 3253 7 of 31 recorded in the dataset at every five to ten days interval from September to May. The measurement was carried out at every 1 km or 2 km in open areas, and every 500 m in forested regions [70]. Data was filtered based on the criteria that the average snow depth should be within the range of the deepest and shallowest snow depth. After quality check, observations from 514 stations remained in the dataset.
The Global Historical Climatology Network (GHCN)-Daily dataset is a database that addresses the critical need for historical daily temperature, precipitation, and snow records over global land areas (https://www.ncdc.noaa.gov). Nearly 30 affiliations, including meteorological stations from several countries, SNOwpack TELemtry (SNOTEL), snow survey data, etc., provided records from over 75,000 stations in 179 countries and territories, but approximately two thirds of the data are for precipitation measurements only [71]. In our study, snow depth, elevation and quality control field were collected for the period 1980-2016 over the Northern Hemisphere. Data that failed the quality check (e.g., internal consistency check, climatological outlier check, spatial or temporal check, etc.) were removed. Finally, there were 41,312 stations offering snow depth information in the dataset.

Data Processing of Ground Observations
A total of 43,391 stations were left after quality check, located in 16,446 0.25 • × 0.25 • pixels. For pixels with multiple observations, mean values were adopted. To obtain more reasonable evaluation results, ground observations for spatial and temporal evaluation were further selected, and the filtering criteria and the distribution of ground observations for spatial and temporal evaluation are described in this section.
Data filtering of stations for spatial evaluation focused on examining the validity of stations in mountain areas (Figure 1), as complex topography often leads to low representativity of these stations [6,72,73]. The filtering checked both the number of stations in a pixel and altitude representativity of ground observations. For pixels with three or more stations, all stations were retained as more data would generally increase the representativity. For pixels with one or two stations, if the elevation of the station was out of the range of the mean elevation plus and minus three times the standard deviation of the pixel, the station was removed because it was probably not representative enough for the average snow depth of the pixel. After the filtering, 233 stations which were located in 230 pixels were removed, and 43,148 stations were left for spatial evaluation.
The left stations were further selected for temporal evaluation based on the following criteria (see flowchart in Figure 1): (1) the station should be located between 35 • N-85 • N, to be consistent with the spatial coverage of GlobSnow; (2) missing data of the station should be less than 40% during the study period, which is in consideration of the seasonal variation in uncertainty; (3) annual mean snow cover days (snow depth deeper than 5 cm was defined as a snow cover day) of the station should be more than 60 days, with the purpose of obtaining more valid information and highlight the differences. As a result, 1984 stations in 1793 pixels were selected for temporal evaluation.
The spatial distribution of the stations for spatial and temporal evaluations is presented in Figure 2. China's stations are densely distributed in the eastern part, and relatively sparse in the west. Russian stations and survey data cover most regions of the former Soviet Union. GHCN stations are most densely distributed in the U.S. and can be found in most part of the Northern Hemisphere. With regard to stations for temporal evaluation, 1237 of them are located in North America, and most of them are distributed around the Rocky Mountains, the national boundary between Canada and the U.S., as well as in most of Canada. In the meanwhile, 747 stations are located in Eurasia, and are densely distributed in Scandinavia, and the rest are almost evenly distributed in mid-north Eurasia.  The spatial distribution of different sources of ground observations for spatial distribution and stations for temporal evaluation. Data for spatial evaluation, namely the ground observations from China, Russian meteorological stations, Russian snow survey and GHCN, were represented by orange, purple, green, and blue dots, respectively. Data selected for temporal evaluation are represented by black crosses.

Ancillary Data
To examine the performance under different land cover types (i.e., plain, forest, mountain, and forest-mountain), a land cover data and a DEM data were obtained to extract mountain and forested areas. Brief information and the data processing of the two data are introduced in this Section.

GlobCover 2009 Global Land Cover Map
Forested and non-forested regions were identified using GlobCover 2009, which is a global land cover data that produced by ESA and the Université Catholique de Louvain (http://due.esrin.esa.int/page_globcover.php, [74]). The land cover map was derived from an

Ancillary Data
To examine the performance under different land cover types (i.e., plain, forest, mountain, and forest-mountain), a land cover data and a DEM data were obtained to extract mountain and forested areas. Brief information and the data processing of the two data are introduced in this Section.

GlobCover 2009 Global Land Cover Map
Forested and non-forested regions were identified using GlobCover 2009, which is a global land cover data that produced by ESA and the Université Catholique de Louvain (http://due.esrin.esa.int/page_globcover.php, [74]). The land cover map was derived from an The spatial distribution of different sources of ground observations for spatial distribution and stations for temporal evaluation. Data for spatial evaluation, namely the ground observations from China, Russian meteorological stations, Russian snow survey and GHCN, were represented by orange, purple, green, and blue dots, respectively. Data selected for temporal evaluation are represented by black crosses.

Ancillary Data
To examine the performance under different land cover types (i.e., plain, forest, mountain, and forest-mountain), a land cover data and a DEM data were obtained to extract mountain and forested areas. Brief information and the data processing of the two data are introduced in this Section.

GlobCover 2009 Global Land Cover Map
Forested and non-forested regions were identified using GlobCover 2009, which is a global land cover data that produced by ESA and the Université Catholique de Louvain (http://due.esrin.esa.int/ Remote Sens. 2020, 12, 3253 9 of 31 page_globcover.php, [74]). The land cover map was derived from an automatic and regionally-tuned classification of a time series of global Medium Resolution Imaging Spectrometer (MERIS) fine resolution mosaics for the year 2009, and 22 types of global land cover classes were defined according to the United Nations Land Cover Classification System.
The spatial resolution of GlobCover is approximately 300 m × 300 m. First, the 22 land cover types were reclassified as forest and non-forest at the original resolution; after resampling them to the spatial resolution of 0.25 • × 0.25 • , if more than half of the pixels at the 0.25 • × 0.25 • pixel extent was defined as forest, the pixel was classified as forest, otherwise it was defined as non-forest.

Global Multi-Resolution Terrain Elevation Data 2010
Mountains were separated from plain areas using Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010, https://topotools.cr.usgs.gov/gmted_viewer/, [75]). It is an update data of GTOPO30, and both are produced by the United States Geological Survey (USGS).
GMTED2010 offers multiple spatial resolution at 30, 15, and 7.5 arc seconds, which is approximately one kilometer, 450 m, and 225 m, respectively [75]. Mountains were defined partly referring to GlobSnow [57] to maintain consistency. Therefore, both mean and standard deviation of the elevation within a pixel was obtained at the spatial resolution of 30 arc seconds, and a threshold of 150 m in standard deviation was set to determine the mountain areas. After resampling to the spatial resolution of 0.25 • × 0.25 • , a pixel was classified as mountain if these represented more than half of the pixels, otherwise it was defined as plain.

Reclassified Land Cover Types of the Northern Hemisphere
The land cover of the Northern Hemisphere was divided into forest and non-forest by GlobCover 2009, plain and mountain by GMTED2010. Hence, by data union, land areas were reclassified into plain, forest, mountain and forest-mountain ( Figure 3), each occupying 60.69%, 30.31%, 6.98%, and 2.02% of the total land area, respectively.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 33 automatic and regionally-tuned classification of a time series of global Medium Resolution Imaging Spectrometer (MERIS) fine resolution mosaics for the year 2009, and 22 types of global land cover classes were defined according to the United Nations Land Cover Classification System. The spatial resolution of GlobCover is approximately 300 m × 300 m. First, the 22 land cover types were reclassified as forest and non-forest at the original resolution; after resampling them to the spatial resolution of 0.25° × 0.25°, if more than half of the pixels at the 0.25° × 0.25° pixel extent was defined as forest, the pixel was classified as forest, otherwise it was defined as non-forest.

Global Multi-Resolution Terrain Elevation Data 2010
Mountains were separated from plain areas using Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010, https://topotools.cr.usgs.gov/gmted_viewer/, [75]). It is an update data of GTOPO30, and both are produced by the United States Geological Survey (USGS).
GMTED2010 offers multiple spatial resolution at 30, 15, and 7.5 arc seconds, which is approximately one kilometer, 450 m, and 225 m, respectively [75]. Mountains were defined partly referring to GlobSnow [57] to maintain consistency. Therefore, both mean and standard deviation of the elevation within a pixel was obtained at the spatial resolution of 30 arc seconds, and a threshold of 150 m in standard deviation was set to determine the mountain areas. After resampling to the spatial resolution of 0.25° × 0.25°, a pixel was classified as mountain if these represented more than half of the pixels, otherwise it was defined as plain.

Reclassified Land Cover Types of the Northern Hemisphere
The land cover of the Northern Hemisphere was divided into forest and non-forest by GlobCover 2009, plain and mountain by GMTED2010. Hence, by data union, land areas were reclassified into plain, forest, mountain and forest-mountain ( Figure 3), each occupying 60.69%, 30.31%, 6.98%, and 2.02% of the total land area, respectively.
Forest is densely distributed in mid-high latitudes of Eurasia, east coast of North America and the southern part of Canada. The largest two mountain regions are the Rocky Mountains and the Himalaya, while the two smaller ones are the Scandinavian Mountains and the Alps. Forestmountain area are mainly located in the Rocky Mountains and the southeastern part of the Himalaya.

Evaluation Scheme
Average uncertainties of the five datasets were verified spatially at the pixel scale, temporally at annual and monthly average scales, under different land cover types and at different snow depth ranges. Evaluation metrics mainly included BIAS, root mean square error (RMSE), relative error (RE), and correlation coefficient (only calculated in spatial uncertainty analysis). Greenland was excluded in all evaluations due to the severe data gaps in all five datasets. Additionally, June, July, August, and September were not included in annual average calculations in consideration of the severe data gaps in the GlobSnow dataset. For ground observations with snow depth less than 1 cm, the RE value was set to invalid to avoid spurious large values. In addition, the difference between AMSR-E and Forest is densely distributed in mid-high latitudes of Eurasia, east coast of North America and the southern part of Canada. The largest two mountain regions are the Rocky Mountains and the Himalaya, while the two smaller ones are the Scandinavian Mountains and the Alps. Forest-mountain area are mainly located in the Rocky Mountains and the southeastern part of the Himalaya.

Evaluation Scheme
Average uncertainties of the five datasets were verified spatially at the pixel scale, temporally at annual and monthly average scales, under different land cover types and at different snow depth ranges. Evaluation metrics mainly included BIAS, root mean square error (RMSE), relative error (RE), and correlation coefficient (only calculated in spatial uncertainty analysis). Greenland was excluded in all evaluations due to the severe data gaps in all five datasets. Additionally, June, July, August, and September were not included in annual average calculations in consideration of the severe data gaps in the GlobSnow dataset. For ground observations with snow depth less than 1 cm, the RE value was set to invalid to avoid spurious large values. In addition, the difference between AMSR-E and AMSR2 was not compared, because only three years of data from AMSR2 were used and there was no overlap with AMSR-E, thus, their differences are likely to be influenced by interannual fluctuations in snow depth.

Results
The spatial and temporal characteristics of the five snow depth datasets were first intercompared, and then evaluated against ground observations through multiple metrics (BIAS, RMSE, RE, R), and the results are described in this section, respectively.

Intercomparison of the Datasets
The five datasets were intercompared by investigating their spatial and temporal consistencies, as well as pair-wise correlations.

Spatial Distribution of Average Snow Depth
Average snow depth of the five datasets from October to May during the study period over the Northern Hemisphere is shown in Figure 4. Consistencies among the five datasets are that snow cover is mainly distributed in mid-high latitudes over the Northern Hemisphere, and deep snow is generally distributed in Alaska, Northern Canada, and Northern Eurasia. However, apparent differences exist in the regional spatial patterns and snow depth.    In North America, the spatial patterns are similar, with snow cover densely distributed in Alaska, the Rocky Mountains and Northern Canada. However, the average snow depth of AMSR-E and AMSR2 is approximately 40 cm in the dense snow-covered area, and it is generally thinner than that of other datasets, especially in the Rocky Mountains and in the northeastern part of Canada. The average snow depth of GlobSnow and ERA-Interim is about 50 cm, which represents a moderate level among the five datasets. MERRA-2 shows extremely thick snow in the northern part of the Rocky Mountains, with the largest snow depths of more than 4 m.
In Eurasia, both the spatial pattern and snow depth values are inconsistent. The deepest snow depth of AMSR-E and AMSR2 appears in the Central Siberian Plateau and East Siberian Mountain, while the snow is shallower in the West Siberian Plain and the East European Plain; a similar spatial pattern is also found in the SMM/I estimates [76]. For the remaining datasets, deeper snow depth appears from the West Siberian Plain to Scandinavia, which is in the middle and western part of North Eurasia. In addition, MERRA-2 exhibits the deepest snow depth in the Arctic among the five datasets.

Temporal Variability of Average Snow Depth
Annual average snow depth during October to May during 1980-2016 among the five datasets over North America, Eurasia and the Northern Hemisphere is illustrated in Figure 5. Regional averages between 35 • N-85 • N land areas were calculated to maintain consistency with the spatial coverage of GlobSnow.
The largest inconsistency in average snow depth occurs in North America, where the average snow depth of MERRA-2 is approximately three times that of AMSR-E and AMSR2. Specifically, average snow depth of AMSR-E and AMSR2 is about 10 cm during the study period, GlobSnow and ERA-Interim show similar average snow depth between 15 and 20 cm, while the average snow depth of MERRA-2 is approximately 30 cm.
Better consistency was found among the five datasets over Eurasia but the relative snow depths between North America and Eurasia were different. Average snow depth of AMSR-E and AMSR2 over Eurasia is approximately 10-15 cm, which is a little deeper than that over North America. GlobSnow and ERA-Interim show similar average snow depths of about 15-20 cm over both continents. Average snow depth of MERRA-2 is approximately 25 cm, which is smaller than that over Northern America.
At the hemispheric scale, differences in average snow depth among the datasets are still apparent. AMSR-E and AMSR2 show the smallest average snow depth around 10-15 cm. GlobSnow and ERA-Interim exhibit a moderate snow depth of about 15-20 cm. MERRA-2 shows the largest average snow depth of about 25 cm. Other studies also suggest a larger average snow depth from MERRA-2 when compared with ERA-Interim at the global scale [77] and GlobSnow over the Northern Hemisphere [78].  The average monthly variability of the five datasets is displayed in Figure 6. Both the peak snow depth and monthly variation patterns are different over North America, Eurasia, and the Northern Hemisphere. AMSR-E and AMSR2 show the smallest snow depth, GlobSnow and ERA-Interim moderate depths, while MERRA-2 exhibits the largest peak snow depth. Additionally, the peak snow depth of AMSR-E and AMSR2 appears in February, implying a small snow accumulation process from January to February, and a strong melt since March. GlobSnow and ERA-Interim exhibit almost equal snow depths in February and March, indicating balanced accumulation and ablation processes. The peak snow depth of MERRA-2 appears in March, which suggests a continuously accumulation from January to March. Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 33

Pair-Wise Correlations among Datasets
The pair-wise correlations from October to May during 1980-2016 between 35 °N-85 °N among the five datasets were calculated over North America, Eurasia and the Northern Hemisphere (displayed in Figure 7). Datasets generally present better correlations over Eurasia than that over North America.
In both North America and Eurasia, the best correlation occurs between ERA-Interim and MERRA-2, probably as both of them are reanalysis, and their forcing data are in overall good agreement. GlobSnow presents closer relationships with ERA-Interim and MERRA-2, rather than AMSR-E and AMSR2, and particularly fine correlations are found between GlobSnow and ERA-

Pair-Wise Correlations among Datasets
The pair-wise correlations from October to May during 1980-2016 between 35 • N-85 • N among the five datasets were calculated over North America, Eurasia and the Northern Hemisphere (displayed in Figure 7). Datasets generally present better correlations over Eurasia than that over North America.
In both North America and Eurasia, the best correlation occurs between ERA-Interim and MERRA-2, probably as both of them are reanalysis, and their forcing data are in overall good agreement. GlobSnow presents closer relationships with ERA-Interim and MERRA-2, rather than AMSR-E and AMSR2, and particularly fine correlations are found between GlobSnow and ERA-Interim over North America. Lower correlations with other datasets are observed between AMSR-E/AMSR2 and other datasets, and the lowest correlation occurs between AMSR-E and ERA-Interim. Interim over North America. Lower correlations with other datasets are observed between AMSR-E/AMSR2 and other datasets, and the lowest correlation occurs between AMSR-E and ERA-Interim.

Evaluation of the Datasets against Ground Observations
In this Section, performances of the five datasets were evaluated against ground observations. First, we present the spatial distribution of the reference data. Then, we describe the spatial and temporal uncertainties (including BIAS, RMSE, RE and R) of each dataset. Finally, we demonstrate the uncertainties under different land cover types and snow depth ranges.

Average Snow Depth of Ground Observations
Average snow depth during October toMay from ground observations over the Northern Hemisphere is shown in Figure 8. An overall good correspondence is shown between snow depth and latitude. At lower latitudes, e.g., most of the U.S. and China, the average snow depth is less than 5 cm. Stations around the border between Canada and the U.S., as well as in the southern part of Europe, Kazakhstan and southern part of Russia show average snow depths around 5-20 cm. In Northern Canada, most parts of the Rocky Mountains and the mid-northern parts of Eurasia, the average snow depth is approximately 20-50 cm. Average snow depths greater than 50 cm are rare and most of them are located in mountain areas, e.g., in the Rocky Mountains and Scandinavian Mountains.

Evaluation of the Datasets against Ground Observations
In this Section, performances of the five datasets were evaluated against ground observations. First, we present the spatial distribution of the reference data. Then, we describe the spatial and temporal uncertainties (including BIAS, RMSE, RE and R) of each dataset. Finally, we demonstrate the uncertainties under different land cover types and snow depth ranges.

Average Snow Depth of Ground Observations
Average snow depth during October toMay from ground observations over the Northern Hemisphere is shown in Figure 8. An overall good correspondence is shown between snow depth and latitude. At lower latitudes, e.g., most of the U.S. and China, the average snow depth is less than 5 cm. Stations around the border between Canada and the U.S., as well as in the southern part of Europe, Kazakhstan and southern part of Russia show average snow depths around 5-20 cm. In Northern Canada, most parts of the Rocky Mountains and the mid-northern parts of Eurasia, the average snow depth is approximately 20-50 cm. Average snow depths greater than 50 cm are rare and most of them are located in mountain areas, e.g., in the Rocky Mountains and Scandinavian Mountains. E/AMSR2 and other datasets, and the lowest correlation occurs between AMSR-E and ERA-Interim.

Evaluation of the Datasets against Ground Observations
In this Section, performances of the five datasets were evaluated against ground observations. First, we present the spatial distribution of the reference data. Then, we describe the spatial and temporal uncertainties (including BIAS, RMSE, RE and R) of each dataset. Finally, we demonstrate the uncertainties under different land cover types and snow depth ranges.

Average Snow Depth of Ground Observations
Average snow depth during October toMay from ground observations over the Northern Hemisphere is shown in Figure 8. An overall good correspondence is shown between snow depth and latitude. At lower latitudes, e.g., most of the U.S. and China, the average snow depth is less than 5 cm. Stations around the border between Canada and the U.S., as well as in the southern part of Europe, Kazakhstan and southern part of Russia show average snow depths around 5-20 cm. In Northern Canada, most parts of the Rocky Mountains and the mid-northern parts of Eurasia, the average snow depth is approximately 20-50 cm. Average snow depths greater than 50 cm are rare and most of them are located in mountain areas, e.g., in the Rocky Mountains and Scandinavian Mountains.

Spatial Uncertainties of the Five Datasets
The results of spatial uncertainties, including BIAS, RMSE, RE, and R of the five datasets were described in this Section. RMSE from October to May during 1980-2016 among the five datasets was displayed in Figure 9, while result of BIAS, RE, and R was displayed in Figures A1-A3

Temporal Uncertainties of the Five Datasets
To detect the temporal uncertainties of the five datasets, annual and monthly BIAS, RMSE, and RE were investigated over North America, Eurasia, and the Northern Hemisphere, respectively, and results are described in this Section. The average annual and monthly snow depths from ground observations selected for temporal evaluation are also attached for reference.
Annual average snow depth from the selected observations, and the corresponding BIAS, RMSE and RE among the five datasets during October to May between 35 °N-85 °N over North America, Eurasia and the Northern Hemisphere are shown in Figure 10. Large interannual fluctuations are seen in observed snow depth over North America, and particularly in years when snow depth was In general, small BIAS and RMSE appear in conditions where snow is shallow, e.g., most parts of the U.S. and China. Correspondingly, in Alaska, the Rocky Mountains, North Canada, as well as Northern Eurasia where snow depth is greater, the BIAS, RMSE, and the discrepancies of the uncertainties are larger. R, however, generally gets finer as latitude gets northward. AMSR-E and AMSR2 exhibit overall weaker correlations than other datasets.
The general uncertainty patterns are similar among the datasets over North America. Almost all datasets show large BIAS (underestimation larger than 20 cm), RMSE (greater than 40 cm) and RE (larger than 175%) around the Rocky Mountains. MERRA-2, however, shows significant overestimation (larger than 20 cm) in the northern part of the Rocky Mountains, and relatively smaller RE (about 75-125%) in the southern part. In most part of Canada, AMSR-E and AMSR2 show larger uncertainties. In most parts of the U.S., ERA-Interim and MERRA-2 exhibit the finest RE and R.
In Eurasia, AMSR-E and AMSR2 exhibit overestimations (5-20 cm) in the East Siberian Plateau and obvious underestimations (5-20 cm on average, some stations larger than 20 cm) from the Middle Siberia Plain to Scandinavia. The RMSE of some stations in this area can be larger than 20 cm, which implies a poor performance of AMSR-E and AMSR2 in North Eurasia. GlobSnow, ERA-Interim and MERRA-2, which show different spatial patterns of snow depth with AMSR-E and AMS2 in Siberia, perform quite well over North Eurasia, except for MERRA-2 in the East European Plain, where a general overestimation (5-20 cm on average) was found. MERRA-2 also shows the overall lowest RE and finest R in China.

Temporal Uncertainties of the Five Datasets
To detect the temporal uncertainties of the five datasets, annual and monthly BIAS, RMSE, and RE were investigated over North America, Eurasia, and the Northern Hemisphere, respectively, and results are described in this Section. The average annual and monthly snow depths from ground observations selected for temporal evaluation are also attached for reference.
Annual average snow depth from the selected observations, and the corresponding BIAS, RMSE and RE among the five datasets during October to May between 35 • N-85 • N over North America, Eurasia and the Northern Hemisphere are shown in Figure 10. Large interannual fluctuations are seen in observed snow depth over North America, and particularly in years when snow depth was larger, e.g., in 1981, 2008, 2011, and 2014, when the snow depth of some stations in the Rocky Mountains was always deep. In other words, the particularly large snow depths, found in a few stations in the Rocky Mountains are likely to be attributed to the large average snow depth of certain years over North America. Meanwhile, the interannual fluctuation of snow depth in Eurasia is more stable. General underestimations were found in North America among the five datasets. AMSR-E and AMSR2 show the largest negative bias, followed by GlobSnow and ERA-Interim, while MERRA-2 is closest to the reference data. In Eurasia, MERRA-2 generally overestimates against ground observations, while GlobSnow and ERA-Interim are relatively closer to the reference data. At the hemispheric scale, a general overestimation is detected by MERRA-2, while AMSR-E and AMSR2 exhibit the largest underestimation, followed by GlobSnow and ERA-Interim.
The interannual fluctuations of RMSE among the five datasets are similar to those of the snow depth from ground observations in North America, and the RMSE is larger in years when the observed snow depth is also large, e.g., 2008 and 2011. ERA-Interim and GlobSnow exhibit smaller RMSE than other datasets before 2004, while the RMSE of ERA-Interim increased since 2004. In Eurasia, the RMSE of ERA-Interim is even smaller than that of GlobSnow before 2004, while it abruptly increases from 2004. This abrupt increase of RMSE in Eurasia is different from other datasets, and partly caused by its revision of the operational analysis, which changes the spatial distribution of the snow depth. In both continents, the RMSE of AMSR-E and AMSR2 is slightly larger than in other datasets.
AMSR-E, AMSR2 and GlobSnow show smaller RE than ERA-Interim and MERRA-2 in both continents, and MERRA-2 generally shows the largest. ERA-Interim exhibits apparently larger RE during 2004-2014, while other datasets show generally a stable trend.  and (c) the Northern Hemisphere, which are placed in each column. The first row represents average snow depth from ground observations, while the second, third and the fourth row stands for BIAS, RMSE and RE results of the five datasets, respectively. The dashed line in dark red from the first row represents snow depth from the selected stations for temporal evaluation, dashed lines from other rows, which are colored in blue, purple, yellow, red, and green, denote snow depth from AMSR-E, AMSR2, GlobSnow, ERA-Interim, and MERRA-2, respectively. General underestimations were found in North America among the five datasets. AMSR-E and AMSR2 show the largest negative bias, followed by GlobSnow and ERA-Interim, while MERRA-2 is closest to the reference data. In Eurasia, MERRA-2 generally overestimates against ground observations, while GlobSnow and ERA-Interim are relatively closer to the reference data. At the hemispheric scale, a general overestimation is detected by MERRA-2, while AMSR-E and AMSR2 exhibit the largest underestimation, followed by GlobSnow and ERA-Interim.
The interannual fluctuations of RMSE among the five datasets are similar to those of the snow depth from ground observations in North America, and the RMSE is larger in years when the observed snow depth is also large, e.g., 2008 and 2011. ERA-Interim and GlobSnow exhibit smaller RMSE than other datasets before 2004, while the RMSE of ERA-Interim increased since 2004. In Eurasia, the RMSE of ERA-Interim is even smaller than that of GlobSnow before 2004, while it abruptly increases from 2004. This abrupt increase of RMSE in Eurasia is different from other datasets, and partly caused by its revision of the operational analysis, which changes the spatial distribution of the snow depth. In both continents, the RMSE of AMSR-E and AMSR2 is slightly larger than in other datasets.
AMSR-E, AMSR2 and GlobSnow show smaller RE than ERA-Interim and MERRA-2 in both continents, and MERRA-2 generally shows the largest. ERA-Interim exhibits apparently larger RE during 2004-2014, while other datasets show generally a stable trend.
Average monthly snow depth from ground observations, and uncertainties (including BIAS, RMSE and RE) of the five datasets over North America, Eurasia, and the Northern Hemisphere are displayed in Figure 11. The peak snow depth from observations appears in February over North America, and in March over Eurasia. This inconsistency possibly originates from the different latitudes where the stations are located (Figure 2): compared with Eurasia, in North America more stations are located at lower latitudes, which may result in the earlier start of snow ablation.
Average monthly snow depth from ground observations, and uncertainties (including BIAS, RMSE and RE) of the five datasets over North America, Eurasia, and the Northern Hemisphere are displayed in Figure 11. The peak snow depth from observations appears in February over North America, and in March over Eurasia. This inconsistency possibly originates from the different latitudes where the stations are located (Figure 2): compared with Eurasia, in North America more stations are located at lower latitudes, which may result in the earlier start of snow ablation.
In general, AMSR-E and AMSR2 exhibit apparently larger underestimations in both continents, followed by GlobSnow and ERA-Interim. MERRA-2 shows a general overestimation, except in January, February, and March over North America.
From October to December, which is the snow accumulation period, ERA-Interim exhibits the smallest RMSE. In January, very similar RMSE was observed between ERA-Interim and GlobSnow. From February to April, GlobSnow shows the smallest RMSE. During May to August, the discrepancies among datasets are small. AMSR-E and AMSR2 exhibit overall largest RMSE during December to April, MERRA-2 shows a slightly larger RMSE than GlobSnow and ERA-Interim in almost all months.
The RE of MERRA-2 is apparently larger than for other datasets, particularly in September, October, April, May, and June, which correspond to the beginning of the snow accumulation and the end of the snow ablation period.  In general, AMSR-E and AMSR2 exhibit apparently larger underestimations in both continents, followed by GlobSnow and ERA-Interim. MERRA-2 shows a general overestimation, except in January, February, and March over North America.
From October to December, which is the snow accumulation period, ERA-Interim exhibits the smallest RMSE. In January, very similar RMSE was observed between ERA-Interim and GlobSnow. From February to April, GlobSnow shows the smallest RMSE. During May to August, the discrepancies among datasets are small. AMSR-E and AMSR2 exhibit overall largest RMSE during December to April, MERRA-2 shows a slightly larger RMSE than GlobSnow and ERA-Interim in almost all months.
The RE of MERRA-2 is apparently larger than for other datasets, particularly in September, October, April, May, and June, which correspond to the beginning of the snow accumulation and the end of the snow ablation period.

Uncertainties under Different Land Cover Types
Uncertainties (including BIAS, RMSE, and RE) of the five datasets under different land cover types (plain, forest, mountain and forest-mountain) during October to May between 35 • N and 85 • N over the Northern Hemisphere were calculated and illustrated in Figure 12. Results show that the five datasets perform best in plains and worse in mountain and forested areas.
According to the results from all the selected 1984 stations from all land cover types, AMSR-E and AMSR2 show the largest underestimation, largest RMSE, and smallest RE among all datasets. ERA-Interim exhibits a slightly smaller underestimation than in GlobSnow, and both datasets show smaller RMSE and RE than others. MERRA-2 exhibits the smallest underestimation, second smallest RMSE and largest RE.
In plain areas, AMSR-E and AMSR2 exhibit the largest BIAS and RMSE. GlobSnow, ERA-Interim and MERRA-2 show similar RMSE. ERA-Interim and MERRA-2 show the smallest and largest RE, respectively.

Uncertainties at Different Snow Depth Ranges
Average BIAS, RMSE, and RE at different snow depth ranges were calculated and results are displayed in Figure 13. Snow depth was divided into eight groups according to the average snow depth from ground observations, i.e., 0-10 cm, 10-20 cm, 20-30 cm, 30-40 cm, 40-50 cm, 50-100 cm, 100-200 cm, and deeper than 200 cm.
When observational snow depth is thinner than 10 cm, AMSR-E and AMSR2 exhibit a very slight underestimation, whereas other datasets show slight overestimations. The RMSE are similar among the five datasets, while the smallest and largest RE is found in AMSR-E/AMSR2 and MERRA-2, respectively.
When snow depth ranges from 10-30 cm, overestimations are found in MERRA-2, when snow depth ranges from 10-20 cm, ERA-Interim also exhibits a slight overestimation, while underestimations are observed in other datasets. GlobSnow, ERA-Interim and MERRA-2 have In forested areas, the smallest BIAS was observed from ERA-Interim, smallest RMSE was found from MERRA-2, and all datasets show almost equal amounts of RE.
In mountain areas, MERRA-2 is the only dataset that overestimates against station data. Additionally, the RMSE and RE of MERRA-2 are the largest. GlobSnow exhibits the smallest RMSE. AMSR-E and AMSR2 show the largest underestimation, second largest RMSE, and smallest RE.
In forest-mountain areas, much larger underestimations and RMSE were found in AMSR-E and AMSR2; the largest RE was found in MERRA-2, while the smallest RMSE was found in GlobSnow.

Uncertainties at Different Snow Depth Ranges
Average BIAS, RMSE, and RE at different snow depth ranges were calculated and results are displayed in Figure 13. Snow depth was divided into eight groups according to the average snow depth from ground observations, i.e., 0-10 cm, 10-20 cm, 20-30 cm, 30-40 cm, 40-50 cm, 50-100 cm, 100-200 cm, and deeper than 200 cm.
When observational snow depth is thinner than 10 cm, AMSR-E and AMSR2 exhibit a very slight underestimation, whereas other datasets show slight overestimations. The RMSE are similar among the five datasets, while the smallest and largest RE is found in AMSR-E/AMSR2 and MERRA-2, respectively.
When snow depth ranges from 10-30 cm, overestimations are found in MERRA-2, when snow depth ranges from 10-20 cm, ERA-Interim also exhibits a slight overestimation, while underestimations are observed in other datasets. GlobSnow, ERA-Interim and MERRA-2 have similar RMSE, which is smaller than that of AMSR-E and AMSR2. GlobSnow and ERA-Interim exhibit the smallest RE when snow depth ranges from 20-30 cm.
When snow depth ranges from 30-50 cm, AMSR-E and AMSR2 exhibit the largest underestimation, RMSE and RE. GlobSnow, ERA-Interim and MERRA-2 show similar RMSE, while the RE of MERRA-2 is slightly larger than GlobSnow and ERA-Interim. GlobSnow and ERA-Interim show similar smallest RE, followed by MERRA-2, AMSR-E, and AMSR2.
When snow depth is deeper than 50 cm, the five datasets perform similarly. ERA-Interim exhibits a slightly better performance than GlobSnow. MERRA-2 have the smallest BIAS, RMSE, and RE, AMSR-E, and AMSR2 have the largest. In particular, when snow depth exceeds 200 cm, the RE of AMSR-E and AMSR2 is very close to 100%, implying a failure in estimating snow depth at this range.
When snow depth is deeper than 50 cm, the five datasets perform similarly. ERA-Interim exhibits a slightly better performance than GlobSnow. MERRA-2 have the smallest BIAS, RMSE, and RE, AMSR-E, and AMSR2 have the largest. In particular, when snow depth exceeds 200 cm, the RE of AMSR-E and AMSR2 is very close to 100%, implying a failure in estimating snow depth at this range.

The Representativeness of Ground Observations
The representativeness of ground observations is an inevitable topic in dataset assessment studies, as remarkable differences occur when comparing ground observations with gridded snow depth datasets [79][80][81]. These inconsistencies are particularly obvious when measurements are sparsely distributed, or the underlying surface is complex or inhomogeneous, such as in mountain areas [20,82,83]. Nevertheless, accounting for the fact that ground observations are far from adequate, the direct upscaling is still the most commonly used approach in data assessment studies at present [81,84].
In our study, while point-wise observational data was directly regarded as the reference "truth" at the 0.25 • × 0.25 • pixel scale, efforts were made to increase the representativity and reliability of ground observations at the pixel scale.
First, by importing multiple sources of ground observations, both the total amount of observations and the number of observations in a pixel were increased. Specifically, in addition to ground observations from meteorological stations, snow survey data and SNOTEL data, etc., were also collected as ground reference. These multiple sources of ground observations largely expanded the total number of ground observations. In addition, because snow surveys are usually carried out over relatively short distance, these data were likely to increase the number of stations in the 0.25 • × 0.25 • pixel range.
The representativity and reliability of observations in mountain areas were further validated in mountain areas (detail in Figure 1). Two aspects were considered in the validation process, namely the number of stations required for a reliable reference at the pixel scale and the filtering criteria if the number of stations did not reach the threshold. As more observations in a pixel typically suggest larger reliability, the number of observations needed for a reliable reference was determined first. Chang [85] mentioned that ten measurements are needed to obtain an error of less than 5 cm in a 1 • × 1 • grid cell. According to an experiment conducted in mountain areas of California, neither individual snow course measurements nor the snow pillow could represent well the regional mean on a scale of 1-16 km 2 [86]. In consideration of these studies, we determined that a reliable reference in mountain areas needs at least three observations. Then, for pixels with one or two station, the representativeness of the stations was further checked. The representativeness of ground observations was validated by its elevation representativity, as different elevation leads to different solar radiation, temperature, precipitation, etc., consequently different snow depth distribution. As a result, 233 stations in mountain areas were removed due to their low representativity (described in Section 2.2.2). Compared to the evaluation result with all stations, the general BIAS, RMSE and RE decreased for 3.19%, 2.66%, 1.47% in mountain areas, and 4.07%, 2.36%, and 0.19% in forest-mountain areas, respectively. It should be also noted that apart from elevation, solar radiation, wind deposition, latitude, slope, aspect, etc., are also impacting factors of snow depth variation, thus, careful investigation on these factors are expected to consider in future work.

Advantages and Limitations of the Five Datasets
AMSR-E and AMSR2 are standalone passive microwave remote sensing datasets, thus, their retrieved snow depth reflects the properties of microwave signals. Based on the linear relationship between brightness temperature gradient and snow depth, AMSR-E and AMSR2 perform quite well in shallow snow, e.g., at snow depths less than 20 cm ( Figure 13). The fine performance of AMSR-E was also reported when SWE is less than 30 mm [32], and when annual peak SWE values are less than 80 mm [36]. However, if the study area is located at high latitudes where snow depth is generally thick, the fine performance of AMSR-E in shallow snow condition is likely to be neglected [48]. Additionally, the spatial resolution of AMSR-E and AMSR2 (0.25 • for AMSR-E, 0.25 • and 0.1 • for AMSR2) is two or three times finer than that of the reanalyses, thus they can pick up more variability and offer valuable information on snow depth at both regional and hemispheric scales ( Figure 4). However, apparent limitations also exist in AMSR-E and AMSR2. First, microwave signals saturate at a certain threshold (usually at 100-200 mm, depending on local conditions), after which the slope of the snow depth and the brightness temperature gradient reverse [77,87]. Consequently, AMSR-E and AMSR2 perform poorly in deep snowpacks, e.g., Alaska and the East European Plain ( Figure A1, Figure 13). In addition, the spatial pattern of AMSR-E and AMSR2 that show underestimation in the East Siberian Plain and East European Plain, and overestimation in the East Siberian Mountain ( Figure A1) may also be an intrinsic weakness of microwave remote sensing datasets, as it was also found in other remote sensing datasets, e.g., SMM/I [76]. The overestimation in Eastern Siberia is possibly due to the low winter-season precipitation with very cold surface temperature, both of which contribute to large faceted grains and causes exaggerated scattering relative to the overestimation of snow depth and SWE [88][89][90].
GlobSnow performed better overall against AMSR-E and AMSR2, largely owing to the import of station data and the assimilation process. Additionally, the time-series melt-detection algorithm possibly attributed to the good performance during the ablation period ( Figure 11). In addition, despite the relatively low representativeness of ground observations in mountain areas, GlobSnow exhibits the finest performance in mountain areas ( Figure 13). This is the first systematic assessment of GlobSnow over mountainous areas, because, in almost all intercomparison and evaluation studies, the standard version of GlobSnow was obtained, the mountain of which was masked [34,48,91]. A significant disadvantage of GlobSnow is that its spatial coverage is limited to 35 • N-85 • N, and serious data gaps occur during June to September, which makes daily snow depth monitoring over the whole Northern Hemisphere impossible.
ERA-Interim reveals generally satisfactory performance in the assessment. Apart from the model design and the assimilation scheme, snow depth estimates from reanalysis critically depend on the accuracy of precipitation forcing [92]. The precipitation data of ERA-Interim, which was corrected by GPCP, has been proved to be quite reliable [47,93,94]. This greatly contributes to the fine performance of ERA-Interim snow depth estimates. However, the revision of the operational analysis, which started in 2004, significantly changed the spatial patterns of snow depth (see Figure 1 of [65]), likely results in the abrupt increase in uncertainties of ERA-Interim in the following few years ( Figure 10). Therefore, the inconsistencies between these two periods must be taken into consideration in ERA-Interim studies. This finding is, however, difficult to be realized if the evaluation was conducted in short time series, or variation of annual uncertainty was not examined [46,95].
MERRA-2 performs quite well at lower latitudes, e.g., most part of the U.S. and China, while an obvious overestimation was detected in polar regions and the northern part of the Rocky Mountains ( Figure A1). This uncertainty pattern is quite similar to that of its precipitation forcing, which was only corrected by rain gauges southward of 62.58 • N. Consequently, the precipitation of MERRA-2 performs quite well at mid-low latitudes, while largely overestimates at higher latitudes [69,96]. Additionally, MERRA-2 correlates quite well with ground observations, even in the East European Plain, where the RMSE is large ( Figure A3, Figure 9). The fluctuations in annual uncertainties are quite small, which implies that MERRA-2 captures the temporal variability of snow depth quite well. Therefore, MERRA-2 is probably a good data source for long-term temporal analysis, since a stable temporal uncertainty is the foundation of reliable long-term variation analysis and trend detection.
Finally, despite different sources, these datasets are faced with some general weaknesses, particularly in mountains and forests. In mountain areas, the complex terrain and forest cover affect the scattering signals, thus, adding noise to the snow depth retrieval [97]. In the meanwhile, the forcing data (e.g., precipitation data) of the reanalyses is less accurate in mountain areas, resulting in poor snow depth simulations [98]. Different forest types show different transmissivities and radiation properties, which cause inaccuracies in snow depth retrieval [40]. For reanalyses, the inaccuracy of model parameterization and the description of the snow process in forested areas leads to inaccurate estimations [39]. These general limitations arouse the necessity for fusing multiple sources of snow depth datasets, instead of focusing on selecting a single "best" one, to provide better estimates [2,99].

Conclusions
Aiming at improving the understanding of inconsistences and uncertainties among multiple snow depth datasets and to prepare for data fusion, this study carried out an intercomparison and integrated evaluation on five hemispheric/global snow depth datasets. Spatial and temporal uncertainties, performance under different land cover types, and snow depth ranges were intercompared and examined by 43,391 stations during 1980-2016, and the major conclusions are: (1) Different spatial performances were observed among the five datasets, particularly in the North Eurasia and the Rocky Mountains. AMSR-E and AMSR2, which show different spatial patterns from other datasets in Siberia and the East European Plain, exhibit slight overestimation in the East Siberian Plateau and significant underestimation in the East Eurasian Plain and the Rocky Mountains. The spatial patterns of GlobSnow and ERA-Interim are similar and agree quite well with observations in most areas. MERRA-2 overestimates against ground observations in northwestern parts of Eurasia and the northern part of the Rocky Mountains. (2) Temporal discrepancies are significant that both annual average snow depth and peak snow depth of MERRA-2 are three times deeper than those of AMSR-E and AMSR2 in North America. The evaluation implies that AMSR-E and AMSR2 exhibit the largest annual and monthly uncertainties, followed by MERRA-2, and ERA-Interim and GlobSnow the finest. At seasonal scales, ERA-Interim and GlobSnow perform best during the snow accumulation and melting periods, respectively. (3) The five datasets generally perform best in plain areas, and worst in forest-mountain areas.
Specifically, GlobSnow, ERA-Interim and MERRA-2 perform best in plain and forested areas, while GlobSnow shows the smallest RMSE and RE in mountain and forest-mountain areas. (4) When snow depth is thinner than 10 cm, AMSR-E and AMSR2 correspond better with ground observations than with other datasets. Similar performances were observed in all five datasets when snow depth ranges between 10 and 20 cm. When snow depth is in the range of 30-50 cm, GlobSnow and ERA-Interim exhibit fewer uncertainties. For snow depth thicker than 50 cm, MERRA-2 performs relatively better.
Based on this evaluation, we suggest that future work should focus on the refinement of algorithms and the improvement of the spatial resolution of different data sources to decrease the uncertainty of individual datasets. An effective approach to decrease the general uncertainty of the snow depth datasets is the fusion of different sources to combine their strengths, and this will be our next focus.  Appendix A Figure A1. The spatial distribution of average BIAS of the five datasets from October to May during 1980-2016 over the Northern Hemisphere. Results were grouped into five groups, underestimation larger than 20 cm, between 5 and 20 cm, BIAS less than ±5 cm, overestimation between 5 and 20 cm, and larger than 20 cm, and each group was colored in red, orange, green, blue, and purple, respectively. Figure A1. The spatial distribution of average BIAS of the five datasets from October to May during 1980-2016 over the Northern Hemisphere. Results were grouped into five groups, underestimation larger than 20 cm, between 5 and 20 cm, BIAS less than ±5 cm, overestimation between 5 and 20 cm, and larger than 20 cm, and each group was colored in red, orange, green, blue, and purple, respectively. Remote Sens. 2020, 12, x FOR PEER REVIEW 27 of 33 Figure A2. Average RE of the five datasets from October to May during 1980-2016 over the Northern Hemisphere. RE was classified into four groups, i.e., smaller than 75%, between 75% and 125%, between 125% and 175%, and larger than 175%, and each class was colored in light green, dark green, pink, and blue, respectively (note that in case of misleading results, stations were masked out from the images if the number of valid results was less than one sixth of the total study period). Figure A2. Average RE of the five datasets from October to May during 1980-2016 over the Northern Hemisphere. RE was classified into four groups, i.e., smaller than 75%, between 75% and 125%, between 125% and 175%, and larger than 175%, and each class was colored in light green, dark green, pink, and blue, respectively (note that in case of misleading results, stations were masked out from the images if the number of valid results was less than one sixth of the total study period).
Remote Sens. 2020, 12, x FOR PEER REVIEW 28 of 33 Figure A3. Average correlations between the five datasets with ground observations from October to May during 1980-2016 over the Northern Hemisphere. The correlations were grouped with an interval of 0.25, i.e., less than 0.25, between 0.25 and0.50, between 0.50 and0.75, and larger than 0.75, and each class is colored in blue, green, yellow, and orange, respectively. Figure A3. Average correlations between the five datasets with ground observations from October to May during 1980-2016 over the Northern Hemisphere. The correlations were grouped with an interval of 0.25, i.e., less than 0.25, between 0.25 and0.50, between 0.50 and0.75, and larger than 0.75, and each class is colored in blue, green, yellow, and orange, respectively.