Fusing Precipitable Water Vapor Data in CHINA at Different Timescales Using an Artiﬁcial Neural Network

: Global climate change has noticeable inﬂuences Finally, we generate a new product of PWV in China with a temporal resolution of 1 day, a spatial resolution better than 31 km, and an accuracy better than 2.7 mm, which will serve as a high-quality product for investigating the water vapor redistribution under a changing climate. Conceptualization, methodology, analysis, J.S.; investigation, resources, writing—original J.S.; visualization,


Introduction
Water vapor plays an important role in climate change and water cycling at various scales. In the past decades, global climate change has exerted noticeable influences on the water vapor distribution in China, which is evidenced by the fact that some regions in western and southern China are becoming wetter while some regions in eastern China are becoming drier [1,2]. Besides being an indicator of climate change, water vapor also affects the propagation of radio signals by causing path bending and time delay, which avoid introducing biases before fusing and can fully consider the spatial variation of the systematic biases between different PWV data.
In this study, we use the Generalized Regression Neural Network (GRNN) method proposed by Zhang et al. [24] to fuse the PWV data from GNSS, MODIS, and ERA5 to generate a unified high-quality PWV dataset in the land area of China. Different from Zhang et al. [24] that fused PWV data in North America at an annual scale, this study fuses the PWV data at annual, quarterly, and monthly scales in China in an attempt to find out the best timescale for PWV fusion in China. Through this study, we aim to generate a unified PWV product for China with improved accuracy and spatial resolutions and thus providing a better PWV product. The new PWV product not only can be useful for the application of earth observation such as performing the tropospheric refraction correction of InSAR observation, but it will also allow more detailed monitoring of water vapor over China. Besides, this PWV product can also be a data source of numerical weather forecasting and help researchers better study the circulation of water vapor over China.

Research Area
Our research area is in the land area of China which is shown in Figure 1 with GNSS stations and radiosonde stations. The topography of the research area is high (>4000 m) in the west and decreases sharply to the east. The climate in the research area is complex and diverse because of the tremendous differences in latitude, longitude, and altitude. These conditions lead to complex variations in PWV in the research area. The GNSS PWVs used in this study are collected by 11 stations from the International GNSS Service (IGS) network, 1084 stations from China Meteorological Administration (CMA) GNSS network, and 257 Crustal Movement Observation Network of China (CMONOC) GNSS stations. Six GNSS stations coexist in both the CMONOC and IGS network. There are 89 radiosonde stations used to assess the accuracy of ERA5 temperature, ERA5 pressure, and GNSS PWV. The radiosonde data are derived from the Integrated Global Radiosonde Archive Version 2 dataset and includes pressure, temperature, relative humidity, and other parameters with a temporal resolution of twice or four times daily (ftp://ftp.ncdc.noaa.gov/pub/data/igra/ (accessed on 25 April 2020)).

2.2.
Multi-Source PWV 2.2.1. GNSS PWV As shown in Figure 1a, the GNSS PWVs have three sources, namely, IGS, CMA, and CMONOC. The CMA GNSS PWV is provided directly by the Meteorological Observation Center of CMA which has a time resolution of 1 h. The strategy used to process CMA GNSS data can be found in Liang et al. [25]. The CMONOC and IGS GNSS observations are processed by a Precise Point Positioning (PPP) software developed by Yao Yibing's group at Wuhan University and the principle can be seen in Yao et al. [26]. The IGS Real-Time Service IGS03 track and clock products are used for data processing. Kouba and Heroux [27] method is used to correct antenna phase center migration and variation, phase windings, earth tides, earth rotation, ocean tides, and relativistic effects. The product of Center for Orbit Determination in Europe [28] corrects discrepancies in the differential code. The ionospheric delay is corrected to the first order by using the ionospheric two-frequency phase observation combination. The higher-order error of the ionosphere is ignored. Then, we obtain the GNSS Zenith Total Delay (ZTD) with a time resolution of 30 min. We compare it with the IGS-released GNSS ZTD, and the bias, the standard deviation (STD) and the RMS are −1.0 mm, 8.9 mm, and 9.0 mm, respectively.
The GNSS ZTD includes zenith hydrostatic delay (ZHD) and zenith wet delay (ZWD). The ZHD could be obtained by the Saastamoinen model [29] which needs surface pressure. The transformational relation between ZWD and PWV can be found in Xiong et al. [11] which needs temperature. The needed temperature and pressure are interpolated from ERA5 data and its accuracy is assessed by the Radiosonde data. The linear interpolation in the vertical direction and bilinear interpolation in the horizontal direction are applied to interpolate ERA5 data for a Radiosonde (GNSS) station location to obtain the temperature and pressure. The accuracy of temperature in terms of bias, STD, and RMS are 0.26 K, 2.25 K, and 2.27 K, respectively. It becomes 0.57 hPa, 1.68 hPa, and 1.77 hPa for pressure.
Finally, we obtain the IGS and CMONOC GNSS PWV. We call the PWV processed by PPP software as PPP PWV and PWV from CMA as CMA PWV. The Radiosonde data are also used to assess PPP PWV and CMA PWV. The method to obtain Radiosonde PWV at GNSS station location can be found in Appendix A. The accuracy of PPP PWV in terms of bias, STD, and RMS is 0.9 mm, 1.7 mm, and 1.9 mm, respectively. It becomes −0.3 mm, 1.6 mm, and 1.6 mm for CMA PWV. Both CMA PWV and PPP PWV have an accuracy within 2 mm, which indicates that the GNSS PWV can be used as the reference in the future to calibrate the MODIS PWV and ERA5 PWV for fusion.

ERA5 PWV
The ERA5 data are the latest generation of global climate reanalysis data released by ECMWF [30]. Compared with the previous generation ERA-Interim, the temporal resolution is improved from 6 to 1 h and the spatial resolution improved from 0.75 • to 0.25 • . The height information of the ERA5 grids is obtained from the Shuttle Radar Topography Mission Digital Elevation Model (STRM DEM) Version 4.1 data with a spatial resolution of 250 m [31,32]. The height system in STRM DEM is orthometric height. Therefore, the height information for the ERA5 PWV is based on the orthometric height.

MODIS PWV
MOD05_L2 provides two types of MODIS PWV products: Near Infrared PWV with a spatial resolution of 1 km and Infrared PWV with a resolution of 5 km. We only use the Near Infrared PWV, since the accuracy of Near Infrared PWV is better than Infrared PWV [33,34]. Considering the spatial resolution of ERA5 PWV and GNSS PWV, we downscale the spatial resolution of Near Infrared PWV from 1 km to 5 km, which can effectively reduce the computational burden. In addition, the data quality of the MODIS PWV products is greatly affected by the weather and surface conditions. So only the Near Infrared PWV on land under cloudless conditions is used. To avoid the effect of anomalous data, we exclude the data in the top 0.5% and negative values. We also use the SRTM DEM to provide height information for the MODIS PWV.
In this study, the GNSS, MODIS, and ERA5 PWV are fused to generate a unified PWV product. Since the GNSS PWV is based on the ellipsoidal height while the MODIS and ERA5 PWV are based on the orthometric height, we need to unify the height system. We use the method in Li et al. [35] to make height corrections to the GNSS PWV so that all the PWV data are based on the orthometric height.

GRNN Method for PWV Fusion
We use the GRNN method to fuse GNSS, MODIS, and ERA5 PWV. In addition to the annual fusion as in Zhang et al. [24], we also implement the fusions at the quarterly and monthly timescales to explore more subtle variations in the PWVs.

Generalized Regression Neural Network
The GRNN was proposed by Specht in 1991 [36] which is based on non-parametric regression and uses sample data as a posterior condition. GRNN has a four-layer structure with an input layer, a pattern layer, a summation layer, and an output layer. The spread parameter in the pattern layer is the only hyperparameter in GRNN which needs to be set at first.
The advantage of the GRNN model is that no model parameters need to be trained, so the convergence speed is fast. In addition, the GRNN is based on radial basis function and has good nonlinear approximation performance. The disadvantage of the GRNN is its high computational complexity and space complexity. Figure 2 shows the model structure of the GRNN model for PWV fusion. The input layer has five neurons corresponding to the longitude, latitude, height, time, and MODIS (ERA5) PWV. There is only one neuron in the output layer which is the GNSS PWV or the modified MODIS (ERA5) PWV. The number of neurons in the pattern layer is equal to the number of observations. Considering that ERA5 PWV, MODIS PWV, and GNSS PWV have systematic differences and different error characteristics, we take GNSS PWV as the reference values and use them to improve the MODIS and ERA5 PWV, respectively.

PWV Data Matching
To form input-output samples, we need to interpolate the dense MODIS (ERA5) PWV to the location of GNSS PWV. For temporal interpolation, we interpolate the GNSS PWV and ERA5 PWV to the epoch of MODIS PWV using a simple linear interpolation. For the spatial interpolation, we use a spherical cap harmonic model to fit the MODIS (ERA5) PWV, which is then used to compute PWV at the GNSS stations. For the ERA5 PWV, the pole of the spherical cap is (35. Since the data quality of the samples affects the model performance, simple quality control is performed. We first calculate the bias and STD of the differences between the interpolated MODIS (ERA5) PWVs and GNSS PWVs and then use bias ±3 STD as the threshold to reject any pair whose corresponding difference is larger than the threshold.  Table 1. It is seen that MODIS PWV tends to be larger than GNSS PWV while ERA5 PWV is smaller than GNSS PWV in terms of bias. The accuracy of MODIS PWV in China region is poorer than that of ERA5 PWV in terms of RMS.

Constructing the GRNN Models
After obtaining the MODIS-GNSS PWV pairs and ERA5-GNSS PWV pairs in the research area in 2018 and 2019, we use the GRNN method to fuse them at annual, quarterly, and monthly scales. Because the spread parameter in the pattern layer has a great influence on the model results, we use an enumeration algorithm to determine the optimal spread parameter, and the results can be found in Appendix B. In Appendix B, we set a series of optional spread parameters to test the GRNN model and present the model performance against spread parameter values.
In machine learning, cross-validation is often used to test the accuracy of algorithms. K-fold cross-validation has been widely used in machine learning [37][38][39]. Using K-fold cross-validation, every sample can be used for training or testing. In this study, the 10-fold cross-validation is used to test the model. The ten-fold cross-validation randomly divides the dataset into ten equal parts, and selects one of them as the test data and the other nine as the training data without duplication. After repeating ten times, the average statistical results can be used to evaluate the accuracy of the model.

Model Performance at Annual Timescale
We use the samples in 2018 and 2019, respectively, to train and test the MODIS-GNSS PWV model and the ERA5-GNSS PWV model. We train the models using data in a whole year, and obtain a GRNN model in 2018 and another one in 2019. Table 2 presents the average testing results when training the model based on annual data in 2018 and 2019 with the optimal spread parameter given in Appendix B. Compared to the RMS values of the original datasets given in Table 1, the accuracy in terms of RMS is improved by 1.9 mm for MODIS PWV and 1.0 mm for ERA5 PWV. The bias between the modified MODIS (ERA5) PWV and the GNSS PWV is close to 0, indicating that the systemic difference between them has been eliminated after training. The scatter plots of the modified MODIS (ERA5) PWV versus GNSS PWV in the annual models are shown in Figure 3. Both the modified MODIS and ERA5 PWV have a better correlation with the GNSS PWV.

Model Performance at Quarterly Timescale
We also train and test the GRNN models based on quarterly MODIS (ERA5)-GNSS PWV samples. This means we use the samples in every individual quarter to construct a corresponding quarterly GRNN model. In this manuscript, spring is defined from March to May, summer from June to August, autumn from September to November, and winter from December to February.  Table 3 presents the overall performance of the quarterly models with the optimal spread parameter given in Appendix B. The systematic difference between MODIS (ERA5) PWV and GNSS PWV is also eliminated after training since the bias is close to 0. The accuracy in terms of RMS of the quarterly model is improved by 2.5 mm for MODIS PWV and 1.1 mm for ERA5 PWV, which are better than the improvement achieved by the annual models.  Figure 4 shows the average bias, STD, RMS, and R of the quarterly models in different seasons. The original accuracy in terms of RMS for MODIS PWV in the spring and autumn is 5-6 mm, and it is even worse with RMS increasing to about 7.5 mm in the summer. The quarterly MODIS-GNSS PWV models effectively improve the accuracy of the modified MODIS PWV, as demonstrated in Figure 4c. The accuracy improvement in terms of RMS for MODIS PWV in four seasons is 1.8 mm (spring), 3.3 mm (summer), 3.0 mm (autumn), and 2.3 mm (winter), respectively. The accuracy improvement in terms of RMS in the spring is slightly less than that of the annual model (1.9 mm), but the RMS in the Spring quarterly model is 3.6 mm which is smaller than that of the annual model (3.8 mm). These results indicate that the seasonal differences are accounted for by the quarterly models. As shown in Figure 4d, the R has improved in all four seasons, especially in the winter. For the ERA5-GNSS quarterly models, the accuracy improvement in terms of RMS in four seasons is 0.9 mm (spring), 1.5 mm (summer), 1.3 mm (autumn), and 0.7 mm (winter), respectively. The accuracy improvement in terms of RMS in the summer and autumn is better than that of the annual model (1.0 mm). Although the accuracy improvement in terms of RMS in spring and winter is less than the annual model, the RMSs of the quarterly models in spring and winter are 1.8 mm and 1.3 mm, respectively, which are smaller than that of the corresponding annual model (2.0 mm). Those results indicate that considering the seasonal differences can improve the model performance, which is conducive to the generation of high-quality PWV products in China. The systematic difference between the modified ERA5 PWV and GNSS PWV is close to 0, which indicates that the ERA5-GNSS quarterly model can also eliminate the systematic difference. In addition, as shown in Figure 4h the correlation coefficient R between the modified ERA5 PWV and the GNSS PWV increases to 0.99 by the quarterly models.

Model Performance at Monthly Timescale
From the above experiments, it is seen that the performance of the quarterly models is better than that of the annual models. It is easy to extend the models from quarterly to monthly. Table 4 presents the overall performance of the monthly models with the optimal spread parameter given in Appendix B. The overall accuracy improvement in terms of RMS is 3.1 mm for MODIS PWV and 1.3 mm for ERA5 PWV, which is better than that by the quarterly models (2.5 mm for MODIS PWV and 1.1 mm for ERA5 PWV). The bias between MODIS (ERA5) PWV and GNSS PWV is reduced to 0 in the monthly models, which indicates that the systematic difference is eliminated.  Figure 5 shows the average bias, STD, RMS, and R of the monthly models in different months. For original MODIS PWV, the bias, STD and RMS are increased from winter to summer and then decreased from summer to winter. The original negative bias between GNSS PWV and MODIS PWV in all months indicates that MODIS PWV tends to be larger than GNSS PWV. The accuracy of MODIS PWV in terms of RMS of the monthly models in spring is improved by 2.0 mm for March, 2.4 mm for April, and 2.4 mm for May, respectively, which is better than that of quarterly models (1.8 mm). The results indicate that the MODIS-GNSS monthly models in the spring are better than the corresponding quarterly model. A similar conclusion can be made in other months. The R between MODIS PWV and GNSS PWV is obviously improved.
For ERA5 PWV, the original bias between ERA5 PWV and GNSS PWV shows that ERA5 PWV is larger than GNSS PWV in summer and autumn while smaller than GNSS PWV in spring and winter in China land region. The accuracy improvement in terms of RMS of the monthly models in spring is 0.9 mm for March, 1.1 mm for April, and 1.2 mm for May, respectively, which is not less than the corresponding quarterly model (0.9 mm). Similar conclusion can be made for other months. Those results show that the ERA5-GNSS monthly models could achieve better performance than the corresponding quarterly model. As shown in Figure 5h, the correlation coefficient R between ERA5 PWV and GNSS PWV is also increased.

Generating PWV Products in the Research Area
From the above results, the monthly models achieved the best performance among the models of three timescales. Therefore, we use the monthly models to generate PWV products in the research area. The spatial resolution of the products is 1 km × 1 km in areas where MODIS PWV is present, and 0.25 • × 0.25 • in other areas. The temporal resolution is 1 day. Since the accuracy measured by RMS is 5-6 mm for the MODIS PWV and 3.0 mm for the ERA5 PWV in the research area while the accuracy of the fused PWV is better than 2.6 mm, apparently, the fusion provides a better PWV product for the research area.

Accuracy Analysis
To investigate the temporal variations of the model accuracy, we compute the daily bias, STD and RMS of the original MODIS (ERA5) PWV, and the modified MODIS (ERA5) PWV from the monthly models. The results are shown in Figures 6-8. Figure 6 shows that the original MODIS PWV has overall negative biases with obvious temporal variations. The absolute values of the biases of the original MODIS PWV increased from January to July and then decreased from August to December. Most of the biases of the original ERA5 PWV are negative from December to April while positive in the rest months. There are also obvious temporal variations in the biases of the original ERA5 PWV. The biases of the modified MODIS PWV and the ERA5 PWV are close to 0 without noteworthy temporal fluctuations, which indicates the time-varying systematic biases between the MODIS (ERA5) PWV and the GNSS PWV are eliminated by the new methods.    Figure 7 shows the daily STDs of the original MODIS (ERA5) PWV and the modified MODIS (ERA5) PWV. It demonstrates that the STD of the original MODIS PWV has obvious temporal variations with an increase from January to July and a decrease from August to December. The modified MODIS PWV has apparently reduced STD whose temporal variations are also greatly weakened. The temporal pattern of STD of the ERA5 PWV is similar to that of the MODIS PWV. To investigate the spatial variations of the model accuracy, we compute the bias, STD, and RMS of the original MODIS (ERA5) PWV and the modified MODIS (ERA5) PWV at individual GNSS stations. Figure 9 shows the biases of the original MODIS (ERA5) PWV and the modified MODIS (ERA5) PWV at the GNSS stations. It can be seen that the original MODIS PWV has negative biases at most stations. The biases are greater in the south and smaller in the north. These results suggest that the biases in the original MODIS PWV are uneven in the research area. Comparing Figure 9a,b, the biases between MODIS PWV and GNSS PWV are significantly decreased by the GRNN models, and their distribution becomes more even. Figure 9c shows that most of the biases of the original ERA5 PWV are positive in the south and northeast, and negative in the rest research area. Comparing Figure 9c,d, the biases of the modified ERA5 PWV are significantly reduced and distributed more evenly over the area. Figure 10 shows the STDs of the original MODIS (ERA5) PWV and the modified MODIS (ERA5) PWV at individual stations. The STD of the original MODIS PWV is greater in the southeast while smaller in the northwest. Comparing Figure 10a,b, the STD of modified MODIS PWV is significantly reduced and the distribution is more even. A similar conclusion can be made for the modified ERA5 PWV. Figure 11 shows the RMSs of the original MODIS (ERA5) PWV and the modified MODIS (ERA5) PWV at individual stations. Comparing Figure 11a,b, the RMS of MODIS PWV is reduced after modeling, especially for the stations in the southeast. We can find that the RMS of ERA5 PWV is also reduced after modeling by comparing Figure 11c,d.

Conclusions
Many scholars have used different methods to produce PWV products with high spatial resolution and improved accuracy in different regions. Zhang et al. [15] used a spherical cap harmonic model and Helmert variance component estimation method in North America. Zhao et al. [18] used a hybrid PWV fusion model in China. Alshawaf et al. [21] used a fixed-rank kriging method in Europe. All those methods could achieve better performance, but there are still some shortcomings. Some of the methods ignore the spatial or temporal variations in bias and only give a global bias for all observations. Some methods are based on the interpolation approach, which inevitably imposes some biases or inaccurate information due to the imperfect assumptions for the interpolation. In this study, we generate a PWV product in China area by fusing GNSS PWV, MODIS PWV, and ERA5 PWV in 2018 and 2019 through a generalized regression neural network. Models at annual, quarterly, and monthly timescales are constructed and performances are assessed to find out the best model for the research area. The results demonstrate that the monthly models achieve the best performance among the three timescales. Thus, we produce the fused PWV data by training and testing the GRNN models at a monthly timescale. Therefore, we have obtained a unified PWV product with a temporal resolution of 1 day and a spatial resolution better than 31 km. The bias, STD, and RMS of the modified MODIS PWV are 0.0 mm, 2.6 mm, and 2.6 mm. The percentage improvement is as high as 50% in terms of RMS. It becomes 0.0 mm, 1.7 mm, 1.7 mm for the modified ERA5 PWV and the percentage improvement is 40%. In this product, systematic differences between different PWV sources are almost diminished. Compared to the original MODIS (ERA5) PWV, the accuracy of the modified PWV is more even in space and time. With much improved quality, this PWV product can not only benefit the studies regarding water vapor circulation over China, but also serve in numerical weather forecasting.

Appendix A. Radiosonde PWV Calculation Method
For a radiosonde station to be qualified as a reference station, it is required to be within 60 km in distance and 500 m in height of a GNSS station. The empirical formula in Equation (A1) is used to reduce the PWV value from a radiosonde station at height h 2 to that at a GNSS station at height h 1 .
where PWV h 1 and PWV h 2 are the PWV values corresponding to the heights of h 1 and h 2 in m, respectively.
The transformational relation between ZWD and PWV can be found in Xiong et al. [11]. ZWD is related to water vapor pressure, temperature, and height difference between the two layers described in Formula (A2).
where P w is the water vapor pressure at each grid point in Pascal and can be obtained by Formula (A3), k 1 = 2.21 × 10 −7 K/Pa, k 2 = 3.73 × 10 −3 K 2 /Pa, h is the height difference between the two layers in m, T is the temperature in K.
where RH is the relative humidity. P s is the saturated water vapor pressure which is related to the temperature and can be calculated by the Wexler formula [40,41].

Appendix B. The Model Performance against Spread Parameter Values
The spread parameter in a GRNN model has a great influence on the accuracy of the model, and it should be tuned to make the model perform well. In this study, a series of optional spread parameter values between 0.01 and 0.2 at a step of 0.01 is first set, and each value is tested. The testing results of different models, in terms of the Bias, STD, RMS and R, are shown below.
An appropriate spread parameter should not only make the modified accuracy as high as possible, but also make the fitting accuracy and modified accuracy comparable. The optimal spread parameter values are summarized in Table A1, and we use them to establish the models in this manuscript.