Comparisons of Performance Using Data Assimilation and Data Fusion Approaches in Acquiring Precipitable Water Vapor: A Case Study of a Western United States of America Area

: There are two main types of methods available to obtain precipitable water vapor ( PWV ) with high accuracy. One is to assimilate observations into a numerical weather prediction (NWP) model, for example, the Weather Research and Forecasting (WRF) model, to improve the accuracy of meteorological parameters, and then obtain the PWV with improved accuracy. The other is the direct fusion of multi-source PWV products. Regarding the two approaches, we conduct a comparison experiment on the West Coast of the United States of America with the data from May 2018, in which the WRF data assimilation (DA) system is used to assimilate the Global Navigation Satellite System (GNSS) PWV , while the method by Zhang et al. to fuse the GNSS PWV , ERA5 PWV and MODIS (moderate-resolution imaging spectroradiometer) PWV . As a result, four groups of PWV products are generated: the assimilated GNSS PWV , the unassimilated GNSS PWV , PWV from the fusion of the GNSS PWV and ECWMF (European Centre for Medium-Range Weather Forecasts) ERA5 (ECWMF Reanalysis 5) PWV , and PWV from the fusion of the GNSS PWV , ERA5 PWV and MODIS PWV . Experiments show that the data assimilation based on the WRF model (WRFDA) and adopted fusion method can generate PWV products with similar accuracy (1.47 mm vs. 1.52 mm). Assimilating the GNSS PWV into the WRF model slightly improves the accuracy of the inverted PWV by 0.18 mm. The fusion of the MODIS PWV , GNSS PWV and ERA5 PWV results in a higher accuracy than the fusion of GNSS PWV and ERA5 PWV by a margin of 0.35 mm. In addition, the inland canyon topography appears to have an inﬂuence on the inversion accuracy of both the methods.


Introduction
Water vapor makes up less than 4% of the atmosphere's mass, but it plays an important role in atmospheric processes at all scales. When passing through the neutral atmosphere, electromagnetic waves will be affected by the water vapor, resulting in velocity delay and path bending. Precipitable water vapor (PWV) is the most commonly used term to express the amount of atmospheric water vapor. It is defined as the total atmospheric water vapor contained in a vertical column of unit cross-sectional area extending between any two specified levels and commonly expressed in terms of the height to which that water substance would stand if completely condensed and collected in a vessel of the same unit cross section [1,2]. Obtaining high-precision precipitable water vapor is conducive to not only the description of the atmospheric processes, but also the high-precision inversion and interpretation of Earth observations. There are many different sources of PWV products, such as numerical weather results are compared. The assessment of these two methods by comparing the differences in the PWV products could help data assimilation researchers to better understand new fusion methods and possibly obtain more accurate assimilation data sources; on the other hand, it could help data fusion researchers to better understand the data assimilation approach and use data of various sources better in the fusion.
In the following, the research area and the multi-source PWV data are presented first. The WRFDA approach and the data fusion method by Zhang et al. [2] are briefly discussed in Section 3. Section 4 presents the results, followed by the conclusion in Section 5.

Research Area
The research area is shown in Figure 1, with latitude from 33 • N to 38 • N and longitude from 118 • W to 123 • W, covering southern California and part of Nevada. The area contains Central Valley stretching through the middle of California from just north of Sacramento to Los Angeles. The Sierra Nevada in eastern California contains a large number of peaks over 3962 m of altitude, including Mount Whitney (4421 m). There are more than 20 GNSS stations, two radiosonde stations and abundant MODIS data in this range, which is well suited for experimenting the numerical DA and data fusion methods. In addition, the topography of the study area is very undulating, which brings some challenges to the experiment.

Multi-Source Data
The data used in this paper include reanalysis data from the numerical weather prediction model (e.g., ECWMF, NCEP), the GNSS PWV and the MODIS PWV. In the data assimilation experiment, because the default background field error file provided by WRFDA was generated based on NCEP data, we choose NCEP FNL (final) Operational Global Analysis Data as the background data with a temporal resolution of 6 h and a spatial resolution of 1 • × 1 • . The reanalysis data is produced by the Global Data Assimilation System (GDAS). In the data fusion experiment, we use the PWV data provided by ECWMF ERA5, which has a temporal resolution of 1 h and a spatial resolution of about 31 km.
The GNSS PWV data used in the experiment were derived from SuomiNet Network, with a time resolution of 30 min. The specific GNSS PWV processing and acquisition method can be referred to Zhang et al. [2].
The MODIS PWV includes NIR (near infrared) PWV and IR (infrared) PWV, with resolutions of 1 km and 5 km, respectively. The accuracy of the NIR PWV is higher than that of the IR PWV. The MODIS PWV is vulnerable to cloud and rain, and its quality is relatively poor against GNSS PWV. Therefore, it is necessary to assess its quality. First, the quality control information provided with the MODIS PWV is used to screen out the PWV data under cloudy and sunny weather. Then, negative PWV and abnormally large PWV are deleted. Detailed data processing methods are presented in Zhang et al. [2]. Altitude information of the MODIS PWV is obtained by interpolating STRMDEM (Shuttle Radar Topography Mission Digital Elevation Model) Version 4.1 data which has a resolution of 250 m [17,18].
In this paper, data from two radiosonde stations is also used. The data of the station near VAN6 is in the IGRA2 (Integrated Global Radiosonde Archive) radiosonde data files, while that of the station near P224 is provided by the University of Wyoming.

WRF Data Assimilation
In the experiments of this paper, WRF Version 3.7 and WRFDA 3-D variational model are adopted to assimilate the GNSS PWV provided by SuomiNet data [19]. Considering that the accuracy of the GNSS PWV is generally at 1-2 mm level, those with accuracy lower than 2 mm are excluded. The background data is from the NCEP FNL (final) Operational Global Analysis whose time resolution is 6 h and spatial resolution is 1 • × 1 • . The pressure of the top layer is set to 50 hPa. There are 10 layers in the planetary boundary layer (PBL). In the vertical direction, 37 isobaric layers are formed, and all the altitudes in the experiments are converted into the normal height system. The flowchart of the WRFDA experiment is shown is Figure 2. The background field would be processed by the WPS (WRF preprocess system) and real.exe. The GNSS PWV is written in the Little_R format. The default background data error file be.dat of WRFDA is used. The MODIS PWV is not assimilated into the WRF model due to its poor accuracy. When the GNSS PWV file, background data and background error information file are prepared, we run WRFDA. The results are then processed by NCAR (National Centre for Atmospheric Research) Command Language (NCL) [20], and the PWV after the assimilation is obtained by using the following equation: where Π is a dimensionless conversion factor, and it is calculated by Equation (2): where water  is the density of water,  [21]. The ZWD in Equation (1) is calculated by the method presented in Vedel and Huang [22].
where p is the pressure in Pa, q the specific humidity in g/g, T the temperature in K, It is noted that the assimilation results are unrelated to the physical parameter setting of the WRF [23]. However, it would still be useful to examine whether the grid resolution used in the assimilation experiment has any significant effect on the results. As such, we set four grid resolutions of 5000 m, The background field would be processed by the WPS (WRF preprocess system) and real.exe. The GNSS PWV is written in the Little_R format. The default background data error file be.dat of WRFDA is used. The MODIS PWV is not assimilated into the WRF model due to its poor accuracy. When the GNSS PWV file, background data and background error information file are prepared, we run WRFDA. The results are then processed by NCAR (National Centre for Atmospheric Research) Command Language (NCL) [20], and the PWV after the assimilation is obtained by using the following equation: where Π is a dimensionless conversion factor, and it is calculated by Equation (2): where ρ water is the density of water, R w = 461 (J × kg −1 × K −1 ) the water vapor ratio constant, k = (3.776 ± 0.014) × 10 5 K 2 × hPa, k = 16.48 K × hPa −1 and T m is the weighted average temperature computed with the method by Yao et al. [21]. The ZWD in Equation (1) is calculated by the method presented in Vedel and Huang [22].
where p is the pressure in Pa, q the specific humidity in g/g, T the temperature in K, k 1 = 2.21 × 10 -7 K/Pa, k 2 = 3.73 × 10 −3 K 2 /Pa, h the layer height in m. It is noted that the assimilation results are unrelated to the physical parameter setting of the WRF [23]. However, it would still be useful to examine whether the grid resolution used in the assimilation experiment has any significant effect on the results. As such, we set four grid resolutions of 5000 m, 7500 m, 10,000 m and 12,500 m, respectively, along the east-west and south-north directions with the center at (35.5 • N, 120.5 • W), and the grid numbers are thus 112 × 112, 74 × 74, 56 × 56 and 46 × 46, respectively. Radiosonde data is treated as true value. Since the radiosonde is launched only at 0:00 UTC and 12:00 UTC daily, the experiments can only be made at those times (by the way, the radiosonde data at 12:00 UTC on 6 May and 0:00 UTC on 7 May is missing). With this setup, we can obtain four PWV result sets with respect to the four grid resolutions at the radiosonde locations.
In this article, bias means the average of the data, STD means the standard deviation, and RMS means the root mean square. Bias, STD and RMS are used as the accuracy measures which are computed by Equations (4)-(6), respectively.
wherex i is the reference value, x i denotes the value estimated by the model and N is the number of observations. The differences between the inverted and reference PWV values for the four grid resolutions, given in Table 1, appear only at the second decimal place, which indicates that the grid resolution has little influence on the accuracy of the inverted PWV. Secondly, we also checked the temperature (T), pressure (P) and relative humidity (RH) of the experimental results by using the radiosonde data, and the differences are listed in Table 2. The differences between the maximum and minimum RMS values of the four groups were 0.01 K, 0.02 hPa and 0.1%, respectively, for T, P and RH. It can be seen that the resolution of the assimilation grid has little influence on inverted temperature, pressure and relative humidity. To sum up, within the experimental area, the resolution of the assimilation grid has little influence on the accuracy of the inverted PWV, temperature, pressure and relative humidity from the assimilation results. Since the grid resolution of 5000 m results in the minimum RMS for the inverted PWV, in the subsequent experiments, we set the grid resolution to 5000 m in the assimilation.

Data Fusion
We use the method by Zhang et al. [2] in this paper to perform data fusion experiments since the combined PWV obtained by this method can not only reflect the features of the individual PWV maps but also suppress regional deviations. The core idea is to use a spherical cap harmonic (SCH) model to describe the PWV field on the sphere and to use the Helmert variance component estimation (HVCE) to determine the weights for data of different sources.
Before fusion, data from different sources need to be corrected for systematic differences. We use the GNSS PWV data to calibrate the ERA5 PWV and MODIS PWV data since the accuracy of the GNSS PWV is as high as 1-2 mm [13,[24][25][26]. The way to determine the systematic bias is as follows.
First, an SCH model is used to fit the MODIS (or ERA5) PWV, and then the fitted model is used to calculate the PWV at the GNSS stations. We take the average of the differences between GNSS PWV and MODIS (or ERA5) PWV as the systematic bias.
The SCH model is of the following form: where a = 6,378,137 m is the radius of the Earth, r, θ, λ are the geocentric distance, co-latitude and longitude of a point on the Earth's surface in the spherical coordinate system. The PWV data will be expanded onto the sphere of radius equal to r. P m n k (m) is the associative Legendre function of the first kind of boundary value problem. n k (m) and m are the order and degree of the SCH model, respectively. k is an integer used to label the order of n k (m). According to boundary conditions proposed by Haines [27], m needs to be a positive integer, n k (m) is real and k ≥ m ≥ 0. N is the highest order of the SCH model expansion. g m k and h m k are the model coefficients representing the amplitude of the harmonic which are to be determined. When m is equal to 0, sin(mλ) is equal to 0, it is impossible to determine the value of h 0 k . Thus, there are (N + 1) 2 coefficients in total to be determined. As data fusion involves data from different sources with different accuracy, they should be weighted appropriately. The a priori information on the data accuracy is usually not well known, the posterior method would be needed to determine the weight relation of various data [28]. A common one is the variance component estimation technique, which not only can estimate the variances of observations with different accuracy, but also provides a regularization method [29]. Here the Helmert variance component estimation method is used to determine the weights of PWV from different sources. The data fusion flowchart is shown in Figure 3. As shown, we first use the spherical cap harmonic function to fit ERA5 (MODIS) PWV, and then the function is used to obtain the interpolated PWV at GNSS stations. The systematic bias is now determined as the average of the differences between GNSS PWV and interpolated ERA5 (MODIS) PWV, and it is used to correct the ERA5 (MODIS) PWV. Next, we determine the weights for PWV of different sources by HVCE. Finally, the PWV data of difference sources is fused with the SCH model, in which the model coefficients are estimated. It is noted that the coordinate transformation, the calculations of n k (m) and the Legendre polynomials are important in the use of the SCH model to obtain fused PWV.
As data fusion involves data from different sources with different accuracy, they should be weighted appropriately. The a priori information on the data accuracy is usually not well known, the posterior method would be needed to determine the weight relation of various data [28]. A common one is the variance component estimation technique, which not only can estimate the variances of observations with different accuracy, but also provides a regularization method [29]. Here the Helmert variance component estimation method is used to determine the weights of PWV from different sources.
The data fusion flowchart is shown in Figure 3. As shown, we first use the spherical cap harmonic function to fit ERA5 (MODIS) PWV, and then the function is used to obtain the interpolated PWV at GNSS stations. The systematic bias is now determined as the average of the differences between GNSS PWV and interpolated ERA5 (MODIS) PWV, and it is used to correct the ERA5 (MODIS) PWV. Next, we determine the weights for PWV of different sources by HVCE. Finally, the PWV data of difference sources is fused with the SCH model, in which the model coefficients are estimated. It is noted that the coordinate transformation, the calculations of n m and the Legendre polynomials are important in the use of the SCH model to obtain fused PWV.  The treatment of the uncertainty associated with different data sources in Zhang et al. (2019)'s method includes two aspects. One is regarding the average difference between GNSS PWV and ERA5 (MODIS) PWV as systematic bias; another is to use HVCE to determine weights for PWV data of different sources.
It is now possible to perform the data fusion experiments. In this paper, the ERA5 PWV and GNSS PWV are fused by a SCH model of order 7, and the order is 8 for the fusion of the ERA5 PWV, GNSS PWV and MODIS PWV.

Results
Due to the limited resolution of NCEP FNL (final) Operational Global Analysis data, its combination with the GNSS PWV in the experimental area would be rank deficit if using a high-order model, and inaccurate if using a low-order model. Therefore, the GNSS PWV and ECWMF REA5 data are fused. In addition, in the original text of Zhang et al. [2], MODIS PWV is also fused with the GNSS PWV and ERA5 PWV. As a result, we set out to perform four experiments. The first one is to assimilate the GNSS PWV into the WRF model, and the result is labeled as DA_PWV. Compared with DA_PWV, the PWV obtained from the WPS and real.exe is labeled as None_PWV which has no assimilation involved. Fuse_PWV is the result from the fusion of the GNSS PWV and ERA5 PWV, while M_Fuse_PWV is that of the GNSS PWV, ERA5 PWV and MODIS PWV.
Because there is no overlap time period between the fused PWV and radiosonde PWV data, some of the high accuracy GNSS PWV is used as the reference to examine the fusion performance. We choose the GNSS PWV data at stations P729, P056 and LUTZ, which are marked with a red box in Figure 1, as reference, and they are excluded in the assimilation and fusion experiments. The assimilation experiment is carried out for the same time period as of the fusion experiment throughout May 2018.
For the None_PWV and DA_PWV results, the bilinear interpolation and altitude correction are applied to obtain the PWV at the three reference stations. The temperature, pressure and humidity information are also obtained from the results of the respective experiments. The M_Fuse_PWV and Fuse_PWV at the three stations are computed from their fusion models, respectively. The comparison between the experiment results and the reference data is shown in Figure 4, and the bias, STD and RMS values are given in Table 3.    The three reference stations are widely separated in the study area, with LUTZ in the upper left corner, P729 in the lower right corner and P056 in the middle right edge of an intermountain valley. The fused or assimilation-inverted results at the three stations are all close to their reference, which shows the reliability of the results.
For the P729 station, the difference between the None_PWV group and DA_PWV group in the period from days of year (DOY) 132 to 150 is large, while the differences during other assimilation time periods are small. The M_Fuse_PWV is close to the Fuse_PWV on the whole, but the differences in the period between DOY 130 and 132 are obvious. For this station, the Fuse_PWV result is slightly smaller than the reference, and the M_Fuse_PWV result is larger than the reference.
The For the P056 station, the Fuse_PWV result is also smaller than the reference, while the M_Fuse_PWV one is larger than the reference, and it is closer to the reference than the Fuse_PWV group. The None_PWV group is very close to the DA_PWV group. Since the absolute value of bias of None_PWV is smaller than Fuse_PWV, the closeness of the None_PWV to the reference is better than that of the Fuse_PWV group.
For the P729 station, the minimum absolute bias occurs in the M_Fuse_PWV group, and the maximum and minimum RMS values are in the M_Fuse_PWV group and Fuse_PWV group, respectively. The maximum and minimum values of RMS at LUTZ station are in DA_PWV and M_Fuse_PWV, respectively. For the P056 station, the maximum RMS is in the None_PWV group, and the minimum in the M_Fuse_PWV group.
Because of the existence of the bias, the RMS is a more important measure of the result accuracy, with the minimum RMS indicating the best, and the maximum the worst. From the overall RMS value given in the last column, the M_Fuse_PWV result is the best among the four groups. With respect to the individual stations, both P056 and LUTZ have their minimum RMS values with M_Fuse_PWV, but P729 s minimum RMS is with Fuse_PWV, which is significantly smaller than the station's maximum RMS unexpectedly from M_Fuse_PWV. This shows that, although there are only three reference stations, the performance in terms of the RMS is not uniform with the data processing methods.
In general, we may conclude that the background field accuracy can be improved when the GNSS PWV is assimilated into the NWP model. The absolute values of bias, STD and RMS in the DA_PWV group are less than those in the None_PWV group. For the two PWV fusion products, the accuracy of the M_Fuse_PWV is higher than that of the Fuse_PWV, suggesting that the fusion of data from more sources is beneficial to the accuracy of resulted PWV. The overall RMS values of the Fuse_PWV group and M_Fuse_PWV group are all smaller than that of the None_PWV group, which indicates that, to some extent, appropriate data fusion can achieve better PWV field than that of the WRF model without any assimilation.
The overall RMS of the DA_PWV is close to that of the Fuse_PWV but has a large difference with that of the M_Fuse_PWV. In particular, for the two stations P729 and LUTZ located in the coastal area, the two-source Fuse_PWV results are slightly better than those of the assimilation group (DA_PWV), but for the station P056 in the inland valley, the DA_PWV results are better than the Fuse_PWV results. There are two possible explanations. First, before the fusion, the GNSS PWVs have been used to correct the ERA5 PWV for the system difference, while in the assimilation, the system difference between GNSS PWVs and the background field is difficult to correct. Second, the ECWMF ERA5 product has a better spatial resolution (0.25 • × 0.25 • ) than the NCEP FNL (1 • × 1 • ). However, the same correction is applied to all stations to eliminate the system difference between ERA5 PWV and GNSS PWV before the fusion, which may result in a relatively large residual deviation for stations in areas of large topographical fluctuations, and this in turn would affect the fusion accuracy for these stations.

Conclusions
This paper presents a comparison study about the accuracy of the assimilated and fused PWV products for a west coast area of the United States of America. The former is obtained from the assimilation of the GNSS PWV into the WRF model, and the latter by the fusion of the GNSS, ERA5 and MODIS PWV data using the method of Zhang et al. [2]. For the study area, high-precision PWV products are obtained from both approaches. The results show that the high precision could be achieved by either adjusting meteorological parameters or directly fusing multi-source data, with accuracy of 1.47 mm by the former and 1.52 mm by the latter. It is seen that the assimilation has reduced the RMS of PWV products from 1.65 to 1.47, a significant improvement in the PWV accuracy, although the temperature, pressure and relative humidity have little change. It also shows that, the fusion of data from three sources including the MODIS PWV clearly outperforms the fusion without the MODIS PWV with the RMS values of 1.17 and 1.52, respectively, suggesting more data sources are helpful to achieve better accuracy.
The other finding from the accuracy comparison at the three reference stations is that, the accuracy at the P056 station located in the inland valley is worse than those at the P729 and LUTZ stations along the coast, indicating that complex topography may affect the performance of the applied data assimilation and data fusion methods.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.