1. Introduction
Soil moisture is an essential parameter of the global surface water cycle and is also a physical surface quantity that has long been studied with interest. Monitoring soil moisture on a large scale is significant for agriculture, hydrology, and the geographic environment [
1]. It also plays a vital role in the climate system and extreme weather such as droughts, floods, and inundation. The persistence of extreme weather is relatively short-lived, and soil moisture has a high memory compared to the atmosphere. In seasonal time scales, soil moisture is of great use [
2]. For this reason, it is of more practical importance to achieve soil moisture inversion at a large scale, with high accuracy, low cost, and high spatial and temporal resolution.
The advent of remote sensing technology has made it possible to estimate and monitor surface parameters significantly. Remote sensing satellites can measure soil moisture with uniform accuracy over large spatial scales by constant revisit intervals. Commonly used remote sensing techniques include optical remote sensing and microwave remote sensing. Among the remote sensing techniques capable of measuring soil moisture, microwave remote sensing is the most promising technique for measuring soil moisture with shorter revisit times by sensing the dielectric properties of soil moisture [
3,
4]. Soil Moisture and Ocean Salinity (SMOS), launched by the European Space Agency, and Soil Moisture Active and Passive (SMAP), launched by NASA in 2015, are the predominant soil moisture missions today [
5,
6]. Both enable soil moisture monitoring globally by carrying L-band instruments to detect soil moisture in the top 5 cm. It provides soil moisture inversion with a spatial resolution of about 40 km and a revisit time of 2–3 days, with an accuracy requirement of 0.04 cm
3cm
−3 [
7]. Meanwhile, the SMAP mission provides a high spatial and temporal resolution product with 3 km spatial resolution and 2–3 days revisit time with the help of rotating antennas until the hardware fails in mid-2015 [
8,
9].
However, microwave wavelengths are hundreds to millions of times longer than visible and infrared light, resulting in a low spatial resolution of microwave soil moisture products, which cannot represent localized soil moisture variations in detail. For this reason, a large number of downscaling studies based on the SMOS and SMAP missions have been carried out. It has become an ongoing research hotspot to improve the spatial and temporal resolution of soil moisture products through algorithms. Piles et al. used a combination of a relatively noisy 3 km radar backscatter coefficient and a more accurate 36 km radiometer based on the SMAP task, generating an optimal 10 km soil moisture product with better performance than the reflectance radiometer alon0065 [
10]. On this basis, Piles combined the accuracy of SMOS observations with the high spatial resolution of visible/infrared satellite data, effectively capturing soil moisture variability at spatial scales of 10 and 1 km without a significant reduction in root mean square error [
11]. With the SMAP mission, Narendra et al. combined coarse-scale radiometry with radar observations detectable at fine-scale spatial heterogeneity; to produce a high-resolution best soil moisture estimate at 9 km, further improving the spatial resolution and accuracy of soil moisture [
12]. Knipper et al. even combined SMOS and SMAP missions with information from a high spatial resolution imaging spectrometer to obtain higher resolution (1 km) soil moisture estimates [
13]. However, microwave reflections from the soil surface are affected by the state of soil moisture and by environmental factors such as surface roughness, vegetation elements, and interactions with the atmosphere. For this reason, soil moisture inversion using microwave remote sensing is susceptible to non-soil moisture factors and thus error, which may lead to inaccurate soil moisture inversion [
14,
15]. Therefore, better methods are needed to obtain soil moisture products with high spatial and temporal resolution.
The advent of Global Navigation Satellite Systems (GNSS) has provided us with a new paradigm for monitoring long-time series soil moisture information. It uses the same L-band remote sensing technology as the microwave. The difference, however, is that this technique interferes with the direct signal emitted by GNSS with the reflected signal reflected by the ground at the ground receiver. The interference contains changes caused by differences in the ground surface, which monitors the physical parameters of the Earth’s surface [
16,
17,
18,
19]. In addition, Global Navigation Satellite System-Reflection (GNSS-R) technology and Global Navigation Satellite System-Interferometry (GNSS-IR) technology have been gradually developed based on this technology [
20,
21]. Because of its advantages, such as all-weather, all-day, and high spatial and temporal resolution, it has been widely used in the fields of soil moisture, sea surface wind field, sea tide, snow depth, and vegetation change [
22,
23,
24,
25,
26,
27]. For soil moisture research, it can be further divided into ground-based GNSS-IR, airborne GNSS-IR, and satellite-based GNSS-R techniques, depending on the location of its GNSS receiver. It has been shown that the effective sensing area of ground-based GNSS-IR reaches at least 120 m
2 and can reach more than 1000 m
2 by combining multiple satellite tracks. It effectively monitors soil moisture from 0 to 5 cm in depth, achieving high accuracy inversion from bare soil to vegetation cover [
28,
29]. However, the relatively sparse distribution of GNSS stations eventually leads to the inability to achieve spatial continuity in soil moisture monitoring using ground-based GNSS-IR technology. In terms of airborne GNSS-IR, Sánchez et al. measured ground soil moisture by airborne GNSS-IR technique; it was jointly analyzed with maps with a high spatial resolution of reflectance, surface temperature, and digital surface models, and experiments showed that topography has an important influence on GNSS-IR signals [
30]. Castellvi et al. combined hyperspectral imagery and airborne GNSS-IR technique inversion of soil moisture; a comparison with Airborne radiometer at L-band (ARIEL) soil moisture estimation was performed to obtain a high-resolution soil moisture product [
31]. However, there are flight limitations and a relatively small range due to the airborne GNSS-IR technique. For this reason, the satellite-based GNSS-R technology, which loads GNSS receivers on a small satellite constellation for soil moisture monitoring, was developed [
32]. It has become a hot topic of current research because of its advantages of high spatial resolution and low revisit time. Kim et al. developed a relative signal-to-noise ratio (rSNR) for deriving terrestrial soil moisture based on satellite-based GNSS-R; combining the rSNR with soil moisture values from SMAP gives daily soil moisture estimates [
33]. Clarizia et al. also used the reflectance provided by satellite-based GNSS-R, combined with the auxiliary vegetation and roughness provided by the SMAP mission information to give daily soil moisture estimates in a grid with a resolution of 36 km × 36 km [
34]. In addition to this, a few authors have replaced traditional algorithms with machine learning methods. It also combines a small amount of auxiliary data to improve the spatial and temporal resolution of soil moisture estimation. Fernández et al. proposed an algorithm to train a neural network using measured data to invert soil moisture from SMOS observations, and experiments showed that the neural network is an effective nonlinear regression tool [
35]. Eroglu et al. proposed an artificial neural network-based method to retrieve daily soil moisture; soil moisture data from ground-based GNSS-IR and other auxiliary data, including normalized difference vegetation index (NDVI), vegetation water content (VWC), terrain elevation, terrain slope, and h-parameter (surface roughness) were input into the model modeling; finally, they obtained daily soil moisture estimates in a 9 km × 9 km grid [
36]. Yuan et al. used neural networks to invert soil moisture using point-surface fusion and combined SMAP and multiple in situ observed soil moisture data based on generalized regression neural networks to build a soil moisture estimation model, ultimately improving the accuracy of the 9 km product of the SMAP task [
37]. A follow-up study overcame the scale mismatch problem caused by a small spatial extent based on the triple configuration technique and used neural networks to combine bright temperature data from SMAP and other auxiliary data to build soil moisture estimation models [
38]. Cui et al. combined soil moisture data from the Fengyun-3B satellite with surface temperature, normalized difference vegetation index, albedo, digital elevation model based on generalized regression neural networks, longitude, and latitude; finally, they improved the spatial and temporal resolution of the Fengyun-3B satellite from 0.25° and 2–3 days to 0.05° and one day [
39]. However, most of these studies were based on the Spatio-temporal resolution of existing products, either improving the spatial or temporal resolution of the dataset, relying too much on the Spatio-temporal resolution of existing satellite-based GNSS-R products without comprehensive improvement. Also, the influence of surface environmental elements is not fully considered in using auxiliary data, such as rainfall, altitude, and some other vital factors that are not input.
In this paper, we propose a multi-data fusion learning method based on machine learning by combining ground-based GNSS-IR technology soil moisture data and surface environmental data. A multi-data fusion soil moisture model is constructed to obtain a spatially continuous soil moisture product of 500 m per day. We used surface environmental data, including (latitude and longitude information, NDVI, rainfall, air temperature, land cover type, and four topographic factors (elevation, slope, slope direction, and shading)). Since the above surface environmental data and soil moisture are related in a complex non-linear manner, it is difficult to fuse multiple data types and map soil moisture using traditional linear statistical regression algorithms. Compared with traditional algorithms, machine learning techniques excel in dealing with complex non-linear problems. In particular, The Genetic Algorithm Back-Propagation Neural Network model optimized by the genetic algorithm (GA) is highly stable and well fitted. Therefore, we input the processed data into the trained GA-BP neural network model and finally obtained the soil moisture map of 500 m per day for 15 days during 15 February 2014–1 March 2014 for the western coast of the United States.
The rest of the paper is described below.
Section 2 describes the ground-based GNSS-IR data, the post-validation data National Aeronautics and Space Administration and the U.S. Department of Agriculture (NASA-USDA) products, and the network structure of the GA-BP model.
Section 3 describes the study area and the pre-processing process of soil moisture data and related geoenvironmental elements from GNSS-IR stations.
Section 4 compares and analyzes the accuracy of the soil moisture products generated based on the GA-BP model and the method’s feasibility.
Section 5 gives the conclusion of the paper and provides an outlook for future research.
2. Materials and Methods
2.1. PBO Project
The National Science Foundation’s (NSF) Plate Boundary Observatory (PBO), which began construction in 2004, was completed in 2008 [
40]. The PBO is a core network of 1100 continuously operating GNSS stations. The network also contains 1 Hz and 5 Hz high sampling rate stations capable of observing millimeter changes in GNSS station locations over days to years. The essence of this is GNSS satellites transmitting signals, which are L-band microwave signals (~1.2 and ~1.5 GHz). The ground receiver antenna receives both direct and reflected signals, while the reflected signals vary with soil moisture, snow depth, and vegetation conditions. The changes in surface reflections are recorded in the signal-to-noise ratio (SNR) data, which is then solved to quantify soil moisture, snow depth, and vegetation growth rate. The network is the only one operating on the principle of GNSS-IR technology. Soil moisture data from the PBO project can be downloaded from the International Soil Moisture Network (ISMN) website and available on the PBO data portal. The basic parameters of all stations of the PBO network in the study area are shown in
Table 1.
2.2. NASA-USDA Soil Moisture Data
National Aeronautics and Space Administration Goddard Space Flight Center (NASA GSFC) provided the NASA-USDA global soil moisture data by the 1-D Ensemble Kalman Filter (EnKF) data assimilation method. SMOS level 2 soil moisture observations were generated by integrating them into a modified two-layer Palmer model [
41]. Due to the low resolution of SMOS itself, as a result, the NASA-USDA global soil moisture data generated based on SMOS data has a low spatial resolution. The spatial resolution of this dataset is only 0.25° × 0.25°. This dataset includes both surface and subsurface soil moisture data, but the soil moisture based on GNSS-IR can only reflect the variation of soil moisture within 1–6 cm of the soil surface. Therefore, the surface soil moisture of the NASA-USDA global soil moisture data is selected as the initial comparison data for the point-surface fusion results. This paper obtained NASA-USDA global soil moisture through Google Earth Engine (
https://developers.google.com/earth-engine/datasets/catalog/NASA_USDA_HSL_soil_moisture#description, accessed on 6 March 2021) data.
2.3. GNSS-IR Technology for Inversion of Soil Moisture
SNR observations are the core computational data of GNSS-IR technology, an index describing the signal quality of GNSS antennas. It is mainly influenced by the combination of elements, such as receiver antenna gain, satellite altitude angle, and multipath effect [
42].
The GNSS receiver receives both direct and reflected signals from GNSS satellites. The continuous movement of GNSS satellites makes the GNSS direct reflection signal constantly change, which makes the characteristic parameters of the interference waveform constantly change over time, and ordinary geodetic receivers will record these changes in the form of SNR [
43]. Therefore, the study of SNR can estimate soil moisture through the change of the characteristic parameters of the interference effect. The ground multipath error model is shown in
Figure 1.
From
Figure 1, the phase difference between the direct and reflected signals of the GNSS satellite can be deduced as:
where
h denotes the vertical height of the GNSS receiving antenna from the ground and
β denotes the angle
λ between the GNSS signal and the ground surface is the L carrier wavelength. Further study reveals that the SNR can be expressed in terms of direct and reflected signals as follows.
In Equation (2), α denotes the phase difference between the direct and reflected signals of GNSS satellites, and
and
denote the amplitudes of the direct and reflected signals of GNSS satellites, respectively. Chew et al. [
44] found that after removing the direct signal amplitude
, only the reflected signal amplitude
is retained in the SNR observations. There is a certain sine or cosine relationship between
and sinβ, which can be expressed as:
In Equation (3),
denotes the relative delayed phase. In the case where
is known, we can use the Lomb-Scargle spectral analysis transform to find the frequency
and then solve for the magnitude
and the relative phase
using a least squares fit. A strong correlation between
and soil moisture values was found by Chew et al. [
45], which is the inverse the best parameter for inversion of surface soil moisture.
Based on this, Chew et al. [
46] smoothed phase
using a moving average filter to remove the expected phase variation due to vegetation and add the residual water content in the soil (
) to produce a phase
that reflects only the variation in soil moisture. The phase
was then related to soil moisture
SM is expressed as follows:
In Equation (4),
S is the expected slope (between soil moisture and phase),
indicates the phase change due to soil moisture, and
is available through public data [
47].
2.4. Data Pre-Processing
The objective of the method is to obtain soil moisture products with a high spatial and temporal resolution by fusing ground-based GNSS-IR data with surface environmental parameters extracted from optical remote sensing. The specific process of the multi-data fusion model in this study is described below, and the flow chart of the method is shown in
Figure 2.
(1) Data processing. Download the soil moisture retrieved by GNSS-IR technology through the International Soil Moisture Network (ISMN). Use Google Earth Engine (GEE) to obtain image data of surface environmental elements (latitude, longitude, NDVI, temperature, rainfall, land cover type, slope, aspect, elevation, and shadow) of the experimental area (1 January 2014–1 March 2014). GEE’s image pyramid strategy specifies the output image with a spatial resolution of 500 m and a temporal resolution of 1 day.
(2) Build the data set. According to each GNSS station’s latitude and longitude information, the corresponding image value of each GNSS station is extracted. Ten surface environment elements are used as the input of the GA-BP neural network model to form the input data set. Take GNSS-IR soil moisture as the training target (output data) to construct an output data set. This makes the input layer of the GA-BP neural network have 10 neurons, while the output layer has only one neuron.
(3) Model building. Import the modeling input data set and the modeling input data set into Matlab, and divide all the data into 70%, 15%, and 15% as the training set, validation set, and test set for model construction. Use the divided training set and confirmation set to train the GA-BP neural network model, and use the test set to test the accuracy of the trained model. Save the GA-BP neural network model (trained qualified neural network) whose accuracy reaches the threshold.
(4) Accuracy verification. First, the reliability of the neural network model that reaches the threshold is tested by the tenfold cross-validation method. Secondly, a verification input data set formed 10 kinds of surface environment elements corresponding to the GNSS stations not involved in the modeling. Input the validation data set into the trained neural network model and output the GA-BP inversion soil moisture data set. The GA-BP inversion soil moisture data set is compared and analyzed with the GNSS-IR soil moisture corresponding to the stations not involved in the modeling. If the accuracy meets the requirements, the neural network model trained to reach the threshold is reliable and effective.
(5) Production soil moisture map. Each 500-m square in the experimental area corresponds to a latitude and longitude coordinate, and 10 kinds of surface environmental elements corresponding to all latitude and longitude coordinates in the experimental area are extracted through the latitude and longitude coordinates to form a map input data set. Input the mapped input data set into the GA-BP neural network model that reaches the threshold and obtain the GA-BP inversion soil moisture data set for mapping through the training output. Import the GA-BP inverted soil moisture data set used for mapping into ArcGIS, and use ArcGIS “Point to raster” function to convert all the GA-BP inverted soil moisture data sets used for mapping into raster images (Soil moisture map).
(6) Soil moisture map verification. The soil moisture map (500 × 500 m) and NASA-USDA (0.25° × 0.25°) products, NDVI (500 × 500 m), and rainfall (500 × 500 m) were compared and analyzed. Due to the different units and resolutions of the four Same. We only analyze whether the generated soil moisture map is qualified by changing the map spots and the value between the same areas. On this basis, we extracted 392 sites based on the latitude and longitude of each NASA grid center and analyzed the correlation between the NASA soil moisture at the sites and the soil moisture retrieved by GA-BP. Further, evaluate the performance of GA-BP inversion of soil moisture.
2.5. GA-BP Neural Network
2.5.1. BP Neural Network
Backpropagation (BP) neural network is a multilayer feedforward network model consisting of two processes: forward propagation of information and backward propagation of error [
48]. It is a more widely used neural network with solid adaptability and learning ability and can better solve nonlinear problems. Its essence is learning by stochastic gradient descent solving algorithm. The input and output layers output data are called forward propagation, called backward propagation, using the weights and deviations calculated in each layer to update the model for iteration. BP neural network mainly consists of the input layer, hidden layer, and output layer. Its network structure is shown in
Figure 3. Our study’s input signals are latitude, longitude, NDVI, rainfall, air temperature, land cover type, and four topographic factors (elevation, slope, slope direction, and shading); the output parameter is GNSS-IR soil moisture. BP neural network model is implemented by using the neural network toolbox of MATLAB. The number of neurons m in the implicit layer of the BP neural network takes values between
and 2n + 1. n is the number of neurons in the input layer, so we tested hidden layer neurons ranging from 7 to 21. Using a 10-fold coefficient performs best by validating that the best model performance is obtained when the number of hidden layer neurons is 19. Also, set the number of training steps for the BP neural network to 1000, the training accuracy to 0.001, and the learning rate to 0.0001.
However, the number of neurons in the hidden layer of the BP neural network needs to be manually selected, and the weights and thresholds are also randomly generated, which causes the BP neural network to converge slowly and quickly fall into a local minimum [
49]. For this reason, this paper uses the GA algorithm to optimize the BP neural network.
2.5.2. The Genetic Algorithm
The GA algorithm is a global search computer algorithm derived mathematically based on inheritance laws in nature [
50]. Its essence is selecting good individuals from the population; through crossover and mutation operations to obtain new individuals of good quality. The advantages of the genetic algorithm: (1) It can quickly search the whole solution in the solution space and has excellent global search capability. (2) It is suitable for distributed computing, and natural parallelism speeds up the convergence speed. (3) Simple, general, and wide range of applications. The GA algorithm is used to implement the optimization of BP neural network weights. The adaptive adjustment of the crossover and variance probability enables individuals to update the network weights continuously, thus improving the BP neural network’s network convergence speed and algorithmic accuracy. For this reason, this paper uses the GA algorithm to optimize the BP neural network. The global search property of the GA algorithm is combined with the powerful nonlinear learning ability of the BP neural network to improve the training ability of the model. The BP neural network optimized by the GA algorithm is called the Genetic Algorithm Back Propagation neural network. The calculation process of the whole GA-BP neural network is shown in
Figure 4.
The content of the GA-BP neural network is divided into two parts: On the one hand, the GA algorithm is used to globally search for the optimal solution to find a set of optimal solutions. On the other hand, the optimal global solutions are used as the initial weights of the BP neural network. In this study, the initial population size of the GA algorithm is set to 50, the number of genetic generations to 100, the crossover probability to 0.3, and the variance probability to 0.09.
2.6. Validation Method and Evaluation Metrics
This study applied a 10-fold cross-validation technique to test the model overfitting and predictive ability [
51]. The training and validation data were run in 10 random iterations. In each iteration, the entire data set was randomly divided into ten equal-sized portions. One of the copies is used as the validation sample, and the remaining nine copies are used as the training sample for one iteration. In the next iteration, one of the previous training samples is used as the validation sample, and the remaining nine are used as the training samples in the next iteration. Repeat this step nine times until ten iterations are completed, and we will get the prediction results for the whole data set of soil moisture. The model cross-validation results are obtained by averaging the ten results. These averaged cross-validation results provide a good check of whether the model is overfitted. When the model is poor, the cross-validation results are also poor, and the model with the most significant correlation coefficient is selected as the best-fit model for subsequent predictions.
To verify the validity of each model, we quantitatively evaluated the training and test sets using the Pearson correlation coefficient R, root mean square error (RMSE), unbiased root mean square error (ubRMSE), and mean bias (bias). R describes the degree of model convergence between +1 and 1. Where +1 indicates a perfect positive linear correlation, 0 indicates no linear correlation, and 1 indicates a perfect negative linear correlation. RMSE and bias measure the deviation between the inverse soil moisture values and the measured values. RMSE of 0 indicates no deviation. The bias of 0 indicates an unbiased estimate, more remarkable bias than 0 is an overestimate, and less than 0 is an underestimate. In general, the smaller the two, the better. ubRMSE is the random error. ubRMSE eliminates possible additional bias when the measured value is considered the actual value, and the smaller the value is, the better the model performance is.