Modeling and Theoretical Analysis of GNSS-R Soil Moisture Retrieval Based on the Random Forest and Support Vector Machine Learning Approach

: Global Navigation Satellite System-Reﬂectometry (GNSS-R) as a microwave remote sensing technique can retrieve the Earth’s surface parameters using the GNSS reﬂected signal from the surface. These reﬂected signals convey the surface features and therefore can be utilized to detect certain physical properties of the reﬂecting surface such as soil moisture content (SMC). Up to now, a serial of electromagnetic models (e.g., bistatic radar and Fresnel equations, etc.) are employed and solved for SMC retrieval. However, due to the uncertainty of the physical characteristics of the sites, complexity, and nonlinearity of the inversion process, etc., it is still challenging to accurately retrieve the soil moisture. The popular machine learning (ML) methods are ﬂexible and able to handle nonlinear problems. It can dig out and model the complex interactions between input and output and ultimately make good predictions. In this paper, two typical ML methods, speciﬁcally, random forest (RF) and support vector machine (SVM), are employed for SMC retrieval from GNSS-R data of self-designed experiments (in situ and airborne). A comprehensive simulated dataset involving di ﬀ erent types of soil is constructed ﬁrstly to represent the complex interactions between the variables (reﬂectivity, elevation angle, dielectric constant, and SMC) for the requirement of training ML regression models. Correspondingly, the main task of soil moisture retrieval (regression) is addressed. Speciﬁcally, the post-processed data (reﬂectivity and elevation angle) from sensor acquisitions are used to make predictions by these two adopted ML methods and compared with the commonly used GNSS-R retrieval method (electromagnetic models). The results show that the RF outperforms the SVM method, and it is more suitable for handling the inversion problem. Moreover, the RF regression model built by the comprehensive dataset demonstrates satisfactory accuracy and strong universality, especially when the soil type is not uniform or unknown. Furthermore, the typical task of detecting water / soil (classiﬁcation) is discussed. The ML algorithms demonstrate a high potential and e ﬃ ciency in SMC retrieval from GNSS-R data. higher correlation coe ﬃ cient lower root square error, are observed in in situ has been mentioned in the SVM method, the measured data points and the classiﬁcation results are shown in Figure 19. Four periods of ﬂight over lakes were distinguished with a spatial resolution of around 20 m. The prediction accuracy of PRN4 and PRN32 is both with 99.75% as illustrated in Figure 19. In this case, the two reﬂection routes (PRN4 and PRN32) show di ﬀ erent classiﬁcation accuracy, as compared to the 99.5% and 99.75% obtained by applying SVM. The RF shows a similar performance with the SVM algorithm.


Introduction
Soil moisture content (SMC) is an important determinant parameter of surface energy balance and plays an important role in the global water cycle. Existing ground-based experiments and satellite missions dedicated to SMC estimation commonly employ heavy and bulk passive or active sensors, with limited data [41]. A bagging ensemble algorithm, random forest (RF), has been widely used in remote sensing applications to obtain the land cover type [48], the boreal forest attributes [49], precipitation [50], vegetation water content [51], and metal concentration [52], since it is good at capturing nonlinear and complex relationships between inputs and predictors with good estimation results [50,51]. These two typical machine learning methods have great potential for interpreting remote sensing data in the fields of land and sea applications, because they are faster and require fewer training samples while exhibiting better prediction performance, compared to other learning methods [46][47][48]51]. Although SVMs and RF have been used in the past studies for soil moisture estimation, neither of them has been adopted for modeling and comparing with the GNSS-R SMC retrieval models.
Therefore, this study aims to investigate the feasibility of GNSS-R estimation (regression and classification) by using two typical ML algorithms with self-designed experiments (in situ and airborne) and establishes an optimization method for SMC retrieval. A simulated dataset involving different types of soil is constructed for training ML regression models. The performance of the two adopted ML methods and the GNSS-R retrieval method for SMC estimation are evaluated and compared. Additionally, the classifications of water and soils are discussed, and the predicted properties of the surfaces are presented by the classification function. This paper is organized as follows: Section 2 presents the theoretical background of the GNSS-R SMC retrieval and ML algorithms. Section 3 describes the methodology for training and modeling the GNSS-R inversion process. The experimental setup and the employed datasets are detailed in Section 4. Section 5 shows the regression results performed by ML and GNSS-R models with self-designed experimental data as well as some discussions. Finally, conclusions are given in Section 6.

Soil Moisture Retrieval Process from Bistatic GNSS-R
The GNSS-R system can be regarded as a bistatic radar system as shown in Figure 1, in which the satellite is the transmitter, and the receiver can be placed near the ground (in situ measurement) or on an aircraft for airborne experiments. GNSS-R aims to obtain the characteristics of the reflecting surface by analyzing the reflected signals or their difference from the direct signal. GNSS-R utilizes the L-band microwave signals that are immune to atmospheric attenuation and normally have a good penetration through vegetation [52]. As seen in Figure 1, the RHCP antenna receives the direct signal, and the LHCP antenna receives the reflected signal. The SNR peak power of the RHCP antenna is: where t P represents the satellite transmit power, t G stands for the satellite gain, r G and N P are the antenna gain and noise power for the RHCP and the LHCP link, respectively. D G is the processing gain due to the de-spread of the GPS C/A code, 3 R denotes the distance between the satellite and the receiver, and λ is the wavelength of the L1 band signal. GNSS-R aims to obtain the characteristics of the reflecting surface by analyzing the reflected signals or their difference from the direct signal. GNSS-R utilizes the L-band microwave signals that are immune to atmospheric attenuation and normally have a good penetration through vegetation [52]. As seen in Figure 1, the RHCP antenna receives the direct signal, and the LHCP antenna receives the reflected signal. The SNR peak power of the RHCP antenna is: Remote Sens. 2020, 12, 3679 4 of 24 where P t represents the satellite transmit power, G t stands for the satellite gain, G r and P N are the antenna gain and noise power for the RHCP and the LHCP link, respectively. G D is the processing gain due to the de-spread of the GPS C/A code, R 3 denotes the distance between the satellite and the receiver, and λ is the wavelength of the L1 band signal. In this study, the reflected signal received by the antenna is considered to be dominated by the coherent reflections [16]. Thus, the reflected signal power of the LHCP antenna is: In (2), R 1 is the distance between the satellite and the reflection point, and R 2 represents the distance between the reflection point and the receiver. The ratio of SNR direct peak to SNR re f lect peak can be written as: SNR re f lect peak where C is a calibration parameter summarizing the uncertainties of G r and P N . Γ is the power reflectivity that depends on the surface roughness [53,54]: where ρ(γ) represents the Fresnel reflection coefficient of the reflecting surface, and γ denotes the elevation angle of the satellite. χ(z) is the probability density function for the surface height z.
Under the assumption of a flat surface, the χ(z) = 1. The reflection coefficient ρ(γ) is given by a linear combination of vertically and horizontally polarized components; therefore [55]: where ρ VV is the horizontal polarization reflection coefficient and ρ HH is the vertical polarization reflection coefficient. More specifically [5]: ρ VV = ε· sin(γ) − ε − (cos(γ)) 2 ε· sin(γ) + ε − (cos(γ)) 2 (6) ρ HH = sin(γ) − ε − (cos(γ)) 2 sin(γ) + ε − (cos(γ)) 2 (7) where ε is the complex permittivity of the reflecting surface. In the case of dry terrain or almost dry, the imaginary part of the permittivity can be neglected [56,57]. When the LH reflected signal and the RH direct signal are known, the real part of permittivity can be obtained from the combination of (3)-(7) with nearby water calibration [16]. Since the relationship between the dielectric constant of soil and soil moisture is given by the soil dielectric models [53,58], the SMC can be retrieved from the dielectric constant.

Support Vector Machines
The support vector machine (SVM) was established by Vapnik [59] on the basis of statistical learning theory. It is a typical machine learning algorithm, which was originally used for classification. Assuming the data sample set is denoted as T = (x i , y i ) i = 1, 2, . . . , l , x i ∈ n , y i ∈ ±1, where x i ∈ n Remote Sens. 2020, 12, 3679 5 of 24 is the input vector and its components are features or attributes; y i ∈ ±1 is the output value of corresponding x i ; l is the number of samples. SVM aims to find a classification hyperplane that maximizes the margin between different classes. The hyperplane is constructed as follows [59]: w is a weighting vector, x is an input vector, and b is the bias. A hyperplane that allows two dashed lines ω·x + b = 1 and ω·x + b = −1 to distinguish positive and negative samples was perfectly satisfied, and the maximum value of the distance between them is 2 ω [59]. The optimization function can be expressed as follows [59]: SVM is quite efficient and requires fewer samples [60]. Especially, SVM features have a kernel function that takes data as input and transforms it into the desired form [59]. These functions can be different types, for example, linear, nonlinear, polynomial, or radial basis function (RBF). Here, we adopted the RBF kernel function, since it has good generalization ability and demonstrated excellent performance [59]. Moreover, SVM is also a typical solution regarding the regression problem, maintaining all the main features that characterize the algorithm (maximal margin), which is known as the support vector regression (SVR). Similar to SVM, SVR can also estimate the nonlinear relationship between input vectors and corresponding predictors [61]. The core of the SVR is the iterative process of the sequential minimal optimization (SMO) algorithm [62].

Random Forest
Random forest (RF) is an integrated machine learning method proposed by Breiman [63], which uses bagging (bootstrap aggregation) and random split selection techniques to construct multiple decision trees and obtain final classification results by voting. Random forests can also be used for regression. An RF can analyze the complex interaction and even highly correlated variables. It has a fast learning speed and it is quite resistant to noisy data and the data with missing values [46][47][48]51].
The random forest is an integrated classifier consisting of a set of tree-structured classifiers h(X, ϑ k ), k = 1, 2, 3 . . . , K , simplified as h i (x), where {ϑ k } is a random vector obeying independent and identical distribution, and K is the number of decision trees in the random forest. Under the given independent variable X, the optimal classification results will be determined by the majority vote from decision trees [63].
Building a random forest requires three steps: generating a training set (bootstrap sampling) for each decision tree, constructing each decision tree, and repeating the above two steps to generate a random forest. In order to construct k trees, we need to generate k random vectors ϑ 1 , ϑ 2 , ϑ 3 . . . ϑ k . These random vectors ϑ i are independent of each other and are equally distributed. The random vector ϑ i is used to construct a collection of decision trees h(x, ϑ i ), and it is simplified as h i (x). When constructing a tree, a feature is selected from a subset of features and is used to grow each tree [63].
The prediction of the model is the average of the regression results for the k decision trees [63]: When using bootstrap sampling, the unselected data is called out-of-bag (OOB) data. This part of the unselected OOB can be used to estimate the generalization error, classification strength, and correlation coefficient (CC) for the model of the ensembled decision trees; for each decision tree, OOB can be used to obtain an error estimate. The estimates of OOB error for all decision trees Remote Sens. 2020, 12, 3679 6 of 24 in a random forest are averaged to evaluate the generalization error of the random forest model. More details about the implementation of RF can be found in e.g., [63].

RF and SVMs Models for GNSS-R Soil Moisture Retrieval
In general, as demonstrated in Figures 1 and 2, the GNSS-R signals coming from direct and reflected links are received, and the collected raw data were post-processed respectively to obtain the correlation power and relevant navigation messages. Therefore, the soil reflectivity can be obtained by calculating the SNR of the received data collected from the reflected and direct signals. After that, as we have introduced in Section 2.1, the soil reflectivity is used to obtain the dielectric constants through the bistatic radar equations. Since the dielectric constants are strongly related to SMC, the relationship between soil dielectric constants and soil moisture is given by the soil dielectric models [53,58].
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 23 relationship between soil dielectric constants and soil moisture is given by the soil dielectric models [53,58]. In fact, it has to be noted that the commonly used semi-empirical soil dielectric models [53,58] need the texture information (e.g., clay, sand, and silt proportions) of the soil. As shown in Figure 3, the SMC increases generally with dielectric constants. However, different soil types (identified with n ) show an evident impact on SMC retrieval, which increases the difficulty and uncertainty in SMC retrieval when the texture of the soil is unknown or nonuniform. Moreover, operating field measurements for acquiring the soil texture in all test sites are practically impossible; therefore, most GNSS-R SMC measurements are conducted without knowing the information of the test site. On the other hand, the inversion process is quite complex and unable to be solved analytically. Thus, it is difficult to establish an accurate GNSS-R soil moisture model analytically due to the complex interaction of these parameters. Hence, facing the above-mentioned challenges, here, the GNSS-R SMC retrieval is considered as a nonlinear regression problem and modeled by ML techniques (RF and SVMs), as shown in Figure 2. Input vectors are , Γ γ , and the SMC is the output to be predicted by ML methods. It is worth mentioning that during the GNSS-R experiment, the instability of the receiving equipment or other unexpected situations may cause missing data. ML methods are effective, flexible, and can maintain In fact, it has to be noted that the commonly used semi-empirical soil dielectric models [53,58] need the texture information (e.g., clay, sand, and silt proportions) of the soil. As shown in Figure 3, the SMC increases generally with dielectric constants. However, different soil types (identified with n) show an evident impact on SMC retrieval, which increases the difficulty and uncertainty in SMC retrieval when the texture of the soil is unknown or nonuniform. Moreover, operating field measurements for acquiring the soil texture in all test sites are practically impossible; therefore, most GNSS-R SMC measurements are conducted without knowing the information of the test site. On the other hand, the inversion process is quite complex and unable to be solved analytically. Thus, it is difficult to establish an accurate GNSS-R soil moisture model analytically due to the complex interaction of these parameters.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 23 relationship between soil dielectric constants and soil moisture is given by the soil dielectric models [53,58]. In fact, it has to be noted that the commonly used semi-empirical soil dielectric models [53,58] need the texture information (e.g., clay, sand, and silt proportions) of the soil. As shown in Figure 3, the SMC increases generally with dielectric constants. However, different soil types (identified with n ) show an evident impact on SMC retrieval, which increases the difficulty and uncertainty in SMC retrieval when the texture of the soil is unknown or nonuniform. Moreover, operating field measurements for acquiring the soil texture in all test sites are practically impossible; therefore, most GNSS-R SMC measurements are conducted without knowing the information of the test site. On the other hand, the inversion process is quite complex and unable to be solved analytically. Thus, it is difficult to establish an accurate GNSS-R soil moisture model analytically due to the complex interaction of these parameters. Hence, facing the above-mentioned challenges, here, the GNSS-R SMC retrieval is considered as a nonlinear regression problem and modeled by ML techniques (RF and SVMs), as shown in Figure 2. Input vectors are , Γ γ , and the SMC is the output to be predicted by ML methods. It is worth mentioning that during the GNSS-R experiment, the instability of the receiving equipment or other unexpected situations may cause missing data. ML methods are effective, flexible, and can maintain Hence, facing the above-mentioned challenges, here, the GNSS-R SMC retrieval is considered as a nonlinear regression problem and modeled by ML techniques (RF and SVMs), as shown in Figure 2. Input vectors are Γ, γ, and the SMC is the output to be predicted by ML methods. It is worth mentioning that during the GNSS-R experiment, the instability of the receiving equipment or other unexpected situations may cause missing data. ML methods are effective, flexible, and can maintain high accuracy prediction, even when a portion of data is lost [51], which is quite valuable for GNSS-R soil moisture retrieval. In this study, two ML algorithms of RF and SVM were applied for training the regression model and testing the performance of the proposed GNSS-R ML retrieval method.

Simulated GNSS-R Dataset for Training Regression Models
As noted previously, the regression problem is a typical task solved by ML methods. As such, in this study, we will use SVR and RF models to perform the SMC retrieval (regression) with data collected during self-designed in situ and airborne experiments. In principle, such learning techniques are based on building a regression model between the known SM values from a reference dataset (such as Soil Moisture Active Passive, SMAP, or ground-truth SMC networks) and the experiment observations, and then exploiting this model to perform future SMC estimations. However, as mentioned earlier, constructing an ML model may highly lie on the prior knowledge of SM or the heavy-loaded ancillary data. For particular regions with a self-designed experiment (airborne or in situ measurement), it is extremely difficult to obtain sufficient reference ground-truth data, satisfying numbers of samples for preferable ML models training. Therefore, in this paper, a comprehensive simulation dataset involving five types of soil was built firstly for training ML models. Next, selected real GNSS-R data from airborne and in situ measurements were processed and further tested to validate the prediction performance of the ML models.
The comprehensive simulation dataset was built and used for training and regression tests. This dataset is featured concerning five types of soils that correspond to the dielectric model as mentioned in Figure 3. The input vector consists of Γ (reflectivity) and γ (elevation angle). The output vector is SMC. The simulated dataset is built by the following input vectors concerning different soil types (n):

γ, Elevation angle (from 35 degrees to 85 degrees)
The designed range [55] of the input data for training aimed at covering the range of our acquired measured data. With the simulated input vectors and Equations (4)-(7) of GNSS-R, the SMC including different soil types can be calculated from the dielectric constant by using semi-empirical soil models [58], as illustrated in Section 2.1. Particularly, since the ML methods can build and reveal the nonlinear relationship between the input and output vectors, the regression model composed of different soil types is trained and used, which can increase the prediction accuracy when the soil type is unknown or uncertain. The overall simulated dataset having five different typical soil types is composed of 2000 points (Γ, γ, SMC), as shown in Figure 4.

Simulated GNSS-R Dataset for Training Classification Models
We further investigate the performance of solving both the classification and regression problems for the airborne data. Hence, the experimental airborne GNSS-R data used for the soil moisture content predictions are also tested for the classification task, and the satellites PRN4 and PRN32 are also considered. Similar to the procedure for our proposed SMC regression scheme, the simulation dataset is devised for training and building the RF and SVM prediction models, since the simulated data can provide sufficient samples and show a more accurate relationship between the input and output. For the classification task, we considered the dielectric constant and elevation angle as the input of the dataset, and the reflectivity (Γ) is the output that can be generated by considering the bistatic Equations (4)-(6) of GNSS-R, under the assumption of a flat surface. Generally, the dielectric Remote Sens. 2020, 12, 3679 8 of 24 constant of soils does not exceed 25, so the simulated dataset was constructed by varying two input variables in the range:

1.
ε, Dielectric constant, soils (from 1 to 25, with a step size of 1), water (78) 2. γ, Elevation angle (from 0 degrees to 90 degrees, with a step size of 3) With the GNSS-R bistatic equations described in Section 2.1, the reflectivity (Γ) was obtained and the simulated 900 training samples were labeled with −1 (soils) and +1 (water), as presented in Figure 5. The label of soil/water is assigned based on the corresponding value of dielectric constant; specifically, that for water is 78, and for soil, it varies from 1 to 25. The simulated dataset is composed of (Γ, γ, labels). and further tested to validate the prediction performance of the ML models.
The comprehensive simulation dataset was built and used for training and regression tests. This dataset is featured concerning five types of soils that correspond to the dielectric model as mentioned in Figure 3. The input vector consists of Γ (reflectivity) and γ (elevation angle). The output vector is SMC. The simulated dataset is built by the following input vectors concerning different soil types (n):

1.
Γ , Reflectivity (from 0-0.8) 2. γ , Elevation angle (from 35 degrees to 85 degrees) The designed range [55] of the input data for training aimed at covering the range of our acquired measured data. With the simulated input vectors and Equations (4)-(7) of GNSS-R, the SMC including different soil types can be calculated from the dielectric constant by using semi-empirical soil models [58], as illustrated in Section 2.1. Particularly, since the ML methods can build and reveal the nonlinear relationship between the input and output vectors, the regression model composed of different soil types is trained and used, which can increase the prediction accuracy when the soil type is unknown or uncertain. The overall simulated dataset having five different typical soil types is composed of 2000 points ( Γ , γ , SMC), as shown in Figure 4.

Simulated GNSS-R Dataset for Training Classification Models
We further investigate the performance of solving both the classification and regression problems for the airborne data. Hence, the experimental airborne GNSS-R data used for the soil moisture content predictions are also tested for the classification task, and the satellites PRN4 and PRN32 are also considered. Similar to the procedure for our proposed SMC regression scheme, the simulation dataset is devised for training and building the RF and SVM prediction models, since the simulated data can provide sufficient samples and show a more accurate relationship between the input and output. For the classification task, we considered the dielectric constant and elevation SMC (cm3/cm3) angle as the input of the dataset, and the reflectivity ( Γ ) is the output that can be generated by considering the bistatic Equations (4)-(6) of GNSS-R, under the assumption of a flat surface. Generally, the dielectric constant of soils does not exceed 25, so the simulated dataset was constructed by varying two input variables in the range: 1. ε , Dielectric constant, soils (from 1 to 25, with a step size of 1), water (78) 2. γ , Elevation angle (from 0 degrees to 90 degrees, with a step size of 3) With the GNSS-R bistatic equations described in Section 2.1, the reflectivity ( Γ ) was obtained and the simulated 900 training samples were labeled with −1 (soils) and +1 (water), as presented in Figure 5. The label of soil/water is assigned based on the corresponding value of dielectric constant; specifically, that for water is 78, and for soil, it varies from 1 to 25. The simulated dataset is composed of ( Γ , γ , labels).

Airborne Experimental Data
To validate this work, we firstly consider data obtained from a low-altitude airborne experiment that was carried out by a P92 Digisky airplane over the Avigliana lake (45.099° N, 7.369° E) in Italy on the 11th of December 2014. The flight route and corresponding reflection points for different PRN satellites (PRN4 and PRN32) are shown in Figure 6, including an image of the experimental area from Google Earth.

Airborne Experimental Data
To validate this work, we firstly consider data obtained from a low-altitude airborne experiment that was carried out by a P92 Digisky airplane over the Avigliana lake (45. To validate this work, we firstly consider data obtained from a low-altitude airborne experiment that was carried out by a P92 Digisky airplane over the Avigliana lake (45.099° N, 7.369° E) in Italy on the 11th of December 2014. The flight route and corresponding reflection points for different PRN satellites (PRN4 and PRN32) are shown in Figure 6, including an image of the experimental area from Google Earth.  This flight experiment was mainly dedicated to investigating soil moisture retrieval from a large area. The type of terrain ranged from open water to terrain with small bushes to built-up areas [64]. It includes two lakes: the size of the northern lake (bigger) is approximately 1 km × 1.3 km, and the southern lake (smaller) is 700 m × 1.1 km. The area was selected for several reasons. First of all, in this area, the presence of two lakes can provide the reflections and the known dielectric constant for calibration. Second, the terrain slope variation can be neglected, and the terrain can be considered smooth [65]. Basically, the reflected signal power is composed of two parts: coherent and non-coherent power. The phase distribution of the coherent part is constant, while in the incoherent part, the phase is random and uniformly distributed over an interval of 2π [66]. If the surface can be considered smooth, the non-coherent component assumes very low values that can be ignored, and the total power received by the antenna can be approximated with the coherent part only [16,65,67].
Data are collected with a receiver working in a bistatic mode, as shown in Figure 1. The up-looking patch antenna is a traditional hemispherical GNSS L1 patch antenna mounted on top of the aircraft fuselage, and the down-looking antenna is a GNSS L1 antenna with LHCP polarization mounted on the bottom fuselage of the aircraft [65]. The antenna was enclosed in a 2-inch square radome (53 mm × 53 mm) and equipped by an Low Noise Amplifier (LNA) to provide 33 dB gain. The GNSS-R receiver [68] is fixed on a small aircraft, as shown in Figure 7 [65].
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 23 This flight experiment was mainly dedicated to investigating soil moisture retrieval from a large area. The type of terrain ranged from open water to terrain with small bushes to built-up areas [64]. It includes two lakes: the size of the northern lake (bigger) is approximately 1 km × 1.3 km, and the southern lake (smaller) is 700 m × 1.1 km. The area was selected for several reasons. First of all, in this area, the presence of two lakes can provide the reflections and the known dielectric constant for calibration. Second, the terrain slope variation can be neglected, and the terrain can be considered smooth [65]. Basically, the reflected signal power is composed of two parts: coherent and noncoherent power. The phase distribution of the coherent part is constant, while in the incoherent part, the phase is random and uniformly distributed over an interval of 2π [66]. If the surface can be considered smooth, the non-coherent component assumes very low values that can be ignored, and the total power received by the antenna can be approximated with the coherent part only [16,65,67].
Data are collected with a receiver working in a bistatic mode, as shown in Figure 1. The uplooking patch antenna is a traditional hemispherical GNSS L1 patch antenna mounted on top of the aircraft fuselage, and the down-looking antenna is a GNSS L1 antenna with LHCP polarization mounted on the bottom fuselage of the aircraft [65]. The antenna was enclosed in a 2-inch square radome (53 mm × 53 mm) and equipped by an Low Noise Amplifier (LNA) to provide 33 dB gain. The GNSS-R receiver [68] is fixed on a small aircraft, as shown in Figure 7 [65]. The prototype used for the acquisition of the received power can measure both the direct and reflected GPS signals through two synchronized channels: one for the direct signal and the other for the reflected signal (see Figure 8). Two antennas are connected with two front ends, respectively. Each front-end is connected to the ODROID-X2 microprocessor board in the prototype, and two data streams are stored in the onboard memory for post-processing [64,65].
Direct signal The prototype used for the acquisition of the received power can measure both the direct and reflected GPS signals through two synchronized channels: one for the direct signal and the other for the reflected signal (see Figure 8). Two antennas are connected with two front ends, respectively. Each front-end is connected to the ODROID-X2 microprocessor board in the prototype, and two data streams are stored in the onboard memory for post-processing [64,65]. reflected GPS signals through two synchronized channels: one for the direct signal and the other for the reflected signal (see Figure 8). Two antennas are connected with two front ends, respectively. Each front-end is connected to the ODROID-X2 microprocessor board in the prototype, and two data streams are stored in the onboard memory for post-processing [64,65]. As shown in Figure 8, the received raw data are stored in an ODROID-X2 eMMC memory of the receiver prototype in order to be post-processed by an open-loop approach to obtain DDMs and the corresponding delay waveforms. Since a large amount of memory (GB/min) is required for storing the raw data, i.e., 1 s ≈ 1.6 GB data, the duration of the data collection is limited in the embedded Multi Media Card (eMMC) memory (64 GB) and external storage devices. To free more space for data storage, some of the data can be processed on board. Raw data are processed with software SOPRANO [69] and stored as much as hardware capability allowed. Especially, since the reflected GNSS signal is very weak, a combination of coherent and non-coherent integration algorithm was adopted in order to distinguish between the reflection peak and the noise [65]. The coherent As shown in Figure 8, the received raw data are stored in an ODROID-X2 eMMC memory of the receiver prototype in order to be post-processed by an open-loop approach to obtain DDMs and the corresponding delay waveforms. Since a large amount of memory (GB/min) is required for storing the raw data, i.e., 1 s ≈ 1.6 GB data, the duration of the data collection is limited in the embedded Multi Media Card (eMMC) memory (64 GB) and external storage devices. To free more space for data storage, some of the data can be processed on board. Raw data are processed with software SOPRANO [69] and stored as much as hardware capability allowed. Especially, since the reflected GNSS signal is very weak, a combination of coherent and non-coherent integration algorithm was adopted in order to distinguish between the reflection peak and the noise [65]. The coherent integration (also known as signal correlation process) time we used is 1 ms depending on the length of GPS C/A code (1 ms). Several summations or averaging (called non-coherent integration) revealed the real signal shape and eliminated the fading noise effects. Comparing the delay waveforms (DW) performances, including average noise power and standard deviation of the noise, a final 500 ms non-coherent integration time is chosen to meet both the needs of system resolution and reliability to detect real signals [65,68].

In Situ Experimental Data
In this subsection, data obtained from several in situ measurements are introduced. The in situ data were collected from a serial of ground-based experiments in two bare and smooth sites with different SMC conditions (wet/dry) and terrain compositions. As shown in Figure 9, the first site is located in Grugliasco, Torino ( integration (also known as signal correlation process) time we used is 1 ms depending on the length of GPS C/A code (1 ms). Several summations or averaging (called non-coherent integration) revealed the real signal shape and eliminated the fading noise effects. Comparing the delay waveforms (DW) performances, including average noise power and standard deviation of the noise, a final 500 ms noncoherent integration time is chosen to meet both the needs of system resolution and reliability to detect real signals [65,68].

In Situ Experimental Data
In this subsection, data obtained from several in situ measurements are introduced. The in situ data were collected from a serial of ground-based experiments in two bare and smooth sites with different SMC conditions (wet/dry) and terrain compositions. As shown in Figure 9, the first site is located in Grugliasco, Torino (45°03′58.5" N, 7°35′33.8" E), in the Dipartimento Inter-ateneo di Scienze Progetto e Politiche del Territorio (DIST) of Polito. The second site is located in Agliano (44° 47′29.1" N, 8° 15′19.8" E), which is an area of smooth hills mainly devoted to wine production. The in situ experiment campaign is summarized in Table 1.    The GNSS-R system used for in situ measurements was performed also in a bistatic GNSS-R configuration, as shown in Figure 1. It consists of two commercial front-ends connected to two antennas and PCs for data acquisition [64]. The raw data processing and calibration procedure were done the same way as for the airborne experiment. Therefore, the reflectivity and corresponding elevation angle can be collected also and would be tested by the proposed ML methods. Moreover, the reference ground-truth SMC was measured and recorded based on the time-domain reflectometry (TDR) technique [70]. A three-rod sensor Tektronix Metallic Cable Tester 1502 manufactured by Tektronix Inc., Beaverton, OR, USA was used in the measurements.
The measurements in dry conditions were done after a long drought, and the wet condition was determined after several rainfalls. The GNSS-R system and ground-truth rod sensor were both used to make measurements before and after rain in bare and smooth fields (Gruliasco/Agliano), as introduced before. The major axis of the first Fresnel zone for satellites in our geometrical condition (high elevation angle and a height of tripod of 1.5 m) is around 1 m. It was estimated for providing the coverage of the GNSS-R data for comparing the results with other kinds of measurements. In this measurement, this information is useful for indicating the location of the instrument probe to precisely evaluate the SMC. In both places (Grugliasco and Agliano), the portable sensor setup moved around in parallel to cover each estimated first Fresnel zones for obtaining the corresponding ground-truth SMC to the GNSS-R system.

In Situ Experiments
As we introduced before, the collected ground-based GNSS-R data are processed to obtain the calibrated reflectivity and the elevation angles. Each SNR time series (5 min) is averaged for obtaining the reflectivity. In each site, we obtained twelve groups of GNSS-R measurement data and the corresponding SMC measured by the portable rod sensor. It has to be noted that the measurements are intentionally selected before and after rain in bare and relatively smooth fields (the roughness of Agliano is slightly higher than Grugliasco). Moreover, the data with elevation angles that are smaller than 35 degrees were excluded for good signal reception. The obtained calibrated reflectivity (Γ) and the corresponding elevation angle are shown in Figure 10.
It is shown that in each site, the reflectivity obtained after rain (wet condition) is higher than before rain (dry condition), which corresponds to the theoretical knowledge that the GNSS-R reflectivity increases with SMC [1][2][3]. The standard deviations (SD) of reflectivity from each site are also shown in Figure 10. It indicates that the SD of reflectivity in Grugliasco is lower than the values obtained in Agliano, which is consistent with the fact that the roughness of Agliano is slightly higher than Grugliasco.
The ground-based GNSS-R data are considered as the testing set to demonstrate and validate the previously established model built by ML algorithms in the preceding section. With the data of reflectivity and elevation angles as an input, in Figure 11, the performance of predictions obtained from RF and SVR models is shown, which is accompanied by the derived GNSS-R SMC on one of the soil types (e.g., n = 1) that corresponds to the semi-empirical dielectric model [58] and the measured reference ground-truth SMC. before rain (dry condition), which corresponds to the theoretical knowledge that the GNSS-R reflectivity increases with SMC [1][2][3]. The standard deviations (SD) of reflectivity from each site are also shown in Figure 10. It indicates that the SD of reflectivity in Grugliasco is lower than the values obtained in Agliano, which is consistent with the fact that the roughness of Agliano is slightly higher than Grugliasco. The ground-based GNSS-R data are considered as the testing set to demonstrate and validate the previously established model built by ML algorithms in the preceding section. With the data of reflectivity and elevation angles as an input, in Figure 11, the performance of predictions obtained from RF and SVR models is shown, which is accompanied by the derived GNSS-R SMC on one of the soil types (e.g., 1 n  ) that corresponds to the semi-empirical dielectric model [58] and the measured reference ground-truth SMC. In Figure 11, the overall good estimations can be seen in these four campaigns. The SMC derived from the GNSS-R model (e.g., n = 1) is close to the reference ground-truth SMC. Meanwhile, the prediction results of RF and SVR are also all close to the GNSS-R model and reference ground-truth SMC, which show the good prediction ability of SMC by using ML models.
The results of SMC predictions in each campaign are summarized in Table 2, as well as the SMC obtained by using GNSS-R models under different soil types. Particularly, it demonstrates that the root mean square error (RMSE) obtained is higher in Aliagno than Grugliasco. This phenomenon can be explained by the fact that the GNSS-R models did not take into account the roughness effects; therefore, the higher roughness in Aliano leads to higher RMSE in SMC estimation. Moreover, compared to the two ML models, the SMC obtained from the RF model is much closer to that of the ground-truth and GNSS-R model. RF has a better prediction performance than SVR in GNSS-R SMC estimation, which will be validated also by the airborne experiment in the next subsection.
Compared to the SMC obtained from regression models, RF, SVR, and GNSS-R with different soil types ( n ), the RF model exhibits the best performance that is the most stable and accurate in all four campaigns. GNSS-R models show some good results, which can be observed only from certain campaigns or soil types ( n ). It is worth noting that the GNSS-R model relies on knowledge of soil type, while the RF model does not. Hence, when there is no available information about soil type, simply choosing one particular type of soil in the GNSS-R model to predict SMC is not a good choice. Thus, the RF regression model is quite significant, especially when the soil type is unknown or Figure 11. The soil moisture content (SMC) results are obtained from ML and GNSS-R models, compared with ground-truth measurements.
In Figure 11, the overall good estimations can be seen in these four campaigns. The SMC derived from the GNSS-R model (e.g., n = 1) is close to the reference ground-truth SMC. Meanwhile, the prediction results of RF and SVR are also all close to the GNSS-R model and reference ground-truth SMC, which show the good prediction ability of SMC by using ML models.
The results of SMC predictions in each campaign are summarized in Table 2, as well as the SMC obtained by using GNSS-R models under different soil types. Particularly, it demonstrates that the root mean square error (RMSE) obtained is higher in Aliagno than Grugliasco. This phenomenon can be explained by the fact that the GNSS-R models did not take into account the roughness effects; therefore, the higher roughness in Aliano leads to higher RMSE in SMC estimation. Moreover, compared to the two ML models, the SMC obtained from the RF model is much closer to that of the ground-truth and GNSS-R model. RF has a better prediction performance than SVR in GNSS-R SMC estimation, which will be validated also by the airborne experiment in the next subsection.
Compared to the SMC obtained from regression models, RF, SVR, and GNSS-R with different soil types (n), the RF model exhibits the best performance that is the most stable and accurate in all four campaigns. GNSS-R models show some good results, which can be observed only from certain campaigns or soil types (n). It is worth noting that the GNSS-R model relies on knowledge of soil type, while the RF model does not. Hence, when there is no available information about soil type, simply choosing one particular type of soil in the GNSS-R model to predict SMC is not a good choice. Thus, the RF regression model is quite significant, especially when the soil type is unknown or nonuniform. As such, the flexible, efficient RF model with strong data mining ability becomes more undeniable. To compare further the behavior of the regression models, Figure 12 illustrates the scatter plot, which compares the overall predicted and the ground-truth SMC. The results of the GNSS-R model shown in Figure 12 are the averages values (see Table 2) obtained from five soil types of GNSS-R models. From Figure 12, the consistency between predicted data (provided by RF, SVR ML models, and the GNSS-R model, respectively) and ground-truth data is observed.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 23 To compare further the behavior of the regression models, Figure 12 illustrates the scatter plot, which compares the overall predicted and the ground-truth SMC. The results of the GNSS-R model shown in Figure 12 are the averages values (see Table 2) obtained from five soil types of GNSS-R models. From Figure 12, the consistency between predicted data (provided by RF, SVR ML models, and the GNSS-R model, respectively) and ground-truth data is observed.
The performance matrix of SMC predictions acquired by using ML and also the GNSS-R models has been summarized in Table 3. Compared to the two ML models, the performance of RF is better than that of SVR. A correlation coefficient (CC) of 0.92 r = and an RMSE of 0.02 m 3 /m 3 are obtained for RF. For the SVR algorithm, the correlation coefficient is 0.82 r = and the RMSE is 0.04 m 3 /m 3 . The SMC obtained from the average of the GNSS-R shows a correlation coefficient of 0.80 r = and an RMSE of 0.03 m 3 /m 3 . The RF ML prediction performed best and is slightly even better than the average of the GNSS-R. The reason could be due to the sufficient training sample and the strong data mining ability of ML, which shows also the high potential of ML predictions in SMC estimations.  In this subsection, the prediction performance of ML and GNSS-R models are also tested and validated by airborne experiments. Those that employed measured airborne GNSS-R data were The performance matrix of SMC predictions acquired by using ML and also the GNSS-R models has been summarized in Table 3. Compared to the two ML models, the performance of RF is better than that of SVR. A correlation coefficient (CC) of r = 0.92 and an RMSE of 0.02 m 3 /m 3 are obtained for RF. For the SVR algorithm, the correlation coefficient is r = 0.82 and the RMSE is 0.04 m 3 /m 3 . The SMC obtained from the average of the GNSS-R shows a correlation coefficient of r = 0.80 and an RMSE of 0.03 m 3 /m 3 . The RF ML prediction performed best and is slightly even better than the average of the GNSS-R. The reason could be due to the sufficient training sample and the strong data mining ability of ML, which shows also the high potential of ML predictions in SMC estimations.

SMC Regression Predictions
In this subsection, the prediction performance of ML and GNSS-R models are also tested and validated by airborne experiments. Those that employed measured airborne GNSS-R data were received along some significant routes (PRN4 and PRN32), in which the elevation angles (see Table 4) were high enough for good signal reception. The specular points corresponding to these satellites fell on the lakes' surfaces, which enables us to calibrate the system. Both direct and reflected signals were processed to obtain the signal-to-noise ratio, and a calibration process was performed through the over-water condition to determine the calibration constant c in (3). After obtaining the calibrated reflectivity as shown in Figure 13, the SMC was retrieved using the bistatic GNSS-R method, as described in Section 2.1, by combing (3)-(7) with a soil dielectric model [58]. Additionally, both the RF and SVMs methods were applied for the comparison of soil moisture retrieval.  Both direct and reflected signals were processed to obtain the signal-to-noise ratio, and a calibration process was performed through the over-water condition to determine the calibration constant c in (3). After obtaining the calibrated reflectivity as shown in Figure 13, the SMC was retrieved using the bistatic GNSS-R method, as described in Section 2.1, by combing (3)-(7) with a soil dielectric model [58]. Additionally, both the RF and SVMs methods were applied for the comparison of soil moisture retrieval. After training and testing the proposed SVR-and RF-based regression models with the simulation data, predictions were made by inputting the measured GNSS-R data. The airborne experiment results retrieved from GNSS-R are represented by the average of the model under different soil types, since the average values have been tested having a preferable result in the previous subsection. Here, the training data were randomly split into two subsets: a training set and a testing set in order to obtain two "unseen" datasets. The training set is a set of samples (1200) used After training and testing the proposed SVR-and RF-based regression models with the simulation data, predictions were made by inputting the measured GNSS-R data. The airborne experiment results retrieved from GNSS-R are represented by the average of the model under different soil types, since the average values have been tested having a preferable result in the previous subsection. Here, the training data were randomly split into two subsets: a training set and a testing set in order to obtain two "unseen" datasets. The training set is a set of samples (1200) used for learning to create a model. The testing set is a set of examples (800) used only to assess the performance of the trained model. The performance of the test set of simulated data is shown in Figure 14,a and the prediction using the tested SVR model for measured data of route PRN32 and PRN4 is shown in Figure 14b.  Figure 14a shows that the SVR regression model could obtain similar results with the target. The performance of regression is also observed in Figure 14b with inputting measured airborne data. The predicted SMC by using the SVR model is highly correlated with the results predicted by the GNSS-R model. In the first and the second periods of time for flying over the lake, the results are better than the others.
The density plot showing the comparison between SM predicted by ML and GNSS-R models for measured data is presented in Figure 15. From Figure 15, good consistency between SM predicted by the SVR model and SM retrieved by GNSS-R can be seen, especially for the densest data. Specifically, a correlation coefficient (CC) of 0.98 r = and an RMSE of 0.08 cm 3 /cm 3 are obtained for PRN32 and PRN4. A similar performance achieved for both PRNs indicates the generalizability of the proposed method.  Figure 14a shows that the SVR regression model could obtain similar results with the target. The performance of regression is also observed in Figure 14b with inputting measured airborne data. The predicted SMC by using the SVR model is highly correlated with the results predicted by the GNSS-R model. In the first and the second periods of time for flying over the lake, the results are better than the others.
The density plot showing the comparison between SM predicted by ML and GNSS-R models for measured data is presented in Figure 15. From Figure 15, good consistency between SM predicted by the SVR model and SM retrieved by GNSS-R can be seen, especially for the densest data. Specifically, a correlation coefficient (CC) of r = 0.98 and an RMSE of 0.08 cm 3 /cm 3 are obtained for PRN32 and PRN4. A similar performance achieved for both PRNs indicates the generalizability of the proposed method.
As was mentioned before, the RF prediction model was also built after the training and testing steps with the simulation data. Then, the GNSS-R acquisition data from the flight were used to perform the SMC predictions. In Figure 16, the performance of testing (Figure 16a) and the prediction (Figure 16b) of regression using RF for route PRN32 and PRN4 are shown. Figure 16a shows that the built RF model has enhanced regression ability as compared with the SVR model shown in Figure 14a. The good regression performance can be seen also in the prediction for airborne measured data in Figure 16b. The prediction results are nearly the same as the target predicted by the GNSS-R model. the SVR model and SM retrieved by GNSS-R can be seen, especially for the densest data. Specifically, a correlation coefficient (CC) of 0.98 r = and an RMSE of 0.08 cm 3 /cm 3 are obtained for PRN32 and PRN4. A similar performance achieved for both PRNs indicates the generalizability of the proposed method. As was mentioned before, the RF prediction model was also built after the training and testing steps with the simulation data. Then, the GNSS-R acquisition data from the flight were used to perform the SMC predictions. In Figure 16, the performance of testing (Figure 16a) and the prediction (Figure 16b) of regression using RF for route PRN32 and PRN4 are shown. perform the SMC predictions. In Figure 16, the performance of testing ( Figure 16a) and the prediction (Figure 16b) of regression using RF for route PRN32 and PRN4 are shown.  Figure 16a shows that the built RF model has enhanced regression ability as compared with the SVR model shown in Figure 14a. The good regression performance can be seen also in the prediction for airborne measured data in Figure 16b. The prediction results are nearly the same as the target predicted by the GNSS-R model.
The density plot is shown for comparing the predicted SMC by using RF and GNSS-R models as in Figure 17. From Figure 17, good consistency between SM predicted by the RF model and SM estimates by GNSS-R can be seen for the whole dataset. The performance is better than the result obtained from SVR ( Figure 15). A correlation coefficient of 0.99 r = and an RMSE of 0.02 cm 3 /cm 3 are obtained for PRN32 and PRN4. It is observed obviously that the prediction accuracy of RF outperformed SVR and with good generalizability. The density plot is shown for comparing the predicted SMC by using RF and GNSS-R models as in Figure 17. From Figure 17, good consistency between SM predicted by the RF model and SM estimates by GNSS-R can be seen for the whole dataset. The performance is better than the result obtained from SVR ( Figure 15). A correlation coefficient of r = 0.99 and an RMSE of 0.02 cm 3 /cm 3 are obtained for PRN32 and PRN4. It is observed obviously that the prediction accuracy of RF outperformed SVR and with good generalizability.
The performance matrix of SMC predictions by using RF and SVR with measured PRN4 and PRN32 is summarized in Table 5. We concluded that compared with the SVR algorithm, the prediction performance of RF is better. It is evidenced by its higher correlation coefficient and lower root mean square error, which are also observed in the previous in situ measurement. as in Figure 17. From Figure 17, good consistency between SM predicted by the RF model and SM estimates by GNSS-R can be seen for the whole dataset. The performance is better than the result obtained from SVR ( Figure 15). A correlation coefficient of 0.99 r = and an RMSE of 0.02 cm 3 /cm 3 are obtained for PRN32 and PRN4. It is observed obviously that the prediction accuracy of RF outperformed SVR and with good generalizability. The performance matrix of SMC predictions by using RF and SVR with measured PRN4 and PRN32 is summarized in Table 5. We concluded that compared with the SVR algorithm, the prediction performance of RF is better. It is evidenced by its higher correlation coefficient and lower root mean square error, which are also observed in the previous in situ measurement.

Open Water Classification
The objective of SVM is to find a plane that has the maximum margin to separate the two classes of data points. Many possible hyperplanes could be chosen. With the simulated training set, we built the SVM learning model. In Figure 5, we show the adopted optimal hyperplane (RBF kernel function) that distinctly classifies the data point to achieve the water/soil classification. Then, the processed measured data (Γ, γ) were taken to do the classification. As shown in Figure 18, the results obtained from the data of two satellites (PRN4 and PRN32) are classified into water and soil.

Open Water Classification
The objective of SVM is to find a plane that has the maximum margin to separate the two classes of data points. Many possible hyperplanes could be chosen. With the simulated training set, we built the SVM learning model. In Figure 5, we show the adopted optimal hyperplane (RBF kernel function) that distinctly classifies the data point to achieve the water/soil classification. Then, the processed measured data ( Γ , γ ) were taken to do the classification. As shown in Figure 18, the results obtained from the data of two satellites (PRN4 and PRN32) are classified into water and soil. In this figure, some data points with high reflectivity (oranges and pink points) stand for the presence of lakes in the measurement. Based on the obtained results, the spatial resolution is found to be about 20 m. In an ideal case, the reflectivity of water should be 0.63. Due to some random factors, e.g., the wave of the water surface, floating plants and microorganisms, etc., the reflectivity is not constant. The measured reflectivity of the water surface ranged between 0.53 and 0.76. In this figure, some data points with high reflectivity (oranges and pink points) stand for the presence of lakes in the measurement. Based on the obtained results, the spatial resolution is found to be about 20 m. In an ideal case, the reflectivity of water should be 0.63. Due to some random factors, e.g., the wave of the water surface, floating plants and microorganisms, etc., the reflectivity is not constant. The measured reflectivity of the water surface ranged between 0.53 and 0.76.
Considering the characteristics of the SVM method, it is anticipated that the trained SVM model would find a hyperplane between the maximum of the soils and the minimal of the water samples. When the elevation angle is around 80 • , the maximum of the soil reflectivity is 0.44, and the minimal of the water samples is 0.63 in the training samples, as shown in Figure 5 (here, the reflectivity is an average, and slight variation was made depending on the satellite elevation angle). In Figure 18, the optimal hyperplane (red line) in prediction results shows that the reflectivity higher than 0.54 is judged as water; otherwise, it is considered to be soil. This is consistent with the trained SVM model and the theoretical background of [1][2][3].
In this case, it can be observed that a majority of data points could be clearly distinguished to be water and soil. Notably, the transitions between soil and water including the soil contents between the two lakes are also distinct, except for three outliers (green circles). The prediction accuracy of both PRN4 and PRN32 is 99.5% and 99.75% respectively, as shown in Figure 18. The support vector machine algorithm can determine the water/soil regions in the figure. Furthermore, the performance of the prediction results is also dependent on the set of training samples. It means that in the training step, the range of the dielectric constant and elevation angles for training samples would be estimated and selected as close as possible to the area of interest, which has similar behavior with the testing samples, in order to train a model with better prediction performance.
The RF algorithm was applied in the simulated dataset to make a comparison with the SVM method. The processed airborne experimental dataset is also used for testing the performance of the classification task. As it has been mentioned in the SVM method, the measured data points and the classification results are shown in Figure 19. Four periods of flight over lakes were distinguished with a spatial resolution of around 20 m. The prediction accuracy of PRN4 and PRN32 is both with 99.75% as illustrated in Figure 19. In this case, the two reflection routes (PRN4 and PRN32) show different classification accuracy, as compared to the 99.5% and 99.75% obtained by applying SVM. The RF shows a similar performance with the SVM algorithm.
Remote Sens. 2020, 12, x FOR PEER REVIEW 18 of 23 The RF algorithm was applied in the simulated dataset to make a comparison with the SVM method. The processed airborne experimental dataset is also used for testing the performance of the classification task. As it has been mentioned in the SVM method, the measured data points and the classification results are shown in Figure 19. Four periods of flight over lakes were distinguished with a spatial resolution of around 20 m. The prediction accuracy of PRN4 and PRN32 is both with 99.75% as illustrated in Figure 19. In this case, the two reflection routes (PRN4 and PRN32) show different classification accuracy, as compared to the 99.5% and 99.75% obtained by applying SVM. The RF shows a similar performance with the SVM algorithm.

Discussions
The major focus on GNSS-R soil moisture currently is to build ML models with ongoing knowledge of SMC. However, the comparison of ML models and GNSS-R SMC retrieval using physically-based models is rarely presented. The motivation and the aim of this paper are to build SMC prediction models using ML, replacing traditional GNSS-R forwarding modeling methods to predict soil moisture from GNSS-R observations, especially in the most of the cases, where the distribution of soil texture is nonuniform or unknown. The study demonstrated that the RF is stable Figure 19. The classification (water/soil) result of RF for the route of PRN32 and PRN4.

Discussions
The major focus on GNSS-R soil moisture currently is to build ML models with ongoing knowledge of SMC. However, the comparison of ML models and GNSS-R SMC retrieval using physically-based models is rarely presented. The motivation and the aim of this paper are to build SMC prediction models using ML, replacing traditional GNSS-R forwarding modeling methods to predict soil moisture from GNSS-R observations, especially in the most of the cases, where the distribution of soil texture is nonuniform or unknown. The study demonstrated that the RF is stable and performs well in all fields with different soil textures. Notably, the ML model does not rely on soil type, while the GNSS-R model does. This distinct advantage is quite useful and significant. The proposed RF model can be used as an alternative to GNSS-R SMC retrieval, which could be applied in various fields and applications in an easy and practical way.
The in situ and airborne GNSS-R experiments are investigated in detail. The technical approaches and the observational data of the experiments are rarely presented in the state of the art, which is regrettable, since field data experiments are very significant and can be a good tool for discovering and studying the inherent GNSS-R problems. Moreover, many researchers are considering assembling their equipment and will be interested in conducting GNSS-R experiments, especially for the airborne platform. In this study, we would like to generalize the finding from the ground to the airborne platform. Despite the lack of the reference ground-truth data for the airborne experiments, the data of input vectors are collected from the real surface and participate in the testing stage in order to show and test the availability of these established ML models and traditional GNSS-R.
In principle, such ML techniques are based on building a regression model between the known SM values from a reference dataset (such as SMAP, or ground-truth SM networks) and the experiment observations, then exploiting this model to perform future SMC estimations. As many samples as possible are needed to achieve the accuracy and stability of a model. In this study, as mentioned earlier, to obtain a batch of reference ground-truth data at every single observational point is almost impossible. So, we built a machine learning algorithm model through the simulation dataset to satisfy the requirement for training the ML models. In future work, it will conduct the proposed ML methods for a larger area with sufficient ground-based reference SMC to generalize the findings (e.g., International Soil Moisture Network or the others).
Another possible future work could be investigating the proposed ML and GNSS-R models with representative soil. The acquisition of knowledge about the site is complex. The GNSS-R model may achieve good results, since the GNSS-R model contains the details of the parameters that better represent the physical components of the site. While, in this case, apart from the accuracy of the GNSS-R, it will give rise to an issue of the significance of existence for building ML models, since the soil composition is already known. Moreover, as we have mentioned, it is not practical, since the soil texture is unknown in most of the GNSS-R experiments, where the ML has demonstrated its efficiency and simplicity in this case.
The distinct advantage of the machine learning algorithms is that they can dig out intrinsically the rules from the dataset. The SM retrieval process itself possesses high complexity and nonlinearity. Here, the ML and traditional SM retrievals are compared. The ML models captured the nonlinear dependencies of the GNSS-R observables (e.g., reflectivity) and the output SMC values directly without intermediate variables. The highly efficient modeling ability and strong data mining capability make it perform well in SM retrieval. Especially for the GNSS-R experiment, the soil texture is commonly not available or nonuniform. The results obtained from this study show the significant advantages of ML methods. The RF model does not rely on knowledge of soil type, while the GNSS-R does. Hence, the RF model could be a very stable and efficient solution employed in different fields even with different scales data of GNSS-R. Moreover, from the perspective of ML algorithms, different ML algorithms are good at handling different data relationships. This paper also shows that the RF has better prediction ability than SVM in solving the SMC estimation problems, which is also one of the significance achievements of our paper.

Conclusions
In this study, two ML methods, i.e., SVMs and RF, are adopted for GNSS-R SMC retrieval. Regression results obtained from airborne and in situ data are presented and compared with the traditional GNSS-R retrieval method. Furthermore, the results obtained from the in situ experiments of two sites using ML models are also validated by the reference ground-truth SMC sensor, respectively. Overall, good predictions are obtained, and the parameters of the performance metrics of applied SVMs and RF with different experiments are analyzed. Particularly, the RF shows the best prediction performance, compared with the SVR model and GNSS-R model under different soil types, which exhibits its high data mining and efficient ability, especially when the soil type is unknown or nonuniform. It is worth noting that the GNSS-R model relies on knowledge of soil type, while the RF model does not. Its good performance with a higher correlation coefficient and a smaller root mean square error is quite noticeable both in the airborne and in situ experiments. In addition, it is also apparent that GNSS-R observations are well suited for open water classification. It is feasible to judge the nature of the reflective surface such as water or soil from the two dependent input variables-reflectivity and elevation angles, which indicates the high potential of ML models.
The study shows the prospects of using ML to represent a complex process that is difficult to model using analytical approaches. The ML methods can help reveal the complex interactions and also make a good prediction, especially since in most of the cases the soil type is unknown or nonuniform. Therefore, regarding the GNSS-R SMC retrieval complexities and challenges, the regression techniques by ML can be practical for the GNSS-R SMC retrieval problem instead of a pure explicit solution of the physical model. This study shows its feasibility by the fact that it can minimize unpredictable influences and help improve the accuracy of soil moisture retrieval. New experiments would be deployed, and the proposed ML techniques will be further validated. Despite a flat surface, validation with SM experiments under a scattering dominated scene is meaningful and will be carried out in the future. They can be used as an alternative to the complex and data-intensive retrieval process and could be applicable in various situations.