Retrieval of Soil Moisture from FengYun-3D Microwave Radiation Imager Operational and Recalibrated Data Using Random Forest Regression

: Three Microwave Radiation Imagers (MWRI) were carried onboard the FengYun-3B/C/D satellites and have collected more than 10 years of data since 2010. To create a robust climate quality of data, MWRI level one data were reprocessed with new calibration. This study evaluates the performance of retrieving global soil moisture from recalibrated MWRI data (RCD) and quantiﬁes the difference of retrieved soil moisture between operational calibration data (OCD) and RCD. Soil Moisture Operational Products System (SMOPS) products from NOAA on four days of different seasons were collocated with MWRI brightness temperatures, and then the collocated data were used for training an algorithm through machine learning. The retrieved soil moisture products using OCD and RCD were evaluated against the independent SMOPS products, in situ networks and SMAP soil moisture product. It is shown that the algorithm from the random forest is suitable for FY-3D recalibrated MWRI data, with a coefﬁcient of determination (R 2 ) of 0.7223, a mean bias of − 0.0062 and an unbiased root mean square difference (ubRMSD) of 0.0476 m 3 m − 3 compared with SMOPS products over the period from 12 July 2018 to 31 December 2019. The difference of retrieved soil moisture using OCD and RCD is spatially heterogeneous. Both temporal and spatial coverage and accuracy of the existing FY-3D operational soil moisture products are signiﬁcantly improved.


Introduction
Soil moisture (SM) plays an important role in the global water and energy cycle between the atmosphere and Earth's surface [1]. Knowledge of the spatiotemporal distribution of land SM can help us better understand surface runoff, terrestrial evapotranspiration, rainfall infiltration, soil biochemical and nutrient cycling processes [2,3]. Furthermore, accurate monitoring of SM at various spatiotemporal scales can benefit a variety of applications, such as drought monitoring [4][5][6], rice mapping [7,8], flood warning and forecasting [9][10][11] and irrigation management [12,13]. Despite its extensive applications, obtaining SM with high temporal and spatial coverage and high precision is challenging.
With the rapid development of remote sensing technology, SM can be retrieved by a variety of algorithms from the measurements made at visible to near infrared, thermal infrared to microwave wavelengths [14][15][16][17]. However, microwave remote sensing, especially from passive technology, is still the most effective method for obtaining SM information at regional or global scales due to the relationship between the observed brightness temperature (TB).(TB) and soil dielectric constant [18][19][20]. According to the band setting, the currently available satellite microwave radiometers are generally divided into two categories. One contains a single L-band (1-2 GHz), such as Soil Moisture and Ocean Salinity (SMOS) and Soil Moisture Active Passive (SMAP). The other contains multiple microwave bands (C-, X-, Ku-, K-, Ka-and W-band) in dual polarization states, such as Advanced Microwave Scanning Radiometer-Earth Observing System (AMSR-E), Advanced Microwave Scanning Radiometer2 (AMSR2) and Wind Satellite (WindSat). Because of its better penetration of vegetation and higher sensitivity to SM, the L-band is considered the best band for retrieving SM [21]. In contrast, the high-frequency channels have low sensitivity to SM even under a small fraction of vegetation coverage, leading to high SM retrieval errors. At present, many algorithms have been proposed for L-band SM retrieval, including the L-band microwave emission of biosphere model [22,23], the single channel algorithm at horizontal or vertical polarization [24,25], the dual channel algorithm [26,27], and the land parameter retrieval model [28]. Based on these algorithms, several global SM products and datasets have been published, such as SMOS, SMAP, AMSR-E, AMSR2 Advanced Scatterometer (ASCAT) and the European Space Agency Climate Change Initiative (ESA CCI). These data have brought great benefits to agricultural and hydrological monitoring, weather forecasting, and climate change research [29][30][31]. However, SM retrieval based on a radiative transfer model using L-band requires a lot of auxiliary information, such as the vegetation optical thickness, vegetation single scattering albedo, soil effective temperature, and soil surface roughness, which limits the global application of these methods [32].
Although the inversion method based on physical models has been developed to estimate SM from the multi-channel radiometers with C-band and/or X-band, greater uncertainties can arise due to their low sensitivity to SM and effects of vegetation and surface roughness [33]. The Chinese FengYun-3D (FY-3D) satellite, launched in November 2017, is equipped with a dual-polarization and multifrequency microwave radiometer (MWRI) with frequencies ranging from 10.65 (X-band) to 89 (W-band) GHz. Because FY-3D MWRI lacks the L-and C-bands, more interference factors need to be considered in retrieving SM when the tau-omega radiative transfer model is used for inversion [34,35]. The lowest frequency at X-band available from FY3-D MWRI provides the most SM information; other higher frequencies contain more vegetation signal. Thus, a statistical model is being developed for SM from MWRI.
Machine-learning techniques were used for SM retrieval at different scales based on passive microwave observations [36][37][38][39][40][41][42][43][44]. The method first establishes relationships between the input data (TB and/or ancillary data) and target SM using a variety of algorithms. The reference SM comes from physical model simulations, ground measurements, satellite products, and European Centre for Medium-Range Weather Forecasts model predictions. Al-Yaari et al. [36] constructed a physically-based regression equation using SMOS level three SM and horizontal and vertical TB observations to derive global SM with SMAP level three TBs with the correlation coefficient (R) between 0.51 and 0.84. Santi et al. [37] estimated SM with both satellite-observed and simulated C-band TB based on an artificial neural network algorithm with the statistical parameters of determination of coefficient (R 2 ) of 0.82, root mean square error (RMSD) of 0.035 m −3 m −3 and bias of 0.09 m −3 m −3 . However, most of these studies are only based on a single band or two bands such as Land C-bands or a combination of C-band and X-band. The potential of using the observed TBs in all channels of spaceborne microwave radiometer to estimate SM has not been fully explored.
To deeply mine the historical data resources of multiple series of domestic satellites such as meteorology, land and ocean and build a long-term climate observation plan from the FengYun satellite series, the level 0 observation data of various sensors have been recalibrated with the support of the National Key Research and Development Plan program. The MWRI has been launched on board FY-3 B/C/D satellites, and it is now very conducive to the establishment of a long-term SM dataset. Due to the uncertainty in characterizing receiver nonlinearity, antenna gains and emission during its in-orbit operation, the current FY-3D MWRI operational calibration data (OCD) need to be recalibrated to ensure the stability of long-term data. Recalibrated data (RCD) have been improved mainly in four aspects: thermal mirror emissivity, antenna back lobe, heat source efficiency and receiver nonlinearity [45]. Therefore, it is necessary to evaluate the accuracy difference of SM retrieval based on these two data sources before constructing a long-term global SM dataset.
The objective of this study is to retrieve global SM with the FY-3D recalibrated MWRI data based on a machine-learning algorithm and to evaluate the accuracy difference of SM inversion using FY-3D MWRI OCD and RCD. The paper is structured as follows: In Section 2, the details of our method and the preprocessing of data are introduced. The retrieved soil moistures are analyzed and validated in Section 3. The final conclusion is given in Section 4. GHz. FY-3D operates in a sun-synchronous polar orbit with the ascending node at 14:00 p.m. and descending node at 2:00 a.m. local solar time. Since the FY-3D MWRI level two SM product (FY-3D L2) was available starting from 12 July 2018, the MWRI level one OCD and RCD provided by the National Satellite Meteorological Center at CMA from 12 July 2018, to 31 December 2019, were used for SM retrieval. The fields of level one data used in this study include TBs of 10 channels, latitude, longitude, land cover type obtained from the International Geosphere-Biosphere Programme classification scheme, surface elevation (digital elevation model, DEM) and quality control information.

FY-3D MWRI Data
Other auxiliary variables include microwave polarization difference index (MPDI) and soil porosity. The MPDI is defined as the ratio of the difference between vertically and horizontally polarized TBs to their sum [46,47]. As an indicator of vegetation, MPDI was selected because it is more sensitive to the total water content of the vegetation per unit area and independent of the surface temperature [48]. Soil porosity, as one of the most important physical attributes, refers to the fraction of the total soil volume that is taken up by the pore space [49], and its size determines how well liquids, gases and heat can be stored and transmitted within the soil matrix. A greater porosity often indicates greater storage and transmission ability. Soil porosity obtained from ESA CCI ancillary data which are derived by applying soil water characteristic equations of Saxton and Rawls (2006) [50] to texture data of the Harmonized World Soil Database v1. was used in this study.
Data cleaning needs to be performed for MWRI level one data before entering the random forest model. First, the fractional coverage of land within MWRI field of view (FOV) were defined by land-sea mask data. The observed TBs affected by radio frequency interference (RFI) were then removed by using the spectral difference method [51]. The FOVs with abnormal TB and surface types of forest, closed shrublands, water body, urban and built up, snow and ice and permanent wetlands were also removed. Finally, the properties from the MWRI level one were resampled to 0.25-degree grids for matching the dependent variable using the nearest neighbor search algorithm. To evaluate the performance of the random forest algorithm for SM estimation, the FY-3D MWRI level two operational SM product was used for comparison.

NOAA Multi-Sensor Soil Moisture Products
NOAA Soil Moisture Operational Products System (SMOPS) products were used as the training target dataset in the machine-learning model due to its high accuracy and temporal resolution as well as high spatial coverage [52]. SMOPS provides a seamless SM map over global land from six satellites, including Global Precipitation Measurement (GPM), SMAP, Global Change Observation Mission 1st-Water (GCOM-W1), SMOS, Meteorological Operational A (Metop-A), and Metop-B. The global SM maps were generated in daily intervals with the most recent 24 h of SM from multiple retrieval algorithms and mapped with a cylindrical projection on a 0.25 × 0.25 degree grid. For each grid point of the map, the output includes SM values as a percentage (vol/vol) of the surface (top 1-5 cm) soil layer with associated quality information and metadata. The daily product is available in netCDF file format.

Validation Data
Three kinds of SM data were used for verification, including the independent SMOPS SM products, SM observed by China soil moisture automatic station and SMAP SM product. SM from both automatic stations and SMAP were resampled to 0.25-degree grids using the same method as above before validation. The daily mean in situ SM observations at 10-cm depth from 1924 stations were used to validate the estimated SM over the same time period of MWRI level one. SM values outside the range of 0-0.5 m −3 ·m −3 were removed. To enhance the representativeness of station observation SM as much as possible, gridding was only carried out for the grids with the number of SM observation values greater than three. The SMAP SM product used for validation was SMAP Enhanced L3 Radiometer Global Daily 9 km EASE-Grid Soil Moisture V004. The time period of all validation data covers from 21 July 2018 to 31 December 2019.

Random Forest Model Construction
The random forest (RF) algorithm was used to model the relationship between TBs, auxiliary variables and target SM due to its popularity and flexibility [53]. When establishing each tree, the RF algorithm randomly and bootstrapping-samples the rows and columns of the training datasets, which makes it not as easy to overfit as the decision tree algorithm. We used the "RandomForest" package developed with R language to construct the RF model. In RF regression, three main hyperparameters need to be optimized during training: ntree (the number of trees), node size (the minimal size of the leaf node) and mtry (the number of features sampled). According to our optimization result, the values of these three parameters were set as 500, 1, and 1/3 of the total number of the variables, respectively.
To reduce overfitting, it is very important to select a representative training dataset to estimate SM. According to the analysis of [54], more random days cannot significantly improve the estimation accuracy of SM. Therefore, MWRI observations, SMOPS and auxiliary data from four random days (31 January, 30 April, 31 July, and 31 October 2018) were used as training datasets to represent seasonal changes in SM as much as possible in this study. The detailed flowchart for the RF model construction and validation is shown in Figure 1.

Evaluation Indicators
To quantitatively evaluate the SM estimated based on OCD and RCD, several statistical metrics were adopted, including the correlation coefficient (R), mean difference (Bias), coefficient of determination (R 2 ) and unbiased root mean square difference (ubRMSD). The formulas for these indicators are as follows: where n is the number of samples and R i and E i are the ith reference and estimated SM, respectively. R and E are the average of the reference and estimated SM, respectively.

Characteristics of Training Dataset from OCD and RCD
The training dataset was obtained through collocating daily SMOPS and MWRI level one data and soil porosity. Due to the scan pattern offset, the total number of training samples was different between OCD and RCD. Therefore, we resampled the data in space to match the sample size of the two datasets. Finally, 601,974 samples were obtained as a training dataset. The distribution of independent variables in the training datasets is shown in Figure 1. The difference in TBs changes with frequencies. The TBs of all the channels have abnormal values, especially in the low value area. MPDI at all frequencies have the characteristics of unimodal distribution. MPDI at 10 and 18 GHz from OCD are larger than that from RCD, while the difference in MPDI at 23 GHz is small. Abnormal values from MPDI also occur, especially in the high value area. To draw with soil porosity in the same axes, the value of DEM divided by 10,000 is displayed. Due to the preprocess of gridding, the distribution of DEM between OCD and RCD has negligible differences. The statistical distribution of soil porosity between OCD and RCD is almost the same (Figure 2). The correlation coefficients between independent variables in OCD and RCD are shown in Figure 3. TBs between different frequencies have a high positive correlation coefficient with the highest R of 0.98 between 23V and 36V and the lowest R of 0.55 between 10H and 89V for RCD. The correlation coefficient decreases with the increase of frequency interval. The correlation coefficients between vertical polarization TBs are greater than that between horizontal polarization TBs. TBs of different frequencies with horizontal polarization have a stronger negative correlation with MPDI than that with vertical polarization. MPDI between different frequencies have high correlation coefficients. Both DEM and soil porosity have low correlation coefficients with other variables. The relationships of independent variables in OCD are similar with that in RCD (Figure 3).
To further analyze the differences of independent variables in the OCD and RCD training datasets, the average bias was calculated by the variable values of the training datasets of RCD minus that of OCD, and the results are shown in Figure 4. Except for TBs at 18V and 23H, the TBs at other channels have a positive bias, and the maximum positive bias occurs in the channel of 10H with a mean bias of 4.0221 k. The mean biases of MPDI at 10 GHz and 18 GHz are negative, while the mean bias of MPDI at 23 GHz is positive.

Soil Moisture Retrieved from OCD and RCD
Because of the advantages of fast computation and unbiased estimation, the out of bag error is commonly used to test performance of a RF model [55,56]. The out of bag error of the RF model trained by 500 trees was 0.0398 m 3 ·m −3 . To reflect the seasonal variations in SM, the global distribution map of retrieved SM of the four days in 2019 are displayed based on OCD and RCD (Figures 5 and 6). Compared with the existing operational SM product (FY3D L2), the SM obtained by the random forest algorithm has higher spatial and temporal coverage and therefore can enhance the application of SM in the fields of atmosphere, hydrology and environment. However, there are some missing retrievals for SM maps over the northern hemisphere, such as Russia and Canada, particularly in January. There are also some missing retrievals in the tropical region, particularly in the Amazon Forest and Southeast Asia, in all months. This is because the SM on these types of land cover is naturally difficult to retrieve based on microwave remote sensing technology including snow, ice, forest and wetland [57,58].  Over the northern hemisphere, SM increases from winter to summer and begins to decrease in autumn, especially for the North American and Eurasian continents. For the equatorial region, a warm and humid climate brings an increasing trend of SM from the winter to summer seasons. For the southern hemisphere, the seasonal variation in SM is the opposite, particularly for the southern part of the African continent and Australia. On the Australian continent, higher SM can be seen during January, especially over the north coast, when compared with the July map. The southern part of the African continent has similar trends as the Australian continent. SM in desert areas, such as the Sahel, is low throughout the year.
From the SM inversion results (Figures 5 and 6), it can be seen that the temporal and spatial distribution between SM retrieved by OCD and RCD is very similar. To quantitatively show the difference between SM retrieved by OCD and RCD, the SM daily change rate was calculated by subtracting the SM retrieved using OCD from the SM retrieved using RCD and then dividing it by the SM retrieved using OCD. The daily change rate of SM in most areas on the four days was between −20% and 20%; in a few areas, it was between −20% and −60% and 20 to 60% (Figure 7). In particular, on 31 January and 31 July, the variation rate of SM in a few grids was between 60% and 80%. Therefore, the relative change rate of SM is different in time and space. For most regions, the change rate is within 20%.

Verification Based on SMOPS Data
The statistical indicators computed using SMOPS products are shown in Figure 8. The accuracy of SM retrieved based on RCD is better than that of the FY3D level two SM product with a mean bias mostly within 0.02, R 2 greater than 0.6 and ubRMSD lower than 0.06. The inversion accuracy of SM between OCD and RCD is similar. The statistical metrics are listed in Table 1

Verification Based on Automatic Station Data
The statistical indicators computed using in situ SM are shown in Figure 9. The number of matched soil observations in each grid point greater than 30 was used to calculate the statistical metrics. The metrics of the grid with a p value of the correlation coefficient less than 0.01 are displayed. As we can see from the distribution of the R, most of the grids show a positive correlation, and a few grids show a negative correlation. The values and spatial distribution of statistical indicators are basically similar using OCD and RCD. The statistical results of the metrics are listed in Table 2. OCD results in a slightly higher average R value (0.3025) than RCD (0.2952). In contrast, the average of mean bias and ubRMSD from OCD (0.0334, 0.0505) is slightly lower than those from RCD (0.0340, 0.0507). The verification of satellite-based SM using in situ SM is a necessary condition for ensuring the quality of information before being used for various studies. However, due to the differences in penetration depth between ground-based and satellite observations and their spatial representation, it is difficult to obtain more realistic validation results [59,60]. To minimize the representativeness errors, all the stations falling within a particular satellite field of view were (after quality control) averaged by calculating the arithmetic mean [61]. Finally, the mean values of SM observed at three to five automatic stations were used to validate SM retrieved by OCD and RCD.  The statistical indicators computed using SMAP SM products are shown in Figure 10. Overall, the change trends of the statistical indicators of SM retrieved using OCD and RCD are basically the same. Due to the lack of SMAP SM products from 20 June to 23 July of 2019, each statistical index is discontinuous. The statistical results of the metrics are listed in Table 3. There is little difference in the statistical results of indicators. SM retrieved based on the random forest algorithm has an average R 2 of 0.5421, average mean bias of 0.0514 and average ubRMSD of 0.0653.  The machine-learning technique allows the retrieval of SM from both active and passive satellite systems with an accuracy of approximately 0.05 m 3 ·m −3 of SM or better [37,[62][63][64][65]. Our result is in agreement with previous studies. In addition to using polarized TB in different channels, these studies also use auxiliary data, such as the normalized difference vegetation index (NDVI) and land surface temperature (LST). The difference between this study and previous studies is that except for soil porosity, most of the auxiliary data such as DEM and land cover type are included in the FY-3D MWRI level one data. Thus, the implementation of the random forest algorithm using FY-3D MWRI data is more convenient.

Conclusions
The performance of global SM retrieval using recalibrated FY-3D MWRI data was evaluated using a multivariable random forest method. The SM retrieved with RCD has a high consistency and accuracy with SMOPS products (R 2 = 0.7223, mean bias = −0.0062 and ubRMSD = 0.0476 m 3 m −3 ), the in situ data (R = 0.2952, mean bias = 0.034 and ubRMSD = 0.0507 m 3 m −3 ) and SMAP products (R 2 = 0.5421, mean bias = 0.0514 and ubRMSD = 0.0653 m 3 m −3 ). The seasonal distribution of SM at the global scale is reasonable. The accuracy difference of SM retrieved using OCD and RCD was also evaluated using SMOPS products, in situ networks from automatic soil moisture observation stations in China and SMAP SM products. The recalibration of TB data from the FY-3D MWRI had an impact on global SM retrieval with the characteristics of spatial heterogeneity; however, the change rate on most areas is within 20%. The multivariable retrieval method could be applied to all the MWRIs onboard FY-3 satellites since 2010 due to their similar characteristics, which could support the development of a long-term SM product. The newly launched FY-3E satellite carries a microwave scatterometer (WindRad) with a 17:30/05:30 ascending/descending orbit that is operated by the Chinese National Satellite Meteorological Center (NSMC). The WindRad measures the backscattering at C-band (5.3 GHz) and Ku-band (13.265 GHz) with HH and VV polarization for both frequencies. The antenna scans conically at a viewing angle of 36-45 degrees for C-band and 37-43 degrees for Kuband. Random forest can also be used for retrieving SM based on WindRad Observations. This will provide the SM product at dawn and evening. It is helpful to study the diurnal variation in SM when combined with the SM product derived from MWRI data from FY-3B to 3D.