Reconstructing Satellite-Based Monthly Precipitation over Northeast China Using Machine Learning Algorithms

Attaining accurate precipitation data is critical to understanding land surface processes and global climate change. The development of satellite sensors and remote sensing technology has resulted in multi-source precipitation datasets that provide reliable estimates of precipitation over un-gauged areas. However, gaps exist over high latitude areas due to the limited spatial extent of several satellite-based precipitation products. In this study, we propose an approach for the reconstruction of the Tropical Rainfall Measuring Mission (TRMM) 3B43 monthly precipitation data over Northeast China based on the interaction between precipitation and surface environment. Two machine learning algorithms, support vector machine (SVM) and random forests (RF), are implemented to detect possible relationships between precipitation and normalized difference vegetation index (NDVI), land surface temperature (LST), and digital elevation model (DEM). The relationships between precipitation and geographical location variations based on longitude and latitude are also considered in the reconstruction model. The reconstruction of monthly precipitation in the study area is conducted in two spatial resolutions (25 km and 1 km). The validation is performed using in-situ observations from eight meteorological stations within the study area. The results show that the RF algorithm is robust and not sensitive to the choice of parameters, while the training accuracy of the SVM algorithm has relatively large fluctuations depending on the parameter settings and month. The precipitation data reconstructed with RF show strong correlation with in situ observations at each station and are more accurate than that obtained using the SVM algorithm. In general, the accuracy of the estimated precipitation at 1 km resolution is slightly lower than that of data at 25 km resolution. The estimation errors are positively related to the average precipitation.


Introduction
Precipitation is a significant factor affecting surface drought and wetness conditions, ecosystem health, and regional environment change [1,2].Precipitation data are the basic observation items of meteorological stations.For a long time, meteorological ground observation stations have mainly been used to observe precipitation.Today, with the improvements in observation technology, ground precipitation stations are automated and fundamental for the precipitation observation system [3,4].However, the observation site can only reflect the precipitation information of limited discrete points.The individual site can only represent the precipitation within a certain radius around the location, especially in complex terrains, which is influenced by local environmental factors.Acquiring precipitation observations over mountainous and underdeveloped regions is therefore still difficult due to the sparse rain gauge network.
The development of satellite technologies revolutionized the observations and acquisition of precipitation information, and enriched the data sources for precipitation observations [5][6][7][8][9][10][11][12][13][14].Remote sensing has been the main tool for the estimation of precipitation, and several satellite-based precipitation datasets have been developed [15], such as the Global Precipitation Climatology Project (GPCP) [12], Global Satellite Mapping of Precipitation Project (GSMaP) [16], Climate Prediction Center (CPC) Morphing method (CMORPH) [17], Meteosat Visible and Infrared Imager (MVIRI) [18], and the Tropical Rainfall Measuring Mission (TRMM) [19,20].Among those remote sensing precipitation datasets, the TRMM satellite (equipped with the first active microwave sensor dedicated to detect precipitation) and the TRMM precipitation datasets have provided reliable data on water cycles for ecological models and environmental and climate change studies [21].
However, acquiring precipitation datasets over regions without satellite coverage is still challenging.For example, the spatial coverage of existing global satellite-based TRMM precipitation datasets is 50 • S to 50 • N. The question therefore arises whether the application of the satellite precipitation datasets may be limited by its spatial extents in global climate modeling.Hydrologic modeling applications might also be restricted over high-latitude areas, especially in areas with sparse in situ networks for precipitation measurements.There are various methods, including multiple linear regression, machine learning, time series analysis, and interpolation techniques that have been used to fill gaps in climatic variables, such as streamflow, total water storage changes, air temperature, and soil moisture [22][23][24].However, they have not been widely used for the reconstruction of satellite-based precipitation.Due to the temporal and spatial complexity of precipitation itself and the complex relationship with other influencing factors, simple fitting algorithms or image restoration methods based on an image itself are difficult to use and the resulting products are unreliable.Therefore, auxiliary information might be introduced into the reconstruction process.
Selecting appropriate land surface characteristics that are strongly related to precipitation is the primary issue.Those datasets must be easily accessible and widely covered.Previous studies acknowledged the response of vegetation to precipitation [25][26][27][28]; vegetation can also influence the development of moist convection both locally and on scales of ten to thousands of kilometers [2,29].Multiple studies reported the co-variability between surface temperature and precipitation [30,31].When the ground is wet, more energy likely contributes to evaporation.If the ground is wet due to rainfall, the associated clouds block the sun, reducing the energy and temperature.Furthermore, high rates of evaporation could occur directly from bare soil after periods of rain, further suppressing the sensible heat and surface temperature [32,33].Therefore, remote sensing vegetation indices (e.g., the Normalized Difference Vegetation Index, NDVI) and land surface temperature (LST) are often used to monitor dry and wet surface state [25,27,28,34,35].Moreover, the topography could also have great impact on the regional atmospheric circulation and spatial pattern of precipitation [36,37].In theory, an increase in the elevation could increase the relative humidity of air masses by expanding and cooling arising air masses, resulting in precipitation [38].Globally covered NDVI, LST, and digital elevation model (DEM) products with fine spatial resolution provide reliable remote sensing data.These data products are all released online and are easily accessible.
The purpose of this study is to develop an approach for the estimation of precipitation over regions that are not covered by satellite-based precipitation datasets.Based on the relationship between precipitation and NDVI, LST and DEM, we constructed estimation models using machine learning algorithms, and conducted a case study over regions in Northeast China that are uncovered by TRMM 3B43 precipitation data.

Study Area
China is located in the eastern part of Asia at the western Pacific coast, between 20 • 13 N and 53 • 34 N, 73 • E and 135 • 05 E, covering a total area of 9.6 million km 2 .China's climate is mainly dominated by dry seasons and wet monsoons [39,40], which lead to pronounced precipitation and temperature differences between winter and summer [41].The study area, with a total area of 175,546 km 2 , is located in the northern part of Northeast China (Figure 1).The region is in the high latitudes and belongs to the cold temperate zone.Precipitation data from eight meteorological ground stations in the study area are used for validation (Figure 1b).Based on in situ observations, the average annual precipitation in the study area is 448.5 mm.The wettest month is July (monthly average precipitation is 123.7 mm), while the driest month is February (monthly average precipitation is 4.2 mm).
Remote Sens. 2017, 9, 781 3 of 17 learning algorithms, and conducted a case study over regions in Northeast China that are uncovered by TRMM 3B43 precipitation data.

Study Area
China is located in the eastern part of Asia at the western Pacific coast, between 20°13′ N and 53°34′ N, 73° E and 135°05′ E, covering a total area of 9.6 million km 2 .China's climate is mainly dominated by dry seasons and wet monsoons [39,40], which lead to pronounced precipitation and temperature differences between winter and summer [41].The study area, with a total area of 175,546 km 2 , is located in the northern part of Northeast China (Figure 1).The region is in the high latitudes and belongs to the cold temperate zone.Precipitation data from eight meteorological ground stations in the study area are used for validation (Figure 1. (b)).Based on in situ observations, the average annual precipitation in the study area is 448.5 mm.The wettest month is July (monthly average precipitation is 123.7 mm), while the driest month is February (monthly average precipitation is 4.2 mm).

Data Resources
The remote sensing precipitation data were obtained from the Tropical Rainfall Measuring Mission (TRMM), which is a research satellite launched in 1997 for monitoring precipitation over the tropical and sub-tropical regions [19].The original spatial resolution of precipitation data obtained from TRMM is 0.25° × 0.25°, and the spatial coverage of products is 50° N-50° S. The datasets used in this study are monthly precipitation data from the version 7 of TRMM 3B43 product (TRMM 3B43 V7) of 2003, 2006 and 2009 (http://pmm.nasa.gov/TRMM/trmm-instruments).The original data were projected to the Albers Conical Equal Area projection and the spatial resolution of the data was resampled to 25-km.
The monthly NDVI and land surface temperature (LST) datasets were obtained from Moderate Resolution Imaging Spectroradiometer (MODIS) products that acquired with Terra (https://lpdaac.usgs.gov/).These products with original sinusoidal projection were re-projected to the Albers Conical Equal Area projection.The spatial resolution of the data was maintained at 1 km.

Data Resources
The remote sensing precipitation data were obtained from the Tropical Rainfall Measuring Mission (TRMM), which is a research satellite launched in 1997 for monitoring precipitation over the tropical and sub-tropical regions [19].The original spatial resolution of precipitation data obtained from TRMM is 0. The monthly NDVI and land surface temperature (LST) datasets were obtained from Moderate Resolution Imaging Spectroradiometer (MODIS) products that acquired with Terra (https://lpdaac.usgs.gov/).These products with original sinusoidal projection were re-projected to the Albers Conical Equal Area projection.The spatial resolution of the data was maintained at 1 km.
The DEM data were obtained from the NASA Shuttle Radar Topographic Mission (SRTM) (http://srtm.csi.cgiar.org/)[42].DEM data with 30 and 90m spatial resolutions are available.The data with 90 m spatial resolution were downloaded considering the spatial scales of this study.Then the data were re-sampled to 1 km using an average algorithm by averaging the values of all pixels within each 1 km pixel.

Precipitation Reconstruction Algorithm
The basic idea of the reconstruction method in this study is to build estimation models using samples extracted from available TRMM 3B43 pixels; the models are established based on the relationship between precipitation and land surface characteristics.In this study, we considered NDVI, daytime LST (LST day ), nighttime LST (LST night ), day-night LST difference (LST DN ), and DEM as land surface characteristics.The land surface characteristics datasets over the study area were then used as input for the established model to estimate the precipitation over the un-covered region.This method is based on the assumption that the machine learning algorithm can simulate the relationship between precipitation and land surface characteristics with high accuracy.This simulation model can be used to estimate the precipitation over regions that were uncovered by the precipitation datasets.
Relationships between precipitation and the surface conditions may vary widely from one environment to another and from one region to another.Therefore, to establish a robust reconstructing model, sufficient training datasets were required to guarantee that there are enough training samples.Meanwhile, concerning the spatial heterogeneity of precipitation, we included geo-locations (latitude and longitude) as independent variables.In this study, we used the TRMM data covering the China land area as input dependent variable samples, and land surface temperature, NDVI, DEM, and geo-locations (latitude and longitude) were input as independent variables.The process of the approach can be described as follows: (1) In regions covered with snow, water bodies, and desert, the NDVI values are usually constant under 0.0.To eliminate the influence of snow and water bodies, the threshold NDVI < 0.0 was used to distinguish and remove snow, and water body pixels from original monthly NDVI images.(2) The LST DN was calculated by subtracting LST night from LST day ; NDVI 1km , DEM 1km , LST day-1km , LST night-1km , LST DN-1km were resampled to 25 km resolution using an average method.The geographical coordinates of the center of each 25 × 25 km grid were extracted.(3) The relationship between the resampled independent variables and TRMM 3B43 V7 precipitation data were established using machine learning algorithms.In this study, we tested two machine learning algorithms for simulating the monthly precipitation, support vector machine (SVM) and random forests (RF).(4) The reconstruction of monthly precipitation in the study area was conducted on two scales.
First, the resampled independent variables (NDVI, DEM, LST day , LST night and LST DN ) with 25 km spatial resolution were input into the established model.Reconstructed results of 25 km resolution were achieved.Second, the 1 km spatial resolution precipitation can be simulated by applying the established model to the variables with original 1 km spatial resolution.

Support Vector Machine (SVM)
The SVM is an outstanding machine learning algorithm for classification and regression problems, and has been successfully applied in different fields such as soil moisture estimation [43], impervious surface estimation [44] and biophysical parameter estimation from remote sensing data [45].The original SVM algorithm was invented by Vladimir Vapnik and his co-workers in the early 1990s for classification problems, and then was extended to the case of regression [46,47].The basic idea of the SVM algorithm is derived from optimization theory that uses a hyper-plane to classify the input variables into an m-dimensional feature space with maximal margin.The maximal margin is derived by solving the constrained dual problem: Subject to where x i are independent variables, y i is dependent variables, C is the capacity parameter cost, and i = 1, . . .., L is the sample size and the approximating function is given by where α and α * are Lagrange multipliers, and b is the "bias" term; and k(x, x i ) is the kernel function that measures non-linear dependence between the two input variables x, and x i .The x i 's are "support vectors", and N (usually N L) is the number of selected data points or support vectors corresponding to values of the independent variable that are at least ε away from actual observations.The training pattern in the dual formulation can be used to estimate the dot product of two vectors of any dimensions and is regarded as the advantage of the dual formulation.This advantage in SVM is used to deal with non-linear function approximations.
Selecting a suitable kernel function and kernel parameter are important steps involved in SVM modeling.In this study, we selected the Radial Basis Function (RBF) as the kernel.The RBF kernel function is given by k where γ is specified by keyword gamma, must be greater than 0. Thus, when training an SVM with the RBF kernel, two parameters must be considered: C and gamma.The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface.A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.The gamma defines how much influence a single training example has.The larger gamma is, the closer other examples must be to be affected.Interested readers are referred to Kalra and Ahmad [48] for the illustration of the working mechanism and an example of the SVM technique.

Random Forests (RF)
Random Forests (RF) as an non-parameter and ensemble learning algorithm for regression and classification had been increasingly applied and was reported to yield high accuracy and be robust to outliers [49].The RF, which was proposed by Breiman [50], is a combination of tree predictors such that each tree depends on the values of a randomly chosen subset of input variables vectors sampled independently and with the same distribution for all trees in the forests [50].The tree predictor is based on the classification and regression trees (CART) algorithm [51].The basic idea of CART is to construct a tree-like graph or model of decisions and their possible consequences.It generates relative homogeneous subgroups by recursively partitioning the training dataset to the maximum variance between groups of independent and dependent variables.In each of the terminal nodes of the tree, a simple and accurate model is built to explain the relationship of independent and dependent variables.The RF regression algorithm process can be briefly described as follows: (1) The ntree (number of trees) samples sets are randomly drawn from the original training sample set with replacement.Each sample set is a bootstrap sample, and the elements that are not included in the bootstrap are termed "out-of-bag data" (OOB) for that bootstrap sample.(2) For each bootstrap sample, an un-pruned regression tree is grown with the modification that a random subset of the variables, from which the best variables are split, is selected at each node.(3) Predictions for new samples can be made by averaging the predictions from all the individual regression trees: where N is the number of trees, f i (x) is the prediction from each individual regression tree.

Performance of Regression Algortihms
The selected algorithms are openly accessible and easy to use; they are clearly documented elsewhere.Sources of the codes are implemented in scikit-learn, which is a Python package integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems [52].The establishment of the RF and SVM based models largely depends on certain parameterizations, and the choice of optimal parameters is significant.In practice, we conducted experiments to cover a majority range of parameter combinations for each algorithm [50][51][52][53][54][55][56][57] (Table 1).A grid search algorithm was implemented to find the optimal parameters for each algorithm.The grid search exhaustively considers all parameter combinations with a cross-validation scheme.We used a k-fold strategy, which divides all the samples in k groups of samples, called folds, of equal sizes.The prediction function is learned using k-1 folds, and the fold left out is used for test.In this study, we used default k value 3. The SVM and RF model were trained by using the same datasets with 25 km spatial resolution.All available pixels covering China land area were used as samples, including 14,893 pixels for each month.Figure 2 shows the training accuracy of the two algorithms under different parameter conditions.The accuracy of the SVM algorithm is greatly influenced by the gamma parameter, and the Cost has less effect on the accuracy of the algorithm.When the gamma changes in the range of [2 −6 , 2 6 ], the R 2 gradually increases and then decreases, and reaches the maximums when gamma = 2 4 and gamma = 2 5 .It can be seen that the accuracy of the SVM algorithm with the parameter setting changes shows relatively large fluctuations.This indicates that SVM is sensitive to the choice of parameters.For the RF algorithm, when the number of trees (n_estimators) changes within [20,200], the average fitting accuracy of the algorithm continues to rise, and gradually stabilize when n_estimators is larger than 100.It can be seen that although the accuracy of RF algorithm fitting increases with the number of trees, the R 2 has been at a high level when using the various parameters.The average R 2 is greater than 0.99.This indicates that the RF algorithm is not sensitive to the choice of parameters and is robust.Table 2 shows the average R 2 , mean absolute error (MAE), and the root mean squared error (RMSE) of the two algorithms for each month.It can be seen that the training accuracy of SVM varies in different months, while the R 2 of RF for each month are all greater than 0.99 for each month.

Reconstruction Results
The estimated results at 25 km resolution for June and October of 2009 are presented in Figure 3.The results estimated at 1 km resolution for January, June and October are presented in Figure 4. Based on the visual comparison, the results at 1 km resolution provide more detailed spatial information compared with the results of 25 km resolution.The estimated results of TRMM 3B43 precipitation using SVM and RF show quite different spatial distribution characteristics.

Reconstruction Results
The estimated results at 25 km resolution for June and October of 2009 are presented in Figure 3.The results estimated at 1 km resolution for January, June and October are presented in Figure 4. Based on the visual comparison, the results at 1 km resolution provide more detailed spatial information compared with the results of 25 km resolution.The estimated results of TRMM 3B43 precipitation using SVM and RF show quite different spatial distribution characteristics.Remote Sens. 2017, 9, 781 9 of 17

Validation and Error Analysis
The results with coarse (25 km) and fine (1 km) spatial resolution were validated with observations from eight meteorological stations in the study area.Figure 5 shows the scatter plots of the results with coarse resolution and the observations for each algorithm; Figure 6 presents the scatter plots of the results with fine resolution and the observations for each algorithm.

Validation and Error Analysis
The results with coarse (25 km) and fine (1 km) spatial resolution were validated with observations from eight meteorological stations in the study area.Figure 5 shows the scatter plots of the results with coarse resolution and the observations for each algorithm; Figure 6 presents the scatter plots of the results with fine resolution and the observations for each algorithm.
It can be seen that the results obtained using the RF algorithm at 25 km and 1 km resolution are more accurate than that of the SVM.The estimated precipitation based on RF strongly correlates with the in situ observations; R 2 of RF is larger than 0.8.However, the accuracy of the precipitation estimated at 1 km resolution is slightly lower than that at 25 km resolution.We examined the temporal behavior of the in situ measurements and the precipitation reconstructed at eight stations during the entire period (the results at 1 km spatial resolution are displayed in Figure 7).Except for individual months, the reconstruction results are in good agreement with the site observations.The precipitation estimated at the eight sites accurately reflects annual and inter-annual variations of the precipitation; the results obtained from the RF algorithm are closer to the observations.

Validation and Error Analysis
The results with coarse (25 km) and fine (1 km) spatial resolution were validated with observations from eight meteorological stations in the study area.Figure 5 shows the scatter plots of the results with coarse resolution and the observations for each algorithm; Figure 6 presents the scatter plots of the results with fine resolution and the observations for each algorithm.
It can be seen that the results obtained using the RF algorithm at 25 km and 1 km resolution are more accurate than that of the SVM.The estimated precipitation based on RF strongly correlates with the in situ observations; R 2 of RF is larger than 0.8.However, the accuracy of the precipitation estimated at 1 km resolution is slightly lower than that at 25 km resolution.We examined the temporal behavior of the in situ measurements and the precipitation reconstructed at eight stations during the entire period (the results at 1 km spatial resolution are displayed in Figure 7).Except for individual months, the reconstruction results are in good agreement with the site observations.The precipitation estimated at the eight sites accurately reflects annual and inter-annual variations of the precipitation; the results obtained from the RF algorithm are closer to the observations.It can be seen that the results obtained using the RF algorithm at 25 km and 1 km resolution are more accurate than that of the SVM.The estimated precipitation based on RF strongly correlates with the in situ observations; R 2 of RF is larger than 0.8.However, the accuracy of the precipitation estimated at 1 km resolution is slightly lower than that at 25 km resolution.
We examined the temporal behavior of the in situ measurements and the precipitation reconstructed at eight stations during the entire period (the results at 1 km spatial resolution are displayed in Figure 7).Except for individual months, the reconstruction results are in good agreement with the site observations.The precipitation estimated at the eight sites accurately reflects annual and inter-annual variations of the precipitation; the results obtained from the RF algorithm are closer to the observations.Table 3 and Table 4 show the R 2 , RMSE, and Bias between observations and estimated monthly precipitation of 25 km and 1 km resolution at eight stations, respectively.Accurate precipitation estimates were obtained at each station.According to Table 3, with respect to the results at 25 km resolution, the R 2 of the RF algorithm is higher than that of the SVM algorithm at each station, ranging from 0.72 to 0.93.The R 2 values of Eergunayouqi, Tulihe, Daxinganling, and Heihe are larger than 0.9.The RMSE of the RF algorithm at each site is lower than that of the SVM algorithm.However, the Bias of the RF algorithm is larger than that of the SVM algorithm at five stations (Mohe, Tahe, Huma, Tulihe, and Daxinganling).
In general, the RF algorithm shows a higher accuracy than the SVM at 1 km resolution for each individual station.However, the SVM performed better than RF at the Huma Station.Except for the Mohe Station, the estimated precipitation is in good agreement with the observations, with R 2 higher than 0.8.The best agreement was observed at the Heihe Station (R 2 = 0.91), followed by Tulihe (R 2 = 0.90).Overall, the RF algorithm tends to underestimate the monthly precipitation, with a negative Bias; the Bias reached −0.24 and −0.18 at the Huma and Eergunanyouqi stations, respectively.Tables 3 and 4 show the R 2 , RMSE, and Bias between observations and estimated monthly precipitation of 25 km and 1 km resolution at eight stations, respectively.Accurate precipitation estimates were obtained at each station.According to Table 3, with respect to the results at 25 km resolution, the R 2 of the RF algorithm is higher than that of the SVM algorithm at each station, ranging from 0.72 to 0.93.The R 2 values of Eergunayouqi, Tulihe, Daxinganling, and Heihe are larger than 0.9.The RMSE of the RF algorithm at each site is lower than that of the SVM algorithm.However, the Bias of the RF algorithm is larger than that of the SVM algorithm at five stations (Mohe, Tahe, Huma, Tulihe, and Daxinganling).
In general, the RF algorithm shows a higher accuracy than the SVM at 1 km resolution for each individual station.However, the SVM performed better than RF at the Huma Station.Except for the Mohe Station, the estimated precipitation is in good agreement with the observations, with R 2 higher than 0.8.The best agreement was observed at the Heihe Station (R 2 = 0.91), followed by Tulihe (R 2 = 0.90).Overall, the RF algorithm tends to underestimate the monthly precipitation, with a negative Bias; the Bias reached −0.24 and −0.18 at the Huma and Eergunanyouqi stations, respectively.To investigate the relationship between the estimation errors and precipitation observations, we calculated the average MAE of the reconstructed results for each station and the average precipitation observations of the stations (Figure 8).In general, the estimation errors positively correlate with the average precipitation; the MAEs increase with increasing average precipitation.The MAE of the SVM model increases at a rate of 2.6 mm/10 mm (R 2 = 0.43); the MAE increase rate of the RF model was 2.2 mm/10 mm (R 2 = These results indicate that the errors increase as the total monthly precipitation increases and that the rate of increase of RF is lower than that of the SVM.To investigate the relationship between the estimation errors and precipitation observations, we calculated the average MAE of the reconstructed results for each station and the average precipitation observations of the stations (Figure 8).In general, the estimation errors positively correlate with the average precipitation; the MAEs increase with increasing average precipitation.The MAE of the SVM model increases at a rate of 2.6 mm/10 mm (R 2 = 0.43); the MAE increase rate of the RF model was 2.2 mm/10 mm (R 2 = 0.35).These results indicate that the errors increase as the total monthly precipitation increases and that the rate of increase of RF is lower than that of the SVM.

Scale Sensitivity of the Model
In this study, we rebuilt precipitation data for Northeast China.The reconstructing models were established at 25  Figure 9 shows the simulation accuracy (R 2 and RMSE) of the two machine learning algorithms based on four different scales.The R 2 and RMSE of the different scales are quite similar.This indicates that the machine learning algorithm of the reconstruction model is not affected by scale changes.In addition, the RF-based model has a higher accuracy at each scale and in each month; however, Figure 9d shows that the accuracy of the RF-based model decreases from to 100 km.Therefore, it is reasonable to establish the reconstruction model at the 25 km scale.If the model is established at a larger scale (50 km to 100 km), the original TRMM data need to be scaled up, which will cause the loss of spatial information and introduce uncertainty to the reconstruction model.Figure 9 shows the simulation accuracy (R 2 and RMSE) of the two machine learning algorithms based on four different scales.The R 2 and RMSE of the different scales are quite similar.This indicates that the machine learning algorithm of the reconstruction model is not affected by scale changes.In addition, the RF-based model has a higher accuracy at each scale and in each month; however, Figure 9d shows that the accuracy of the RF-based model decreases from 25 to 100 km.Therefore, it is reasonable to establish the reconstruction model at the 25 km scale.If the model is established at a larger scale (50 km to 100 km), the original TRMM data need to be scaled up, which will cause the loss of spatial information and introduce uncertainty to the reconstruction model.

Influence Factors of the Reconstructing Model
The factors influencing the precipitation are complex and diverse.The factors affecting the accuracy of the precipitation reconstruction model also vary, including various environmental factors considered in the reconstruction model and the accuracy of multi-source remote sensing information.

Influence Factors of the Reconstructing Model
The factors influencing the precipitation are complex and diverse.The factors affecting the accuracy of the precipitation reconstruction model also vary, including various environmental factors considered in the reconstruction model and the accuracy of multi-source remote sensing information.
The TRMM precipitation products have been the focus of research and applications due to the high data quality and accuracy.In theory, the accuracy of reconstruction results largely depends on the accuracy of the original satellite precipitation data.Although TRMM 3B43 V7 precipitation products show a high consistency with in situ observations, the accuracy of the TRMM data might vary during different seasons and from one region to another.Multiple studies showed that the satellite precipitation datasets have a limited ability in estimating trace and solid precipitation.Therefore, the accuracy of reconstructed precipitation might be lower during winter.
The response of vegetation to precipitation has been extensively studied.Water is an important factor affecting the growth of vegetation.Therefore, NDVI, as the best indicator of vegetation growth, has been widely used in precipitation monitoring.Figure 10 shows the in situ observed precipitation and NDVI values at eight stations during the entire period.Figure 11  The TRMM precipitation products have been the focus of research and applications due to the high data quality and accuracy.In theory, the accuracy of reconstruction results largely depends on the accuracy of the original satellite precipitation data.Although TRMM 3B43 V7 precipitation products show a high consistency with in situ observations, the accuracy of the TRMM data might vary during different seasons and from one region to another.Multiple studies showed that the satellite precipitation datasets have a limited ability in estimating trace and solid precipitation.Therefore, the accuracy of reconstructed precipitation might be lower during winter.
The response of vegetation to precipitation has been extensively studied.Water is an important factor affecting the growth of vegetation.Therefore, NDVI, as the best indicator of vegetation growth, has been widely used in precipitation monitoring.Figure 10 shows the in situ observed precipitation and NDVI values at eight stations during the entire period.Figure 11

Conclusions
In this study, a reconstruction algorithm is proposed for monthly TRMM 3B43 precipitation based on machine learning algorithms.The reconstruction is performed over Northeast China at two spatial resolutions (25 km and 1 km).The reconstructed precipitation is validated with in situ observations of eight meteorological stations in the study area.
Based on the training results, the RF algorithm produces a higher training accuracy than the SVM.Moreover, the accuracy of the SVM is greatly affected by the selection of parameters and varies in different months.In contrast, the RF produces a consistent and high accuracy.This indicates that the RF algorithm is more robust than the SVM.The validation results show that the reconstructed monthly precipitation based on RF is more accurate than the results obtained from the SVM.The results estimated by RF show high correlations with the in situ observations for each station and the estimated precipitation at eight stations accurately reflects annual and interannual variations.In general, the RF algorithm outperforms the SVM with respect to the reconstruction model.
The relationship between the estimations errors and precipitation observations was analyzed by comparing the average MAE with the average precipitation observations at each station.The results show that there is a positive relationship between the absolute error and average precipitation.The absolute errors increase as the monthly total precipitation increases, while the rate of increase of RF is lower than that of SVM.
The scale effect is important for remote sensing models.We also analyzed the scale sensitivity of the reconstruction model by comparing the accuracy of the models established at different scales (25 km, 50 km, 75 km, and 100 km).The results show that the training accuracies are quite similar at different scales, indicating that the reconstruction model is not affected by scale changes.

Conclusions
In this study, a reconstruction algorithm is proposed for monthly TRMM 3B43 precipitation based on machine learning algorithms.The reconstruction is performed over Northeast China at two spatial resolutions (25 km and 1 km).The reconstructed precipitation is validated with in situ observations of eight meteorological stations in the study area.
Based on the training results, the RF algorithm produces a higher training accuracy than the SVM.Moreover, the accuracy of the SVM is greatly affected by the selection of parameters and varies in different months.In contrast, the RF produces a consistent and high accuracy.This indicates that the RF algorithm is more robust than the SVM.The validation results show that the reconstructed monthly precipitation based on RF is more accurate than the results obtained from the SVM.The results estimated by RF show high correlations with the in situ observations for each station and the estimated precipitation at eight stations accurately reflects annual and interannual variations.In general, the RF algorithm outperforms the SVM with respect to the reconstruction model.
The relationship between the estimations errors and precipitation observations was analyzed by comparing the average MAE with the average precipitation observations at each station.The results show that there is a positive relationship between the absolute error and average precipitation.The absolute errors increase as the monthly total precipitation increases, while the rate of increase of RF is lower than that of SVM.
The scale effect is important for remote sensing models.We also analyzed the scale sensitivity of the reconstruction model by comparing the accuracy of the models established at different scales (25 km, 50 km, 75 km, and 100 km).The results show that the training accuracies are quite similar at different scales, indicating that the reconstruction model is not affected by scale changes.

Figure 1 .
Figure 1.(a) Tropical Rainfall Measuring Mission (TRMM) 3B43 V7 precipitation data over China in August 2009; (b) region in Northeast China uncovered by TRMM.

Figure 1 .
Figure 1.(a) Tropical Rainfall Measuring Mission (TRMM) 3B43 V7 precipitation data over China in August 2009; (b) region in Northeast China uncovered by TRMM.
25 • × 0.25 • , and the spatial coverage of products is 50 • N-50 • S. The datasets used in this study are monthly precipitation data from the version 7 of TRMM 3B43 product (TRMM 3B43 V7) of 2003, 2006 and 2009 (http://pmm.nasa.gov/TRMM/trmm-instruments).The original data were projected to the Albers Conical Equal Area projection and the spatial resolution of the data was resampled to 25-km.

Figure 2 .
Figure 2. Boxplots of determination coefficient of training process by using different parameters.

Figure 2 .
Figure 2. Boxplots of determination coefficient of training process by using different parameters.

Figure 3 .
Figure 3. Reconstruction results at 25 km spatial resolution: (a) reconstructed precipitation for June 2009 by using support vector machine (SVM); (b) reconstructed precipitation for June 2009 by using RF; (c) reconstructed precipitation for September 2009 by using SVM; (d) reconstructed precipitation for September 2009 by using random forests (RF).

Figure 4 .
Figure 4. Reconstruction results at 1 km spatial resolution: (a) reconstructed precipitation for June 2009 by using SVM; (b) reconstructed precipitation for June 2009 by using RF; (c) reconstructed precipitation for September 2009 by using SVM; (d) reconstructed precipitation for September 2009 by using RF.

Figure 3 . 17 Figure 3 .
Figure 3. Reconstruction results at 25 km spatial resolution: (a) reconstructed precipitation for June 2009 by using support vector machine (SVM); (b) reconstructed precipitation for June 2009 by using RF; (c) reconstructed precipitation for September 2009 by using SVM; (d) reconstructed precipitation for September 2009 by using random forests (RF).

Figure 4 .
Figure 4. Reconstruction results at 1 km spatial resolution: (a) reconstructed precipitation for June 2009 by using SVM; (b) reconstructed precipitation for June 2009 by using RF; (c) reconstructed precipitation for September 2009 by using SVM; (d) reconstructed precipitation for September 2009 by using RF.

Figure 4 .
Figure 4. Reconstruction results at 1 km spatial resolution: (a) reconstructed precipitation for June 2009 by using SVM; (b) reconstructed precipitation for June 2009 by using RF; (c) reconstructed precipitation for September 2009 by using SVM; (d) reconstructed precipitation for September 2009 by using RF.

Figure 5 .
Figure 5. Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 25 km.

Figure 6 .
Figure 6.Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 1 km.

Figure 5 .
Figure 5. Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 25 km.

Figure 5 .
Figure 5. Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 25 km.

Figure 6 .
Figure 6.Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 1 km.

Figure 6 .
Figure 6.Scatter plots between observed monthly total precipitation and estimated results using (a) SVM and (b) RF at spatial resolution of 1 km.

Figure 7 .
Figure 7.Comparison of in situ observations and reconstructed monthly precipitation by using RF and SVM at eight stations, respectively.

Figure 7 .
Figure 7.Comparison of in situ observations and reconstructed monthly precipitation by using RF and SVM at eight stations, respectively.

Figure 8 .Figure 8 .
Figure 8. Scatter plots between average mean absolute error (MAE) of reconstructed precipitation for each station and average precipitation observations of the stations: (a) SVM; (b) RF.
km resolution.The scale effect is one of the most important issues in remote sensing research.The information and characteristics reflected by different scales might be completely different.The same model might produce completely different results at different spatial scales.Based on Immerzeel et al. (2009) and Jia et al. (2011), models established based on precipitation and NDVI/DEM might have different accuracies at different spatial scales.To explore the scale sensitivity of the reconstruction model, we established models at 25 km, 50 km, 75 km, and 100 km resolution based on the two machine learning algorithms.The simulation abilities of the models were analyzed at different scales.

5 . 1 .
Scale Sensitivity of the Model In this study, we rebuilt precipitation data for Northeast China.The reconstructing models were established at 25 km resolution.The scale effect is one of the most important issues in remote sensing research.The information and characteristics reflected by different scales might be completely different.The same model might produce completely different results at different spatial scales.Based on Immerzeel et al. (2009) and Jia et al. (2011), models established based on precipitation and NDVI/DEM might have different accuracies at different spatial scales.To explore the scale sensitivity of the reconstruction model, we established models at 25 km, 50 km, 75 km, and 100 km resolution based on the two machine learning algorithms.The simulation abilities of the models were analyzed at different scales.

Figure 9 .
Figure 9. (a) The R 2 achieved by using SVM on different scale from January to December; (b) the R 2 achieved by using RF on different scale from January to December; (c) The MAE achieved by using SVM on different scale from January to December; (d) The MAE achieved by using RF on different scale from January to December.

Figure 9 .
Figure 9. (a) The R 2 achieved by using SVM on different scale from January to December; (b) the R 2 achieved by using RF on different scale from January to December; (c) The MAE achieved by using SVM on different scale from January to December; (d) The MAE achieved by using RF on different scale from January to December.
presents the in situ observed precipitation and land surface temperature values at eight stations.According to the figures, there are positive relationships between precipitation and the NDVI and LST variables.Compared with the land surface temperature, the variation curve of NDVI agrees well with the change of precipitation.However, both the LST and NDVI cannot be consistent with changes in precipitation in some individual months.There are limitations by using NDVI and LST as indicator variables.The NDVI and LST cannot objectively reflect the real precipitation change due to human and natural factors.For example, harvesting and irrigating farmland artificially changes the NDVI and surface temperature, and defoliation of vegetation might worsen the relationship between precipitation and NDVI, LST.The NDVI and LST changes caused by human intervention and natural factors are not controlled by precipitation.In addition, NDVI cannot effectively reflect the change of precipitation in sparsely vegetated areas (NDVI is a constant smaller than or close to zero) and lush vegetation areas (NDVI saturation).The NDVI and surface temperature data are transient data, reflecting the transient state of the surface environment; however, the impact of precipitation on the surface environment is continuous.Although those data have been composited by calculating maximum and average values, data gaps and quality issues caused by clouds and atmospheric conditions still exist, which have an impact on the accuracy of reconstruction results.Remote Sens. 2017, 9, 781 13 of 17 presents the in situ observed precipitation and land surface temperature values at eight stations.According to the figures, there are positive relationships between precipitation and the NDVI and LST variables.Compared with the land surface temperature, the variation curve of NDVI agrees well with the change of precipitation.However, both the LST and NDVI cannot be consistent with changes in precipitation in some individual months.There are limitations by using NDVI and LST as indicator variables.The NDVI and LST cannot objectively reflect the real precipitation change due to human and natural factors.For example, harvesting and irrigating farmland artificially changes the NDVI and surface temperature, and defoliation of vegetation might worsen the relationship between precipitation and NDVI, LST.The NDVI and LST changes caused by human intervention and natural factors are not controlled by precipitation.In addition, NDVI cannot effectively reflect the change of precipitation in sparsely vegetated areas (NDVI is a constant smaller than or close to zero) and lush vegetation areas (NDVI saturation).The NDVI and surface temperature data are transient data, reflecting the transient state of the surface environment; however, the impact of precipitation on the surface environment is continuous.Although those data have been composited by calculating maximum and average values, data gaps and quality issues caused by clouds and atmospheric conditions still exist, which have an impact on the accuracy of reconstruction results.

Figure 10 .
Figure 10.Comparison of in situ observed precipitation and normalized difference vegetation index (NDVI) values at eight stations, respectively.

Figure 10 .
Figure 10.Comparison of in situ observed precipitation and normalized difference vegetation index (NDVI) values at eight stations, respectively.

Figure 11 .
Figure 11.Comparison of in situ observed precipitation, daytime land surface temperature (LST), and nighttime LST values at eight stations, respectively.

Figure 11 .
Figure 11.Comparison of in situ observed precipitation, daytime land surface temperature (LST), and nighttime LST values at eight stations, respectively.

Table 1 .
Parameter combinations for each algorithm.

Table 2 .
The averaged training accuracy for different months by using the two algorithms.

Table 2 .
The averaged training accuracy for different months by using the two algorithms.

Table 3 .
The correlation coefficient, root mean squared error (RMSE), and Bias between observed and estimated monthly precipitation of 25 km resolution at eight stations.

Table 3 .
The correlation coefficient, root mean squared error (RMSE), and Bias between observed and estimated monthly precipitation of 25 km resolution at eight stations.

Table 4 .
The correlation coefficient, RMSE, and Bias between observed and estimated monthly precipitation of 1 km resolution at eight stations.

Table 4 .
The correlation coefficient, RMSE, and Bias between observed and estimated monthly precipitation of 1 km resolution at eight stations.