Estimating Sub-Pixel Soybean Fraction from Time-Series MODIS Data Using an Optimized Geographically Weighted Regression Model

: Soybean cultivation in China has signiﬁcantly decreased due to the rising import of genetically modiﬁed soybeans from other countries. Understanding soybean’s extent and change information is of great value for national agricultural policy implications and global food security. Some previous studies have explored the quantitative relationships between crop area and spectral variables derived from remote sensing data. However, both those linear or non-linear relationships were expressed by global regression models, which ignored the spatial non-stationarity of crop spectral signature and may limit the prediction accuracy. This study presented a geographically weighted regression model (GWR) to estimate fractional soybean at 250 m spatial resolution in Heilongjiang Province, one of the most important food production regions in China, using time-series MODIS data and high-quality calibration information derived from Landsat data. A forward stepwise optimization strategy was embedded with the GWR model to select the optimal subset of independent variables for soybeans. Normalized Difference Vegetation Index (NDVI) of Julian day 233 to 257 when soybeans are ﬁlling seed was found to be the most important temporal period for sub-pixel soybean area estimation. Our MODIS-based soybean area compared well with Landsat-based results at pixel-level. Also, there was a good agreement between the MODIS-based result and census data at county level, with the coefﬁcient of determination (R 2 ) of 0.80 and the root mean square error (RMSE) was 340.21 km 2 . Additionally, F -test results showed GWR model had better model goodness-of-ﬁt and higher prediction accuracy than the traditional ordinary least squares (OLS) model. These promising results suggest crop spectral variations both at temporal and spatial scales should be considered when exploring its relationship with pixel-level crop acreage. The optimized GWR model by combining an automated feature selection strategy has great potential for estimating sub-pixel crop area at regional scale based on remote sensing time-series data. for estimating the sub-pixel soybean fractions and the evaluation methods. The equation of GWR here is simpliﬁed and the detailed equation of GWR and associated parameters are described in Section 3.1.1.


Introduction
Soybean, one of the most important crop types and commodities for vegetable oil and protein in the global agricultural market, comprises 5% of global cropland and ranks 5th among major crop categories [1,2]. During the last decade, China imported more than 80% of its soybean consumption from sensing fields, its potential in modelling the spatial distribution of crop type is unclear. In addition, how to optimize the GWR model to deal with many potential independent variables has not yet been explored.
To address the knowledge gap, the objective of this study is to use an optimized GWR model by combining a forward stepwise feature selection strategy (1) to evaluate its potential in estimating sub-pixel soybean fraction based on MODIS time-series data; (2) to characterize the non-stationary spatial relationships between sub-pixel crop fraction and MODIS-NDVI time series; and (3) to determine the optimal temporal features for soybean area estimation. The Heilongjiang province, one of the most important soybean production regions in China, was chosen as the study area. This paper is organized as follows: Section 2 introduces the basic geographical information of the study area and describes the study dataset that was used to calibrate and evaluate the GWR model. Section 3 shows the details of the specific implication of the optimized GWR in this study and accuracy assessment for GWR results. Sections 4 and 5 show the results and discussion, respectively. Summary and conclusion are given in Section 6.

Study Area
Heilongjiang province, the most northeastern part of China, is located approximately between 43 • N and 53 • N latitude and 121 • W and 135 • W longitudes ( Figure 1). It has a continental monsoon climate between temperate and boreal zones, with an average annual temperature of −4 • C to 4 • C decreasing from southeast to northwest. Its annual average precipitation ranges from 300 mm to 700 mm, which mostly concentrated from June through August [24]. Nicknamed as the "Great northern granary", this subarctic province produced more than 60.71 million tons of grain in 2016 on 11.87 million ha of cultivated land [25]. Its main crops include soybean, rice, and corn, all of which are harvested once per year due to the limited hours of sunshine and accumulated heat. During the past decade, the soybean production of Heilongjiang accounted for more than one-quarter of the nation's annual soybean production. However, due to the rising import of genetically modified (GM) soybean from Brazil and Unite State, the non-GM soybean planting area in Heilongjiang decreased from 40,322 km 2 in 2005 to 24,761 km 2 in 2015, with corn replacing soybean as the dominant dry-land crop in this region [4,25]. A field survey in 2013 showed that the soybeans were mainly distributed in northwestern part of Heilongjiang Province, and soybean fields in regions near Songnen plain are almost continuous and large, while those in the northwest part close to Lesser Khingan Mountains are relatively small and fragment. Soybean in Heilongjiang is generally sowed in late-April-followed by emergence, three leaves, seven leaves, blooming, bearing pod, filling seed, senescence and harvested in early-October ( Figure 2). The distinct difference in growing period length and biochemical properties between soybean and other crops in this region will be expressed as the variance of spectral signature (e.g., NDVI) across the growing seasons, which is the crucial characteristic to identify soybeans.

Sample Data
The reference data used to calibrate the GWR model and assess the prediction results was from a classified soybean map of partial regions in Heilongjiang, which was generated by supervised classifications based on 45 cloud-free Landsat-8 OLI images in 2013 and partial expert visual interpretation. The accuracy of this mapped soybean map was 93.84% that was assessed by 341 field data, which was shown in Figure 1a [26]. The reference soybean map with a spatial resolution of 30 meter was re-projected to the MODIS standard Sinusoidal projection. Then, the sub-pixel soybean fraction for each 250-m grid was calculated by dividing the number of 30-m soybean pixels by the total number of 30-m pixels within the 250-m grid cell (An example was shown in Figure 1c). Finally, a total of 4000 samples points ( Figure 1b) were randomly selected from the resampled reference map to guarantee the representativeness of training data in geographical space and values distribution. The estimated sub-pixel soybean fractions of those 4000 points range from 0 to 1 and were used as

Sample Data
The reference data used to calibrate the GWR model and assess the prediction results was from a classified soybean map of partial regions in Heilongjiang, which was generated by supervised classifications based on 45 cloud-free Landsat-8 OLI images in 2013 and partial expert visual interpretation. The accuracy of this mapped soybean map was 93.84% that was assessed by 341 field data, which was shown in Figure 1a [26]. The reference soybean map with a spatial resolution of 30 meter was re-projected to the MODIS standard Sinusoidal projection. Then, the sub-pixel soybean fraction for each 250-m grid was calculated by dividing the number of 30-m soybean pixels by the total number of 30-m pixels within the 250-m grid cell (An example was shown in Figure 1c). Finally, a total of 4000 samples points ( Figure 1b) were randomly selected from the resampled reference map to guarantee the representativeness of training data in geographical space and values distribution. The estimated sub-pixel soybean fractions of those 4000 points range from 0 to 1 and were used as

Sample Data
The reference data used to calibrate the GWR model and assess the prediction results was from a classified soybean map of partial regions in Heilongjiang, which was generated by supervised classifications based on 45 cloud-free Landsat-8 OLI images in 2013 and partial expert visual interpretation. The accuracy of this mapped soybean map was 93.84% that was assessed by 341 field data, which was shown in Figure 1a [26]. The reference soybean map with a spatial resolution of 30 m was re-projected to the MODIS standard Sinusoidal projection. Then, the sub-pixel soybean fraction for each 250-m grid was calculated by dividing the number of 30-m soybean pixels by the total number of 30-m pixels within the 250-m grid cell (An example was shown in Figure 1c). Finally, a total of 4000 samples points ( Figure 1b) were randomly selected from the resampled reference map to guarantee the representativeness of training data in geographical space and values distribution. The estimated sub-pixel soybean fractions of those 4000 points range from 0 to 1 and were used as the values of dependent variables for GWR model. The histogram for the dependent variable was illustrated in Section 4.3. In addition to training points, a total of 2000 points with estimated soybean fractions were kept aside for validation.

Feature Set
In our study, time series of the 8-day composite MODIS/Terra Surface Reflectance Collection 6 products (MOD09Q1 V006) were used as the main data input. The MOD09 Q1 product, with a spatial resolution of~250 m (Band 1, red, 620-670 nm; Band 2, near infrared, 841-876 nm) and a temporal repeat of 8-days, has already been cloud screened, atmospherically corrected, and transformed to a standard Sinusoidal projection. Due to the trade-offs involving spatial details, temporal information and areal coverage, MOD09Q1 has been deemed particularly suitable to characterize the seasonal dynamic of land cover classes and monitor large-region land cover change [9,27]. Four MODIS tiles (i.e., h25v03, h26v03, h26v04, and h27v04) covering the entire Heilongjiang region during the key growing season, from Julian days 65 to 305 (31 time points) in 2013 were selected and downloaded from the USGS EROS Data Center (http://edc.usgs.gov/). These tiled MODIS data were mosaicked and clipped to the extent of the study area. Also, NDVI was calculated based on Band 1 and Band 2 for each time point and used as the candidate input variables of GWR (shown as Equation (1)).
where Band 1 is the MODIS surface reflectance of red (620-670 nm) and Band 2 is the MODIS surface reflectance of near infrared (841-876 nm). In addition to MODIS-NDVI time series, the cropland map of Heilongjiang Province extracted from GlobeLand30 [28], the world's first global land cover dataset at a 30 m resolution, was resampled to MODIS pixel grid and then was used to mask non-agriculture areas within the satellite data sets.

Census Data
In this study, soybean cultivation statistics of 2013 were acquired for county level from the China Agricultural Statistics Yearbook published by Chinese Ministry of Agriculture [25]. The soybean statistical data of 80 counties in Heilongjiang were used to compare the soybean planting area estimated from GWR-derived soybean fraction maps.

Methodology
This paper is focused on using an optimized GWR model to estimate the sub-pixel soybean fractions in Heilongjiang, province based on time-series MODIS data. 4000 training points were used to calibrate the GWR model, where soybean fractions of 250-m MODIS pixels estimated from 30-m soybean reference map was used as the dependent variable for the GWR model. A total of 31 time-series NDVI, from Julian days 65 to 305, were used as the candidate input features. A forward stepwise strategy was used to select the optimal features by minimizing the corrected Akaike Information Criterion (AICc). The optimal NDVI temporal features selected from the forward stepwise strategy were used as the final independent variables for the GWR model. There were three ways to evaluate the performances of GWR model in estimating the sub-pixel crop fractions: (1) comparisons with OLS-based result; (2) assessment by using 2000 validation points with soybean fractions estimated from Landsat-based results; and (3) comparisons with county-level census data. The flow diagram in Figure 3 outlines the main analysis steps, which are described in detail in the following sections.

The Basic Framework of GWR Model
The traditional ordinary least squares (OLS) regression, often called a global logistic regression model, assumes that relationships between independent variables and the dependent variable are spatially stationary and the regression coefficients are also constant across space. The OLS model structure can be expressed as follows: where i y is the dependent variable, to are independent variables, ̂ is the constant parameter specific to location , ̂ to ̂ are estimated regression coefficients and is the independent error term at location with distribution ( , ); GWR, a local regression approach, recognizes that the parameter estimates in a regression relationship can vary across the study area and thereby allows the parameter estimates to be a function of location. The above equation can be modified to the following form [19]: This regression equation supposes the regression parameters to be estimated at a location for which the spatial coordinates are provided by the variables and . GWR accounts for the spatial drift in relationships by localizing the regression through a moving window (kernel). In this model, the regression and its parameters for each observation at location i are quantified separately and independently from other points. The GWR model coefficients can be estimated by solving the matrix equation:

The Basic Framework of GWR Model
The traditional ordinary least squares (OLS) regression, often called a global logistic regression model, assumes that relationships between independent variables and the dependent variable are spatially stationary and the regression coefficients are also constant across space. The OLS model structure can be expressed as follows: whereŷ i is the dependent variable, X 1i to X ni are independent variables,β 0i is the constant parameter specific to location i,β 1i toβ ni are estimated regression coefficients and ε i is the independent error term at location i with distribution N(0, σ 2 ); GWR, a local regression approach, recognizes that the parameter estimates in a regression relationship can vary across the study area and thereby allows the parameter estimates to be a function of location. The above equation can be modified to the following form [19]: This regression equation supposes the regression parameters to be estimated at a location for which the spatial coordinates are provided by the variables u and v. GWR accounts for the spatial drift in relationships by localizing the regression through a moving window (kernel). In this model, the regression Remote Sens. 2018, 10, 491 7 of 21 and its parameters for each observation at location i are quantified separately and independently from other points. The GWR model coefficients can be estimated by solving the matrix equation: whereβ are regression parameters in location (u, v) and W(u, v) is a weighting matrix whose diagonal elements represent the geographical weightings of observations near point i: where w in is the weight assigned to the observation at location i. The weights are chosen based on the assumption that those observations near the point in space have a more significant influence on the result than observations further away [16,29]. Therefore, near locations get high weights and far ones get low weights. There are usually two types of kernel functions, fixed kernel and adaptive kernel used to obtain weights. In this study, we used the adaptive kernel function, which can ensure a certain number of nearest neighbors as local samples and better represents the degree of spatial heterogeneity than the fixed kernel. This adaptive kernel function was based on a bi-square distance decay function as follows [30]: where d ij is the distance from location j to location i, and s is the distance from the Nth nearest neighbor to i. N, is the number of nearest neighbor points, also referred to as a bandwidth, which is the radius or number of observations around each subject point and controls the distance decay in the weighting function [31]. The bandwidth in this study was chosen by minimizing the corrected Akaike Information Criterion (AICc). The corrected AICc can be estimated as [32]: where n is the local sample points size;σ is the estimated standard deviation of the error term, and tr(S) denotes the trace of the hat matrix S. The hat matrix is the projection matrix from the observed value y i to the fitted valuesŷ i . AICc accounts for model parsimony and is a trade-off between prediction accuracy and complexity. The AICc method has the advantage of taking into account the fact that the degrees of freedom may vary among models centered on different observations. All these analyses were implemented using the GWmodel package in open-source software R of version 3.3.3 (https://www.r-project.org/). In our study, the dependent variable is the sub-pixel soybean fractions and the independent variables are the optimal temporal NDVI features selected by a forward stepwise strategy described in Section 3.1.2.

A Forward Stepwise Strategy for Selecting the Optimal Independent Variables
For our analysis, annual time-series NDVI ranging from day 65 to day 305 (a total of 31 temporal features) were used as candidate independent variables for the basic GWR model. A forward stepwise optimization method [32,33] was used to select an optimal subset of independent variables from the 31 time-series NDVI for the GWR model. This method is based on the theory that an independent variable is considered to be important if the AICc is significantly reduced when it is added to the model [32]. The whole procedures can be described in the following three steps: Remote Sens. 2018, 10, 491 8 of 21

•
Step 1: Start by calibrating all possible bivariate GW regressions by in turn regressing a single explanatory variable against the dependent variable. Calculate AICc in each case (31 runs in this study). Select the variable that produces the lowest AICc.

•
Step 2: Sequentially introduce a variable from the remaining n − 1 features to construct new models with the permanently included independent variable in step 1. Calculate the change in AICc between step1 and step 2. Select variable yielding greatest reduction in AICc. Add this variable to the model.

•
Step 3: Repeat step 2 until no independent variables among the candidate variables can be added into the model, and the model at this point is the final model.
The function in GWmodel package to perform this procedure is model.selection.gwr, where AICc values were sorted using function model.sort.gwr and then were visually presented using model.view.gwr. Figure 4 shows the iteration process of the stepwise selection strategy in selecting the optimal features from the candidate thirty-one features, where the dependent variable (i.e., the soybean proportion within MODIS pixel) is located at the center of the vortex and different shape in different color represents different independent variable (i.e., 31 NDVI acquired at different dates).

•
Step 2: Sequentially introduce a variable from the remaining n − 1 features to construct new models with the permanently included independent variable in step 1. Calculate the change in AICc between step1 and step 2. Select variable yielding greatest reduction in AICc. Add this variable to the model.

•
Step 3: Repeat step 2 until no independent variables among the candidate variables can be added into the model, and the model at this point is the final model.
The function in GWmodel package to perform this procedure is model.selection.gwr, where AICc values were sorted using function model.sort.gwr and then were visually presented using model.view.gwr. Figure 4 shows the iteration process of the stepwise selection strategy in selecting the optimal features from the candidate thirty-one features, where the dependent variable (i.e., the soybean proportion within MODIS pixel) is located at the center of the vortex and different shape in different color represents different independent variable (i.e., 31 NDVI acquired at different dates). . The iteration process of the forward stepwise strategy for selecting the optimal NDVI features (independent variables) from the thirty-one candidate NDVI variables. The dependent variable is located at the center of the vortex. Different symbol in different shape represents different NDVI temporal variable. The iteration process of the forward stepwise selection strategy starts from the first vortex circle (inside) and ends at the last vortex (outside), where the symbol with the largest amount in each vortex circle was the optimal NDVI variable of this iteration.

F-Test Statistics
With the selected optimal subset of independent variables, we run the GWR and OLS models based on the 4000 training pixels and then used the calibrated models to predict soybean fractions for all MODIS pixels of the whole study area. Descriptive statistics, including the overall AICc, the residual sum of squares (RSS), R 2 and adjusted R 2 , were used to reflect the model fitting performance and the spatially inhomogeneous of the GWR. Then, an F-test statistical testing method proposed by Leung et al. [33], was utilized to test the improvements of GWR over OLS in fitting the model and to understand whether or not each variable in the GWR model vary significantly.
To test whether GWR is superior than OLS, we used the F1-test statistics in Leung et al. [31]. Specifically, if the null hypothesis-H0: there is no significant difference between OLS and GWR models for the given data-is true, the quantity RSSg/RSSo is close to one. Otherwise, it tends to be small. The F1-value can be calculated as [33]: where RSSg and RSSo are the residual sum of squares of GWR and OLS model, respectively. is the standard deviation of error and also equal to the mean of / where is constant variance. n − p − 1 represents the degrees of freedom in the denominator. The distribution of F1-value . The iteration process of the forward stepwise strategy for selecting the optimal NDVI features (independent variables) from the thirty-one candidate NDVI variables. The dependent variable is located at the center of the vortex. Different symbol in different shape represents different NDVI temporal variable. The iteration process of the forward stepwise selection strategy starts from the first vortex circle (inside) and ends at the last vortex (outside), where the symbol with the largest amount in each vortex circle was the optimal NDVI variable of this iteration.

F-Test Statistics
With the selected optimal subset of independent variables, we run the GWR and OLS models based on the 4000 training pixels and then used the calibrated models to predict soybean fractions for all MODIS pixels of the whole study area. Descriptive statistics, including the overall AICc, the residual sum of squares (RSS), R 2 and adjusted R 2 , were used to reflect the model fitting performance and the spatially inhomogeneous of the GWR. Then, an F-test statistical testing method proposed by Leung et al. [33], was utilized to test the improvements of GWR over OLS in fitting the model and to understand whether or not each variable in the GWR model vary significantly.
To test whether GWR is superior than OLS, we used the F 1 -test statistics in Leung et al. [31]. Specifically, if the null hypothesis-H 0 : there is no significant difference between OLS and GWR models for the given Remote Sens. 2018, 10, 491 9 of 21 data-is true, the quantity RSS g /RSS o is close to one. Otherwise, it tends to be small. The F 1 -value can be calculated as [33]: where RSS g and RSS o are the residual sum of squares of GWR and OLS model, respectively. σ 1 is the standard deviation of error and also equal to the mean of RSS g /σ 2 where σ 2 is constant variance. n − p − 1 represents the degrees of freedom in the denominator. The distribution of F 1 -value may be approximated by an F-distribution with σ 1 2 /σ 2 degrees of freedom in the numerator and n − p − 1 degrees of freedom in the denominator.
, where ff is the given significance level, the null hypothesis will be rejected and it is thus believed that the GWR model is significantly better than the OLS in describing the data. Otherwise, we will conclude that the GWR model cannot improve the model fitting significantly compared to the OLS model.
To test whether each variable varies significantly across the study region, we used the F 3 -test statistics in Leung et al. [31]. The null hypothesis is: H 0 : β 1k = β 1k = . . . = β nk (for a given k)" the F 3 -value can be calculated as: where V k 2 is a statistic that can reflect the spatial variation of the given set of parameters is an unbiased estimator of σ 2 . Its distribution can be approximated by an F-distribution with γ 1 2 /γ 2 degrees of freedom in the numerator and δ 1 2 /δ 2 degrees of freedom in More details about the F-test statistical testing method are provided by Leung et al. [33].

Accuracy Assessment
In this study, the performances of the GWR model in estimating the sub-pixel crop area were assessed in three ways: first was to compare the model fitting result of GWR and OLS (described in Section 3.1.3), second was the most traditional accuracy assessment using a separate validation dataset and third was comparing area estimates to available national statistics data. For the second way, the sub-pixel soybean fraction map derived from GWR was assessed using 2000 points derived from Landsat-based reference map that is described in Section 2.2. These 2000 validation points were separate from the 4000 training samples. For the third evaluation on area estimates, the sub-pixel soybean map was aggregated to the total acreage of soybean cultivation for each pixel by multiplying the prediction values (between 0 to 1) by the area of each pixel. Then, the derived soybean acreage was compared with census data at county level by overlaying administrative boundaries with the soybean cultivation area map. The results of these two assessments were quantified by computing several statistical measures, including the RMSE, NRMSE (normalized root mean square error) and R 2 . They are defined as: For the assessment of using validation points, a i was the estimated soybean fraction from GWR, a i was the soybean fraction of 250 m pixel that was aggregated from Landsat-based reference map, and k, the number of validation points, is 2000. y max is the maximum sub-pixel soybean fractions and y min is the minimum sub-pixel soybean fractions of those 2000 validation points. For the assessment of using census data, a i is the GWR-derived soybean acreage, and a i is the census data, and k, the number of counties in Heilongjiang, is 80. y max is the maximum area and y min is the minimum areas of those 80 counties.

Descriptive Statistics
The histograms of independent variables (NDVI values) as well as the dependent variable (sub-pixel soybean fraction) are illustrated in terms of frequency distribution in Figure 5. Five NDVI images that are acquired at different Julian days and refer to different phenological phases of soybeans are shown as examples for the total thirty-one candidate variables. Among the five variables, Julian day 121 when soybeans are still at emergence stage has the lowest frequency for NDVI values of larger than 0.4, with a mean value of 0.2485 and a standard deviation (Std.Dev) of 0.0446. As time goes on, the pixel number of NDVI larger than 0.4 increases and that smaller than 0.4 decreases dramatically. Both the frequency for values between 0.8-1.0 and the mean NDVI value approach the peaks at Julian day 217 when soybeans are bearing pods and contain higher vegetation density than other periods. From bearing pod stage to filling seed stage, partial green leaves begin to turn yellow, which makes NDVI gradually decrease with the mean value dropping from 0.7769 to 0.6740. Overall, the changes of NDVI frequency distribution across seasons well reflect the phenological dynamics of soybean cultivation.

Descriptive Statistics
The histograms of independent variables (NDVI values) as well as the dependent variable (sub-pixel soybean fraction) are illustrated in terms of frequency distribution in Figure 5. Five NDVI images that are acquired at different Julian days and refer to different phenological phases of soybeans are shown as examples for the total thirty-one candidate variables. Among the five variables, Julian day 121 when soybeans are still at emergence stage has the lowest frequency for NDVI values of larger than 0.4, with a mean value of 0.2485 and a standard deviation (Std.Dev) of 0.0446. As time goes on, the pixel number of NDVI larger than 0.4 increases and that smaller than 0.4 decreases dramatically. Both the frequency for values between 0.8-1.0 and the mean NDVI value approach the peaks at Julian day 217 when soybeans are bearing pods and contain higher vegetation density than other periods. From bearing pod stage to filling seed stage, partial green leaves begin to turn yellow, which makes NDVI gradually decrease with the mean value dropping from 0.7769 to 0.6740. Overall, the changes of NDVI frequency distribution across seasons well reflect the phenological dynamics of soybean cultivation.
For the dependent variable, the sub-pixel soybean fraction value ranges from 0 to 1, with a mean value of 0.0971 and a standard deviation of 0.2103. However, the frequency distribution of these 4000 training pixels is uneven because a clear majority of pixel values are less than 0.1. This high frequency for low values indicates the soybean field is generally fragment in Heilongjiang Province, which also implies the necessity of sub-pixel calculation of crop area when using MODIS images. In addition to analyzing the value distribution characteristics of independent and dependent variables, we also observed whether these values changes are related to spatial distances. The MODIS-NDVI time series spectral curves of three pure soybean pixels in different spatial positions (a-c in Figure 1a) and two mixed pixels (d, e in Figure 1a) were illustrated to characterize soybean temporal-spectral curves of different soybean fractions and to examine whether the distance of spatial locations impact the variations of crop time-series spectral curves. Their NDVI time-series curves ranging from day 65 to day 305 were shown in Figure 6. We observed that pixel "c", which is For the dependent variable, the sub-pixel soybean fraction value ranges from 0 to 1, with a mean value of 0.0971 and a standard deviation of 0.2103. However, the frequency distribution of these 4000 training pixels is uneven because a clear majority of pixel values are less than 0.1. This high frequency for low values indicates the soybean field is generally fragment in Heilongjiang Province, which also implies the necessity of sub-pixel calculation of crop area when using MODIS images.
In addition to analyzing the value distribution characteristics of independent and dependent variables, we also observed whether these values changes are related to spatial distances. The MODIS-NDVI time series spectral curves of three pure soybean pixels in different spatial positions (a-c in Figure 1a) and two mixed pixels (d, e in Figure 1a) were illustrated to characterize soybean temporal-spectral curves of different soybean fractions and to examine whether the distance of spatial locations impact the variations of crop time-series spectral curves. Their NDVI time-series curves ranging from day 65 to day 305 were shown in Figure 6. We observed that pixel "c", which is spatially close to pixel "b", presents much similar spectral curve characteristic with pixel "b". By contrast, the pixel "a" further away from pixel "b", shows a large discrepancy in spectral curve with the pixel "b", especially for the peak value of curve and associated dates. This implies the spatial instability of crop signature should be considered when the relationship between sub-pixel soybean area and time-series NDVI is explored.
Remote Sens. 2018, 13, x FOR PEER REVIEW 11 of 21 spatially close to pixel "b", presents much similar spectral curve characteristic with pixel "b". By contrast, the pixel "a" further away from pixel "b", shows a large discrepancy in spectral curve with the pixel "b", especially for the peak value of curve and associated dates. This implies the spatial instability of crop signature should be considered when the relationship between sub-pixel soybean area and time-series NDVI is explored. Figure 6. Time-series NDVI curves of pure soybean pixels (a-c) and mixed pixels (d,e) during the entire crop growing period in Heilongjiang. The specific spatial locations of those five pixels were described in Figure 1. These five pixels were presented here to show the time-series curve characteristics of different soybean fractions and to examine whether the distance of spatial locations is related to the variations of crop time-series spectral curves.

The Optimal Independent Variables for Soybean Cultivation
In this study, the best adaptive bandwidth (number of nearest neighbors) is 736, with the minimum AICc value of −3087.201. This best value means 736 nearest neighbors (points) are included as local samples to perform local regression. Figure 7 displays the corresponding AICc values of models composed by different features derived from the forward stepwise selection strategy, which corresponds to the vortexes in Figure 4. Obviously, AICc value continues to decrease as the number of independent variable increases to 31. The model (point B in Figure 7) containing all 31 NDVI variables is the best for calculating sub-pixel soybean fraction in terms of the minimum AICc value. In addition, it can be found the model at point A has the biggest change rate of AICc, indicating those associated variables play the most critical roles in reducing AICc values and thus are considered as the optimal features among the 31 NDVI variables for sub-pixel soybean area calculation. Specifically, the number of independent variables at point A is four, and there are NDVI of Julian day 233, 257, 241 and 249, respectively. During this period (day 233 to 257), soybean cultivation is at the stage of filling seeds when the yellowing leaves and changing canopy structure due to the growing seeds provides good separation with other classes. Figure 6. Time-series NDVI curves of pure soybean pixels (a-c) and mixed pixels (d,e) during the entire crop growing period in Heilongjiang. The specific spatial locations of those five pixels were described in Figure 1. These five pixels were presented here to show the time-series curve characteristics of different soybean fractions and to examine whether the distance of spatial locations is related to the variations of crop time-series spectral curves.

The Optimal Independent Variables for Soybean Cultivation
In this study, the best adaptive bandwidth (number of nearest neighbors) is 736, with the minimum AICc value of −3087.201. This best value means 736 nearest neighbors (points) are included as local samples to perform local regression. Figure 7 displays the corresponding AICc values of models composed by different features derived from the forward stepwise selection strategy, which corresponds to the vortexes in Figure 4. Obviously, AICc value continues to decrease as the number of independent variable increases to 31. The model (point B in Figure 7) containing all 31 NDVI variables is the best for calculating sub-pixel soybean fraction in terms of the minimum AICc value. In addition, it can be found the model at point A has the biggest change rate of AICc, indicating those associated variables play the most critical roles in reducing AICc values and thus are considered as the optimal features among the 31 NDVI variables for sub-pixel soybean area calculation. Specifically, the number of independent variables at point A is four, and there are NDVI of Julian day 233, 257, 241 and 249, respectively. During this period (day 233 to 257), soybean cultivation is at the stage of filling seeds when the yellowing leaves and changing canopy structure due to the growing seeds provides good separation with other classes. Remote Sens. 2018, 13, x FOR PEER REVIEW 12 of 21

GWR Results of Sub-Pixel Soybean Area Estimation
Model fitting results (AICc, RSS, R 2 , and adjusted R 2 ) of GWR and OLS are summarized as Table  1. The adjusted overall mean local R 2 for GWR model is 0.4377, meaning 43.77% of the variability in the sub-pixel soybean proportion can be explained in the modeling dataset. Compared to OLS model, significant improvements for all descriptive statistics indicates GWR is superior in exploring the relationships between the time-series NDVI values and sub-pixel crop area. In addition, the non-stationarity F-test results including F1-test and F3-test between GWR and OLS is shown in Table  2. The F1-value indicates the null hypotheses is rejected and there is a significant difference between GWR and OLS model. Additionally, it is observed 21 variables of the total 31 NDVI variables as well as intercept show highly significant non-stationarity (α = 0.001 level), suggesting the response of the sub-pixel soybean area to those variables is not constant across the study area. By contrast, only six NDVI variables (i.e., NDVI of Julian day 81, 89,137, 153, 201 and 209) are not statistically significant, indicating they keep relatively stable relationships with the dependent variable. Not surprisingly, the four optimal variables, i.e., NDVI of Julian day 233, 241, 249, and 257 (Section 4.2) are found statistically significant, which confirms the importance of images acquired at filling seed stage for mapping sub-pixel soybean area. This F-test results also highlight the advantages of GWR over OLS model in expressing the spatially variant relationships between crop area and remote sensing images. It should be noted here the coordinate system based on which the distance between different points are calculated may impact the bandwidth and consequently the regression accuracy of GWR. For our work, the geographic coordinate system (GCS/WGS84) was used to calculate great circle distance between different points as we have demonstrated it outperforms the Euclidean distance based on projected coordinate system (PCS/Sinusoidal). The performances of model fitting based on these two distance calculation ways are displayed in Table 3. It shows their optimal bandwidth is different, and WGS84 has lower AICc and RSS, and higher R 2 and adjusted R 2 . The higher accuracy derived by calculating great circle distance may be explained by the fact the spherical distance can

GWR Results of Sub-Pixel Soybean Area Estimation
Model fitting results (AICc, RSS, R 2 , and adjusted R 2 ) of GWR and OLS are summarized as Table 1. The adjusted overall mean local R 2 for GWR model is 0.4377, meaning 43.77% of the variability in the sub-pixel soybean proportion can be explained in the modeling dataset. Compared to OLS model, significant improvements for all descriptive statistics indicates GWR is superior in exploring the relationships between the time-series NDVI values and sub-pixel crop area. In addition, the non-stationarity F-test results including F 1 -test and F 3 -test between GWR and OLS is shown in Table 2. The F 1 -value indicates the null hypotheses is rejected and there is a significant difference between GWR and OLS model. Additionally, it is observed 21 variables of the total 31 NDVI variables as well as intercept show highly significant non-stationarity (α = 0.001 level), suggesting the response of the sub-pixel soybean area to those variables is not constant across the study area. By contrast, only six NDVI variables (i.e., NDVI of Julian day 81, 89,137, 153, 201 and 209) are not statistically significant, indicating they keep relatively stable relationships with the dependent variable. Not surprisingly, the four optimal variables, i.e., NDVI of Julian day 233, 241, 249, and 257 (Section 4.2) are found statistically significant, which confirms the importance of images acquired at filling seed stage for mapping sub-pixel soybean area. This F-test results also highlight the advantages of GWR over OLS model in expressing the spatially variant relationships between crop area and remote sensing images. It should be noted here the coordinate system based on which the distance between different points are calculated may impact the bandwidth and consequently the regression accuracy of GWR. For our work, the geographic coordinate system (GCS/WGS84) was used to calculate great circle distance between different points as we have demonstrated it outperforms the Euclidean distance based on projected coordinate system (PCS/Sinusoidal). The performances of model fitting based on these two distance calculation ways are displayed in Table 3. It shows their optimal bandwidth is different, and WGS84 has lower AICc and RSS, and higher R 2 and adjusted R 2 . The higher accuracy derived by calculating great circle distance may be explained by the fact the spherical distance can characterize the real geographical correlations between different points more accurately than the plane distance.  [33]. *** Significant at the α = 0.001 level; ** Significant at the α = 0.01 level; * Significant at the α = 0.05 level.  Figure 8 shows the continuous map of soybean cultivation derived from the optimized GWR model using MODIS-NDVI time series, where the value presents the proportion of each MODIS pixel occupied by soybean cultivation and can also reflect the field heterogeneity as its patch sizes are smaller than the spatial resolution of MODIS (~250 m). As expected, the mapped soybean cultivation mainly concentrated in the northwest part close to Lesser Khingan Mountains, which is one of the most important soybean production regions in China. In addition, Sanjiang Plain to the northeast and the south part of Heilongjiang are also observed have some small fractions of soybean cultivation. Those regions were reported their soybean planting areas had been dramatically decreased during the last five years due to a great deal of importation of GM soybean from other countries, such as Brazil and the United States [4].

Sub-Pixel Soybean Map and Accuracy Assessment
A total of 2000 random validation points selected from Landsat-based fractional soybean map were used to assess our MODIS-based soybean extent in a spatially explicit fashion. A direct differencing comparison based on 10% bins of increase in percentage of soybean cover are illustrated in Figure 9. Overall, the regression line composed of median values was close to 1:1 line as hoped, indicating the spatial distribution of soybean cultivation derived from time-series MODIS data generally agree with Landsat-based soybean maps. Nevertheless, relative to Landsat data, MODIS results underestimate the values of larger than 0.2 while overestimates those of less than 0.2. Additionally, the vertical bars that respond to standard deviations of soybean fractions derived from MODIS show soybean with the proportion between 0.4-0.6 has relatively large bias compared to others. These big biases tend to mostly happen for those pixels which are mixtures of soybean and corn. This is because field survey (Figure 1a) shows the soybean fields always adjoin to corn fields and some of them are even intercropped with each other in Heilongjiang. Furthermore, previous studies have reported there are high levels of spectral confusion between soybean and corn [6,34]. Hence, those pixels occupied by a large proportion of corn will significantly reduce the identifiability of their soybean areas and consequently increase the regression bias. model using MODIS-NDVI time series, where the value presents the proportion of each MODIS pixel occupied by soybean cultivation and can also reflect the field heterogeneity as its patch sizes are smaller than the spatial resolution of MODIS (~250 m). As expected, the mapped soybean cultivation mainly concentrated in the northwest part close to Lesser Khingan Mountains, which is one of the most important soybean production regions in China. In addition, Sanjiang Plain to the northeast and the south part of Heilongjiang are also observed have some small fractions of soybean cultivation. Those regions were reported their soybean planting areas had been dramatically decreased during the last five years due to a great deal of importation of GM soybean from other countries, such as Brazil and the United States [4].
A total of 2000 random validation points selected from Landsat-based fractional soybean map were used to assess our MODIS-based soybean extent in a spatially explicit fashion. A direct differencing comparison based on 10% bins of increase in percentage of soybean cover are illustrated in Figure 9. Overall, the regression line composed of median values was close to 1:1 line as hoped, indicating the spatial distribution of soybean cultivation derived from time-series MODIS data generally agree with Landsat-based soybean maps. Nevertheless, relative to Landsat data, MODIS results underestimate the values of larger than 0.2 while overestimates those of less than 0.2. Additionally, the vertical bars that respond to standard deviations of soybean fractions derived from MODIS show soybean with the proportion between 0.4-0.6 has relatively large bias compared to others. These big biases tend to mostly happen for those pixels which are mixtures of soybean and corn. This is because field survey (Figure 1a) shows the soybean fields always adjoin to corn fields and some of them are even intercropped with each other in Heilongjiang. Furthermore, previous studies have reported there are high levels of spectral confusion between soybean and corn [6,34]. Hence, those pixels occupied by a large proportion of corn will significantly reduce the identifiability of their soybean areas and consequently increase the regression bias.   A comparison on the soybean cultivation areas estimated from the optimized GWR and the agricultural census data at the county level is shown in Figure 10. The two estimates of soybean acreage were significantly correlated (α < 0.01) at county levels, with R 2 of 0.80 and RMSE of 340.21 km 2 and NRMSE of 0.1054 across 80 counties. A closer observation for this figure shows our GWR model based on MODIS data overestimated soybean areas of more than 750 km 2 , while almost underestimated those areas of less than 500 km 2 .
The map in Figure 11 displays the spatial distribution of differences (both in soybean area and percentage) between the MODIS-based and census-based soybean cultivation area estimates for each county in Heilongjiang Province. It is observed that the largest percentage differences between these two datasets are in those counties, which are in the center part of Heilongjiang and has a soybean area of less than 500 km 2 , such as Yichun, Hegan, Suiling, Hailun, Tangyuan and Baiquan. The high landscape heterogeneity and small field size are responsible for the relatively large confusion between soybean and non-soybean spectral signature in those regions. Overall, in the north part of Heilongjiang, our GWR model based on MODIS data overestimated soybean cultivation. In contrast, our GWR model slightly underestimated soybean acreage in northern districts, such as Shanzhi, Dongning, Mulin, Linkou and Hailin. The obvious regional similarity in the regression bias may be because the adjoining areas present similar spectral characteristics for soybean cultivation, and thereby they may have the similar or the same regression relationships between sub-pixel soybean fractions and time-series NDVI. It should be noted here some errors could be due to the uncertainty of the census data, since crop area information is acquired from sampling survey or communication with farmers which are not always accurate [13]. A comparison on the soybean cultivation areas estimated from the optimized GWR and the agricultural census data at the county level is shown in Figure 10. The two estimates of soybean acreage were significantly correlated (α < 0.01) at county levels, with R 2 of 0.80 and RMSE of 340.21 km 2 and NRMSE of 0.1054 across 80 counties. A closer observation for this figure shows our GWR model based on MODIS data overestimated soybean areas of more than 750 km 2 , while almost underestimated those areas of less than 500 km 2 .
The map in Figure 11 displays the spatial distribution of differences (both in soybean area and percentage) between the MODIS-based and census-based soybean cultivation area estimates for each county in Heilongjiang Province. It is observed that the largest percentage differences between these two datasets are in those counties, which are in the center part of Heilongjiang and has a soybean area of less than 500 km 2 , such as Yichun, Hegan, Suiling, Hailun, Tangyuan and Baiquan. The high landscape heterogeneity and small field size are responsible for the relatively large confusion between soybean and non-soybean spectral signature in those regions. Overall, in the north part of Heilongjiang, our GWR model based on MODIS data overestimated soybean cultivation. In contrast, our GWR model slightly underestimated soybean acreage in northern districts, such as Shanzhi, Dongning, Mulin, Linkou and Hailin. The obvious regional similarity in the regression bias may be because the adjoining areas present similar spectral characteristics for soybean cultivation, and thereby they may have the similar or the same regression relationships between sub-pixel soybean fractions and time-series NDVI. It should be noted here some errors could be due to the uncertainty of the census data, since crop area information is acquired from sampling survey or communication with farmers which are not always accurate [13].

Discussion
In this study, we used an optimized GWR model to explore the non-stationary spatial relationships between soybean area and time-series NDVI values on a sub-pixel scale and then predict soybean fraction for the whole Heilongjiang Province, China. Our results have high consistency both with Landsat-based soybean estimates and census data, indicating that satellite remote sensing technology when given high quality reference data can provide a simple but effective way to estimate reginal crop area compared to traditional sampling survey or communication with farmers. In addition, compared to the OLS model, the superior performance of the GWR model also implies the spatial variation of spectral signature across time and the spatial nonstationary within these NDVI variables should be considered when the relationships between crop area and time-series NDVI is explored. The regression results achieved for soybean cultivation also indicate the strengths of the MODIS time series data for crop mapping as its high temporal and spectral resolutions effectively preserve the dynamic phenological characteristics of crops. It is well known that one critical shortcoming in mapping land cover over large areas based on MODIS data is its coarse spatial resolution, which can be partially overcome by using this regression method that can calculate area fraction at the sub-pixel level. Since such data are available at global scale and with nearly twenty years of data, there is great potential for sub-pixel crop mapping at large region including changes to crop cultivation through time, especially when it combined with GWR model.
For the selection of optimal independent variables, the minimum AICc is achieved when all the 31 time-series NDVI are employed together, among which NDVI of Julian day 233, 241, 248 and 257 were found the most crucial variables in significantly reducing AICc. This period (233 to 257) responds to the filling seed stage of soybean cultivation in Heilongjiang, when the yellowing leaves and unique seed texture make the spectral signature of soybean significantly different from other classes. It should be noted here since phenological characteristics may vary from region to region, these selected spectro-temporal features for soybean cultivation in Heilongjiang may not necessarily be optimal for other study areas and other crop types. However, the forward stepwise regression method can effectively select the optimal combination of independent variables and provide valuable insights for understanding the variable importance, which is highly recommended for variables selection of GWR especially when there is a large number of candidate independent variables.
It is well known that spectral variations, including inter-class and intra-class spectral variation, are essential factors for the accuracy of spectral unmixing [15]. In this study, inter-class spectral variation is caused by the different land cover classes and their associated area proportions within MODIS pixels. Due to different agricultural intensive management (e.g., irrigation, fertilization, and sowing), local weather conditions and soil quality, the same crop type especially those spatially further away may present different spectral signature, thus resulting in intra-class spectral variations. The results of GWR model show the relationships between sub-pixel crop fraction and spectral signatures vary from location to location, with similar relationships for adjacent pixels and vice versa. This can be explained by the fact that the close soybean MOIDS pixels may share more similar climate environment, agricultural planting conditions and phenological characteristics than those far away from each other, which makes them have smaller intra-class spectral variations and thus having similar relationships with sub-pixel crop fractions. Therefore, the specific advantage of GWR model in estimating sub-pixel crop area is that GWR can largely reduce intra-class spectral variation by building local regression relationships between sub-pixel crop fraction and spectral signatures.
Despite the wide application of GWR models in remote sensing fields, its potential in modelling the spatial distribution of crop type has never been explored before. Our study is the first effort to explore the spatial variation in the relationships between sub-pixel crop area and remote sensing data and to predict the crop area at regional scale based on the GWR model. Our conclusions extend and strengthen the findings of Pan et al. [10] and Lobel et al. [11], who found the quantitative relationships between sub-pixel crop fractions and MODIS time-series data but ignored the spatial non-stationarity of such relationship. In addition, our presented GWR model improved the regular GWR model by using a forward stepwise strategy to select the optimal dependent variables from time-series remote sensing data, which shows great potential when having a large of candidate independent variables for GWR. Even though the GWR model was used in this study to estimate the sub-pixel soybean cultivation in a single province in China using MODIS time series data, it is easily generalizable to any other crop type (e.g., corn or rice) and any input data (e.g., VIIRS) since it is entirely automated. However, it is noteworthy that the landscape fragmentation has negative impacts on the accuracy of crop area estimation. The landscape heterogeneity impacts estimates of crop area by increasing variability in crop spectral reflectance at decreasing spatial scales [10]. Also, the fragmented cropland always increases the difficulty for sensors to detect the weak spectral signature within MODIS pixels and therefore will reduce the accuracy of GWR prediction. Since the landscape fragmentations may vary with locations, more regions with different agriculture landscapes and more remote sensing images with different spatial resolution needed to be tested in the future to improve the robustness of GWR model in estimating sub-pixel crop fraction.
However, some potential improvements could be considered before the presented GWR model is applied to larger areas and more crop types. First, a wide range of independent variables, such as different vegetation indices and phenological metrics (e.g., mean, standard deviation, minimum and maximum) in addition to NDVI could be tested to improve the regression accuracy. Previous studies have proved NDVI is not necessarily the optimal spectral feature for identifying specific crop type (e.g., rice) [6,35], and some studies also reported the inclusion of phenological metrics could significantly improve classification accuracy [1,36]. Second, an extension to GWR will be developed, which aims to realize the possibility of creating models where some variables are held constant across the study area while others can vary spatially. Such models, which are known as mixed or semi-parametric models [32,37], could be tested in the future for application to other study areas or other land cover classes. In addition, some models [38] that considered non-stationarity both in space and time will be also explored to improve the accuracy in regions where agricultural landscapes is fragment and cropping pattern is complex. Finally, it is well known that the representativeness in quality and quantity of training samples, as well as their spatial homogeneity are quite important for GWR [1,39]. However, one of the most challenging problems is how to collect and compile such a dataset for regions without existing high-quality reference data. Fortunately, more and more freely satellite data of finer spatial and temporal resolutions than MODIS becomes available, such as Sentinel1-A [40], Sentinel2-A [41], Landsat 8 [42], GaoFen-1/2 [43]. High-quality classification results from these datasets can be used as reference information to acquire reference samples for large-scale land cover mapping efforts on a sub-pixel level based on coarse spatial resolution images (e.g., MODIS and VIIRS).

Conclusions
In this study, we presented and evaluated a method for estimating the fractional soybean cultivation at 250 m spatial resolution using the MODIS-NDVI time series. We developed a geographically weighted regression model calibrated using sub-250 m percentage soybean information from Landsat, focusing on Heilongjiang region of China in 2013. A forward stepwise selection optimization strategy was specially used to select the optimal dependent variables from the time-series NDVI for the GWR model. All 31 time-series NDVI were identified as the optimal combination of independent variables with minimum AICc. NDVI of Julian day 233, 241, 249, and 257 were found the most important temporal variables for sub-pixel soybean area calculation when soybean is at the stage of filling seed. Our analyses showed that estimated percent soybean at 250 m compared well with Landsat-based results. There was also a good agreement between our MODIS-based results and census data, with R 2 of 0.80 and RMSE was 340.21 km 2 across 80 counties. The largest differences between these two datasets were observed in regions with high landscape heterogeneity and small field size. Additionally, the optimized GWR model had better performances with lower RSS and higher R 2 compared to OLS model, and their F-test results shown 23 variables of the total 31 time-series NDVI are statistically significant. These promising results indicate the spatial nonstationary within spectral-based variables should be considered when the relationship between crop sub-pixel fraction and remote sensing images is explored. Our presented GWR model shows great potential in mapping sub-pixel crop distribution over large region based on coarse spatial resolution imagery such as MODIS. However, if the presented GWR model is to be extended to other regions and land cover types, further development of the GWR method is possible, such as testing a variety of the input variables (e.g., vegetation indices and phenological metrics) and developing a mixed GWR model that allow some variables held constant while others vary spatially. In addition, future work could address the issues on how to use multi-source finer spatial resolution data to improve the selection of training and validation samples.