An Optimal Sample Data Usage Strategy to Minimize Overfitting and Underfitting Effects in Regression Tree Models Based on Remotely-Sensed Data

Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.

Data-driven regression tree models based on satellite observations, climate and environmental variables, and ground truth data have been successfully used for regional and global land cover mapping and ecosystem modeling [6,[20][21][22][23][24][25][26].The data-driven regression tree approach divides the multi-dimensional data domain of environmental variables into many segments and derives one (or multiple) regression equation(s) for each segment.Results from these previous studies help better understand the regional land surface changes and trends, ecosystem services, and climate change impacts on land surface dynamics.
Cubist software [27], a common tool for predicting continuous ecosystem attributes, develops a generalized set of rules with associated multiple regression models (a series of piecewise regressions) which are constrained by the training data [28][29][30][31].Generally, a simple sample data usage strategy (e.g., 90% of the sample data for model training and 10% of the sample data for testing the reliability of the model) is applied in the Cubist regression tree model development [4].However, improper use of the sample data (percentage of the data used for model training and testing) can cause overfitting or underfitting effects [32] and may lead to a bias evaluation of the derived regression tree models [33,34].Therefore, an approach that seeks to identify optimal model parameters for a dataset (e.g., training and testing sample size, number of rules) is often needed.
The main objective of this study is to develop a method that identifies an optimal sample data usage strategy and rule numbers that minimize over-and underfitting effects and maximize prediction accuracy on unseen data in regression tree mapping models.Regression tree models were developed using nine randomized replications at each training data size from 10% to 90% of the sample data, in 10% increments, using GNU General Public License (GPL) Cubist (version 2.07 GPL Edition) and Python scripts.Accuracy assessments (i.e., training and testing error magnitudes and variation across a wide range of training data sizes) for the derived regression tree models were quantified.Results from this study reveal trends of prediction error and provide an approach for optimizing the accuracy of a Cubist regression tree model [35], which will likely be useful for future remote-sensing-based ecosystem parameter mapping and modeling studies.

Case Study Area
A pilot study area located mainly in Northeastern Nevada (Figure 1) was selected for testing.Based on the 2011 National Land Cover Database (NLCD 2011), the main land cover types for the study area are shrub/scrub (75.4%) and forest (10.2%).Other land cover types include barren land (6.6%), herbaceous (4.9%), and hay/pasture (1.2%) [5].The broad range of land cover types and vegetation biomass productivity in the study area helped to ensure the robustness of the testing results.multiple) regression equation(s) for each segment.Results from these previous studies help better understand the regional land surface changes and trends, ecosystem services, and climate change impacts on land surface dynamics.
Cubist software [27], a common tool for predicting continuous ecosystem attributes, develops a generalized set of rules with associated multiple regression models (a series of piecewise regressions) which are constrained by the training data [28][29][30][31].Generally, a simple sample data usage strategy (e.g., 90% of the sample data for model training and 10% of the sample data for testing the reliability of the model) is applied in the Cubist regression tree model development [4].However, improper use of the sample data (percentage of the data used for model training and testing) can cause overfitting or underfitting effects [32] and may lead to a bias evaluation of the derived regression tree models [33,34].Therefore, an approach that seeks to identify optimal model parameters for a dataset (e.g., training and testing sample size, number of rules) is often needed.
The main objective of this study is to develop a method that identifies an optimal sample data usage strategy and rule numbers that minimize over-and underfitting effects and maximize prediction accuracy on unseen data in regression tree mapping models.Regression tree models were developed using nine randomized replications at each training data size from 10% to 90% of the sample data, in 10% increments, using GNU General Public License (GPL) Cubist (version 2.07 GPL Edition) and Python scripts.Accuracy assessments (i.e., training and testing error magnitudes and variation across a wide range of training data sizes) for the derived regression tree models were quantified.Results from this study reveal trends of prediction error and provide an approach for optimizing the accuracy of a Cubist regression tree model [35], which will likely be useful for future remote-sensing-based ecosystem parameter mapping and modeling studies.

Case Study Area
A pilot study area located mainly in Northeastern Nevada (Figure 1) was selected for testing.Based on the 2011 National Land Cover Database (NLCD 2011), the main land cover types for the study area are shrub/scrub (75.4%) and forest (10.2%).Other land cover types include barren land (6.6%), herbaceous (4.9%), and hay/pasture (1.2%) [5].The broad range of land cover types and vegetation biomass productivity in the study area helped to ensure the robustness of the testing results.

Background
The satellite-derived Normalized Difference Vegetation Index (NDVI) is the normalized reflectance difference between the near infrared (NIR) and visible red bands [36,37].NDVI represents the vigor and photosynthetic capacity (or greenness) of the vegetation canopy [37,38].The time series of Moderate Resolution Imaging Spectroradiometer (MODIS) NDVI data have been widely used for vegetation dynamic (e.g., land surface phenology) [39] and ecosystem monitoring (e.g., biomass productivity mapping) [40].The advantages of MODIS data include (1) high temporal resolution (one-to two-day revisit time), which can capture the rapid land changes during the growing season, and (2) a wide range of wavelengths that make the MODIS land surface products robust and reliable (e.g., atmospherically corrected for cloud, cloud shadows, and aerosols) [41].However, the 250 m MODIS productivity map can only provide coarse-scale pattern information and cannot capture more detailed site-specific information of a region.To solve this problem, an approach that integrated 250 m MODIS growing season NDVI (GSN, used as a proxy for vegetation biomass productivity) [40,42,43] and 30 m Landsat 8 Operational Land Imager data to downscale MODIS GSN to 30 m was developed [29,44].These downscaling studies illustrate the strong correlation between Landsat multiple-band data and MODIS NDVI (i.e., the MODIS NDVI can be predicted by multi-band Landsat data).
In this case study, Landsat 8 spectral band data and MODIS NDVI were selected to develop approaches for optimizing Cubist model parameterization, build the data-driven regression tree models, and demonstrate the optimal data usage strategy that minimizes possible over-and underfitting effects in the regression tree models.

Landsat 8 Observations and MODIS NDVI Data
A LANDFIRE tile (r04c03), which is a composite made up of multiple Landsat 8 scenes for a target date, was selected in this study.To minimize any cloud and bad detection effects in the original Landsat 8 data, a compositing approach using cosine-similarity was used [45], which combines pixels from multiple observations based on data quality and temporal proximity to a target date.In this case, Julian date 212 which yielded relatively low "no data and/or cloudy" pixels, was used as the target date with observations from days 140-240 in 2013.The derived 30 m composite data were then upscaled to 250 m using the "spatial averaging" method [29].Based on our previous study results [29,44], six Landsat 8 spectral bands (bands 1-6) at 250 m resolution were used as independent variables for developing the piecewise regression tree models to predict the 250 m MODIS NDVI (dependent variables).Furthermore, to ensure the high quality of the derived 250 m Landsat 8 data, and avoid any additional cloud and atmospheric effects, the percentage of 30-m pixel(s) with "0" value(s) within a 250 m pixel was calculated.Only those 250 m pixels with 0% of "0" values (i.e., all the 30 m pixels within a 250 m pixel are none "zero vales" pixels) were selected (Figure 2a, light blue color) to develop regression tree models.The USGS 30-m NLCD 2011 map for the study area was obtained from the NLCD 2011 data archive [50] (Figure 1).This NLCD map was used to evaluate where and what land cover types are likely to have large prediction errors (i.e., over-and underfitting effects; see the next section for a The seven-day maximum value composites of 250-m MODIS NDVI (scaled NDVI, value from 0 to 200) for the year 2013 were obtained from the USGS expedited MODIS (eMODIS) data archive [46].Pixels with bad quality, negative values, clouds, snow cover, and low view angles were filtered out based on the MODIS quality assurance data [47] to ensure high quality eMODIS NDVI data.The 2013 weekly NDVI data were then stacked and temporally smoothed using a weighted least-squares approach to reduce additional atmospheric noise [48].Temporal smoothing helps to ensure reliable NDVI values for weekly or longer cloudy periods [49].Finally, the 250 m MODIS NDVI for the date around Julian date 212 (similar date as the Landsat scene date) was extracted and used as the dependent variable in the regression tree model.

Spatial Evaluations of Large Model Prediction Errors Using NLCD 2011
The USGS 30-m NLCD 2011 map for the study area was obtained from the NLCD 2011 data archive [50] (Figure 1).This NLCD map was used to evaluate where and what land cover types are likely to have large prediction errors (i.e., over-and underfitting effects; see the next section for a detailed explanation) and investigate the potential cause of the large errors.

Method
As described earlier, the main goal and focus of this study is to develop an approach that identifies the optimal sample data usage strategy for regression tree models by minimizing over-and underfitting effects in the model.The mean absolute difference (MAD) between the predicted NDVI and the actual NDVI (scaled NDVI, value from 0-200) was calculated (Equation ( 1)) and used to evaluate model over-and underfitting tendencies.The standard deviation of the MAD (SD) from the nine different randomized tests using the same criteria for building the regression tree models was also calculated (Equation ( 2)) and used to assess regression tree model performance.
N: total number of samples.
In this study, "underfitting" is defined as the highest errors (MAD) for both training and testing (i.e., validation), and "overfitting" means training error is low while testing error is high (i.e., MAD training < MAD testing )."Good fit" (i.e., optimal model) is defined as the lowest errors for both validation and training and with relatively low SD [32].Table 1 is a summary of the criteria used to assess regression tree model performance.A series of tests was performed that include (1) small training data size with a large testing data size (large training and testing MADs were expected primarily from overgeneralization); (2) large training data size with small testing size (high testing MAD with relatively low training MAD were expected primarily from overfitting); and (3) the "happy medium" (robust model with minimal errors) with different maximum allowed number of rules to build Cubist regression tree models.Models with overfitting, underfitting, and good fit conditions were identified based on the criteria in Table 1.

Processing Procedures
Python is a programming language that allows a programmer to develop code (e.g., batch processing) and subsequently uses an interpreter to run the code [51].A Python script was developed to automatically perform two Cubist modeling procedures (version 2.07 GPL Edition).The first procedure replicates a Cubist model nine times utilizing 10%-90% of the subsamples of a given dataset (i.e., training data) incrementally by 10%, resulting in a total of 81 model runs.Training data were randomly sampled from the entire dataset per 10% subsample.The maximum number of rules (i.e., regression trees) that could be used was set to 100 and the maximum extrapolation (values can range between 0 and 100%; increasing numbers indicate the percent adjustment of the rules to the training set) was set to 10%.The MAD and SD for both testing and training were subsequently calculated and were output as a summary file.The summary file also includes the minimum and maximum number of rules at each 10% increment.The second procedure held the random sampling percentage for training constant at a user-defined value (40%, 60%, and 80%), varied the number of rules between 1 and 100, and set the maximum extrapolation to 10%.The MAD was subsequently calculated for each of the rules.The following steps describe an application of this approach (a case study), demonstrating the detailed processing procedures of this method: 1. Classify the MODIS NDVI map into three classes (low, medium, and high NDVI) with each class having an equal amount of pixels.Selected ~10,000 randomly stratified samples from the study area based on the three NDVI classes (Figure 2b, each class had ~3330 random samples).The 250 m six-band Landsat 8 data (independent variables) and 250 m MODIS NDVI (dependent variable) were extracted for each of the selected samples [52].
2. Develop data-driven rule-based piecewise regression tree models using 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% of the sample data as the model training data in Cubist (nine different randomized tests were run and nine models for each percentage were built).Mean training and testing MAD and SD from the nine randomized tests for each of the training data percentages were calculated and models with overfitting, underfitting, and good fit potentials (Table 1) were identified.All procedures were accomplished utilizing the above Python script.
3. Used different rule numbers (i.e., rule numbers from one to maximum observed rule numbers determined by Cubist regression tree algorithm, in our case, from 1-12) with the selected optimal percentage of sample training data (40%, 60%, and 80% from step 2 in this case study) to build the Cubist regression tree models.Calculated the MAD for each model and identified the optimal rule numbers that mitigate over-and underfitting effects and minimize the prediction errors.
4. Based on the results from steps 2 and 3, additional testing was performed to determine the final model parameters (percent of training data size and number of rules) for overfitting and underfitting models in this case study.These tests include (1) using 10% training data size (overfitting tendency) with different rule numbers (1-12 rules), and (2) using one rule (underfitting tendency) with different training data sizes (10%-90%) to build the Cubist regression tree models.
Additionally, the following procedures were performed in our case study application to quantify and understand the mapping impacts of overfitting, underfitting, good fit, and all sample prediction models: A. Predicted the 250 m MODIS-Landsat NDVI using overfitting, underfitting, good fit, and all samples (100% of the samples used for model development) regression tree models.Mapped the absolute difference (AD) between the predicted NDVI and the actual NDVI for the study areas for each model.Since the scaled NDVI (value from 0-200) was used in this study, the AD and MAD values calculated from this study were also scaled (i.e., the real AD and MAD values should be divided by 100).
B. Generated the extreme AD (i.e., scaled AD > 10, real AD > 0.01) maps for the "overfit", "underfit", "good fit", and "all samples" models.Compared the four extreme AD maps and investigated the potential causes (where and why) of the extreme AD based on the NLCD 2011 map.
C. Evaluated the MAD comparison graphics for the four models (overfit, underfit, good fit, and all samples) based on the three categories (i.e., MAD ≤ 10, MAD > 10, and all cases).

Identification of the Optimal Regression Tree Model Complexity and Associated Model Parameters
Results from the Cubist model replication tests (step 2) show that there are only minor differences in error observed across all of the training data sizes (Figure 3a, all mean values of MAD < 3), suggesting that Cubist was resistant to overfitting.The training MAD was constant across the gradient of training data sizes (blue color in Figure 3a), while the testing MAD was higher at low training data sizes, indicting improved model robustness with larger training sizes.The Cubist model derived from the 80% sample training data size had the lowest MAD and SD (red circles in Figure 3a,b) for both training and testing, indicating the optimal sample data usage for the regression tree model (Table 1).On the other hand, using a 10% training data size to develop the Cubist regression tree model, the MAD for testing is much higher than the MAD for training and the SDs for both training and testing are high (Figure 3), indicating an overfitting tendency in these models (Table 1 values calculated from this study were also scaled (i.e., the real AD and MAD values should be divided by 100).B. Generated the extreme AD (i.e., scaled AD > 10, real AD > 0.01) maps for the "overfit", "underfit", "good fit", and "all samples" models.Compared the four extreme AD maps and investigated the potential causes (where and why) of the extreme AD based on the NLCD 2011 map.
C. Evaluated the MAD comparison graphics for the four models (overfit, underfit, good fit, and all samples) based on the three categories (i.e., MAD ≤ 10, MAD > 10, and all cases).

Identification of the Optimal Regression Tree Model Complexity and Associated Model Parameters
Results from the Cubist model replication tests (step 2) show that there are only minor differences in error observed across all of the training data sizes (Figure 3a, all mean values of MAD <3), suggesting that Cubist was resistant to overfitting.The training MAD was constant across the gradient of training data sizes (blue color in Figure 3a), while the testing MAD was higher at low training data sizes, indicting improved model robustness with larger training sizes.The Cubist model derived from the 80% sample training data size had the lowest MAD and SD (red circles in Figure 3a,b) for both training and testing, indicating the optimal sample data usage for the regression tree model (Table 1).On the other hand, using a 10% training data size to develop the Cubist regression tree model, the MAD for testing is much higher than the MAD for training and the SDs for both training and testing are high (Figure 3), indicating an overfitting tendency in these models (Table 1).To further optimize the rule numbers, regression tree models were built using 80% of the sample data with varying maximum number of rules from 1-12 (processing step 3).Results show that the MADs for both training and testing were stable when the rule number was greater than approximately six (Figure 4).Therefore in this case study, we concluded that an accurate and parsimonious model occurred at the 80% training data size with six rules (MADtraining =2.5 and MADtesting = 2.4).Furthermore, the validation error (testing error) is expected to be reliable and consistent (the testing SD was low for the 80% training data size model replications, Figure 3b).Description of this optimal regression tree model (6 rules) can be found from the "Supplementary Materials".To further optimize the rule numbers, regression tree models were built using 80% of the sample data with varying maximum number of rules from 1-12 (processing step 3).Results show that the MADs for both training and testing were stable when the rule number was greater than approximately six (Figure 4).Therefore in this case study, we concluded that an accurate and parsimonious model occurred at the 80% training data size with six rules (MAD training =2.5 and MAD testing = 2.4).Furthermore, the validation error (testing error) is expected to be reliable and consistent (the testing SD was low for the 80% training data size model replications, Figure 3b).Description of this optimal regression tree model (6 rules) can be found from the "Supplementary Materials".Based on the rule number tests in processing steps 3 and 4 (Figure 4), we found that the regression tree models with one rule resulted in large MAD for both training and testing (ranges from 2.9-3.1),which indicated an underfitting tendency in these models.In this case study, using 40% of the data for training with one rule in the regression tree model (MADtraining =3.0 and MADtesting =3.0) met our criteria for the underfitting condition (Table 1).In addition, a Cubist model trained on 10% of the sample data with 11 rules in the regression tree model (MADtraining =2.2 and MADtesting =2.8) was used as an example for the overfitting condition.

Model Accuracy Assessments for the Four Conditions
To provide a clear view of how the MADs varied through the "overfit", "underfit", "good fit", and all samples in the regression tree models, the MAD histograms for the three AD categories (all AD; low error with AD ≤10; and high error with AD >10) are illustrated in Figure 5a.Both "good fit" and "all samples" have the lowest MADs when compared with "overfit" and "underfit" model predictions.Since independent testing data are desirable to quantify the model error rates, in this case study the "good fit" model was considered as the "best" model, which optimized both model parameters (number of rules) and sample data usage (suitable portion for training and testing) compared with the "all samples" model.Based on the rule number tests in processing steps 3 and 4 (Figure 4), we found that the regression tree models with one rule resulted in large MAD for both training and testing (ranges from 2.9-3.1),which indicated an underfitting tendency in these models.In this case study, using 40% of the data for training with one rule in the regression tree model (MAD training =3.0 and MAD testing =3.0) met our criteria for the underfitting condition (Table 1).In addition, a Cubist model trained on 10% of the sample data with 11 rules in the regression tree model (MAD training =2.2 and MAD testing =2.8) was used as an example for the overfitting condition.

Model Accuracy Assessments for the Four Conditions
To provide a clear view of how the MADs varied through the "overfit", "underfit", "good fit", and all samples in the regression tree models, the MAD histograms for the three AD categories (all AD; low error with AD ≤10; and high error with AD >10) are illustrated in Figure 5a.Both "good fit" and "all samples" have the lowest MADs when compared with "overfit" and "underfit" model predictions.Since independent testing data are desirable to quantify the model error rates, in this case study the "good fit" model was considered as the "best" model, which optimized both model parameters (number of rules) and sample data usage (suitable portion for training and testing) compared with the "all samples" model.Based on the rule number tests in processing steps 3 and 4 (Figure 4), we found that the regression tree models with one rule resulted in large MAD for both training and testing (ranges from 2.9-3.1),which indicated an underfitting tendency in these models.In this case study, using 40% of the data for training with one rule in the regression tree model (MADtraining =3.0 and MADtesting =3.0) met our criteria for the underfitting condition (Table 1).In addition, a Cubist model trained on 10% of the sample data with 11 rules in the regression tree model (MADtraining =2.2 and MADtesting =2.8) was used as an example for the overfitting condition.

Model Accuracy Assessments for the Four Conditions
To provide a clear view of how the MADs varied through the "overfit", "underfit", "good fit", and all samples in the regression tree models, the MAD histograms for the three AD categories (all AD; low error with AD ≤10; and high error with AD >10) are illustrated in Figure 5a.Both "good fit" and "all samples" have the lowest MADs when compared with "overfit" and "underfit" model predictions.Since independent testing data are desirable to quantify the model error rates, in this case study the "good fit" model was considered as the "best" model, which optimized both model parameters (number of rules) and sample data usage (suitable portion for training and testing) compared with the "all samples" model.The "underfit" model had the highest overall MAD (Figure 5a) and a slightly higher frequency of occurrence in the AD ≤10 class than the "overfit" model (Figure 5b).This suggests that the prediction errors from the "underfit" model generally occurred within the low error group.Within the large error (AD > 10) category, the "overfit" model had a slightly higher MAD than the "underfit" model (Figure 5a), but it also had the highest frequency of occurrence in comparison with the other models (Figure 5b), implying that the "overfit" model tends to produce larger magnitudes of errors (or large outliers) than the other models.The "overfit" model may develop rules based on the noise signals within the training data or develop over-specific, non-robust rulesets, resulting in a tendency for more frequent extreme errors in model predictions.
The "all sample" and the "good fit" models have the lowest overall AD (Figure 5a), the most frequent perfect predictions (i.e., AD = 0), and the least frequent large errors (AD > 10) compared with the other model predictions (Figure 5b).The "overfit" model had over-specified conditions, which had a tendency to either predict very well (i.e., AD = 0) or very poorly (frequency of AD >10) across the study area.In general, the "overfit" model, with its limited robustness, performed better than the "underfit" model (low overall AD in Figure 5a).It appears that the model diagnostic approach proposed in this study can successfully identify the "overfit", "underfit", and "good fit" models.

Evaluation of the Extreme AD Regions for the Four Models Using NLCD 2011
To understand and demonstrate the prediction errors in a mapping application, locations and land cover types for the large prediction error pixels (AD > 10) from the four models ("underfit", "overfit", "good fit", and all samples) were mapped for the study area (Figure 6).Three boxes representing different locations with large numbers of high AD pixels were selected for evaluation.The extreme AD maps from the four models with 30 m land cover type (NLCD 2011) maps for the three boxes are illustrated in Figure 6 boxes 1-3.Results show that pixels with large prediction errors (particularly for the overfitting model) are usually mixed land cover types (within a 250 m model training resolution) and located along the edges of riparian zones, croplands, and forests (Figure 6, boxes 1-3).Mixed land cover types within 250 m pixels can confound satellite spectral signals and lead to large model prediction errors.These same mixed pixel errors would also exist in the 30 m land cover product used in this study.The locations and the spatial patterns of the large AD regions are similar for the four models; however, differences can still be found especially for the "underfit" model.For example, there are more extreme high AD pixels within boxes 1 and 2 (Figure 6) from the "underfit" model than the other models, implying that the "underfit" model is less reliable in these regions than the other models.

Discussion
Overall, there were only minor differences in accuracy across all the models.Results from this case study indicate that Cubist software was generally quite resistant to overfitting and underfitting tendencies.The generalization of the regression tree into "rules" by Cubist helps mitigate overfitting effects [27].The mean absolute error modified by the ratio of the number of cases and the number of parameters is the cost function employed by Cubist, which weighs the process toward having fewer parameters and trying to simplify the model as a whole.This function is used in both parameter elimination during the regression fitting and pruning the regression trees.In our case study, the model accuracy was stable with rule numbers greater than five (Figure 4).However, different datasets may vary in the optimal number of rules and training to testing data size ratios needed for the regression tree model.Our approach, which could be applied to other Cubist modeling applications, is to identify a parsimonious number of rules and training to testing data size ratios to minimize the prediction errors.
As described in Section 3.2, both the "good fit" and "all samples" models have the lowest MADs.The "good fit" model was selected as the "best" model in this case study to allow the quantification of model errors.However, when the reference data (sample data) are limited and maximizing model accuracy while mitigating overfitting tendencies is important (e.g., carbon flux mapping based on observations from several flux towers), the "best" model is suggested to be developed from all of the available sample points.Using leave-one-out cross-validation [26,53], random k-fold validation [54][55][56][57], or randomized test replication methods can provide reasonable error rate expectations on unseen data (particularly if testing accuracies are stable over multiple replications).As described in Section 3.2, both the "good fit" and "all samples" models have the lowest MADs.The "good fit" model was selected as the "best" model in this case study to allow the quantification of model errors.However, when the reference data (sample data) are limited and maximizing model accuracy while mitigating overfitting tendencies is important (e.g., carbon flux mapping based on observations from several flux towers), the "best" model is suggested to be developed from all of the available sample points.Using leave-one-out cross-validation [26,53], random k-fold validation [54][55][56][57], or randomized test replication methods can provide reasonable error rate expectations on unseen data (particularly if testing accuracies are stable over multiple replications).
Although the locations and the spatial patterns of the large AD error regions are similar for the four models, differences can still be found, especially for the "underfit" and the "overfit" models.Figure 7 shows the extreme AD (AD > 10) maps overlaid on the NLCD 2011 land cover for the "overfit" (black), "underfit" (magenta), and "good fit" (light blue) models.Additionally, a small box in the study area was selected and zoomed for illustration (Figure 7).The extreme AD regions for the "underfit" model usually occurred along ecotones, while the "overfit" model had more outliers (large Although the locations and the spatial patterns of the large AD error regions are similar for the four models, differences can still be found, especially for the "underfit" and the "overfit" models.Figure 7 shows the extreme AD (AD > 10) maps overlaid on the NLCD 2011 land cover for the "overfit" (black), "underfit" (magenta), and "good fit" (light blue) models.Additionally, a small box in the study area was selected and zoomed for illustration (Figure 7).The extreme AD regions for the "underfit" model usually occurred along ecotones, while the "overfit" model had more outliers (large errors) in the middle of the rangelands, indicating tight data space envelopes from the training data for rangelands are being exceeded in the rangeland population being mapped (Figure 7).The "good fit" model had fewer extreme AD pixels than the other two models (Figures 5b and 7).
errors) in the middle of the rangelands, indicating tight data space envelopes from the training data for rangelands are being exceeded in the rangeland population being mapped (Figure 7).The "good fit" model had fewer extreme AD pixels than the other two models (Figures 5b and 7).

Conclusions
This study developed a Cubist model optimization approach that uses variation in training sizes and rule numbers to identify combinations that minimize over-and underfitting effects on unseen data and improve the accuracy and robustness of the Cubist model.Our case study demonstrated that using 80% of the data as the training data (saving 20% of the data as independent test) with six rules in the regression tree model had the lowest prediction errors (MADtraining = 2.5 and MADtesting = 2.4) (the scaled NDVIs with values from 0-200 were used in this study) and was chosen as the "good fit" (or optimal) model.However, when the sample data size is limited and maximizing model accuracy is a high priority (e.g., carbon flux mapping based on observations from several flux towers), using all the available data to build the Cubist model and employing a cross-validation accuracy assessment method is recommended to maximize model robustness for unseen data.
Using 10% of the sample data as the model training data with 11 rules in the regression tree model has relatively low MAD for training but high MAD for testing (MADtraining = 2.2 and MADtesting = 2.8), which was considered an "overfit" model.When 40% of the sample data was used as training data with one rule in the regression tree model, the MADs are high for both training and testing (MADtraining = 3.0 and MADtesting = 3.0) and was considered an "underfit" model in this case study.Results from this study provide useful information on how the training data selection impacts the accuracy of the regression tree model and which data usage strategy can minimize the model overfitting and underfitting impacts.This regression tree model optimization approach will be useful for future remote-sensing-based ecosystem parameter (e.g., biomass, cover, and carbon flux) mapping.

Conclusions
This study developed a Cubist model optimization approach that uses variation in training sizes and rule numbers to identify combinations that minimize over-and underfitting effects on unseen data and improve the accuracy and robustness of the Cubist model.Our case study demonstrated that using 80% of the data as the training data (saving 20% of the data as independent test) with six rules in the regression tree model had the lowest prediction errors (MAD training = 2.5 and MAD testing = 2.4) (the scaled NDVIs with values from 0-200 were used in this study) and was chosen as the "good fit" (or optimal) model.However, when the sample data size is limited and maximizing model accuracy is a high priority (e.g., carbon flux mapping based on observations from several flux towers), using all the available data to build the Cubist model and employing a cross-validation accuracy assessment method is recommended to maximize model robustness for unseen data.
Using 10% of the sample data as the model training data with 11 rules in the regression tree model has relatively low MAD for training but high MAD for testing (MAD training = 2.2 and MAD testing = 2.8), which was considered an "overfit" model.When 40% of the sample data was used as training data with one rule in the regression tree model, the MADs are high for both training and testing (MAD training = 3.0 and MAD testing = 3.0) and was considered an "underfit" model in this case study.Results from this study provide useful information on how the training data selection impacts the accuracy of the regression tree model and which data usage strategy can minimize the model overfitting and underfitting impacts.This regression tree model optimization approach will be useful for future remote-sensing-based ecosystem parameter (e.g., biomass, cover, and carbon flux) mapping.

Figure 1 .
Figure 1.Land cover types as identified by the 2011 National Land Cover Database [5] and the location of the study area (red box in the USA map).The legend is for the entire USA, some land cover types are not present within the study area.

Figure 1 .
Figure 1.Land cover types as identified by the 2011 National Land Cover Database [5] and the location of the study area (red box in the USA map).The legend is for the entire USA, some land cover types are not present within the study area.

Figure 2 .
Figure 2. (a) Two-hundred fifty meter pixels with 0% of "0" values (light blue color) used to develop regression tree models and (b) randomly stratified samples (orange dots) overlaid on the 250 m MODIS NDVI (scaled NDVI, value from 0-200) map for the study area.

Figure 2 .
Figure 2. (a) Two-hundred fifty meter pixels with 0% of "0" values (light blue color) used to develop regression tree models and (b) randomly stratified samples (orange dots) overlaid on the 250 m MODIS NDVI (scaled NDVI, value from 0-200) map for the study area.

Figure 3 .
Figure 3. (a) Mean MAD and (b) SD of MAD for the training and testing (validation) results from step 2. Eighty percent of the sample data used for training is highlighted by the red oval.

Figure 3 .
Figure 3. (a) Mean MAD and (b) SD of MAD for the training and testing (validation) results from step 2. Eighty percent of the sample data used for training is highlighted by the red oval.

Figure 4 .
Figure 4. Training and testing MAD based on the different number of rules for the 80% training data size.The red oval indicates the optimal number of rules and the associated training and testing MAD.

Figure 5 .
Figure 5. (a) MAD values for all error (all AD), small error (AD ≤ 10), and large error (AD > 10, the MAD value for this class was scaled to get a better view with the other classes) categories based on the different model accuracy conditions; (b) percent of the total population (map area) for perfect prediction (AD = 0), small error (AD ≤ 10, % of total value was scaled to get a better view with the other classes), and large error (AD > 10) categories based on the different model accuracy conditions.

Figure 4 .
Figure 4. Training and testing MAD based on the different number of rules for the 80% training data size.The red oval indicates the optimal number of rules and the associated training and testing MAD.

Figure 4 .
Figure 4. Training and testing MAD based on the different number of rules for the 80% training data size.The red oval indicates the optimal number of rules and the associated training and testing MAD.

Figure 5 .
Figure 5. (a) MAD values for all error (all AD), small error (AD ≤ 10), and large error (AD > 10, the MAD value for this class was scaled to get a better view with the other classes) categories based on the different model accuracy conditions; (b) percent of the total population (map area) for perfect prediction (AD = 0), small error (AD ≤ 10, % of total value was scaled to get a better view with the other classes), and large error (AD > 10) categories based on the different model accuracy conditions.

Figure 5 .
Figure 5. (a) MAD values for all error (all AD), small error (AD ≤ 10), and large error (AD > 10, the MAD value for this class was scaled to get a better view with the other classes) categories based on the different model accuracy conditions; (b) percent of the total population (map area) for perfect prediction (AD = 0), small error (AD ≤ 10, % of total value was scaled to get a better view with the other classes), and large error (AD > 10) categories based on the different model accuracy conditions.

Figure 6 .
Figure 6.Extreme AD (AD > 10 at 250 m resolution) map from the four models (underfit, overfit, good fit, and all samples) for the study area.Three small representative boxes (boxes 1-3) were selected and zoomed for MADs and 30 m NLCD 2011.

Figure 6 .
Figure 6.Extreme AD (AD > 10 at 250 m resolution) map from the four models (underfit, overfit, good fit, and all samples) for the study area.Three small representative boxes (boxes 1-3) were selected and zoomed for MADs and 30 m NLCD 2011.

Figure 7 .
Figure7.Extreme AD maps overlaid on the NLCD 2011 map for the "underfit", "overfit", and "good fit" models.Box 1 in the study area was zoomed for illustration.

Figure 7 .
Figure7.Extreme AD maps overlaid on the NLCD 2011 map for the "underfit", "overfit", and "good fit" models.Box 1 in the study area was zoomed for illustration.
2.2.Data Used for Developing and Evaluating Regression Tree Models

Table 1 .
Mean absolute difference (MAD) between predicted NDVI and actual NDVI and standard deviation of the MAD (SD) for regression tree model performance. ).