Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Large-Scale Inter-Comparison and Evaluation of Spatial Feature Engineering Strategies for Forest Aboveground Biomass Estimation Using Landsat Satellite Imagery

Remote Sens. 2024, 16(23), 4586; https://doi.org/10.3390/rs16234586

by John B. Kilbride^*

and Robert E. Kennedy

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Lunche Wang

Reviewer 4: Anonymous

Reviewer 5: Anonymous

Remote Sens. 2024, 16(23), 4586; https://doi.org/10.3390/rs16234586

Submission received: 16 August 2024 / Revised: 28 October 2024 / Accepted: 3 December 2024 / Published: 6 December 2024

(This article belongs to the Section Biogeosciences Remote Sensing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper focused on the saturation problem, which is of great importance to improve AGB estimation. By adding the spatial feature GLCM feature and adjusting algorithm parameters, the accuracy has been improved.

The following suggestions could be reconsidered.

1. Keywords, some of them could be deleted.

2. Line 142: "from the", have been repeated

3. the workflow could be improved

4. As the author mentioned, the estimates of random forest may be random, so usually the process would be repeated to achieve more reliable results, this procedure seems to be not so clear

5. The most important, to assess the role of different combination of features, or the hyperparameters of random forest, it is better to use the ANOVA to compare the contribution of every part, especially to judge the significance, otherwise the results are subjective

6. The abstract mentioned that the dataset used with AGB beyond 1000 Mg/ha, but the results and discussion part seldom investigated these high values.

7. The Unit Mg ha^-1could be Mg/ha

8. Table A1 could be simplified.

Author Response

We thank the reviewer for taking the time to read our manuscript and for offering suggestions on how to improve it.

In addition to our specific comments, I would like to highlight some changes:

The entire paper was re-edit to improve the clarity of the writing.
The feature engineer section of the methods was re-organized to conform with the order the results are presented.
Most sections of the methodology section were revised to improve clarity.
The discussion has been expanded to more thoroughly discuss the results.
Figure 3, the analysis overview was done, was completely re-done.
Scale bars and improved legends were added to figures 7-10.

Comments 1: Keywords, some of them could be deleted.

Response 1: We have removed and reorganized the keywords.

Comments 2: Line 142: "from the", have been repeated

Response 2: This was corrected.

Comments 3: the workflow could be improved

Response 3: We created a new workflow to improve the clarity of the analysis steps and to make the figure more aesthetically pleasing. However, I’m definitely more of a scientist than I am an artist.

Comments 4: As the author mentioned, the estimates of random forest may be random, so usually the process would be repeated to achieve more reliable results, this procedure seems to be not so clear

Response 4: In our analysis, we developed 250 models using subsets of the training/development dataset. These were then evaluated using the testing set to get a distribution of error scores to represent each “group” in the different experiments. Random forest is indeed random in that it will perform a random feature selection and will draw bootstrapped samples of the training dataset.

Comments 5: The most important, to assess the role of different combination of features, or the hyperparameters of random forest, it is better to use the ANOVA to compare the contribution of every part, especially to judge the significance, otherwise the results are subjective

Response 5: We opted to use t-tests with a Bonferroni correction as we were only interested in making specific comparisons (e.g., each model against the baseline) as opposed to making comparisons across all groups. It is not clear why this would be more subjective than using an ANOVA.

Comments 6: The abstract mentioned that the dataset used with AGB beyond 1000 Mg/ha, but the results and discussion part seldom investigated these high values.

Response 6: This is an excellent point and reflects poor editing on my (the lead author) part. Originally, the paper more explicitly examined the problem of spectral saturation before the focus transitioned towards an intercomparison of feature engineering.

We dropped this component as, frankly, the number of locations on Earth where this is a problem is quite limited. Such biomass densities are essentially only found in the Pacific Northwest of the United States and Canada. We have revised the text to better reflect that feature engineering is used to address the problem of spectral saturation but have removed this focus on the densities >1000 Mg/ha.

Comments 7: The Unit Mg ha-1 could be Mg/ha

Response 7: I believe these are equivalent but could revise if necessary.

Comments 8: Table A1 could be simplified.

Response 8: I split the table into tables for each of the key feature groups. I’m not the best at LaTeX formatting and getting these tables to look “nice” across multiple pages was rather difficult. However, the new format makes it easier to index to a particular feature group.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper Utilizing 176 lidar-derived AGB maps covering 9.3 million ha of forests in the Pacific Northwest of the United States, we construct an expansive reference dataset spanning numerous biophysical gradients and AGB densities exceeding 1000 Mg ha−1. And it conducts a large-scale inter-comparison of multiple spatial feature engineering techniques, including GLCMs, edge detectors, morphological operations, spatial buffers, neighborhood vectorization, and neighborhood similarity features to improves AGB accuracy. This paper is well structured.

Figure4, please add a legend for it.

Figure5, please add R² for the regression models.

In the three experiments, how the samples were set? Please add the description for sampling.

Author Response

We would like to thank the reviewer for reading our manuscript and providing suggestions for improving it.

In addition to our specific comments, I would like to highlight some changes:

The entire paper was re-edit to improve the clarity of the writing.
The feature engineer section of the methods was re-organized to conform with the order the results are presented.
Most sections of the methodology section were revised to improve clarity.
The discussion has been expanded to more thoroughly discuss the results.
Figure 3, the analysis overview was done, was completely re-done.
Scale bars and improved legends were added to figures 7-10.

Comments 1: Figure4, please add a legend for it.

Response 1: We have made this change.

Comments 2: Figure5, please add R2 for the regression models.

Response 2: We have made this change.

Comments 3: In the three experiments, how were the samples set? Please add the description for sampling.

Response 3: The sampling logic was described in section 2.3. We have refined the text to make our sampling strategy more clear.

Reviewer 3 Report

Comments and Suggestions for Authors

Accurate quantification of terrestrial carbon stored in forests is important for understanding global carbon cycle and mitigation of climate change. This study compared a series of spatial features together with time series statistical features from Landsat satellites to estimate forest aboveground biomass in the Pacific Northwest of the United States. Though this study deals with an important subject, there are major issues that have to be resolved before further consideration of publications.

Major:

1. Line 55-56, the mere fewer number of reference observations utilized doesn’t warrant innovation.

2. Of the so many features adopted and compared, have the authors considered feature redundancy and normalization?

3. Line 167-168, please specify the different sensors or Landsat satellites used for different periods as there were overlaps. Have the authors considered the quite different spectral response functions between different sensors of Landsat satellites? The more clear observations used in the CCDC algorithm, the better the fitting effect will be. Thus, it is suggested to list the number of clear observations within the time range to demonstrate the validity of the time features extracted by CCDC.

4. Regarding Table 2, the calculation of features including GLCM and Buffer etc. also involves setting neighborhood window sizes. Why did the authors tested the influence of different neighborhood window sizes on NV only while neglecting its influence on the other features?

5. The findings are not in accordance. In Tables 4 and 5, the addition of spatial feature s, temporal features and the combination of the two improved the AGB model accuracy. However, in Figure 7, the comparison of “baseline+temporal” and “baseline+spatial+temporal” is not very obvious. The situation is similar with figure 8, while the performance of “baseline+temporal” even seems superior than “baseline+spatial+temporal”. Some recent publications should be discussed, for example, The roles of environmental conditions in the pollutant emission-induced gross primary production change: Co-contribution of meteorological fields and regulation of its background gradients; Estimation of aerosol and cloud radiative effects on terrestrial net primary productivity over northeast Qinghai-Tibet plateau; Predicting the supply-demand of ecosystem services in the Yangtze River Middle Reaches Urban Agglomeration.

Minor:

1. Line 36, please explain the meaning of “rich historical depth”.

2. Line 228-230, please provide adequate explanations on why two different time frames were adopted for CCDC calculation?

3. Line 372, “Here, we included all spatial feature except for the NV group”: As can be seen from Figure 4 and Table 3, neighborhood vectorization (NV) features were not the least effective ones. Why were only NV features excluded from the spatial features used in experiment 2? Before experiment 2, the performance of different combinations of spatial features could be compared, then the optimal combination could be combined with the time features.

4. Figure 4, please provide the full names of abbreviations involved in figure caption.

5. Figure 5, please name the subfigures (a) to (d) and provide relevant descriptions. “A ordinary” -> “an ordinary”.

Comments on the Quality of English Language

none

Author Response

We would like to thank the reviewer for reading our manuscript and providing suggestions for improving it.

In addition to our specific comments, I would like to highlight some changes:

The entire paper was re-edit to improve the clarity of the writing.
The feature engineer section of the methods was re-organized to conform with the order the results are presented.
Most sections of the methodology section were revised to improve clarity.
The discussion has been expanded to more thoroughly discuss the results.
Figure 3, the analysis overview was done, was completely re-done.
Scale bars and improved legends were added to figures 7-10.

Comments 1: Line 55-56, the mere fewer number of reference observations utilized doesn’t warrant innovation.

Response 1. This is a fair point and we have decided to remove that line. Our general objection to the presentation of results in the prior literature is that they identify very nominal improvements in model performance over very limited geographic extents. We have observed these results being cited without the appropriate nuance.

Comments 2: Of the so many features adopted and compared, have the authors considered feature redundancy and normalization?

Response 2. We deliberately selected an algorithm, Random Forests, that is robust to the inclusion of multicollinear features to avoid having to introduce additional analysis steps. For example, if we did PCA then we would also need to assess and transform all of the variables in each group so that they were more normally distributed using something like a Yeo-Johnson transformation.

Comment 3: Line 167-168, please specify the different sensors of Landsat satellites used for different periods as there were overlaps. Have the authors considered the quite different spectral response functions between different sensors of Landsat satellites? The more clear observations used in the CCDC algorithm, the better the fitting effect will be. Thus, it is suggested to list the number of clear observations within the time range to demonstrate the validity of the time features extracted by CCDC.

Response 3. We note that we used the Landsat Collection 2 dataset. This dataset includes some calibrations to ensure the data are suitable for time series analysis. In the case of the LandTrendr workflow, we aggregated all imagery within each year using the medoid compositing method and produced model fitted images using LandTrendr. This approach is commonly used (e.g., [1,2]) for time series mapping and assists in smoothing the temporal variation caused by the different sensors present in the Landsat archive.

We utilized two time periods with CCDC due to the presence of snow in our study area. CCDC works by fitting harmonic models with an intercept, slope, and several sine and cosine terms. LASSO is used to select the most important terms for fitting the model. We observed that in many mountainous areas (e.g., the Cascades and the Coast Range in Oregon and Washington) that presence of snow caused poor fitting behavior. This is because the very large reflectances produced over snow distorted the ability of the algorithm to capture the harmonic pattern associated with vegetation phenology. Given the United States has very dense Landsat cover, there was still sufficient data to parameterize the CCDC models, even if periods of time outside of the growing season were excluded. Given this approach is less common (CCDC is often used in Tropical areas where this isn’t a problem) we opted to compute a “conventional” set of CCDC features using all clear-sky observations and a modified set using only surface reflectance values obtained during the growing season.

Comment 4: Regarding Table 2, the calculation of features including GLCM and Buffer etc. also involves setting neighborhood window sizes. Why did the authors test the influence of different neighborhood window sizes on NV only while neglecting its influence on the other features?

Response 4. There are previous analyses that have utilized GLCM and buffer metrics and examined how different kernel sizes impact the performance [3, 4]. However, to our knowledge, we are unaware of any remote sensing analysis that has used neighborhood vectorization with Landsat data. Given this technique was somewhat novel, we expended greater energy assessing it.

Comment 5: The findings are not in accordance. In Tables 4 and 5, the addition of spatial features, temporal features and the combination of the two improved the AGB model accuracy. However, in Figure 7, the comparison of “baseline+temporal” and “baseline+spatial+temporal” is not very obvious. The situation is similar with figure 8, while the performance of “baseline+temporal” even seems superior to “baseline+spatial+temporal”.

Response 5: The results in Table 4 and 5 indicate the improvement in the model’s performance with respect to the entire modeling dataset. More specifically, these are the model performance scores and improvements relative to the 7,500 test points. Figures 7, 8, 9, and 10 represent only 15km^2 subsets and model improvement was not consistent across all spatial subsets as you noted. We expanded our discussion of the spatial features in section 4.2 (originally called “GLCM features are the best performing spatial features”) to discuss this phenomena and to expand our discussion of the use of spatial features for AGB modeling.

Comment 6: Some recent publications should be discussed, for example, The roles of environmental conditions in the pollutant emission-induced gross primary production change: Co-contribution of meteorological fields and regulation of its background gradients; Estimation of aerosol and cloud radiative effects on terrestrial net primary productivity over northeast Qinghai-Tibet plateau; Predicting the supply-demand of ecosystem services in the Yangtze River Middle Reaches Urban Agglomeration.

Response 6. These papers are related to modeling of vegetation productivity. It is not clear how these relate to our research. One analysis, “ Estimation of aerosol and cloud radiative effects on terrestrial net primary productivity over northeast Qinghai-Tibet plateau” is using MODIS and does not seem to use spatial features. It is not clear where we would cite these papers in our analysis or why we would cite these papers.

Comment 7: Line 36, please explain the meaning of “rich historical depth”.

Response 7. We agree the use of this word is confusing and have removed it. The intention of the phrase “rich” was motivated by the fact that the Landsat archive is one of the few datasets that characterizes forest ecosystems on a temporal and spatial scale sufficient for characterizing fine-grain ecosystem processes.

Comment 8: Line 228-230, please provide adequate explanations on why two different time frames were adopted for CCDC calculation?

Response 8. We outlined our motivation for doing this in Response 3. We have added text to that section of the manuscript which explains this logic.

Comment 9: Line 372, “Here, we included all spatial features except for the NV group”: As can be seen from Figure 4 and Table 3, neighborhood vectorization (NV) features were not the least effective ones. Why were only NV features excluded from the spatial features used in experiment 2? Before experiment 2, the performance of different combinations of spatial features could be compared, then the optimal combination could be combined with the time features.

Response 9. This is a fair point. We originally excluded the NV features on the grounds that it introduced a large number of highly correlated variables. However, we acknowledge that the argument could also be applied to the other feature groups we did include. We expanded upon the motivation for using the NV features in the discussion and explained our rationale for excluding them in the methods.

Given I have graduated from my university, I no longer have access to a computer that can run these models in a reasonable amount of time.

Comment 10: Figure 4, please provide the full names of abbreviations involved in figure caption.

Response 10. A comment by another reviewer requested we add a legend. We have added a legend that associates the color with the full feature group name.

Comment 11: Figure 5, please name the subfigures (a) to (d) and provide relevant descriptions. “A ordinary” -> “an ordinary”.

Response 11. We have made this modification.

References

[1] Hudak, A. T., Fekety, P. A., Kane, V. R., Kennedy, R. E., Filippelli, S. K., Falkowski, M. J., ... & Dong, J. (2020). A carbon monitoring system for mapping regional, annual aboveground biomass across the northwestern USA. Environmental Research Letters, 15(9), 095003.

[2] Kennedy, R. E., Yang, Z., Braaten, J., Copass, C., Antonova, N., Jordan, C., & Nelson, P. (2015). Attribution of disturbance change agent from Landsat time-series in support of habitat monitoring in the Puget Sound region, USA. Remote Sensing of Environment, 166, 271-285.

[3] Lu, D. (2005). Aboveground biomass estimation using Landsat TM data in the Brazilian Amazon. International journal of remote sensing, 26(12), 2509-2525.

[4] Hopkins, L. M., Hallman, T. A., Kilbride, J., Robinson, W. D., & Hutchinson, R. A. (2022). A comparison of remotely sensed environmental predictors for avian distributions. Landscape Ecology, 37(4), 997-1016.

Reviewer 4 Report

Comments and Suggestions for Authors

This study is judged to be the result of comparing the AGB estimation capabilities according to different feature engineering techniques by applying Lidar-based AGB maps as reference data and utilizing spatiotemporal Landsat data.

In this study, although various spatial data and spatiotemporal landsat data are being integrated and utilized, it is judged that the problem of autocorrelation of the data used is not considered. Rather than utilizing various input data, it is judged that an efficient approach is to extract heterogeneous data that considers the correlation among input data and apply those as input data. We need to consider cost-effectiveness to estimate forest AGB.

The target area of this study is relatively large, so regional variability (climate, growth, species composition, etc.) should be considered. I wonder if such variability can be significantly detected by feature engineering techniques.

Considering the comments presented above, the discussion section may be revised.

Author Response

We would like to thank the reviewer for reading our manuscript and providing suggestions for improving it.

In addition to our specific comments, I would like to highlight some changes:

The entire paper was re-edit to improve the clarity of the writing.
The feature engineer section of the methods was re-organized to conform with the order the results are presented.
Most sections of the methodology section were revised to improve clarity.
The discussion has been expanded to more thoroughly discuss the results.
Figure 3, the analysis overview was completely re-done.
Scale bars and improved legends were added to figures 7-10.

Comments 1: In this study, although various spatial data and spatiotemporal landsat data are being integrated and utilized, it is judged that the problem of autocorrelation of the data used is not considered.

Response 1: As noted in section 2.3, prior literature assessing the spatial autocorrelation of multispectral satellite predictors has found that the structure of forests in the Pacific Northwest varies at scales less than 500m. That threshold is informed by forests on the West Coast and thus might not be applicable to all forests in the region we evaluated, however it is important to note that the effect of spatial autocorrelation in the predictor variables would be to inflate apparent predictive power. This only serves to strengthen the argument we are making in the paper: the spatial domain only nominally improves predictive strength, esp. with regard to temporal metrics.

Comments 2: Rather than utilizing various input data, it is judged that an efficient approach is to extract heterogeneous data that considers the correlation among input data and apply those as input data. We need to consider cost-effectiveness to estimate forest AGB.

Response 2: In this analysis, we utilized the Google Earth Engine platform. Outside of the time it requires to export additional features, Google Earth Engine is free for research scientists to use. There is not necessarily an additional cost associated with generating a large number of features beyond the additional storage requirements. Additionally, we deliberately selected an algorithm, Random Forests, that is robust to the inclusion of multicollinear features to avoid having to introduce additional analysis steps. For example, if we did PCA then we would also need to assess and transform all of the variables in each group so that they were more normally distributed using something like a Yeo-Johnson transformation.

Comment 3: The target area of this study is relatively large, so regional variability (climate, growth, species composition, etc.) should be considered. I wonder if such variability can be significantly detected by feature engineering techniques.

Response 3: We were specifically interested in assessing spatial features produced by techniques like GLCMs and were interested in comparing the impact of these features with temporal features derived from algorithms like CCDC and LandTrendr. We acknowledge that including climate variables would be important if our goal was to maximize predictive power.

Reviewer 5 Report

Comments and Suggestions for Authors

I have included my comments in the attached pdf (Review_Report_Remotesensing-3185841).

Comments for author File: Comments.pdf

Author Response

We would like to thank the reviewer for reading our manuscript and providing suggestions for improving it.

In addition to our specific comments, I would like to highlight some changes:

The entire paper was re-edit to improve the clarity of the writing.
The feature engineer section of the methods was re-organized to conform with the order the results are presented.
Most sections of the methodology section were revised to improve clarity.
The discussion has been expanded to more thoroughly discuss the results.
Figure 3, the analysis overview was completely re-done.
Scale bars and improved legends were added to figures 7-10.

Comments 1: The study is marred by several ad hoc processes or unjustified feature generation processes.

Response 1: We respectfully disagree. The primary goal of our paper was to assess whether the spatial domain could improve estimation of AGB. It is not necessary to test every possible method of spatial feature generation to explore this question. While other spatial feature approaches could always be used, we believe our suite of tests both the dominant types used in the literature (for which we cite sources) and more broadly explores different parts of the feature generation space. Additionally, we describe below our logic in response to specific criticisms by this reviewer.

Comment 2: First, despite authors mention of using LiDAR-derived biomass; it is not clear why authors opted in for this instead of the forest inventory-based biomass. Were these LiDAR data captured/acquired in the same year across the PNW?

Response 2:

Our decision to use biomass estimates from airborne lidar was based on two issues. The first is sample size. Each of the lidar maps derived from airborne lidar was each built from many site-level, co-occurring plot-level estimates of biomass that sampled the range of conditions for each site. Secondly, spatial maps were critical for our tests. First, the use of spatial maps at the local scale allowed us to develop test datasets that sampled the range of biomass values at each site. Were only plot data used, the high and low value biomass plots would be from different regions, which does not represent the use case relevant for most biomass mapping exercises. Thus, our approach would represent a more challenging case for testing the map. Second, and relatedly, if spatial features are in fact predictive, then they must distinguish high and low biomass sites. Spatial features drawn at widely-space inventory plots could be predictive simply because of broad site differences, not because the actual local spatial structure mattered. Again, we feel that testing against maps provides a more rigorous test of the impact of spatial features.

We note that the only comparable dataset to the one we developed is the United States Forest Service’s Forest Inventory Analysis plot database. This database is not publicly available due to federal regulation. The FIA will not provide the actual plot coordinates without a signed Memorandum of Understanding, a process that can take more than a year. The lead author previously had an MOU for the New England states and it took >9 months and the study area is considerably smaller. We contacted the FIA on several occasions and they would not respond to our inquiries. Therefore the alternative would be to stitch together a patchwork of non-uniform plot datasets. It is not clear why what would be an improvement over lidar plots which have been extensively used.

Regarding the dates of capture of the lidar data: No, these were collected in different years. However, each acquisition was paired with concurrent local scale plot measurements, making the field and lidar data match.

Comment 3: The LiDAR-derived biomass already has an error associated with this model; how this error propagation was handled is not addressed.

Response 3:

Agreed! Lidar data do introduce predictive error, and in fact do saturate slightly at high biomass. However, lidar maps from ALS data are recognized to be the best approach for scaling from plot data to maps, and are completely independent from the optical data we then tested on. The key issue is that our goal was not to solve the saturation or prediction issue with lidar, but rather to test how the spatial and temporal metrics could improve the saturation issue with optical data. The lidar maps are simply meant to be a common metric (and best practice) against which to test different spatial and temporal metrics. Propagation of error would not substantively change the comparative results on which we focused.

Comment 4: Second, if biomass were modeled at a specific point, how features extracted from time-series analysis, such as continuous change detection (CDCD) and landtrender, would be helpful in explaining variances in above-ground biomass at a specific point. Unfortunately, these variables are improving about 10% in R-squared values from that of base model. Rather than using temporal data (e.g., landtrender output), stratification of low to high biomass regions based on the LiDAR data would have provided a simple and straightforward division of biomass variability.

Response 4:

Our analysis made an explicit comparison between the spatial features and temporal features in experiment 2. The motivation for this is that temporal features are commonly utilized in remote sensing analyses, particularly when there are large AGB densities and the landscape features stand replacing disturbances (see the introduction of our paper for citations). It is not clear what the stratification comment is referring to. We did use stratified random sampling across biomass gradients.

Comment 5: Third, the spatial autocorrelation assumption based on previous analysis in the region (<500 m) is not justified. When the authors had the dataset, why did they choose to assume autocorrelation from previous studies? Moreover, after creating correlated variables (features) from LiDAR, textural, and other variables, this assumption may not hold true. Because spatial autocorrelation can exist in dependent variable space and explanatory variables space (or environmental space). Although Random Forest (RF) is a non-parametric data analysis, sample independence is one of the requirements of supervised machine learning (Belgiu & Dragut, 2016). The medoid layer on the temporal data set smoothens the data, thereby increasing spatial autocorrelation. Unless a thorough assessment of the autocorrelation across variables is performed.The data set cannot be assumed independent; therefore, the parametric test (t-test) on AGB estimates is not justified. However, non-parametric counterparts would still provide significant results.

Response 5:

We agree that random forest requires independence of samples, yes!

However, we disagree with several other assertions in this comment.

First, the medoid approach is not a spatial feature smoother. The medoid method is a temporal normalization feature that avoids contamination with clouds. It is unclear why the medoid method would be different from any other technique used to aggregate imagery into a single mosaic (i.e., the average, medians, etc). We are unaware of any papers that support this assertion.

Second, these forested landscapes have been studied extensively, and we know of no study that suggests that meaningful spatial autocorrelation of any features exists beyond lag distances even shorter than our chosen 500m threshold. In fact, we chose 500m to be conservative relative to the literature. We gave three citations in the same region that use this value. Given the other studies that have already done this work, we did not feel it was justified to recreate the same analysis again.

Third, even if spatial autocorrelation in the predictor variables somehow was created because of the additional spatial manipulations we performed, the impact would be to inflate apparent predictive power. If additional steps were taken to reduce these theoretical impacts, they would only serve to strengthen the argument we are making in the paper: that the spatial domain only nominally improves predictive strength, esp. with regard to temporal metrics.

Comment 6: Figures 7-10 clearly show the spatial pattern in the residuals; however, it is unclear how to determine the range (distance at which spatial correlation is assumed nonexistent). The scale bar could have provided a better reference for spatial autocorrelation. The significant overestimation of AGB density also indicated that the spatial structure in the data set is not fully addressed (Figure 9).

Response 6:

It is a very good point that we provided no way to assess the spatial autocorrelation distance in these figures. We have added a distance scale to help the reader with this assessment. Thank you.

We are unclear about the second point in relation to Figure 9, although we think we agree with the premise and disagree about the conclusion. Indeed, there is spatial structure in the residuals here, but this simply underscores our conclusion about the relative strength of these spatial feature engineering metrics. The reason there is spatial structure in overestimating biomass is not a spatial problem – it is the classic issue where low values are overestimated and high values underestimated. We have expanded the discussion to discuss this in more detail.

The key point, again, is in the comparison among the methods against the common baseline. The utilization of the temporal domain improved the overall function of the model, and minimized the over-estimation problem noted above. Spatial metrics did nothing to improve that.

Comment 7: The authors’ choice of reduced major axis (RMA) regression indicates that both X and Y variables have errors, unlike ordinary least squares (OLS) regression, in which independent variables (Xs) are assumed to be measured without errors. This issue seriously undermines the treatment of biomass estimates from LiDAR as dependent variables. This regression probably would have provided summary information (representing all temporal, spatial, and baseline) on how much biomass should be expected given the observed biomass.

However, upon closer look, the resultant bias plot would show significant deviation from the random pattern. As far as I know, RMA is symmetrical, unlike OLS, and I am not sure if reversing the observed AGB (X) variable if switched to predicted AGB (Y) produces the same result. Moreover, what patterns of variances were observed that caused the authors to use RMA? RMA is designed to address a particular pattern of error variance.

Response 7: Our decision to use RMA regression lines was, as the reviewer indicated, that RMA considers error on both the independent and dependent variables. This is precisely our case, where our response is modeled (i.e., derived from lidar) and Landsat-derived predictions are modeled. However, we agree there are merits to using OLS curves given that the Random Forest models do not assume error in the response. We have replaced the RMA curves in the figure.

Comment 8: The key finding is that temporal features significantly accounted for additional variances compared to baseline variables, probably indicating land use legacy. However, I am not convinced how such information would improve a model; I would probably stratify areas based on the severity of disturbance or the age of forests. I am not against including such variables, but having to rely on historical information seriously hampers the practical application of such models.

Response 8:

The use of temporal features has been shown to improve biomass estimation in these forest types before [1,2, 3, 4, 5]. Our goal here was the comparison with spatial features using a wide-ranging test dataset.

The reason that temporal features improve prediction with optical imagery is related to the land use legacy, yes, but more specifically to the successional state of the forest in relation to the optical properties of forest. Optical data cannot distinguish between closed-canopy forests that are young and short versus older, taller, and with greater biomass. However, the speed to the spectral response over years does separate these types and adds predictive power. The CCDC spectral features additionally add in intra-year temporal signals, which have also been shown to improve prediction in forest stands with high biomass and intra-canopy shadowing [6].

Stratification would indeed achieve a similar outcome, but at the expense of 1) requiring arbitrary stratum boundaries that likely should change by forest type and 2) disjunctions in modeled outputs. By simply including these as a continuous variable, these problems are avoided.

We disagree that relying on the temporal metrics hampers practical application of the models. These temporal metrics are no more difficult to obtain these days than the spatial metrics. LandTrendr is implemented in Google Earth Engine [7]. CCDC is also implemented in Google Earth Engine and has an entire tool box that has been published to assist with the extraction of harmonic and disturbance features [8].

Comment 9: However, the ad hoc feature generation process and lack of proper treatment of spatial structures seriously undermine the results of this manuscript. I recommend that the editor-in-chief reject the manuscript. However, a revised version can be submitted as a new manuscript after addressing such issues.

Response 9: As we noted above, our decisions were not ad hoc, and the use of the metrics we proposed should adequately sample the methodological space necessary to answer our underlying question: does the inclusion of standard spatial metrics improve biomass saturation issues in modeling biomass from optical data Of course, any other investigator could choose a different set of metrics, but the reviewer does not provide specific evidence for what we missed that would radically change our conclusions. The reviewer might be objecting to the novel features that were proposed, neighborhood similarity and spatial vectorization. These techniques emerged after a discussion of alternatives to more well known techniques (i.e., spatial buffers and GLCMs) as well as established techniques (Edge detectors and Morphological features) that have not been evaluated.

We also believe that we have addressed the issue of spatial structures, as noted in our comments above. Several of the issues brought up by the reviewer were, we believe, not relevant to our underlying goals. And even if issues of spatial autocorrelation were theoretically present, addressing them would only strengthen our underlying argument.

Again, we believe we have created a solid, consistent test of the relative power of spatial and temporal metrics using a large dataset of high quality data. Given the proliferation of black-box deep-learning methods that purport to leverage spatial context to magically improve biomass estimates, we believe our straightforward and transparent tests here show that these are unlikely to provide radical improvements in biomass estimation.

References

[1] Pflugmacher, D., Cohen, W. B., & Kennedy, R. E. (2012). Using Landsat-derived disturbance history (1972–2010) to predict current forest structure. Remote Sensing of Environment, 122, 146-165.

[2] Pflugmacher, D., Cohen, W. B., Kennedy, R. E., & Yang, Z. (2014). Using Landsat-derived disturbance and recovery history and lidar to map forest biomass dynamics. Remote Sensing of Environment, 151, 124-137.

[3] Kennedy, R. E., Ohmann, J., Gregory, M., Roberts, H., Yang, Z., Bell, D. M., ... & Seidl, R. (2018). An empirical, integrated forest biomass monitoring system. Environmental Research Letters, 13(2), 025004.

[4] Matasci, G., Hermosilla, T., Wulder, M. A., White, J. C., Coops, N. C., Hobart, G. W., ... & Bater, C. W. (2018). Three decades of forest structural dynamics over Canada's forested ecosystems using Landsat time-series and lidar plots. Remote Sensing of Environment, 216, 697-714.

[5] Nguyen, T. H., Jones, S., Soto-Berelov, M., Haywood, A., & Hislop, S. (2018). A comparison of imputation approaches for estimating forest biomass using Landsat time-series and inventory data. Remote Sensing, 10(11), 1825.

[6] Lefsky, M. A., Turner, D. P., Guzy, M., & Cohen, W. B. (2005). Combining lidar estimates of aboveground biomass and Landsat estimates of stand age for spatially extensive validation of modeled forest productivity. Remote sensing of Environment, 95(4), 549-558.

[7] Kennedy, R. E., Yang, Z., Gorelick, N., Braaten, J., Cavalcante, L., Cohen, W. B., & Healey, S. (2018). Implementation of the LandTrendr algorithm on google earth engine. Remote Sensing, 10(5), 691.

[8] Arévalo, P., Bullock, E. L., Woodcock, C. E., & Olofsson, P. (2020). A suite of tools for continuous land change monitoring in google earth engine. Frontiers in Climate, 2, 576740.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have generally revised this manuscript and it can be accepted

Article Menu

A Large-Scale Inter-Comparison and Evaluation of Spatial Feature Engineering Strategies for Forest Aboveground Biomass Estimation Using Landsat Satellite Imagery

Further Information

Guidelines

MDPI Initiatives

Follow MDPI