Remote Sensing Scale Issues Related to the Accuracy Assessment of Land Use/land Cover Maps Produced Using Multi-resolution Data: Comments on " the Improvement of Land Cover Classification by Thermal Remote Sensing " . Remote Sens. 2015, 7(7), 8368–8390

Much remote sensing (RS) research focuses on fusing, i.e., combining, multi-resolution/multi-sensor imagery for land use/land cover (LULC) classification. In relation to this topic, Sun and Schulz [1] recently found that a combination of visible-to-near infrared (VNIR; 30 m spatial resolution) and thermal infrared (TIR; 100–120 m spatial resolution) Landsat data led to more accurate LULC classification. They also found that using multi-temporal TIR data alone for classification resulted in comparable (and in some cases higher) classification accuracies to the use of multi-temporal VNIR data, which contrasts with the findings of other recent research [2]. This discrepancy, and the generally very high LULC accuracies achieved by Sun and Schulz (up to 99.2% overall accuracy for a combined VNIR/TIR classification result), can likely be explained by their use of an accuracy assessment procedure which does not take into account the multi-resolution nature of the data. Sun and Schulz used 10-fold cross-validation for accuracy assessment, which is not necessarily inappropriate for RS accuracy assessment in general. However, here it is shown that the typical pixel-based cross-validation approach results in non-independent training and validation data sets when the lower spatial resolution TIR images are used for classification, which causes classification accuracy to be overestimated.

Fusion of multi-resolution and/or multi-sensor remote sensing (RS) imagery has been shown to result in higher classification accuracy in many past studies [2][3][4][5][6][7][8][9][10][11][12][13][14][15], so classification-oriented image fusion is an important research topic.Satellite data from the Landsat series is commonly-used for land use/land cover (LULC) classification in RS, and Landsat 4/5/7/8 have image bands that vary in terms of spatial resolution, as detailed in [16].In a recent study using Landsat 4/5/8 data, Sun and Schulz [1] combined the lower spatial resolution thermal infrared (TIR) image bands (120 m for Landsat 4/5, 100 m for Landsat 8) with the higher spatial resolution visible-to-near infrared (VNIR) image bands (30 m) for LULC classification, and found that the combined result led to higher overall classification accuracy.They also found that using multi-temporal TIR data alone for classification resulted in comparable (and in some cases higher) LULC classification accuracies to the use of multi-temporal VNIR data, which contrasts with the results of another recent study [2] and is particularly surprising given the lower spatial resolutions of the TIR bands.In general, I agree with the authors' sentiment that more research on to the combined use of VNIR-TIR data for classification is needed.However, given the details provided in the manuscript, the authors' very encouraging results seem to be due to the use of an improper accuracy assessment procedure rather than the utility of VNIR-TIR data fusion for LULC classification.The aim of this comment is not to dismiss the work of Sun and Schulz, which was actually quite interesting, but rather to highlight the importance of considering scale issues for accuracy assessment, particularly when multi-resolution imagery is used for classification.
Sun and Schulz used a 10-fold cross-validation procedure for accuracy assessment [17], which means 10% of the training pixels are withheld for accuracy assessment in each fold.While the use of cross-validation is not uncommon in RS, a problem with it in their study comes from the fact that the TIR image bands are resampled from their original resolutions of 100 m (Landsat 8) or 120 m (Landsat 4/5) to 30 m to match the VNIR bands [16], meaning that each original TIR pixel is represented by roughly nine (3 × 3) resampled pixels in the case of Landsat 8, or 16 pixels (4 × 4) in the case of Landsat 4/5.The implication of this resampling is that in the cross-validation process, a single original TIR pixel is very likely to be represented in both the training and validation sets, as shown in the example in Figure 1.The training and validation sets should be spatially independent to ensure reliable estimation of LULC classification accuracy [18], as the inclusion of the same data in the training and validation data sets will lead to overestimation of classification accuracy.This nonindependence of the training and validation data sets would explain why the TIR bands performed as well as the VNIR bands for LULC classification, despite their lower spatial resolutions, and it would also explain why the classification accuracy they achieved using the combined VNIR-TIR was so high (>99% overall accuracy for the classifications that used multi-temporal imagery).
A correct way to perform cross-validation taking into account the multi-resolution nature of the data would be to do it at the region-of-interest (ROI), i.e., polygon, level rather than at the individual pixel level.This would involve holding out all of the pixels within 10% of the ROIs in each cross-validation fold, so individual pixels would still be the base units for accuracy assessment.This procedure would ensure training/validation data independence as long as the distance between ROIs is significantly larger than the pixel size of the lowest spatial resolution image.The only caveat is that there needs to be at least as many ROIs for each LULC class as there are cross-validation folds (e.g., at least 10 ROIs for 10-fold cross-validation), which may in some cases require gathering additional ground truth data.A ROI-based cross-validation approach would also reduce spatial autocorrelation between training and validation samples caused by their close proximity to one another, which also compromises the assumption of training/validation data set independence even for the higher resolution images [18].It should be noted that, although Sun and Schulz simply used the resampled lower resolution TIR pixels for classification, meaning that the TIR images were not "sharpened" using the higher resolution imagery (as in some other studies on classification-oriented image fusion [5,6,9,[11][12][13][14][15]), the scale issues pointed out here also apply when the lower resolution imagery is "sharpened" using the higher resolution imagery prior to classification, as the spatial resolution of the lower resolution image is only artificially increased, and its pixel values are still derived in part from the original lower resolution image.
I hope that Sun and Schulz can respond to this comment by providing additional information on their cross-validation procedure (in the case that their accuracy assessment did not suffer from the problems pointed out here), or to submit a correction to their manuscript using a proper accuracy assessment method so that the RS community can have a better understanding of the utility of VNIR-TIR data fusion for LULC classification.

Figure 1 .
Figure 1.(a) One pixel from an original spatial resolution Landsat 4/5 thermal infrared (TIR) band; (b) original TIR pixel resampled to 4 × 4 pixels to match the pixels of the higher spatial resolution visible-to-near infrared (VNIR) bands; (c) resampled TIR training pixels (green cells) and validation pixels (red cells) in one fold of a cross-validation, assuming approximately 10% of pixels are held out for validation.