Next Article in Journal
An Adaptive End-to-End Classification Approach for Mobile Laser Scanning Point Clouds Based on Knowledge in Urban Scenes
Next Article in Special Issue
Classification of PolSAR Image Using Neural Nonlocal Stacked Sparse Autoencoders with Virtual Adversarial Regularization
Previous Article in Journal
Model Simulation and Prediction of Decadal Mountain Permafrost Distribution Based on Remote Sensing Data in the Qilian Mountains from the 1990s to the 2040s
Previous Article in Special Issue
A New Method for Region-Based Majority Voting CNNs for Very High Resolution Image Classification
Open AccessArticle

Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

Department of Geology and Geography, West Virginia University, Morgantown, WV 26506, USA
*
Author to whom correspondence should be addressed.
Remote Sens. 2019, 11(2), 185; https://doi.org/10.3390/rs11020185
Received: 8 December 2018 / Revised: 13 January 2019 / Accepted: 16 January 2019 / Published: 18 January 2019
High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided. View Full-Text
Keywords: training sample selection; cross-validation; high resolution imagery; NAIP; Lidar; regional-scale training sample selection; cross-validation; high resolution imagery; NAIP; Lidar; regional-scale
Show Figures

Graphical abstract

MDPI and ACS Style

A. Ramezan, C.; A. Warner, T.; E. Maxwell, A. Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification. Remote Sens. 2019, 11, 185.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop