Next Article in Journal
The Trend Inconsistency between Land Surface Temperature and Near Surface Air Temperature in Assessing Urban Heat Island Effects
Previous Article in Journal
An Introduction to the Geostationary-NASA Earth Exchange (GeoNEX) Products: 1. Top-of-Atmosphere Reflectance and Brightness Temperature
Open AccessArticle

Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region

1
School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
2
UF Health Shands, University of Florida, Gainesville, FL 32611, USA
3
Sociology and Criminology & Law, University of Florida, Gainesville, FL 32611, USA
*
Author to whom correspondence should be addressed.
Remote Sens. 2020, 12(8), 1268; https://doi.org/10.3390/rs12081268
Received: 10 March 2020 / Revised: 5 April 2020 / Accepted: 14 April 2020 / Published: 17 April 2020
(This article belongs to the Section Forest Remote Sensing)
It is computationally challenging to fit models to big data. For example, satellite imagery data often contain billions to trillions of pixels and it is not possible to use a pixel-level analysis to identify drivers of land-use change and create predictions using all the data. A common strategy to reduce sample size consists of drawing a random sample but this approach is not ideal when the outcome of interest is rare in the landscape because it leads to very few pixels with this outcome. Here we show that a case-control (CC) sampling approach, in which all (or a large fraction of) pixels with the outcome of interest and a subset of the pixels without this outcome are selected, can yield much better inference and prediction than random sampling (RS) if the estimated parameters and probabilities are adjusted with the equations that we provide. More specifically, we show that a CC approach can yield unbiased inference with much less uncertainty when CC data are analyzed with logistic regression models and its semiparametric variants (e.g., generalized additive models). We also show that a random forest model, when fitted to CC data, can generate much better predictions than when fitted to RS data. We illustrate this improved performance of the CC approach, when used together with the proposed bias-correction adjustments, with extensive simulations and a case study in the Amazon region focused on deforestation. View Full-Text
Keywords: satellite imagery; deforestation; inference; prediction; Amazon; pixel sampling; case-control satellite imagery; deforestation; inference; prediction; Amazon; pixel sampling; case-control
Show Figures

Graphical abstract

MDPI and ACS Style

Valle, D.; Hyde, J.; Marsik, M.; Perz, S. Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region. Remote Sens. 2020, 12, 1268. https://doi.org/10.3390/rs12081268

AMA Style

Valle D, Hyde J, Marsik M, Perz S. Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region. Remote Sensing. 2020; 12(8):1268. https://doi.org/10.3390/rs12081268

Chicago/Turabian Style

Valle, Denis; Hyde, Jacy; Marsik, Matthew; Perz, Stephen. 2020. "Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region" Remote Sens. 12, no. 8: 1268. https://doi.org/10.3390/rs12081268

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop