Evaluating the Effect of Training Data Size and Composition on the Accuracy of Smallholder Irrigated Agriculture Mapping in Mozambique Using Remote Sensing and Machine Learning Algorithms
Abstract
:1. Introduction
2. Materials and Methods
2.1. Study Area and RS Data
2.2. Training and Validation Samples per Scenario
2.3. Algorithm and Cross-Validation Parameter Tuning
2.4. Classifications and Replications
2.5. Accuracy Assessment
3. Results
3.1. The Overall Accuracy of All Scenarios
3.2. Class Specific Accuracies per Scenario
3.2.1. Scenario 1: Same Ratio, Smaller Dataset
3.2.2. Scenario 2: Equal Numbers per Class
3.2.3. Scenario 3: Over- and Undersampling
3.2.4. Scenario 4: Mislabeling Irrigated, Rainfed, and Light Vegetation
3.3. Visual Inspection
3.3.1. Scenario 1: Same Ratio, Smaller Dataset
3.3.2. Scenario 2: Equal Numbers per Class
3.3.3. Scenario 3: Over- and Undersampling
3.3.4. Scenario 4: Mislabeling Irrigated, Rainfed, and Light Vegetation
4. Discussion
5. Conclusions
- Ensure that training data represents the area being classified and includes sufficient samples to achieve high accuracy. This can be done best using a random sampling design. Although perfect data is desirable, models (RF and SVM) can tolerate some noise.
- Evaluate multiple algorithms when classifying data, as different algorithms may perform better or worse depending on the specific characteristics of the data being classified.
- Interpret classification results carefully, as accuracies alone may not correctly represent the classification performance. Visual inspection and further interpretation are needed to understand the results and potential limitations of the classification fully.
- Perform multiple simulations with different subsets of the data to estimate if the training data yields robust results (i.e., minimal variation in accuracies between sets), which can indicate that sufficient data has been collected.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Foody, G.; Pal, M.; Rocchini, D.; Garzon-Lopez, C.; Bastin, L. The Sensitivity of Mapping Methods to Reference Data Quality: Training Supervised Image Classifications with Imperfect Reference Data. Int. J. Geo-Inf. 2016, 5, 199. [Google Scholar] [CrossRef] [Green Version]
- Foody, G.M. Sample Size Determination for Image Classification Accuracy Assessment and Comparison. Int. J. Remote Sens. 2009, 30, 5273–5291. [Google Scholar] [CrossRef]
- Foody, G.M.; Mathur, A.; Sanchez-Hernandez, C.; Boyd, D.S. Training Set Size Requirements for the Classification of a Specific Class. Remote Sens. Environ. 2006, 104, 1–14. [Google Scholar] [CrossRef]
- Olofsson, P.; Foody, G.M.; Herold, M.; Stehman, S.V.; Woodcock, C.E.; Wulder, M.A. Good Practices for Estimating Area and Assessing Accuracy of Land Change. Remote Sens. Environ. 2014, 148, 42–57. [Google Scholar] [CrossRef]
- Stehman, S.V.; Foody, G.M. Key Issues in Rigorous Accuracy Assessment of Land Cover Products. Remote Sens. Environ. 2019, 231, 111199. [Google Scholar] [CrossRef]
- Collins, L.; McCarthy, G.; Mellor, A.; Newell, G.; Smith, L. Training Data Requirements for Fire Severity Mapping Using Landsat Imagery and Random Forest. Remote Sens. Environ. 2020, 245, 111839. [Google Scholar] [CrossRef]
- Mellor, A.; Boukir, S.; Haywood, A.; Jones, S. Exploring Issues of Training Data Imbalance and Mislabelling on Random Forest Performance for Large Area Land Cover Classification Using the Ensemble Margin. ISPRS J. Photogramm. Remote Sens. 2015, 105, 155–168. [Google Scholar] [CrossRef]
- Millard, K.; Richardson, M. On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping. Remote Sens. 2015, 7, 8489–8515. [Google Scholar] [CrossRef] [Green Version]
- Ebrahimy, H.; Mirbagheri, B.; Matkan, A.A.; Azadbakht, M. Effectiveness of the Integration of Data Balancing Techniques and Tree-Based Ensemble Machine Learning Algorithms for Spatially-Explicit Land Cover Accuracy Prediction. Remote Sens. Appl. Soc. Environ. 2022, 27, 100785. [Google Scholar] [CrossRef]
- Douzas, G.; Bacao, F.; Fonseca, J.; Khudinyan, M. Imbalanced Learning in Land Cover Classification: Improving Minority Classes’ Prediction Accuracy Using the Geometric SMOTE Algorithm. Remote Sens. 2019, 11, 3040. [Google Scholar] [CrossRef] [Green Version]
- Ramezan, C.A.; Warner, T.A.; Maxwell, A.E.; Price, B.S. Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data. Remote Sens. 2021, 13, 368. [Google Scholar] [CrossRef]
- Beekman, W.; Veldwisch, G.J.; Bolding, A. Identifying the Potential for Irrigation Development in Mozambique: Capitalizing on the Drivers behind Farmer-Led Irrigation Expansion. Phys. Chem. Earth Parts A/B/C 2014, 76–78, 54–63. [Google Scholar] [CrossRef]
- Veldwisch, G.J.; Venot, J.-P.; Woodhouse, P.; Komakech, H.C.; Brockington, D. Re-Introducing Politics in African Farmer-Led Irrigation Development: Introduction to a Special Issue. Water Altern. 2019, 12, 12. [Google Scholar]
- Venot, J.-P.; Bowers, S.; Brockington, D.; Komakech, H.; Ryan, C.; Veldwisch, G.J.; Woodhouse, P. Below the Radar: Data, Narratives and the Politics of Irrigation in Sub-Saharan Africa. Water Altern. 2021, 14, 27. [Google Scholar]
- Woodhouse, P.; Veldwisch, G.J.; Venot, J.-P.; Brockington, D.; Komakech, H.; Manjichi, Â. African Farmer-Led Irrigation Development: Re-Framing Agricultural Policy and Investment? J. Peasant Stud. 2017, 44, 213–233. [Google Scholar] [CrossRef] [Green Version]
- de Bont, C. Modernisation and African Farmer-Led Irrigation Development: Ideology, Policies and Practices. Water Altern. 2019, 12, 23. [Google Scholar]
- Bégué, A.; Arvor, D.; Bellon, B.; Betbeder, J.; de Abelleyra, D.; PD Ferraz, R.; Lebourgeois, V.; Lelong, C.; Simões, M.; Verón, S.R. Remote Sensing and Cropping Practices: A Review. Remote Sens. 2018, 10, 99. [Google Scholar] [CrossRef] [Green Version]
- Izzi, G.; Denison, J.; Veldwisch, G.J. The Farmer-Led Irrigation Development Guide: A What, Why and How-to for Intervention Design; World Bank: Washington, DC, USA, 2021. [Google Scholar]
- Elmes, A.; Alemohammad, H.; Avery, R.; Caylor, K.; Eastman, J.; Fishgold, L.; Friedl, M.; Jain, M.; Kohli, D.; Laso Bayas, J.; et al. Accounting for Training Data Error in Machine Learning Applied to Earth Observations. Remote Sens. 2020, 12, 1034. [Google Scholar] [CrossRef] [Green Version]
- DEA. DEA GeoMAD. Available online: https://docs.digitalearthafrica.org/en/latest/data_specs/GeoMAD_specs.html#Triple-Median-Absolute-Deviations-(MADs) (accessed on 6 September 2022).
- Roberts, D.; Dunn, B.; Mueller, N. Open Data Cube Products Using High-Dimensional Statistics of Time Series. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: Valencia, Spain, 2018; pp. 8647–8650. [Google Scholar]
- Wellington, M.J.; Renzullo, L.J. High-Dimensional Satellite Image Compositing and Statistics for Enhanced Irrigated Crop Mapping. Remote Sens. 2021, 13, 1300. [Google Scholar] [CrossRef]
- Gitelson, A.A.; Viña, A.; Ciganda, V.; Rundquist, D.C.; Arkebauer, T.J. Remote Estimation of Canopy Chlorophyll Content in Crops. Geophys. Res. Lett. 2005, 32, L08403. [Google Scholar] [CrossRef] [Green Version]
- Segarra, J.; Buchaillot, M.L.; Araus, J.L.; Kefauver, S.C. Remote Sensing for Precision Agriculture: Sentinel-2 Improved Features and Applications. Agronomy 2020, 10, 641. [Google Scholar] [CrossRef]
- Abubakar, G.A.; Wang, K.; Shahtahamssebi, A.; Xue, X.; Belete, M.; Gudo, A.J.A.; Mohamed Shuka, K.A.; Gan, M. Mapping Maize Fields by Using Multi-Temporal Sentinel-1A and Sentinel-2A Images in Makarfi, Northern Nigeria, Africa. Sustainability 2020, 12, 2539. [Google Scholar] [CrossRef] [Green Version]
- Gella, G.W.; Bijker, W.; Belgiu, M. Mapping Crop Types in Complex Farming Areas Using SAR Imagery with Dynamic Time Warping. ISPRS J. Photogramm. Remote Sens. 2021, 175, 171–183. [Google Scholar] [CrossRef]
- Gao, Q.; Zribi, M.; Escorihuela, M.; Baghdadi, N.; Segui, P. Irrigation Mapping Using Sentinel-1 Time Series at Field Scale. Remote Sens. 2018, 10, 1495. [Google Scholar] [CrossRef] [Green Version]
- Jennewein, J.S.; Lamb, B.T.; Hively, W.D.; Thieme, A.; Thapa, R.; Goldsmith, A.; Mirsky, S.B. Integration of Satellite-Based Optical and Synthetic Aperture Radar Imagery to Estimate Winter Cover Crop Performance in Cereal Grasses. Remote Sens. 2022, 14, 2077. [Google Scholar] [CrossRef]
- Mandal, D.; Kumar, V.; Ratha, D.; Dey, S.; Bhattacharya, A.; Lopez-Sanchez, J.M.; McNairn, H.; Rao, Y.S. Dual Polarimetric Radar Vegetation Index for Crop Growth Monitoring Using Sentinel-1 SAR Data. Remote Sens. Environ. 2020, 247, 111954. [Google Scholar] [CrossRef]
- Abdolrasol, M.G.M.; Hussain, S.M.S.; Ustun, T.S.; Sarker, M.R.; Hannan, M.A.; Mohamed, R.; Ali, J.A.; Mekhilef, S.; Milad, A. Artificial Neural Networks Based Optimization Techniques: A Review. Electronics 2021, 10, 2689. [Google Scholar] [CrossRef]
- Maxwell, A.E.; Warner, T.A.; Fang, F. Implementation of Machine-Learning Classification in Remote Sensing: An Applied Review. Int. J. Remote Sens. 2018, 39, 2784–2817. [Google Scholar] [CrossRef] [Green Version]
- Thanh Noi, P.; Kappas, M. Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery. Sensors 2017, 18, 18. [Google Scholar] [CrossRef] [Green Version]
- Kuhn, M. Building Predictive Models in R Using the Caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef] [Green Version]
- Meyer, H.; Reudenbach, C.; Hengl, T.; Katurji, M.; Nauss, T. Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation. Environ. Model. Softw. 2018, 101, 1–9. [Google Scholar] [CrossRef]
- Phalke, A.R.; Özdoğan, M.; Thenkabail, P.S.; Erickson, T.; Gorelick, N.; Yadav, K.; Congalton, R.G. Mapping Croplands of Europe, Middle East, Russia, and Central Asia Using Landsat, Random Forest, and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2020, 167, 104–122. [Google Scholar] [CrossRef]
Cropland Irrigated | Croplands under Management Mainly during the Dry Season |
---|---|
Cropland rainfed | Croplands under management mainly during the wet season |
Dense vegetation | Natural vegetation comprising mainly of trees and dense undergrowth. |
Light vegetation | Natural vegetation comprising of mainly low shrubs, grasses, and some trees. |
Grassland | Natural vegetation of primarily grass. |
Wetland | Natural vegetation that is submerged part of the year (mainly during the rainy season and first part of the dry season). |
Water | Water bodies and rivers. |
Built-up area | Man-made surfaces and built-up areas, including bare areas such as sand (no vegetation). |
Group | Variable | Equation |
---|---|---|
Sentinel-2 | Blue | |
Green | ||
Red | ||
Near Infrared (NIR) Red-edge 1 (RE1) Red-edge 2 (RE2) | ||
Shortwave Infrared 1 (SWIR1) | ||
Shortwave Infrared 2 (SWIR2) | ||
Indices S2 | Normalized Difference Vegetation Index (NDVI) | (NIR − Red)/(NIR + Red) |
Normalized Difference Water Index (NDWI) | (NIR − SWIR1)/(NIR + SWIR1) | |
Bare Soil Index (BSI) | ((Red + SWIR1) − (NIR + Blue))/((Red + SWIR1) + (NIR + Blue)) | |
Chlorophyll index (CI) | (NIR/Red Edge 1) − 1 | |
Temporal variation | 3 MADS S2 | See [21,22] for more details on equations |
Sentinel-1 | VV | |
VH | ||
Indices S1 | RVI | 4 × VH/(VV + VH) |
Manica Province | Gaza Province | |||||||
---|---|---|---|---|---|---|---|---|
Catandica | Manica | Chokwe | Xai-Xai | |||||
# polygons | hectares | # polygons | hectares | # polygons | hectares | # polygons | hectares | |
Built-up area | 10 | 3.4 | 10 | 5.6 | 10 | 11.5 | 10 | 18.1 |
Cropland irrigated | 45 | 16.4 | 58 | 10.2 | 68 | 166 | 157 | 38.3 |
Cropland rainfed | 34 | 10.9 | 32 | 7 | 48 | 40.4 | 19 | 5.8 |
Dense vegetation | 9 | 148 | 19 | 104 | 15 | 12.5 | 9 | 37.2 |
Grassland | 52 | 111 | ||||||
Light vegetation | 25 | 89.5 | 20 | 11.3 | 104 | 187 | 28 | 26 |
Water | 9 | 113 | 5 | 17.2 | 9 | 42.6 | ||
Wetland | 12 | 144 | 6 | 27 | ||||
Total | 123 | 268.2 | 148 | 251.1 | 262 | 578.6 | 290 | 306 |
Gaza | Manica | |
---|---|---|
Class | Set 8 (100%) | Set 8 (100%) |
Built-up area | 2849 | 1064 |
Irrigated agriculture | 19,601 | 3260 |
Rainfed agriculture | 4798 | 2540 |
Dense vegetation | 6111 | 22,185 |
Grassland | 10,157 | - |
Light vegetation | 20,386 | 9782 |
Water | 5504 | 9720 |
Wetland | 16,582 | - |
Set 1 | Set 2 | Set 3 | Set 4 | Set 5 | Set 6 | Set 7 | |
---|---|---|---|---|---|---|---|
Gaza | 50 | 508 | 966 | 1424 | 1882 | 2340 | 2798 |
Manica | 50 | 225 | 400 | 575 | 750 | 925 | 1100 |
Class | Set 1 (1%) | Set 2 (5%) | Set 3 (10%) | Set 4 (20%) | Set 5 (50%) | Set 6 (80%) | Set 7 (90%) | Set 8 (95%) | Set 9 (99%) | |
---|---|---|---|---|---|---|---|---|---|---|
Gaza | Irrigated agriculture | 202 | 1008 | 2015 | 4030 | 10,076 | 16,122 | 18,137 | 19,144 | 19,950 |
Rest of the classes (7) | 2850 | 2735 | 2591 | 2303 | 1439 | 576 | 288 | 144 | 29 | |
Total | 20,152 | 20,153 | 20,152 | 20,151 | 20,149 | 20,154 | 20,153 | 20,152 | 20,153 | |
Manica | Irrigated agriculture | 54 | 268 | 535 | 1071 | 2677 | 4283 | 4819 | 5086 | 5300 |
Rest of the classes (5) | 1060 | 1017 | 964 | 857 | 535 | 214 | 107 | 54 | 11 | |
Total | 5354 | 5353 | 5355 | 5356 | 5352 | 5353 | 5354 | 5356 | 5355 |
Set 1 (1%) | Set 2 (5%) | Set 3 (10%) | Set 4 (20%) | Set 5 (40%) | |
---|---|---|---|---|---|
Gaza | 860 | 4299 | 8599 | 17,198 | 34,396 |
Manica | 486 | 2428 | 4855 | 9710 | 19,420 |
Gaza | Manica | |
---|---|---|
Built-up area | 668 | 252 |
Irrigated agriculture | 4936 | 823 |
Rainfed agriculture | 1227 | 607 |
Dense vegetation | 1496 | 5577 |
Grassland | 2536 | - |
Light vegetation | 5132 | 2428 |
Water | 1339 | 2452 |
Wetland | 4165 | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Weitkamp, T.; Karimi, P. Evaluating the Effect of Training Data Size and Composition on the Accuracy of Smallholder Irrigated Agriculture Mapping in Mozambique Using Remote Sensing and Machine Learning Algorithms. Remote Sens. 2023, 15, 3017. https://doi.org/10.3390/rs15123017
Weitkamp T, Karimi P. Evaluating the Effect of Training Data Size and Composition on the Accuracy of Smallholder Irrigated Agriculture Mapping in Mozambique Using Remote Sensing and Machine Learning Algorithms. Remote Sensing. 2023; 15(12):3017. https://doi.org/10.3390/rs15123017
Chicago/Turabian StyleWeitkamp, Timon, and Poolad Karimi. 2023. "Evaluating the Effect of Training Data Size and Composition on the Accuracy of Smallholder Irrigated Agriculture Mapping in Mozambique Using Remote Sensing and Machine Learning Algorithms" Remote Sensing 15, no. 12: 3017. https://doi.org/10.3390/rs15123017
APA StyleWeitkamp, T., & Karimi, P. (2023). Evaluating the Effect of Training Data Size and Composition on the Accuracy of Smallholder Irrigated Agriculture Mapping in Mozambique Using Remote Sensing and Machine Learning Algorithms. Remote Sensing, 15(12), 3017. https://doi.org/10.3390/rs15123017