Categorical Prediction of the Anthropization Index in the Lake Tota Basin, Colombia, Using XGBoost, Remote Sensing and Geomorphometry Data
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Area
2.2. Data Acquisition and Variable Preparation
2.2.1. Target Variable: Integrated Relative Anthropization Index (INRA)
2.2.2. Predictor Variables
- Spectral Data: Original spectral bands from the Sentinel-2 image.
- Proximity Variables: Euclidean distances to road infrastructure and populated centers were calculated to model the influence of accessibility on anthropization.
2.3. Data Pre-Processing and Scale Optimization
2.4. XGBoost Model Development and Evaluation
2.4.1. Model Selection
2.4.2. Model Training and Hyperparameter Tuning
2.4.3. Performance Evaluation
2.5. Model Interpretability Using SHAP
2.6. Post-Prediction Analysis
- i.
- Area Comparison: Calculating the total area for each predicted INRA class and comparing it against the original classified raster to identify potential over- or underestimation.
- ii.
- Shape Metrics: Analyzing the area-to-perimeter ratio of the predicted polygons to assess landscape fragmentation and edge effects, which can indicate model performance in transition zones [32].
- iii.
- Spatial Consistency: Comparing the final prediction map with observed land cover data to verify the spatial fit and identify potential discrepancies related to human influence or positioning.
3. Results
3.1. Optimal Scale Selection
3.2. Performance of the Optimized 50 m Model
3.3. Predictor Importance and Feature Selection
3.4. Model Interpretability with SHAP Values
3.5. Predicted vs. Calculated INRA Map
4. Discussion
4.1. Interpretation of Anthropization Patterns in Lake Tota
4.2. Methodological Insights: Scale Optimization and Predictor Importance
4.3. Model Limitations and Future Directions
4.4. Methodological Insights
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- IDEAM. Estudio Nacional Del Agua 2014; Franco Torres, O., García Herrán, M., Vargas Martínez, O., Bernal Quiroga, F., Campillo, A.K., Eds.; IDEAM: Bogota, Colombia, 2014.
- Plazas Figueroa, D.A.; Ortiz Villota, M.T. Diseño de Medidas de Manejo Ambiental Orientadas a la Disminución de Los Niveles de Eutrofización: Estudio de Caso en la MI-Crocuenca Del Río Hatolaguna en El Humedal Lago de Tota (Municipios de Aquitania-Sogamoso, Boyacá). Bachelor’s Thesis, Universidad Libre, Bogotá, Colombia, 2016. [Google Scholar]
- Northcote, T.G. Eutrofización y Problemas de Polución. In El Lago Titicaca: Síntesis Del Conocimiento Limnológico Actual; Hisbol-ORSTOM: La Paz, Bolivia, 1991. [Google Scholar]
- Franco Vidal, L.; Delgado, J.; Andrade, G.I. Vulnerability Factors to Global Climate Change in the High Andean Colombian Wetlands. Cuad. Geogr. Rev. Colomb. Geogr. 2013, 22, 69–85. [Google Scholar] [CrossRef]
- Salamanca Gómez, M.A. Multi-Timer Analysis on the Loss of the Water Mirror on Laguna La Herrera Wetland for Anthropic Effects Associated with Mining. Bachelor’s Thesis, Universidad Militar Nueva Granada, Bogotá, Colombia, 2018. [Google Scholar]
- Lewandowski, J.; Meinikmann, K.; Krause, S. Groundwater–Surface Water Interactions: Recent Advances and Interdisciplinary Challenges. Water 2020, 12, 296. [Google Scholar] [CrossRef]
- Abdullah, A.Y.M.; Masrur, A.; Adnan, M.S.G.; Baky, M.A.A.; Hassan, Q.K.; Dewan, A. Spatio-Temporal Patterns of Land Use/Land Cover Change in the Heterogeneous Coastal Region of Bangladesh between 1990 and 2017. Remote Sens. 2019, 11, 790. [Google Scholar] [CrossRef]
- Chemura, A.; Rwasoka, D.; Mutanga, O.; Dube, T.; Mushore, T. The Impact of Land-Use/Land Cover Changes on Water Balance of the Heterogeneous Buzi Sub-Catchment, Zimbabwe. Remote Sens. Appl. 2020, 18, 100292. [Google Scholar] [CrossRef]
- Martínez-Dueñas, W.A. INRA—Relative Integrated Anthropization Index: A Conceptual-Technical Proposal and Its Application. Intropica Rev. Inst. Investig. Trop. 2010, 5, 37–46. Available online: https://dialnet.unirioja.es/servlet/articulo?codigo=3794116 (accessed on 5 October 2025).
- Radočaj, D.; Plaščak, I.; Jurišić, M. A Comparative Assessment of Regular and Spatial Cross-Validation in Subfield Machine Learning Prediction of Maize Yield from Sentinel-2 Phenology. Eng 2025, 6, 270. [Google Scholar] [CrossRef]
- Valerio, F.; Basile, M.; Balestrieri, R.; Posillico, M.; Di Donato, S.; Altea, T.; Matteucci, G. The Reliability of a Composite Biodiversity Indicator in Predicting Bird Species Richness at Different Spatial Scales. Ecol. Indic. 2016, 71, 627–635. [Google Scholar] [CrossRef]
- Wanumen Mesa, A.M. Dynamics of Land Cover and Perception of Water Resources in the Lake Tota Basin. Master’s Thesis, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia, 2018. Available online: https://repository.udistrital.edu.co/items/4934289e-05c7-418a-865d-a186dff5065f (accessed on 5 October 2025).
- Plaza Ortega, V.; Valencia Rojas, M.P.; Figueroa Casas, A. Relative Integrated Anthropization Index (INRA) Application in a High Mountain Ecosystem. Luna Azul 2017, 44, 80–93. [Google Scholar] [CrossRef]
- Ariza, A.; Roa Melgarejo, O.J.; Serrato, P.K.; León Rincón, H.A. Use of Spectral Indices Derived from Remote Sensors for Geomorphological Characterization in Island Areas of the Colombian Caribbean. Perspect. Geográfica 2018, 23, 105–122. [Google Scholar] [CrossRef]
- Revelo Luna, D.A.; Mejía Manzano, J.; Montoya Bonilla, B.; Hoyos García, J. Analysis of the Vegetation Indices NDVI, GNDVI, and NDRE for the Characterization of Coffee Crops (Coffea Arabica). Ing. Desarro. 2021, 38, 298–312. [Google Scholar] [CrossRef]
- Paz Pellat, F.; Romero Sánchez, M.E.; Palacios Vélez, E.; Bolaños González, M.; Valdez Lazalde, J.R.; Aldrete, A. Scopes and Limitations of Spectral Vegetation Indexes: Theoretical Framework. Terra Latinoam. 2014, 32, 177–194. [Google Scholar]
- Raduła, M.W.; Szymura, T.H.; Szymura, M. Topographic Wetness Index Explains Soil Moisture Better than Bioindication with Ellenberg’s Indicator Values. Ecol. Indic. 2018, 85, 172–179. [Google Scholar] [CrossRef]
- Heinonen, T.; Kurttila, M.; Pukkala, T. Possibilities to Aggregate Raster Cells through Spatial Optimization in Forest Planning. Silva Fenn. 2007, 44, 89–103. [Google Scholar] [CrossRef]
- Carmel, Y. Aggregation as a Means of Reducing Raster Data Uncertainty. In Proceedings of the 7th International Conference on GeoComputation, Southampton, UK, 8–10 September 2003. [Google Scholar]
- Newman, D.R.; Cockburn, J.M.H.; Drǎguţ, L.; Lindsay, J.B. Local Scale Optimization of Geomorphometric Land Surface Parameters Using Scale-Standardized Gaussian Scale-Space. Comput. Geosci. 2022, 165, 105144. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. arXiv 2016, arXiv:1603.02754v3. [Google Scholar] [CrossRef]
- Fan, J.; Wang, X.; Wu, L.; Zhou, H.; Zhang, F.; Yu, X.; Lu, X.; Xiang, Y. Comparison of Support Vector Machine and Extreme Gradient Boosting for Predicting Daily Global Solar Radiation Using Temperature and Precipitation in Humid Subtropical Climates: A Case Study in China. Energy Convers. Manag. 2018, 164, 102–111. [Google Scholar] [CrossRef]
- Dorado Guerra, D.Y. Integrated Modeling with Machine Learning to Assess Nutrient Pollution in Water Bodies Today and under the Effect of Climate Change. Application to the Júcar River Basin District. Master’s Thesis, Universitat Politècnica de València, Valencia, Spain, 2024. [Google Scholar]
- Ojeda Riaños, C.K.; Torres, C.A.; Zapata Calero, J.C.; Romero-Leiton, J.P.; Benavides, I.F. A Machine Learning Approach to Map the Potential Agroecological Complexity in an Indigenous Community of Colombia. J. Environ. Manag. 2024, 370, 122655. [Google Scholar] [CrossRef]
- Dong, H.; He, D.; Wang, F. SMOTE-XGBoost Using Tree Parzen Estimator Optimization for Copper Flotation Method Classification. Powder Technol. 2020, 375, 174–181. [Google Scholar] [CrossRef]
- Wang, S.; Liu, S.; Zhang, J.; Che, X.; Yuan, Y.; Wang, Z.; Kong, D. A New Method of Diesel Fuel Brands Identification: SMOTE Oversampling Combined with XGBoost Ensemble Learning. Fuel 2020, 282, 118848. [Google Scholar] [CrossRef]
- Osorio Díaz, D.F. Classification of Mental Illnesses in Adults Using Machine Learning Techniques and Tree-Based Models in Colombian Mental Health. Master’s Thesis, Universidad de los Ándes, Bogotá, Colombia, 2023. [Google Scholar]
- Santarelli, J. Machine Learning Approaches to Address Subscriber Churn on a Streaming Platform in the Context of Digital Transformation. Master’s Thesis, Universidad Torcuato Di Tella, Buenos Aires, Argentina, 2021. [Google Scholar]
- Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1–11. [Google Scholar] [CrossRef]
- Brennan, R.L.; Prediger, D.J. Coefficient Kappa: Some Uses, Misuses, and Alternatives. Educ. Psychol. Meas. 1981, 41, 687–699. [Google Scholar] [CrossRef]
- Rozemberczki, B.; Watson, L.; Bayer, P.; Yang, H.-T.; Kiss, O.; Nilsson, S.; Sarkar, R. The Shapley Value in Machine Learning. In Proceedings of the 31st International Joint Conference on Artifical Intelligence, IJCAI-ECAI 2022, Vienna, Austria, 23–29 July 2022. [Google Scholar] [CrossRef]
- Fahrig, L. Ecological Responses to Habitat Fragmentation Per Se. Annu. Rev. Ecol. Evol. Syst. 2017, 48, 1–23. [Google Scholar] [CrossRef]
- Hell, M.; Brandmeier, M. Identifying Plausible Labels from Noisy Training Data for a Land Use and Land Cover Classification Application in Amazônia Legal. Remote Sens. 2024, 16, 2080. [Google Scholar] [CrossRef]
- Rojas Paez, D. Análisis Multitemporal Mediante Imágenes Landsat Del Cambio de La Cobertura Vegetal y Su Impacto En La Desecación Del Es-Pejo de Agua En La Laguna de Tota Para El Periodo de 1991 al 2017. Master’s Thesis, Universidad Militar Nueva Granada, Bogotá, Colombia, 2018. [Google Scholar]
- Arias Sosa, L.A.; Cely Reyes, O.A.; López Dulcey, J.R.; Ramos Montaño, C.; Rodríguez Africano, P.E.; Salamanca Reyes, J.R. Un Breve Recorrido por el Lago de Tota 2020. Available online: https://repositorio.uptc.edu.co/server/api/core/bitstreams/740253ad-09ca-467d-8f2b-acfd854faed9/content (accessed on 5 October 2025).
- Forero Salamanca, J.C. Estudio de La Incidencia de Actividades Agropecuarias en Cuerpos Lénticos de Alta Montaña de La Cordillera Andina Colombiana. Master’s Thesis, Universidad Nacional Abierta y a Distancia, Bogotá, Colombia, 2021. Available online: https://repository.unad.edu.co/jspui/handle/10596/39046?locale=es (accessed on 5 October 2025).
- Pratt, B.; Chang, H. Effects of Land Cover, Topography, and Built Structure on Seasonal Water Quality at Multiple Spatial Scales. J. Hazard. Mater. 2012, 209–210, 48–58. [Google Scholar] [CrossRef]
- Arenas-Castro, S.; Gonçalves, J.; Alves, P.; Alcaraz-Segura, D.; Honrado, J.P. Assessing the Multi-Scale Predictive Ability of Ecosystem Functional Attributes for Species Distribution Modelling. PLoS ONE 2018, 13, e0199292. [Google Scholar] [CrossRef]
- Comber, A.; Harris, P. The Importance of Scale and the MAUP for Robust Ecosystem Service Evaluations and Landscape Decisions. Land 2022, 11, 399. [Google Scholar] [CrossRef]
- Ghafarian, F.; Wieland, R.; Lüttschwager, D.; Nendel, C. Application of Extreme Gradient Boosting and Shapley Additive Explanations to Predict Temperature Regimes inside Forests from Standard Open-Field Meteorological Data. Environ. Model. Softw. 2022, 156, 105466. [Google Scholar] [CrossRef]
- Monteiro, G.O.d.A.; Difante, G.d.S.; Montagner, D.B.; Euclides, V.P.B.; Castro, M.; Rodrigues, J.G.; Pereira, M.d.G.; Ítavo, L.C.V.; Campos, J.A.; da Costa, A.B.; et al. Interpreting Machine Learning Models with SHAP Values: Application to Crude Protein Prediction in Tamani Grass Pastures. Agronomy 2025, 15, 2780. [Google Scholar] [CrossRef]
- Binet, R.; Bergsma, E.; Poulain, V. Accurate Sentinel-2 Inter-Band Time Delays. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, V-1–2022, 57–66. [Google Scholar] [CrossRef]
- Hirayama, H.; Sharma, R.C.; Tomita, M.; Hara, K. Evaluating Multiple Classifier System for the Reduction of Salt-and-Pepper Noise in the Classification of Very-High-Resolution Satellite Images. Int. J. Remote Sens. 2019, 40, 2542–2557. [Google Scholar] [CrossRef]
- Chen, Y.; Zhou, Y.; Ge, Y.; An, R.; Chen, Y. Enhancing Land Cover Mapping through Integration of Pixel-Based and Object-Based Classifications from Remotely Sensed Imagery. Remote Sens. 2018, 10, 77. [Google Scholar] [CrossRef]
- Bo, F.; Xiao-Yang, Z.; Yi, L.; Xiang-Hai, W.; Yong-Gong, R. A Convolutional Neural Networks Denoising Approach for Salt and Pepper Noise. arXiv 2018, arXiv:1807.08176. [Google Scholar] [CrossRef]
- Tziachris, P.; Nikou, M.; Aschonitis, V.; Kallioras, A.; Sachsamanoglou, K.; Fidelibus, M.D.; Tziritis, E. Spatial or Random Cross-Validation? The Effect of Resampling Methods in Predicting Groundwater Salinity with Machine Learning in Mediterranean Region. Water 2023, 15, 2278. [Google Scholar] [CrossRef]






| Evaluation Category | Nomenclature | INRA | Reclassification Values |
|---|---|---|---|
| Water surfaces | 4.1.1. Swampy Areas | 0 | 0 |
| 4.1.2. Peatlands | 0 | 0 | |
| 5.1.2. Natural lagoons, lakes, and swamps | 0 | 0 | |
| Tall vegetation without human intervention or with a high degree of recovery | 3.1.3.2. Fragmented forest with secondary vegetation | 0 | 0 |
| 3.1.4. Gallery and riparian forest | 0 | 0 | |
| 3.2.2.2.2 Open Mesophilic Shrubland | 0 | 0 | |
| 3.2.3.1. High Secondary Vegetation | 0 | 0 | |
| 3.2.3.2. Low Secondary Vegetation | 0 | 0 | |
| Low vegetation without anthropogenic intervention or with a high degree of recovery | 3.2.1.1.1.1. Dense, non-wooded, terra firme grasslands | 0.25 | 1 |
| 3.2.1.1.1.2. Dense wooded dry land grasslands | 0.25 | 1 | |
| 3.2.1.1.1.3. Dense Firm Ground Grasslands with Shrubs | 0.25 | 1 | |
| 3.2.1.2. Open Grassland | 0.25 | 1 | |
| grazing area | 2.3.1. clean pastures | 0.5 | 3 |
| 2.3.2. Wooded pastures | 0.5 | 3 | |
| 2.3.3. Weedy pastures | 0.5 | 3 | |
| Infrastructure | 1.1.1. Continuous urban fabric | 1 | 5 |
| 1.2.2.1. Road network and associated territories | 1 | 5 | |
| 1.1.2. Discontinuous urban fabric | 1 | 5 | |
| Crop area | 2.1.4.1. Onion | 0.75 | 4 |
| 2.4.1. Crop Mosaic | 0.75 | 4 | |
| 2.4.2. Grassland and crop mosaic | 0.75 | 4 | |
| Forest plantations | 3.1.5. Forest plantation | 0.5 | 3 |
| Areas devoid of vegetation | 3.3.1.2. Sandy areas | 0.4 | 2 |
| 3.3.2. Rocky outcrops | 0.4 | 2 |
| Res (m) | Accuracy | Kappa Index |
|---|---|---|
| 20 | ![]() 0.769 | ![]() 0.687 |
| 50 | 0.743 | 0.641 |
| 50 Smote | 0.75 | 0.653 |
| 100 | 0.709 | 0.593 |
| 150 | 0.68 | 0.554 |
| 200 | 0.67 | 0.54 |
| 250 | 0.66 | 0.531 |
| 300 | 0.662 | 0.531 |
| 350 | 0.647 | 0.516 |
| 400 | 0.597 | 0.451 |
| 450 | 0.642 | 0.509 |
| 500 | 0.64 | 0.505 |
| Modeling Resolution (m) | Equipment Features | Training Data | |||||
|---|---|---|---|---|---|---|---|
| CPU | GPU | RAM (GB) | Operating System | Dataframe Rows | Execution Time | ||
| Seconds | Days | ||||||
| 20 | Intel Core i5-10300H | NVIDIA GeForce GTX 1650 with MAX-Q Design | 24 | Windows 11 Home | 423,507 | 513,185.72 | 5.94 |
| 50 | AMD Ryzen 5 3200 | NVIDIA GeForce 2060 OC—9 GB | 16 | 48,018 | 31,260.87 | 0.36 | |
| 50—SMOTE | Intel Core i5-10300H | NVIDIA GeForce GTX 1650 with MAX-Q Design | 24 | 48,471 | 67,307.66 | 0.78 | |
| 100 | Intel Core i7-8565U | NVIDIA GeForce MX110 | 20 | Windows 11 Pro | 17,645 | 38,197.15 | 0.44 |
| 150 | 7094 | 28,229.80 | 0.33 | ||||
| 200 | 4609 | 9927.31 | 0.11 | ||||
| 250 | 2781 | 6222.48 | 0.07 | ||||
| 300 | Intel Core i5 3210M | NVIDIA GeForce 610M | 6 | Windows 10 Pro | 2781 | 13,607.81 | 0.16 |
| 350 | 1520 | 5185.20 | 0.06 | ||||
| 400 | 1250 | 4627.69 | 0.05 | ||||
| 450 | 975 | 5443.29 | 0.06 | ||||
| 500 | 836 | 8002.14 | 0.09 | ||||
| Metrics | Class: 0 | Class: 1 | Class: 2 | Class: 3 | Class: 4 | Class: 5 |
|---|---|---|---|---|---|---|
| Sensitivity | 0.55559 | 0.8927 | 0.30881 | 0.49529 | 0.8882 | 0.827815 |
| Specificity | 0.97633 | 0.9371 | 0.97242 | 0.93074 | 0.8442 | 0.999645 |
| Pos Pred Value | 0.65710 | 0.8769 | 0.51853 | 0.58285 | 0.7426 | 0.880282 |
| Neg Pred Value | 0.96416 | 0.9457 | 0.93601 | 0.90421 | 0.9372 | 0.999457 |
| Prevalence | 0.07549 | 0.3341 | 0.08774 | 0.16344 | 0.3361 | 0.003145 |
| Detection Rate | 0.04194 | 0.2982 | 0.02709 | 0.08095 | 0.2986 | 0.002603 |
| Detection Prevalence | 0.06383 | 0.3401 | 0.05225 | 0.13889 | 0.4020 | 0.002957 |
| Balanced Accuracy | 0.76596 | 0.9149 | 0.64061 | 0.71301 | 0.8662 | 0.913730 |
| Category | CALCULATED INRA | FORETOLD INRA | Percentage Difference in Area | ||||
|---|---|---|---|---|---|---|---|
| Area (m2) | Perimeter (m) | Relation A/P | Area (m2) | Perimeter (m) | Relation A/P | ||
| Natural (0) | 13,098,416.97 | 289,778.07 | 45.20 | 12,770,866.56 | 328,175.16 | 38.91 | 2.50% |
| Slightly Disturbed (1) | 58,321,970.36 | 510,561.35 | 114.23 | 58,483,945.83 | 529,519.92 | 110.45 | −0.28% |
| Moderately Altered (2) | 15,164,504.17 | 453,685.66 | 33.43 | 13,962,286.18 | 475,644.00 | 29.35 | 7.93% |
| Disturbed (3) | 28,248,523.32 | 577,156.31 | 48.94 | 26,902,327.13 | 665,229.65 | 40.44 | 4.77% |
| Highly Altered (4) | 58,141,997.61 | 399,089.79 | 145.69 | 60,863,185.63 | 575,836.41 | 105.70 | −4.68% |
| Completely Anthropized (5) | 543,517.71 | 6719.49 | 80.89 | 536,318.80 | 7079.46 | 75.76 | 1.32% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Camargo-Pérez, A.M.; Mayorga-Guzmán, I.A.; Flórez-Yepes, G.Y.; Benavides-Martínez, I.F.; Garcés-Gómez, Y.A. Categorical Prediction of the Anthropization Index in the Lake Tota Basin, Colombia, Using XGBoost, Remote Sensing and Geomorphometry Data. Earth 2026, 7, 17. https://doi.org/10.3390/earth7010017
Camargo-Pérez AM, Mayorga-Guzmán IA, Flórez-Yepes GY, Benavides-Martínez IF, Garcés-Gómez YA. Categorical Prediction of the Anthropization Index in the Lake Tota Basin, Colombia, Using XGBoost, Remote Sensing and Geomorphometry Data. Earth. 2026; 7(1):17. https://doi.org/10.3390/earth7010017
Chicago/Turabian StyleCamargo-Pérez, Ana María, Iván Alfonso Mayorga-Guzmán, Gloria Yaneth Flórez-Yepes, Ivan Felipe Benavides-Martínez, and Yeison Alberto Garcés-Gómez. 2026. "Categorical Prediction of the Anthropization Index in the Lake Tota Basin, Colombia, Using XGBoost, Remote Sensing and Geomorphometry Data" Earth 7, no. 1: 17. https://doi.org/10.3390/earth7010017
APA StyleCamargo-Pérez, A. M., Mayorga-Guzmán, I. A., Flórez-Yepes, G. Y., Benavides-Martínez, I. F., & Garcés-Gómez, Y. A. (2026). Categorical Prediction of the Anthropization Index in the Lake Tota Basin, Colombia, Using XGBoost, Remote Sensing and Geomorphometry Data. Earth, 7(1), 17. https://doi.org/10.3390/earth7010017



