1. Introduction
Land use and land cover (LULC) mapping is a vital application of remote sensing, supporting various domains such as environmental monitoring, climate change assessment, disaster management, and urban development [
1,
2,
3,
4]. Temporal and spatial information regarding LULC is also essential for land use planning and the establishment of effective development strategies [
5]. In general, LULC mapping can be applied across diverse climatic and environmental zones, including tropical, arid, semi-arid, temperate, and continental environments [
6,
7,
8,
9,
10].
Despite the importance of LULC mapping, ensuring accurate surface mapping remains challenging in arid and semi-arid environments, which have experienced continuous LULC changes driven by urbanization and industrialization. Additionally, natural processes (e.g., droughts and flash floods) and anthropogenic activities may contribute to changes in LULC [
11,
12]. Mapping quality is also influenced by the unique environmental conditions in these regions, including ecological fragility, environmental and land degradation, and climate variability [
13]. Notably, field mapping and feature extraction are particularly complex in arid and semi-arid environments, owing to spectral similarities between urban surfaces and barren lands, fragmented and sparse vegetation cover, and seasonal variations. All these factors significantly hinder the creation of high-quality maps in dry arid areas.
LULC mapping can be conducted using field surveys and/or remotely sensed data [
7]. When combined with historical and current satellite imagery, these data sources provide up-to-date information, support advanced mapping, and facilitate monitoring of changes on the Earth’s surface. LULC patterns derived from remote-sensing data highlight the importance of LULC accuracy and generalization in environmental contexts [
14]. Various satellite image sets with different spatial resolutions have been employed to produce LULC maps across scales ranging from local to global. Coarse-resolution data include Moderate Resolution Imaging Spectroradiometer (MODIS, 250 m to 1 km) and Advanced Very-High-Resolution Radiometer (1 km) [
15]. Moderate-resolution data include Landsat (30 m) [
16] and Sentinel-2 (10 and 20 m) images [
17]. High-resolution data include WorldView-3 (0.3 and 1.2 m) [
18] and IKONOS (1 m) [
19]. Although high spatial resolutions provide detailed information, medium spatial resolutions, such as the 10 m bands of Sentinel-2, are commonly used for city-scale mapping. This is because such data can capture detailed information and are freely available, facilitating sustainable urban development and environmental studies. These data can contribute to the generation of accurate, high-quality maps, which are critical inputs for environmental and climatic models.
Various machine learning (ML) algorithms have been adopted to improve LULC classification and handle complex patterns, including random forest (RF), support vector machine (SVM), classification and regression trees (CART), and K-nearest neighbors (KNN) [
20,
21,
22,
23]. These algorithms have been used in both supervised and unsupervised classification settings, across different spatial and temporal scales, to extract essential information from surface features. However, their performance depends on the characteristics of satellite imagery data, computational processing, and environmental and climatic factors. Therefore, the performance of ML models in LULC mapping must be comprehensively evaluated to better understand their behavior and improve estimation accuracy across different applications [
24].
Recent advances in remote sensing have led to increasing interest in deep-learning approaches for urban land cover classification, such as convolutional neural networks (CNNs) [
25,
26], recurrent neural networks and temporal models [
27,
28], and Siamese and metric-learning networks [
29,
30]. These frameworks have demonstrated strong performance in extracting spatial and temporal information for distinguishing spectral patterns in complex areas. In addition, transfer learning and domain adaptation techniques have been used to improve model generalization across different environmental conditions and geographical regions [
31,
32,
33]. However, deep-learning approaches typically require larger datasets, more considerable computational capabilities, and extensive model tuning, which can limit their applicability to local-scale regions. Consequently, traditional ML algorithms remain widely used for LULC mapping, owing to their applicability, interpretability, computational efficiency, and ability to achieve comparable performance when different feature sets are implemented. Thus, this study focuses on assessing the generalization capability of four ML algorithms (RF, SVM, CART, and KNN) across multiple urban arid environments using different feature sets.
Advances in remote-sensing and geographic information system technologies have enabled the processing of large datasets with improved computational efficiency. For example, Google Earth Engine (GEE), a geospatial online programming platform, has been used across various domains, such as urban development [
4,
34,
35], forest monitoring [
36,
37,
38], identification of burned areas [
39,
40,
41], natural disaster assessment [
42,
43,
44], and land surface temperature evaluations [
2,
45,
46]. GEE offers free access to different types of data, including remotely sensed data [
42], as well as coding, processing, and instant data-visualization tools, rendering it superior to various other online sources [
47].
Under this backdrop, LULC dynamics across different environmental conditions, especially in arid and semi-arid contexts, must be examined using classification techniques and remotely sensed data [
48]. Numerous studies have examined the potential of various ML models in LULC mapping across different regions, including Brazil, Indonesia, Australia, China, India, Ethiopia, Sweden, Canada, and the United States of America [
6,
7,
9,
17,
49,
50,
51,
52,
53]. However, few studies have focused on the use of supervised classification techniques for arid and semi-arid environments. Notable examples include studies on the Dengkou Oasis, China [
7]; the Urmia Lake basin, Iran [
54]; and Botswana [
8].
Although previous studies have evaluated ML performance and provided valuable insights, research on LULC generalization across spatial and temporal domains remains limited. Notably, generalization performance may degrade, owing to challenges in modeling uncertainty across different sources of remotely sensed data [
55,
56]. Relevant studies have mainly focused on crop and wetland mapping across diverse climates [
57,
58,
59,
60]. For example, Cai et al. [
39] attempted to enhance generalization accuracy by integrating segmentation and spectral data. Shafizadeh-Moghadam et al. [
37] and Shibuya et al. [
40] examined algorithm performance for temporal and spatial generalization, illustrating the strengths and limitations of different ML techniques. In the context of arid environments, Halmy and Gessler [
38], Weng et al. [
41], and Ali and Johnson [
42] explored generalization challenges and the corresponding influence of data and study design. This literature review identifies a critical knowledge gap, i.e., limited direct comparisons of ML techniques in terms of their generalization performance for LULC mapping, specifically using Sentinel-2 imagery across seasons in arid and semi-arid regions.
To address this gap, this study is aimed at evaluating the generalization performance of LULC classification in arid and semi-arid cities using ML algorithms across five feature sets and two seasons (summer and winter). A novel approach, named exclude-one-city-out (EOCO), is introduced to assess the generalization strength and robustness of ML algorithms. The models are trained on data from three cities and tested on a fourth, unseen city using supervised classification. Although several ML- and deep-learning-based studies have used similar approaches, such as the leave-one-region-out or cross-domain validation strategies [
61,
62,
63], the proposed EOCO framework is novel in various aspects. The model is trained in arid-region cities that vary in terms of urban land cover, spectral response and topography, and environmental conditions, while excluding an entire city for testing. This creates a strict spatial generalization scenario in which models must transfer knowledge from training cities to a completely unseen city. Moreover, the integration of five feature sets (including spectral features, spectral indices, texture, topographical variables, and their combinations) helps identify the feature set that most effectively improves LULC classification under conditions of spectral confusion observed in arid environments.
Four densely populated cities in Saudi Arabia, noted to witness rapid urban growth, are selected: Riyadh, Madinah, Jeddah, and Dammam. Sentinel-2 surface reflectance data from the visible and near-infrared bands (Blue, Green, Red, NIR, SWIR1, and SWIR2) are used. Texture variables (based on the gray level co-occurrence matrix, GLCM), spectral indices (modified normalized difference water index, MNDWI; normalized difference built-up index, NDBI; normalized difference vegetation index, NDVI; and soil-adjusted vegetation index, SAVI), and topographic data (elevation and slope) were organized into five feature sets. The following research questions are addressed in this work: (1) How effectively do ML models generalize across spatial and seasonal contexts when trained on multiple cities and tested on an unseen city? (2) Which ML algorithm demonstrates consistently high performance across the five feature sets in summer and winter? (3) Which LULC classes are commonly misclassified across cities, models, feature sets, and seasons in arid and semi-arid environments? (4) How does environmental heterogeneity influence model performance over seasons and four selected cities? The findings of this work are expected to provide valuable insights for the scientific community and policymakers focused on both urban and environmental monitoring.
The remainder of this paper is organized as follows.
Section 2 describes the study area, EOCO approach, data collection, and preprocessing.
Section 3 outlines training data and classification, generalization accuracy assessment across cities, and statistical analysis.
Section 4 discusses the model performance metrics, spatial generalization, seasonal performance, and feature importance. The implications of the results are presented in
Section 5, and
Section 6 presents the concluding remarks.
5. Discussion
A novel EOCO approach was used to perform multidimensional comparative analysis of the generalization and performance of ML algorithms under different feature sets and seasonal variation. The four study areas (Riyadh, Madinah, Jeddah, and Dammam) are characterized by arid and semi-arid environments, which makes it challenging to assess performance across diverse and organized input features. The EOCO approach involves model training on three cities and testing on an unseen city. The analysis was performed for each city, four ML algorithms (RF, SVM, CART, and KNN), five feature sets (Spectral, Spectral_Indices, Spectral_Texture, Spectral_Topography, and All_Features), and two seasons (summer and winter). The results show that RF algorithms consistently achieved the highest and most stable accuracy across all five feature sets, especially when combined with Spectral_Texture and All_Features. RF offers advantages in ensemble learning and is robust against spectral variations. The Spectral_Texture set achieved the best performance among feature sets, highlighting the importance of contrast and homogeneity across all four classes: urban areas, vegetation cover, barren lands, and water bodies. These results help clarify the contribution of each feature to the generalization process in arid environments.
The superior performance of the Spectral_Texture set can be attributed to the characteristics of urban arid environments, where similar spectral responses commonly occur among bare soil, exposed ground, and urban areas. Therefore, spectral bands alone may be insufficient to distinguish among urban land cover classes. In contrast, adding texture features provides spatial support to ML models, including the arrangement, heterogeneity, and structure of each pixel. Urban areas are typically characterized by a complex spatial structure, including buildings, roads, open spaces, and mixed land cover, which helps generate distinct texture information despite similar spectral responses. Thus, by incorporating texture information with spectral bands, the ML classifier can capture both spectral and spatial information of the urban landscape, leading to improved class separability and classification accuracy. The consistent performance of the Spectral-Texture feature set across the four arid cities suggests that texture information exhibits robust, transferable importance in urban arid environments.
Despite promising results, misclassification in arid and semi-arid environments during summer and winter remains a critical challenge. Analysis of Riyadh using the RF model (most consistent) across the five feature sets indicates that certain classes are misclassified (
Table 8). The most frequent misclassifications occur between urban and barren, barren and urban, and water and urban classes. The confusion between urban and barren land is attributable to similarities in spectral responses. The confusion between water and urban classes may be ascribed to spectral similarity. For instance, the darker surfaces of urban areas may resemble water in terms of reflectance. Shadows of built-up areas also appear darker, resulting in their classification as water. Mixed pixels often contain multiple classes, such as water and urban areas. Moreover, atmospheric effects, such as dust, haze, and humidity, may also distort results. In winter, misclassifications are noted between water and barren, water and urban, and urban and barren classes. The confusion between water and barren classes is particularly severe in water. This is because wet soil or clay often exhibits reflectance similar to that of shallow water; salt flats and dry-dark beds display reflectance similar to water; and the low reflectance of dark barren areas, such as dark rock or sand, may be misidentified as water in some bands.
Table A2 in
Appendix A presents the producers’ and users’ accuracies across feature sets and seasons for Riyadh using RF. The Spectral_Texture and All_Features sets provided higher accuracies than the other features, and the accuracy values in summer are higher than those in winter.
Figure 7 shows the stability analysis (summer versus winter) across the four classes for Riyadh using RF. Results for the barren and vegetation classes were superior to those of other classes during summer and winter, while urban classes showed slightly better performance in winter than in summer. This highlights the importance of carefully considering water classes, particularly in winter. The seasonal variations observed between summer and winter are unlikely to be influenced by vegetation phenology, as the vegetation cover in arid and semi-arid lands is sparse and often maintained by irrigation. Instead, these variations may be attributable to differences in solar illumination geometry and solar zenith angle, which influence the surface reflectance and extent of building shadows in urban areas. Additionally, seasonal atmospheric conditions, including dust and sand aerosols, which commonly occur in the Arabian Peninsula, can influence image quality and spectral response despite atmospheric correction processes. Variations in irrigation management and vegetation health within the urban areas may also contribute to the seasonal spectral differences. These factors may explain the seasonal variations identified by the Wilcoxon signed-rank test.
The findings provide valuable insights into the role of ML algorithms across five feature sets and two seasons (summer and winter). Achieving high accuracy and high-quality mapping, especially in challenging environments such as arid regions, is crucial for supporting studies on sustainable urban growth and climatic monitoring, especially under the backdrop of rapid changes in the Earth’s surface and climate. Previous studies have investigated the generalization of LULC classes using different datasets, methods, and ML models. However, they largely overlooked the performance of ML models in generalization, which is essential for uncovering the model’s strengths and weaknesses when using spatial and temporal variables [
55,
57,
60]. For example, Shafizadeh-Moghaddam [
60] assessed spatial and temporal generalization in urban growth modeling. The results confirmed that RF achieved high calibration accuracy, but its performance degraded during validation. SVM showed the opposite trend. In contrast, this study demonstrates that the RF is the best generalization model, with the SVM displaying moderate performance (
Table 4). Another study, conducted in the Cerrado and Amazon biomes in Brazil, using MODIS satellite images, assessed temporal generalization using RF, a CNN (TempCNN), and a lightweight temporal attention encoder (L-TAE). RF achieved higher accuracy and consistently performed better across the agricultural land cover classes [
57]. The excellent performance of RF is evident in both densely vegetated areas (Brazilian Cerrado and Amazon biomes) and arid environments (Saudi Arabia) characterized by desert-dominant, sparse vegetation. Huang et al. [
55] investigated LULC classification in a subtropical karst environment using remote sensing. RF and SVM performed and generalized well under specific conditions, especially under limited sample size and data availability. The results indicate their robustness, aligning with the findings of the present study.
At the class level, despite advances in ML models for LULC classification, misclassification errors, owing to spectral variations, remain [
84]. In this study, the most prevalent error was the misclassification of barren land as urban areas, especially in regions featuring rocky and sandy mountains. This issue was also reported by Aljaddani et al. [
4], who performed Landsat time-series analysis in arid and semi-arid environments, during the processing of the CCDC time series [
4]. Among topographic effects, mountainous areas with shadows—such as in Madinah—were misclassified as water. This is because shadowed regions in satellite images appear similar to water in terms of spectral signature. This issue was also reported by Huang et al. [
55]. Moreover, the sparse vegetation in arid regions, with a low spectral signature, contributes to misclassification errors [
54]. Additionally, small and limited LULC classes are often underrepresented. Classes such as vegetation and water are often indistinguishable, particularly in areas with spectral similarity, as observed in the misclassification of urban and barren classes. This represents a common challenge in multi-class classification with imbalanced LULC distributions. These challenges must be addressed in future research to achieve higher accuracy and optimize spatial and temporal generalization performance.
The spatial resolution of Sentinel-2 data for the visible, NIR, SWIR1, and SWIR2 bands (10 m) is well suited for training data collection. However, several limitations restrict the scope of this work. First, the sample size was limited to four cities (Riyadh, Madinah, Jeddah, and Dammam), which hindered the comprehensive analyses of spatial, seasonal, and statistical generalization. Although no statistically significant difference was noted in this work, expanding the analysis to more diverse areas could enhance assessment. Second, the imbalance between the four classes affected performance. Barren land was the dominant class, followed by urban areas, while vegetation and water classes were limited, especially in inner cities like Riyadh and Madinah. Also, the training set shows an imbalance, especially for the water class, which is substantially underrepresented in Madinah compared with other LULC classes. The limited availability of water samples may have constrained the EOCO model’s ability to capture the full spectral variability of water bodies. This results in increased confusion with neighboring LULC classes and reduced class-specification performance. Even though the overall accuracy remains high, this metric is influenced by the prevalence of other LULC classes (Barren, Urban, and Vegetation), which may reflect the main challenge in precisely determining the minority classes. Thus, the reported accuracy should be analyzed in considering the imbalance.
Third, the seasonal analysis was limited to one year (2025), which restricted the assessment of long-term seasonal generalization. Incorporating data from multiple years would help improve interpretation. Fourth, although training points provide good results, they may not sufficiently represent mixed pixels and class heterogeneity. Moreover, such samples are highly sensitive to geolocation errors and limit the diversity of pixels per object. In addition, the Spectral_Topography feature set incorporated both elevation and slope derived from the NASA SRTM DEM, which has an original spatial resolution of 30 m. To ensure consistency with Sentinel-2 imagery, the DEM was resampled to 10 m for feature extraction. Although this processing facilitated the integration of topographic and spectral features, it did not enhance the spatial detail of the DEM. Consequently, the finer-scale details may not have been adequately captured. The use of higher-resolution evaluation datasets such as the JAXA ALOS World 3D (AW3D30) could help assess the contribution of topographic information to LULC classification. Future work must address these limitations by increasing the sample size to cover more global arid and semi-arid cities, using deep-learning models, and expanding the temporal window to include multiyear data. These improvements will provide a comprehensive understanding of generalization and model performance and also enable meaningful statistical analysis of spatial and seasonal conditions. Increasing the training samples and attempting to address class imbalance can help mitigate misclassification.
Overall, this study provides meaningful, informative results, highlighting that the RF model using the Spectral_Texture feature set achieves high-quality mapping and serves as a first-tier macro-monitoring instrument that can facilitate urban land cover monitoring, natural resource management, environmental monitoring, and disaster management in arid regions. Improved accuracy and mapping quality will benefit decision-making in government and private sectors, contributing to the realization of urban Sustainable Development Goals.