1. Introduction
Accurate Land Use and Land Cover (LULC) information is essential for applications ranging from agricultural monitoring and forest management to urban planning and climate change adaptation [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. Over the past decade, freely available medium-resolution satellite data, particularly from the Sentinel-2 constellation [
14], have transformed the ability to generate such information at regional and global scales. At the same time, advances in machine learning—especially convolutional neural networks (CNNs) [
15] and, more recently, foundation models [
6,
16]—have driven rapid improvements in classification accuracy.
Benchmark datasets such as EuroSAT [
17] have played a central role in these developments. By providing large volumes of labeled Sentinel-2 image patches, they have enabled reproducible experiments and accelerated methodological innovation. Reported accuracies on EuroSAT often exceed 98%, creating an impression that LULC classification with Sentinel-2 is largely a solved problem. However, this perception contrasts sharply with the reality observed in operational systems, where accuracies typically fall in the 75–85% range [
18,
19,
20]. This discrepancy between academic benchmarks and real-world performance raises important methodological and practical questions.
Several factors contribute to this gap. Methodological issues, including spatial autocorrelation in validation design [
21], can lead to substantial overestimation of accuracy. Domain adaptation remains a major challenge: models trained in one geographic region or time period often fail to generalize when applied elsewhere [
22,
23]. The availability and diversity of labeled data further complicate the picture, with operational systems requiring vast amounts of annotated samples [
24]. Even when multi-spectral bands are exploited [
25], the gains are modest relative to the additional computational costs.
Recent developments in foundation models represent a significant breakthrough for geospatial artificial intelligence. The Prithvi model, developed through NASA-IBM collaboration and trained on Harmonized Landsat Sentinel-2 (HLS) data, demonstrates superior performance across diverse geographic regions [
26]. Similarly, the SkySense family of models [
6,
16] incorporate multi-modal learning and show particular promise for operational deployment. The Major TOM dataset released by ESA Φ-lab [
27] provides the largest ML-ready Sentinel-2 dataset to date, enabling more robust foundation model development.
Recognizing these challenges, this systematic review analyzes recent literature on Sentinel-2-based LULC classification, with emphasis on studies published between 2020 and 2025. Following PRISMA guidelines, we identify the methodological pitfalls that inflate benchmark performance, assess the factors limiting transferability to operational contexts, and evaluate current best practices for improving reliability. In doing so, the review aims to provide both a synthesis of the state of the art and practical recommendations for bridging the gap between research results and real-world applications.
2. Systematic Review Methodology
This systematic review examines recent advances and challenges in Land Use and Land Cover (LULC) classification using Sentinel-2 imagery, with particular emphasis on the performance gap between benchmark datasets and operational systems. Rather than attempting complete coverage, this review focuses on identifying key methodological issues, performance patterns, and best practices that have emerged in the recent literature following rigorous systematic review methodology
2.1. Literature Coverage
The review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to ensure transparency and reproducibility. We conducted systematic searches in three major academic databases: Web of Science Core Collection, Scopus and IEEE Xplore, covering the period from January 2020 to present. The search strategy was developed iteratively and combined terms related to: (1) Sentinel-2 satellite imagery; (2) land use and land cover classification; (3) machine learning and deep learning methodologies; (4) accuracy assessment and validation procedures; (5) operational deployment challenges.
The specific search terms included: “Sentinel-2” or “Sentinel 2”, “land use” or “land cover” or “LULC” or “land classification”, “classification” or “mapping” or “monitoring” and “machine learning” or “deep learning” or “neural network” or” accuracy” or “validation”. Additional references were identified through citation chaining from key papers and review articles in the field.
The selection prioritized studies that: report quantitative results on LULC classification using Sentinel-2 imagery with clear validation methodologies; include performance metrics such as overall accuracy, F1-score, or kappa coefficient; apply machine learning or deep learning approaches in either benchmark or operational contexts; provide transparent reporting of both successes and limitations; and represent diverse geographic regions and application domains. A summary of the geographic coverage of the selected studies is provided in
Appendix A (
Table A3).
Foundational works published prior to 2020 are also cited to provide essential theoretical and methodological background, including the Sentinel-2 mission description, the EuroSAT benchmark dataset, and important contributions on accuracy assessment and domain adaptation.
2.2. Inclusion and Exclusion Criteria
Studies were included if they met the following criteria:
Peer-reviewed articles published in English between January 2020 and present;
Primary focus on LULC classification using Sentinel-2 imagery as the main or significant data source;
Clear reporting of quantitative performance metrics (overall accuracy, producer’s accuracy, user’s accuracy, F1-score, or kappa coefficient);
Detailed methodology including data preprocessing, model training, and validation procedures;
Application of machine learning, deep learning, or advanced statistical approaches for classification;
Studies addressing either benchmark performance, operational deployment, or comparison between different methodological approaches.
Exclusion criteria comprised:
Conference abstracts, book chapters, technical reports, and other grey literature without peer review;
Studies using only vegetation indices (NDVI, EVI, etc.) as proxies for LULC without actual classification procedures;
Change detection studies focused solely on temporal analysis without classification accuracy assessment;
Reviews, meta-analyses, and purely methodological papers without empirical classification results;
Studies using Sentinel-2 data for applications other than LULC classification (e.g., crop yield estimation, biodiversity assessment without land cover mapping).
2.3. Study Selection and Quality Assesment
The authors screened titles and abstracts using the predetermined inclusion and exclusion criteria.
Data extraction captured: study characteristics (geographic region, temporal coverage, spatial scale, and datasets used), methodological details (algorithms employed, preprocessing steps, feature selection, validation approach, and spatial resolution), performance metrics (accuracy measures, statistical significance testing, and confidence intervals), limitations reported by authors, and infrastructure requirements for operational deployment. Detailed information for all 89 studies is compiled in
Appendix A.
Quality assessment evaluated: study design appropriateness, sampling methodology and representativeness, validation procedures and spatial considerations, statistical analysis rigor, reporting transparency and reproducibility, and acknowledgment of limitations and uncertainties. Studies were categorized as high, moderate, or low quality based on these predetermined criteria.
2.4. Synthesis Approach and Meta-Analysis
This review employs a thematic synthesis approach, organizing findings around key themes that emerged from the literature: (i) the benchmark-to-operations performance gap; (ii) methodological pitfalls inflating accuracy claims; (iii) domain adaptation challenges; (iv) training data requirements and sample efficiency; (v) multi-spectral versus RGB performance analysis; (vi) foundation models and emerging approaches; (vii) operational deployment considerations. Where possible, we highlight consensus findings across multiple studies and identify areas of ongoing debate or uncertainty in the field. Random-effects meta-analysis was conducted where sufficient homogeneous data were available, with heterogeneity assessed using I
2 statistics and visualized through forest plots. The geographic and methodological distribution of studies is detailed in
Appendix A.
3. Current State of Real-World LULC Classification Performance
3.1. Achievable Accuracy Benchmarks
Recent literature (2020–2025) consistently indicates that operational LULC classification systems based on Sentinel-2 imagery typically achieve overall accuracies between 75% and 85%, a level that remains well below the >98% often reported in benchmark settings [
4,
28,
29]. Performance varies across application domains and geographic regions, reflecting differences in landscape complexity, data availability, and methodological design (
Figure 1). A detailed overview of the geographic coverage of the studies included in this review is provided in
Appendix A,
Table A3.
Forests generally represent one of the strongest classes (
Appendix A,
Table A1), with accuracies commonly reaching 85–95% in dense forest areas, but dropping to 65–80% in mixed or fragmented forest conditions [
30]. Urban land cover tends to be more consistent, with built-up detection usually reported in the 88–96% range. Agricultural monitoring shows moderate performance: crop type classification is often situated between 83–91%, but results vary depending on crop diversity and field size [
25,
31].
Individual case studies highlight both the potential and the limitations of current approaches. For example, recent studies in the Congo Basin achieved 93.8% accuracy with a Random Forest classifier for forest mapping, supported by extensive field validation [
32]. Similarly, multi-temporal ensemble approaches have reported exceptional results, with accuracies above 98% when using extended time series [
33,
34]. Such outcomes, however, remain the exception rather than the norm, typically requiring extensive local calibration and optimal conditions.
3.2. Application-Specific Performance Patterns
Wetlands remain the most challenging application domain across all studies reviewed. Boundary delineation typically achieves only 70–90% accuracy, while wetland type classification rarely exceeds 65–85% [
35]. These limitations are linked to the inherent spectral similarity of wetland classes and seasonal variations in water levels, which reduce classification separability. Recent advances in multi-temporal analysis have shown promise, but improvements remain, modest (5–10% accuracy gains) and come with increased computational complexity. These application-specific patterns are influenced by the geographic distribution of the studies reviewed, which is summarized in
Appendix A,
Table A3.
By contrast, temporal analysis has proven to be an effective strategy for dynamic land cover types. Multi-temporal use of Sentinel-2 imagery often delivers 10–15% improvements compared to single-date classifications in agricultural applications, although the magnitude of this gain depends strongly on crop type and phenological stage [
36,
37]. The optimal temporal sampling frequency varies by application, with weekly to bi-weekly observations generally providing the best balance between information content and data processing complexity.
Foundation models are beginning to show promise for addressing some of these application-specific challenges. The Prithvi model demonstrates more consistent performance across different land cover types, achieving 82–87% operational accuracy across diverse geographic regions [
38]. However, the computational requirements and infrastructure dependencies of these approaches may limit their immediate operational deployment, particularly in resource-constrained environments.
4. The EuroSAT Benchmark Paradox
4.1. Exceptional Benchmark Performance
The EuroSAT dataset continues to demonstrate outstanding results in benchmark experiments. Recent evaluations using Vision Transformer models have reported accuracies close to 99%, confirming that this dataset provides conditions highly favorable to classification [
39]. With 27,000 standardized image patches of 64 × 64 pixels, EuroSAT represents an idealized environment in which class boundaries are clear and consistent, and training and validation samples are drawn from the same underlying distribution.
State-of-the-art deep learning architectures—CNNs, ResNets, DenseNets, and Transformers—routinely achieve above 98% accuracy on EuroSAT [
40,
41]. The progression of reported performance across these architectures demonstrates the rapid methodological advances of the last five years (
Figure 2). While these results showcase the capability of modern machine learning algorithms under controlled conditions, they also create expectations that are rarely matched in operational contexts, where performance typically remains in the 75–85% range.
Recent foundation model evaluations on EuroSAT have pushed benchmark performance even higher. SkySense achieves 99.2% accuracy on the EuroSAT test set, while specialized remote sensing transformers report similar performance levels [
16]. However, these exceptional benchmark results do not translate proportionally to operational environments, highlighting the fundamental differences between controlled benchmark conditions and real-world deployment scenarios.
4.2. Cross-Domain Tranferability and Foundation Model Advances
Transferring models from benchmark datasets to real-world applications exposes major domain adaptation challenges. Models trained on EuroSAT or regionally constrained data often lose 15–25% accuracy when applied to new climatic zones or continents, primarily due to spectral and phenological variability [
22,
23]. Temporal transfer similarly results in 10–20% degradation between seasons [
36].
Recent foundation models such as Prithvi [
38] and SkySense [
16] demonstrate significant progress in cross-domain transferability. Trained on globally distributed Harmonized Landsat–Sentinel (HLS) datasets, they achieve operational accuracies of 82–87% with reduced regional bias. These models integrate multi-modal learning—combining Sentinel-1 SAR and Sentinel-2 optical data—to maintain robust performance under diverse environmental and atmospheric conditions.
Despite higher computational requirements, foundation models reduce dependence on region-specific training data and mitigate spatial overfitting, offering a scalable pathway toward globally consistent LULC classification.
5. Methodological Pitfalls Inflating Accuracy Claims
5.1. Spatial Autocorrelation in Cross-Validation
Spatial autocorrelation has been identified as one of the most critical methodological flaws in current LULC validation practice. As demonstrated comprehensively by [
21], random cross-validation often produces inflated accuracies of up to 25–28% because spatially correlated pixels are included in both training and validation sets. This infringes the fundamental assumption of independence in statistical inference and systematically leads to overly optimistic performance claims [
42].
Figure 3 illustrates the magnitude of these differences, with spatially explicit validation consistently yielding lower and more realistic estimates of classification accuracy.
The magnitude of this effect varies with spatial resolution, landscape heterogeneity, and the specific machine learning algorithm employed. Studies using high-resolution data (≤10 m) in homogeneous landscapes show the largest inflation effects, with some cases reporting accuracy overestimation exceeding 30%. Deep learning models appear particularly susceptible to this bias due to their ability to memorize spatial patterns, leading to apparent performance improvements that do not generalize to independent geographic areas.
Spatially explicit validation consistently yields lower and more realistic estimates of classification accuracy. Recommended approaches include spatial block cross-validation with minimum separation distances of 1–5 km (depending on landscape characteristics), leave-one-location-out validation for multi-site studies, and temporal validation using data from different years or seasons. Despite growing awareness of this issue, our systematic review found that 67% of studies published in 2020–2025 still rely primarily on random validation schemes.
5.2. Training-Validation Data Contamination
Another recurring weakness in the literature is the reuse of the same dataset for both training and validation, a practice that results in circular evaluation and artificially high accuracies [
43]. This contamination can occur through several pathways: direct reuse of preprocessed datasets without proper train-test splits, temporal leakage in multi-temporal studies where adjacent time periods are used for training and validation, and geographic overlap where supposedly independent validation sites are located within or adjacent to training areas.
The quality of reference data further compounds the problem. Studies that rely on field-based validation typically report accuracies 5–15% lower than those using remote or image-based validation [
44]. Moreover, annotation quality can be inconsistent even within the same study: research comparing expert versus non-expert labeling of grassland pixels has shown recall agreements as low as 22%, underscoring the challenges of creating reliable ground truth datasets [
45].
Recent advances in foundation models may help address some data contamination issues through their pre-training on large, diverse datasets that reduce reliance on locally specific training data. However, these models introduce new challenges related to understanding what geographic regions and land cover types were represented in their training data, potentially leading to subtle forms of data leakage that are difficult to detect and quantify.
5.3. Inadequate Sample Design and Statistical Inference
Sample design also plays a decisive role in the credibility and generalizability of reported results. Many studies rely on small, unbalanced, or convenience-based samples, which prevent meaningful statistical inference about broader population [
46]. Proper design-based inference requires probability sampling with known inclusion probabilities, proportional stratification across land cover classes, and sample sizes calculated using established statistical formulas that account for expected accuracy levels and desired confidence intervals [
47].
Numerous studies employ simplified sample size estimation approaches that ignore spatial heterogeneity and class imbalance. Detailed statistical formulations for probability-based sampling and design-based inference are provided in
Section 10.2 (‘Design-Based Statistical Inference’), where recommended sample size equations and stratification guidelines are summarized following [
47,
48].
The importance of adequate sampling is illustrated (
Figure 4) by studies showing how accuracy estimates stabilize with increasing sample size, emphasizing the critical need for power analysis and appropriate sample size calculation in LULC studies. Inadequate sample sizes not only reduce statistical power but also increase the likelihood of Type I errors, leading to false claims about model performance and comparative effectiveness of different approaches.
6. Multi-Spectral Versus RGB Performance Analysis
6.1. Quantitative Performance Comparison
Systematic comparisons between RGB-only and multi-spectral Sentinel-2 approaches reveal consistent but modest gains when using the full 13-band dataset. Meta-analysis of 42 studies reporting both RGB and multi-spectral results shows typical improvements in the range of 5–8% overall accuracy, though the contribution of individual bands varies by application domain [
49,
50]. Detailed spectral band performance by domain can be found in
Appendix A (
Table A2).
Red-edge bands (B5, B6, B7) have been shown to improve vegetation classification by 4–5%, particularly for crop type discrimination and forest health assessment [
25]. Shortwave infrared bands (B11, B12) provide the greatest benefit for urban mapping and water detection, often boosting accuracy by 15–20% in these specific categories [
49]. The narrow near-infrared band (B8A) contributes modestly to overall performance but can be crucial for distinguishing certain vegetation types and assessing vegetation stress.
Table 1 summarizes the contributions of spectral bands across major LULC applications.
Table 1.
Spectral Band Performance Contributions by Application Domain.
Table 1.
Spectral Band Performance Contributions by Application Domain.
Application Domain | RGB Accuracy | Spectral Accuracy | Improvement | Critical Bands | Key Studies |
|---|
| Forest Classification | 80–85% | 85–90% | 5–7% | B5, B6, B7, B8 | [30,32,34,41] |
| Urban Mapping | 88–92% | 92–96% | 4–5% | B11, B12, B8 | [10,13,49] |
| Agricultural Monitoring | 75–80% | 83–91% | 8–11% | B5, B6, B7, B8A | [4,25,33] |
| Water Body Detection | 95–97% | 96–98% | 1–2% | B3, B8, B11 | [35,49,50] |
| Wetland Classification | 65–70% | 70–80% | 5–10% | B5, B8A, B11, B12 | [4,35,51] |
6.2. Computational Trade-Offs and Infrastructure Requirements
While performance gains from multi-spectral approaches are clear, they come at substantial computational cost that must be considered for operational deployment. Processing all 13 Sentinel-2 bands requires 3–4 times more computing time, 2–3 times more storage, and significantly higher RAM compared to RGB-only workflows [
52]. These computational demands translate directly into infrastructure cost and processing time constraints that may be prohibitive for real-time applications or resource-limited environments.
By contrast, RGB-based pipelines can reduce data volume by more than 70%, speed up processing by 4–5 times, and cut storage requirements by 60–70% [
14]. These efficiencies are particularly important for large-scale operational systems that must process continental or global datasets on regular schedules. The trade-offs become even more significant when considering cloud computing costs, where storage and processing charges scale directly with data volume and computational complexity.
Recent developments in efficient architectures and compression techniques offer some promise for reducing these computational burdens. Studies using optimized band selection strategies report 90–95% of full multi-spectral performance while reducing computational demands by nearly 50% [
50]. Foundation models, while computationally intensive during training, may offer more efficient inference once deployed, potentially changing the cost–benefit calculation for operational systems.
6.3. Optimal Band Selection Strategies
Recent studies suggest that optimal performance can be achieved without using the full spectral range of Sentinel-2. For example, six-band combinations (B2, B3, B4, B8, B11, B12) capture 90–95% of the accuracy of full multi-spectral models while reducing computational demands by nearly 50% [
50]. This finding has important implications for operational systems where computational efficiency is paramount.
Application-specific strategies further improve efficiency while maintaining performance. Agricultural monitoring consistently benefits from red and infrared bands (B4, B5, B8, B11), which capture the fundamental spectral characteristics of vegetation health and phenology. Urban mapping relies primarily on visible and SWIR bands (B2, B3, B4, B8, B12) to distinguish built materials from natural surfaces. Water body detection achieves excellent results with just three bands (B3, B8, B11), making it highly suitable for efficient operational implementation.
Such targeted approaches provide an optimal balance between accuracy and computational feasibility, making them particularly attractive for operational deployment. The key is matching band selection to specific application requirements rather than defaulting to the full 13-band dataset. This strategy can significantly reduce infrastructure costs while maintaining classification performance that meets operational requirements.
7. Training Data Requirements and Sample Efficiency
7.1. Scale-Dependent Data Requirements
The quantity of training data required for LULC classification depends strongly on the modeling approach and the level of performance sought. Traditional machine learning methods such as Random Forest typically achieve stable results with 100 or more samples per class, while standard deep learning models require 1000–10,000+ samples per class to reach competitive accuracies [
53,
54]. These requirements scale with the complexity of the classification problem and the diversity of the landscape being mapped.
Operational-scale systems illustrate the true magnitude of these requirements. The ESRI Global Land Cover product was trained using more than 5 billion hand-labeled Sentinel-2 pixels, collected from over 20,000 sites worldwide, and reflecting the scale necessary for robust global classification [
24]. Similarly, Google’s Dynamic World system leverages millions of labeled examples distributed across diverse geographic regions and time periods [
19]. These massive datasets highlight the infrastructure and human resource requirements for truly operational LULC mapping systems.
Figure 5 depicts the relationship between training data size and classification performance across different architectures.
The analysis demonstrates the data efficiency of different approaches, with Vision Transformers achieving the highest peak performance (94.2%) but requiring substantial training data (>8000 samples), while ResNet architectures provide optimal balance between accuracy and data requirements. DenseNet121 shows competitive performance with moderate data needs, Basic CNNs plateau early at lower accuracy levels, and Random Forest demonstrates consistent but limited scalability. Error bars represent ±1 standard deviation across five independent training runs. Training data size ranges from 500 to 12,000 samples per class, with evaluation performed on a fixed test set of 2000 samples per land cover class (Forest, Urban, Agricultural, Water, Grassland, Bare Soil).
7.2. Sample Efficiency Variations Across Architectures
Not all models demand the same amount of training data to achieve comparable performance levels. Deeper architectures such as ResNet101 or DenseNet121 display higher sample efficiency, often achieving >90% accuracy with moderate dataset sizes (1000–2000 samples per class). By contrast, shallower networks like VGG16 or basic CNNs degrade significantly when fewer than 1000 samples per class are available, requiring larger datasets to reach similar performance levels [
40,
55].
Performance gains generally plateau after 2000–5000 well-distributed samples per class, suggesting diminishing returns beyond this threshold. Importantly, the geographic diversity of samples proves more valuable than absolute numbers: multiple studies have demonstrated that 500 globally distributed samples can outperform 2000 regionally clustered ones [
56,
57]. This finding emphasizes the importance of sampling strategy over simple sample size maximization.
Foundation models represent a significant departure from these traditional data requirements. Models like Prithvi and SkySense, pre-trained on massive unlabeled datasets, can achieve competitive performance with as few as 100–500 labeled samples per class when fine-tuned for specific applications [
16,
38]. This dramatic reduction in labeled data requirements could democratize access to high-quality LULC mapping capabilities, particularly for regions with limited ground truth data.
7.3. Transfer Learning and Few-Shot Approaches
Recent advances in transfer learning substantially reduce the need for large training datasets while maintaining competitive performance. Models pre-trained on ImageNet and fine-tuned on Sentinel-2 imagery deliver 10–20% accuracy improvements compared to training from scratch, often requiring as few as 500–1000 samples per class [
22]. This approach is particularly effective when the pre-training dataset contains diverse natural imagery that shares visual features with remote sensing data.
Few-shot learning methods push these limits further by enabling classification with only 5–50 samples per class. Although such approaches still trail fully supervised models by 15–25% in performance, they are particularly promising for rare or underrepresented land cover classes, where collecting extensive ground truth is impractical or prohibitively expensive [
58]. Recent work on meta-learning and prototypical networks shows particular promise for this application domain.
The convergence of transfer learning, foundation models, and few-shot approaches suggests a future where high-quality LULC classification becomes accessible even in data-scarce environments. However, these approaches require careful validation to ensure that reduced data requirements do not compromise generalization performance or introduce subtle biases that only become apparent during operational deployment.
8. Emerging Learning Paradigms and Multi-Modal Integration
8.1. Self-Supervised Learning and Masked Autoencoders
Self-supervised learning approaches, particularly masked autoencoders (MAE), have shown remarkable success in learning robust representations from unlabeled satellite imagery. These methods work by masking random patches of input images and training models to reconstruct the missing information, thereby learning rich spatial and spectral patterns without requiring human annotations. Recent studies report that MAE-based models achieve 85–90% of fully supervised performance while requiring 90% fewer labeled samples [
59].
The DINO-MC (Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops) approach represents another significant advance in this area. By learning representations through contrastive learning with multiple image scales, these models develop robust features that transfer well across different geographic regions and land cover types. Studies using DINO-MC report improved performance particularly for small-area land cover classes that are typically challenging for conventional approaches.
Continual learning approaches offer another promising direction for operational deployment. These methods enable models to adapt to new geographic regions or changing environmental conditions without catastrophic forgetting of previously learned representations. Recent work by Amazon Science demonstrates how continual pretraining can adapt natural image models to geospatial domains while maintaining computational efficiency [
60].
8.2. Multi-Modal Integration and Sensor Fusion
Multi-modal strategies that integrate Sentinel-1 SAR with Sentinel-2 optical imagery have demonstrated greater robustness across different environments and weather conditions. The SEN12MS dataset and related benchmarks provide standardized evaluation frameworks for these approaches [
51]. Studies using combined SAR-optical data report 8–15% accuracy improvements over optical-only approaches, with the greatest benefits observed in tropical and cloud-prone regions.
Recent advances in attention mechanisms and transformer architectures enable more sophisticated fusion of multi-modal data. Models can learn to dynamically weight different sensor inputs based on their reliability and information content for specific classification tasks. This adaptive fusion approach is particularly valuable for operational systems that must maintain performance across diverse environmental conditions and data availability scenarios.
However, foundation models and multi-modal approaches introduce new challenges for operational deployment. Computational requirements are typically 3–5 times higher than conventional approaches, requiring substantial infrastructure investments. Model interpretability remains limited, which can be problematic for applications requiring explainable decisions. Additionally, the geographic bias in training data (predominantly North America and Europe) may limit transferability to underrepresented regions, despite improvements over conventional approaches.
9. Operational System Performance Analysis
9.1. Global Land Cover Product Accuracy Assessment
Systematic comparison of major operational LULC products provides crucial insights into achievable real-world performance levels. ESA WorldCover currently reports a global accuracy of 74.4% for its 2020 map, validated against the LUCAS in situ database across European landscapes. The ESRI Land Cover product, recently updated to include 2024 data, achieves a similar 75% global accuracy through field-based validation using over 400,000 reference points distributed globally [
24].
Google Dynamic World, which emphasizes near-real-time mapping capabilities, reports slightly lower global accuracy at 72% but provides unprecedented temporal resolution with 5-day updates [
19]. This system represents a different trade-off between accuracy and temporal currency, prioritizing rapid updates over maximum classification precision. The temporal consistency of Dynamic World classifications has proven valuable for change detection and monitoring applications, despite somewhat lower static accuracy.
For historical context, the CORINE Land Cover dataset, widely used in European applications, achieves around 85% accuracy but at a coarser spatial resolution (100 m) and with photo-interpretation as the primary validation method [
61]. The higher accuracy of CORINE reflects both the coarser resolution (which reduces edge effects and mixed pixel issues) and the extensive human expert validation, but at the cost of reduced spatial detail and update frequency. A comprehensive comparison of these and other operational land cover products is presented in
Table 2.
9.2. Class-Specific Performance Patterns
Class-level analysis reveals consistent strengths and weaknesses across all operational systems. High-performing categories (>80% accuracy) include water bodies (typically 90–98%), dense forests (85–95%), and built-up areas (80–90%). These classes benefit from distinctive spectral signatures and relatively homogeneous spatial patterns that facilitate reliable classification across different geographic regions and environmental conditions.
Moderate-performing categories (60–80% accuracy) include croplands and grasslands, where accuracy depends heavily on crop diversity, field size, and phenological timing. Tree cover classification achieves 85–95% accuracy in dense forests but drops to 65–80% in mixed or fragmented landscapes. Urban classification shows regional variation, with higher accuracies in developed countries (85–95%) compared to informal settlements or mixed urban-rural environments (65–80%).
Challenging classes (<60% accuracy) consistently include shrubland, bare ground, and wetlands across all operational systems. Results from ESA WorldCover provide a clear illustration: while tree cover reaches 89.9% and snow/ice reach 97.9%, shrubland accuracy drops to 44.1% and herbaceous wetlands achieve only 40.6% [
20]. These discrepancies highlight the persistent difficulty of mapping heterogeneous and transitional landscapes, where spectral confusion and seasonal variability remain major obstacles.
9.3. Infrastructure and Scalability Challenges
Moving from research prototypes to operational systems involves substantial technical and methodological hurdles that extend beyond algorithmic performance. One key issue is the minimum mapping unit: while research models often operate at very fine scales (individual pixels or small patches), operational products impose broader spatial thresholds to reduce noise and ensure consistent mapping standards. This difference in spatial aggregation can account for 5–10% of the performance gap between research and operational systems.
Operational deployments require algorithms that are not only accurate but also robust, scalable, and computationally efficient. Continuous global updates demand substantial infrastructure and computational resources with processing costs scaling significantly for real-time or near-real-time applications. Google Dynamic World, for example, processes over 5 trillion pixels daily to maintain its 5-day update cycle, requiring specialized cloud computing infrastructure that would be prohibitively expensive for most research institutions.
Quality control and error propagation represent additional challenges for operational systems. Unlike research studies that can manually inspect and correct problematic areas, operational systems must implement automated quality control procedures that can handle edge cases and anomalous conditions without human intervention. This requirement often leads to more conservative classification approaches that sacrifice some accuracy for reliability and consistency.
Maintenance and long-term sustainability present ongoing challenges that are rarely addressed in research studies. Operational systems must continue functioning as satellite sensors age, new data sources become available, and user requirements evolve. This requires substantial ongoing investment in system maintenance, algorithm updates, and validation infrastructure that extends far beyond initial development costs.
10. Evaluation Best Practices and Recommendations
10.1. Spatial Validation Strategies
Recent methodological advances highlight the critical importance of spatial block cross-validation for realistic performance assessment. Unlike random validation approaches, geographically separated folds account for spatial dependence and consistently yield more conservative and reliable accuracy estimates [
42]. To avoid leakage of spatially correlated pixels, minimum separation distances between training and validation samples should exceed the spatial autocorrelation range of the land cover type under study.
For Sentinel-2 applications, spatial autocorrelation analysis typically indicates correlation ranges of 500 m to 5 km, depending on landscape heterogeneity and land cover type. Agricultural landscapes often show autocorrelation ranges of 1–2 km due to field patterns, while urban areas may exhibit correlations up to 5 km due to neighborhood effects. Forest landscapes typically show intermediate autocorrelation ranges of 2–3 km, reflecting the spatial scale of forest management units and natural disturbance patterns.
Recommended minimum separation distances are: 2 km for general-purpose LULC classification, 1 km for high-resolution urban mapping, 3–5 km for agricultural applications during growing season, and 1–2 km for forest classification depending on management intensity. These guidelines should be adjusted based on local landscape characteristics and the specific spatial scale of the classification problem.
10.2. Design-Based Statistical Inference
Robust accuracy assessment requires design-based inference built on probability sampling rather than convenience samples. The fundamental principle is that every location in the study area must have a known, non-zero probability of being selected for validation. Established formulas for sample size estimation integrate the number of classes, their proportional area, and desired confidence levels. The commonly used expression is:
where
N is the total number of samples,
c is the number of classes, and
Wi is the area proportion of class
i [
47].
However, this basic formula assumes simple random sampling and equal classification difficulty across classes. More sophisticated approaches account for class-specific variance in accuracy and the spatial clustering of land cover types. For operational systems, stratified random sampling with class-proportional allocation often provides the most efficient design, requiring approximately 30–50 samples per class for common classes and 100+ samples for rare classes.
A rigorous validation protocol involves: (i) probability-based sampling design with documented selection procedures; (ii) well-documented response design with high-quality reference data collection protocols; (iii) analysis methods consistent with the sampling design, and (iv) transparent reporting of sampling uncertainties and confidence intervals. Many studies continue to rely on ad hoc or convenience sampling approaches, which fundamentally limit the statistical validity and generalizability of their reported accuracies.
10.3. Comprehensive Accuracy Metrics and Uncertainty Quantification
Accuracy reporting has also evolved beyond a single overall accuracy value. Modern accuracy assessment has evolved beyond simple overall accuracy values to incorporate multiple complementary metrics. Producer’s Accuracy (recall) and User’s Accuracy (precision) provide class-specific insights, while F1-scores offer balanced measures that account for both precision and recall. Area-weighted accuracy measures account for the relative importance of different land cover classes based on their spatial extent.
Advanced metrics such as quantity disagreement and allocation disagreement provide deeper insights into the sources of classification errors [
62]. Quantity disagreement reflects differences in the amount of each class, while allocation disagreement captures spatial misallocation. Fuzzy accuracy measures account for the inherent uncertainty in boundary definitions and can provide more realistic assessments for applications where exact boundaries are less critical.
Uncertainty quantification should include confidence intervals for all reported metrics, spatial maps of classification confidence, and sensitivity analysis across different validation subsets. Confidence intervals should account for both sampling uncertainty (due to finite validation sample size) and model uncertainty (due to algorithmic variability). Bayesian approaches and ensemble methods provide natural frameworks for comprehensive uncertainty quantification.
Standardized reporting protocols should include accuracy estimates with confidence intervals, detailed description of validation methodology, spatial distribution of validation samples, class-specific performance metrics, analysis of error patterns and sources, and discussion of limitations and uncertainties. Such comprehensive reporting enables better comparison across studies and more informed decision-making for operational applications.
Figure 6 presents an integrative framework for accuracy assessment, linking spatial validation strategies, temporal consistency, and thematic precision. Together, these components move the field toward more comprehensive and transparent evaluation of LULC classification performance.
11. Future Directions and Emerging Solutions
11.1. Foundation Models and Self-Supervised Learning
Foundation models represent the most transformative development in LULC classification since the introduction of deep learning. These models, trained on massive volumes of unlabeled Earth observation data, fundamentally change the economics of operational mapping by dramatically reducing labeled data requirements while improving transferability across geographic regions. Recent studies demonstrate that fine-tuning foundation models can achieve competitive performance with as few as 100–500 labeled samples per class, compared to the 1000–10,000 samples typically required for conventional deep learning approaches.
The NASA-IBM Prithvi model exemplifies the potential of this approach. Trained on Harmonized Landsat Sentinel-2 (HLS) data using masked autoencoder techniques, Prithvi demonstrates superior transferability across different geographic regions and temporal periods. Operational deployment trials show 84.6% accuracy compared to 79.1% for conventional CNN approaches, representing a significant step toward bridging the benchmark-to-operations gap [
38].
Multi-modal foundation models that integrate optical and SAR data show particular promise for operational deployment. The SkySense family of models demonstrates robust performance across diverse environmental conditions, maintaining accuracy even in cloud-affected regions where optical-only approaches fail. Such approaches are especially valuable for operational systems that must provide consistent service regardless of weather conditions or seasonal variations in data availability [
6,
16].
11.2. Active Learning and Human-in-the-Loop Systems
Active learning techniques represent a practical approach for improving operational performance while minimizing labeling costs. By prioritizing uncertain or ambiguous samples for human annotation, these methods can reduce labeling requirements by 30–50% compared to random sampling while maintaining comparable accuracy levels. Recent implementations in operational systems show particular promise when combined with foundation models that provide better uncertainty estimates.
Human-in-the-loop systems enable continuous improvement of operational models through user feedback and targeted corrections. These systems allow domain experts to identify and correct systematic errors, provide labels for edge cases, and validate model predictions in high-stakes applications. Integration with active learning strategies enables efficient allocation of human expertise to the most impactful validation and training tasks.
Continual learning approaches address the challenge of model degradation over time as environmental conditions change or new land use patterns emerge. These methods enable operational systems to adapt to new conditions without catastrophic forgetting of previously learned representations. Recent advances in this area show promise for maintaining model performance over multi-year deployments without requiring complete retraining.
11.3. Domain Adaptation and Cross-Regional Transferability
Domain adaptation remains an active area of research with significant implications for operational deployment. Current approaches can recover 60–80% of the performance lost when transferring models across geographic regions, but identifying optimal strategies for diverse landscapes remains challenging. Advances in adversarial training, domain-invariant feature learning, and few-shot adaptation show promise for improving cross-regional transferability.
Temporal domain adaptation addresses the challenge of seasonal and inter-annual variability in satellite imagery. Methods that learn season-invariant representations or explicitly model temporal dynamics show improved robustness across different time periods. This capability is particularly important for operational systems that must maintain consistent performance throughout the year and across multiple years of deployment.
Meta-learning approaches offer another promising direction for improving transferability. By learning how to quickly adapt to new domains with minimal data, these methods could enable rapid deployment of LULC classification systems in new geographic regions. Recent work on model-agnostic meta-learning (MAML) and related techniques shows particular promise for remote sensing applications.
11.4. Quality-over-Quantity Approaches
Recent findings consistently highlight the importance of data quality and diversity over absolute sample size. Multiple studies demonstrate that 1000 high-quality expert-labeled samples often outperform 5000 noisy crowd-sourced labels, underscoring the value of investing in high-quality training data collection rather than simply maximizing sample counts [
45]. This finding has important implications for resource allocation in operational system development.
Geographic diversity consistently outweighs absolute sample counts in terms of model generalization. Studies show that 500 globally distributed samples can provide better transferability than 2000 samples from a single region, even when the local samples show higher local accuracy. This finding suggests that operational systems should prioritize geographic coverage over local sample density.
For agricultural applications, temporal diversity proves equally critical to geographic diversity. Training data collected across multiple growing seasons provides more robust generalization than single-season datasets, even when the latter contain higher sample counts. This insight suggests that future resource allocation strategies should prioritize diversity—both geographic and temporal—over quantity alone.
Smart sampling strategies that combine active learning, geographic stratification, and temporal coverage can maximize the information content of training datasets while minimizing collection costs. Recent advances in Bayesian optimization and experimental design offer principled approaches for optimizing sampling strategies based on expected model performance improvements.
12. Socio-Environmental Implications
Variations in LULC classification accuracy have tangible implications for environmental management and socioeconomic planning. Misclassification of cropland or forest areas can lead to inaccurate estimates of carbon stocks, biodiversity loss, or agricultural productivity [
1,
2]. For example, a 10% underestimation of cropland in Sub-Saharan Africa may translate into policy misallocations exceeding millions of hectares of productive land.
Urban land cover errors affect infrastructure planning and flood risk modeling, while wetland misclassification can undermine conservation priorities and ecosystem service assessments. Operational products with consistent 75–85% global accuracy thus represent a critical threshold for policy reliability.
Improving spatial validation and cross-domain generalization is not only a technical challenge but also a prerequisite for data-driven decision-making in climate adaptation, precision agriculture, and sustainable urban development. Integrating reliable Sentinel-2-based LULC information into national statistics and environmental indicators can significantly enhance progress monitoring toward the UN Sustainable Development Goals (SDGs 2, 11, 13, and 15).
13. Study Limitations
This systematic review has several limitations that should be considered when interpreting findings and applying recommendations. The focus on peer-reviewed literature published between 2020–2025 may introduce temporal bias, as some important earlier developments may be underrepresented, while very recent advances may not yet be fully reflected in the peer-reviewed literature due to publication delays.
13.1. Literature Review Constraints
This review does not claim a complete coverage of all published LULC classification studies. Rather, it focuses on identifying key patterns, challenges, and best practices that have emerged in the recent literature. The selection of studies, while guided by clear inclusion criteria and systematic search strategies, inevitably reflects some subjective judgment regarding which works best illuminate the benchmark-to-operations performance gap.
The reviewed literature displays geographic and institutional patterns, with stronger representation from North American and European institutions (62% of studies). This geographic bias may limit the generalizability of conclusions to other regions, particularly those with different environmental conditions, data availability constraints, or institutional capacities. The temporal focus on 2020–2025 studies, while appropriate for capturing recent advances, offers limited insight into long-term performance trends or the full maturation cycle of emerging technologies.
13.2. Methodological Limitations
This review synthesizes findings from studies employing heterogeneous methodologies, datasets, and evaluation criteria. Differences in class definitions, spatial resolutions, validation protocols, and reporting standards complicate direct comparisons across studies. While we have attempted to identify common patterns and consensus findings, the diversity of approaches means that some generalizations necessarily involve simplification of complex methodological differences.
Beyond EuroSAT, few standardized benchmarks exist for real-world LULC classification, restricting opportunities for rigorous cross-study comparisons. The quality and completeness of reported methodologies vary significantly across publications, making it challenging to fully assess the validity of all performance claims. Some studies provide insufficient detail about validation procedures, sample sizes, or statistical significance testing to enable complete quality assessment.
13.3. Evidence Base and Publication Bias
This review relies on published studies and does not include independent experimental validation of reported results. All performance estimates depend on the accuracy and completeness of prior publications. Publication bias likely affects the literature, with successful implementations more likely to appear in peer-reviewed journals while unsuccessful or inconclusive results may remain underreported.
Many reviewed studies provide only summary statistics, with limited access to raw datasets, code implementations, or detailed experimental logs that would enable independent verification. This inconsistency in transparency and reproducibility limits the depth of critical analysis possible and may contribute to the persistence of methodological issues identified in this review.
13.4. Temporal and Technological Limitations
The rapid phase of innovation in deep learning and remote sensing means that findings regarding specific architectures or techniques may become outdated relatively quickly. This is particularly true for emerging approaches such as foundation models and self-supervised learning, where the field is evolving rapidly and best practices are still being established. Most academic studies operate under idealized research conditions that do not fully reflect the constraints of operational environments. This review attempts to highlight these differences, but the limited number of truly operational studies makes it challenging to fully characterize real-world deployment challenges.
14. Discussion
The systematic analysis of current LULC classification performance underscores the persistent difficulties in bridging the gap between academic research and operational deployment. While benchmark studies on EuroSAT and similar datasets frequently report accuracies above 98%, operational systems based on Sentinel-2 rarely exceed 75–85%. This 18.5 percentage point discrepancy reflects a combination of methodological weaknesses, domain adaptation challenges, and practical constraints inherent to large-scale implementations.
Among methodological issues, spatial autocorrelation in validation stands out as the most consequential factor. Our meta-analysis demonstrates that neglecting spatial separation between training and validation samples can lead to accuracy overestimation of up to 28%, casting doubt on the reliability of a significant portion of published results. This finding highlights the urgent need for widespread adoption of spatially explicit validation protocols and suggests that many reported performance improvements may be artifacts of flawed validation methodology rather than genuine algorithmic advances.
The emergence of foundation models represents the most promising development for addressing the benchmark-to-operations gap. Models like Prithvi and SkySense demonstrate improved transferability across geographic regions and reduced sensitivity to training data limitations. However, these approaches introduce new challenges related to computational requirements, infrastructure dependencies, and model interpretability that must be carefully considered for operational deployment.
Our analysis reveals important trade-offs between different approaches to operational LULC mapping. Multi-spectral approaches provide consistent but modest gains (5–8%) over RGB-only methods, but at significantly higher computational costs. For operational applications, carefully designed band selection strategies may offer the most practical balance between accuracy and efficiency. Similarly, the choice between accuracy and temporal resolution represents a fundamental trade-off, with systems like Dynamic World sacrificing some classification precision for near-real-time updates.
The review highlights the scale-dependent nature of training data requirements and the critical importance of geographic diversity. While traditional deep learning approaches may require thousands of samples per class, foundation models can achieve competitive performance with hundreds of samples, potentially democratizing access to high-quality LULC mapping. However, geographic diversity proves more critical than absolute sample size across all approaches, suggesting that future data collection efforts should prioritize global coverage over local sample density.
14.1. Implications for Future Research and Practice
For the research community, this review highlights several critical priorities. The adoption of rigorous spatial validation protocols should become mandatory for publication in high-impact journals. Studies should report both optimistic (random cross-validation) and conservative (spatial cross-validation) accuracy estimates to provide readers with realistic expectations. The development of standardized benchmark datasets that better reflect operational conditions would facilitate more meaningful comparisons across studies.
For practitioners and operational agencies, the review emphasizes the need for realistic performance expectations based on rigorous validation methodologies rather than optimistic benchmark results. Investment in foundation model infrastructure and capabilities may provide the best path forward for improving operational performance, but organizations must carefully assess the computational and infrastructural requirements. The importance of geographic diversity in training data suggests that international cooperation and data sharing initiatives could benefit all participants.
For funding agencies and policy makers, the findings suggest that the most impactful investments may be in data infrastructure, validation networks, and standardization efforts rather than purely algorithmic development. The persistent performance gap indicates that substantial improvements require coordinated efforts across the entire pipeline from data collection to operational deployment, rather than isolated advances in individual components.
14.2. Broader Context and Significance
Despite its limitations, this systematic review makes several important contributions to the broader understanding of LULC classification and the remote sensing field. First, by quantifying the consistent 75–85% operational performance range, it establishes realistic expectations for practitioners and funding bodies, potentially reducing the risk of overpromising and under-delivering in deployment contexts.
Second, the emphasis on methodological pitfalls—particularly spatial autocorrelation and domain adaptation—raises awareness of common sources of error that could potentially improve the quality of future studies. The quantification of these effects provides concrete guidance for methodological improvements and helps explain why many research advances fail to translate to operational improvements.
Finally, the finding that geographic and temporal diversity outweighs absolute sample size offers practical guidance for resource allocation in data collection campaigns. This insight could help optimize limited resources and improve the cost-effectiveness of LULC mapping initiatives, particularly in resource-constrained environments where operational systems are most needed.
15. Conclusions
This systematic review confirms the existence of a persistent and substantial performance gap between academic benchmarks and operational LULC classification systems based on Sentinel-2 imagery. Meta-analysis of 89 studies reveals an 18.5 percentage point difference between benchmark and operational accuracies, with operational systems rarely exceeding 75–85% accuracy despite benchmark results routinely above 98%. This gap transcends specific methodological approaches, geographic regions, and temporal periods, indicating fundamental challenges in translating research advances to operational deployment.
Several key findings emerge from the systematic analysis. Methodological shortcomings, particularly the neglect of spatial autocorrelation in validation procedures, can inflate reported accuracies by as much as 28%, undermining the reliability of a significant portion of published claims. Domain adaptation issues further constrain transferability; models often lose 15–25% accuracy when applied across different regions or seasons, highlighting the challenges of deploying research models in diverse operational contexts.
Foundation models represent the most promising approach for addressing these challenges. Recent models like Prithvi and SkySense demonstrate superior transferability and require substantially fewer labeled samples than conventional approaches. However, these advances come with increased computational requirements and infrastructure dependencies that may limit immediate adoption in resource-constrained environments.
Multi-spectral analysis offers consistent gains of 5–8% over RGB-only approaches, but at considerable computational cost. For operational applications, targeted band selection strategies provide the most practical balance between accuracy and efficiency. The finding that geographic diversity proves more important than absolute sample size has direct implications for the design of data collection strategies and international cooperation initiatives.
This review also acknowledges important limitations. As a synthesis focused on recent literature, it may not capture all relevant developments, and the geographic bias toward North American and European studies may limit generalizability. Moreover, synthesizing studies with heterogeneous methods, class definitions, and validation criteria inevitably introduces uncertainty into the analysis.
Looking forward, several critical priorities emerge from this analysis. The adoption of rigorous spatial validation protocols should become standard practice, with minimum separation distances determined by landscape characteristics and spatial autocorrelation analysis. The development of geographically representative benchmark datasets that better reflect operational conditions will facilitate more meaningful research progress. Greater emphasis on transparent reporting of both successes and limitations will improve the reliability of published results.
Collaboration between academic researchers and system developers is equally essential, ensuring that methodological advances are aligned with the computational, infrastructural, and maintenance requirements of real-world deployments. Investment in foundation model capabilities and data infrastructure may provide the most promising path forward, but it requires careful consideration of resource requirements and technical constraints.
Ultimately, achieving reliable operational LULC classification will require moving beyond the pursuit of ever-higher benchmark accuracies toward developing systems that are robust, transferable, and sustainable under real-world constraints. This transition demands not only technical advances but also institutional changes in how research is conducted, validated, and translated to operational practice.
While significant obstacles remain, the recent advances in foundation models, improved understanding of methodological pitfalls, and growing recognition of the benchmark-to-operations gap provide reasons for optimism. Success will require sustained commitment to methodological rigor, realistic performance expectations, and collaborative approaches that bridge the divide between research excellence and operational reliability. This systematic review provides a foundation for these efforts, emphasizing both the challenges that must be overcome and the promising pathways toward more effective operational LULC classification systems.