1. Introduction
Accurate and timely land cover land use (LCLU) classification is fundamental to monitoring environmental changes [
1,
2,
3], managing natural resources [
4,
5], and informing sustainable policy [
6,
7,
8]. Although land cover (LC) refers to the physical materials on the Earth’s surface and land use (LU) describes how humans utilize those surfaces, LC inherently provides the observable foundation from which many LU categories are inferred. In remote sensing (RS), the combined term LCLU is therefore commonly adopted to reflect both the physical surface characteristics and their functional interpretation. Following this convention, the term LCLU were used throughout this study, while recognizing that the classification task is primarily driven by LC information derived from Landsat spectral–temporal observations.
The Landsat archive, with its continuous, global-scale data collected since 1972, represents an unparalleled resource for large-area and long-term analysis of the Earth’s surface. The Landsat data’s rich spatial (30 m), spectral (multiple bands), and temporal (biweekly) dimensions provide a robust foundation for tracking complex land surface dynamics.
Harnessing the full potential of this extensive spatial and temporal archive for LCLU classification presents a significant computational challenge. Processing large-area, long-period time-series data requires models that can efficiently integrate spatial, spectral, and temporal information simultaneously. Traditional deep learning (DL) models, particularly transformers, have become popular for LCLU classification; however, they are typically computationally intensive [
9,
10]. The high parameter counts and operational costs of these “heavy” models can create a bottleneck for large-scale deployment, operational monitoring, and processing on resource-constrained devices, such as mobile phones and edge devices [
11,
12].
Lightweight deep neural networks (DNNs) are efficient DL models, architecturally optimized to reduce computational load, memory usage, and parameter count, making them suitable for deployment on edge devices with limited resources [
11,
13]. Lightweight models enable on-device processing, ensuring data privacy and ownership by eliminating the need to transmit sensitive information to the cloud. By keeping processing local, these architectures reduce communication overhead and network dependency, making them ideal for secure, real-time decentralized monitoring. Lightweight DNN models have been applied in LCLU classification by offering a good balance between efficiency and accuracy. These lightweight models primarily focus on two tasks: scene classification and hyperspectral classification. Recent studies have demonstrated the effectiveness of lightweight convolutional neural networks (CNNs) in achieving accuracy comparable to state-of-the-art heavy models in scene classification. Scene classification refers to assigning a single label to an entire scene or image. These models are typically evaluated on standard public datasets, including AID [
14], NWPU [
15], and UC Merced [
16], which consist of satellite and aerial RGB images with resolutions ranging from 0.2 m to 30 m. For instance, Tong et al. [
17] developed a lightweight DenseNet model based on DenseNet121, which performed comparably to heavy CNNs across these datasets. Liu and Bai [
18] further improved classification by incorporating a statistically independent Gaussian noise-based feature augmentation (IGNA) module into a lightweight CNN, outperforming MobileNetV2 on the same datasets. Additionally, knowledge distillation has been used to enhance lightweight models. Song et al. [
19] employed an ensemble of EfficientNet-b0 and EfficientNet-b3 in a teacher–student knowledge-distillation framework, achieving superior performance on the AID and NWPU datasets compared to other lightweight and heavy CNNs.
Hyperspectral classification generally involves semantic segmentation, i.e., assigning a label to each pixel in an image. Lightweight models, both CNN-based and transformer-based, have been proposed to extract spatial and spectral features from hyperspectral datasets like Indian Pines [
20], Houston 2013 [
21], Pavia University [
22], and Salinas [
23]. For example, researchers have successfully utilized hierarchical residual strategies [
24], 1D convolutional layers within transformers [
25], and hybrid CNN–transformer designs [
26] across these datasets. Furthermore, gradient-based Network Architecture Search (NAS) has been employed to automatically design efficient attention networks that outperform traditional state-of-the-art models [
27].
While lightweight models have advanced for scene and hyperspectral classification, their application to high-spatial-resolution optical imagery in LCLU classification remains relatively underexplored. Recent studies have successfully adapted architectures like MobileNetV2 and lightweight UNet for specific tasks, including water body and cropland classification, often outperforming heavier 2D CNNs and Vision Transformers [
28,
29]. Similarly, Synthetic Aperture Radar (SAR) imagery was primarily used by lightweight models for object detection [
30,
31]. Lightweight models like MobileNet and SegNet have shown promise for specialized LCLU tasks such as green-tide detection [
32].
Medium-spatial-resolution optical imagery, the focus of this work, has been sparsely explored by lightweight models in LCLU classification. Among the medium-spatial-resolution optical sensors, Sentinel-2 (10 m) and Landsat (30 m) are mostly used for their free availability and continuous global coverage, compared to other sensors such as ASTER (15 m). Here, lightweight models utilizing Sentinel-2 were first presented, followed by those employing Landsat data. Mazzia et al. [
33] combined a Recurrent Neural Network (RNN) with CNN using one-year Sentinel-2 temporal data for crop classification in a north–central part of Italy. Results showed the lightweight model (96%) with ~31k parameters was better than traditional machine learning methods such as Support Vector Machine (SVM) (80%), and Random Forest (RF) (78%). Garnot and Landrieu [
34] designed a lightweight Temporal Self-Attention model with ~150k parameters for crop classification using one-year Sentinel-2 time series and the lightweight model (94.3%) achieved comparable accuracy to the traditional Temporal Attention Encoder (TAE) (94.2%) and outperformed TempCNN (93.3%). Corbane et al. [
35] developed a simple CNN with ~1.4M parameters for built-up classification using Sentinel-2 images, although the model was not compared to others. Arrechea-Castillo et al. [
36] used a simple CNN based on LeNet using two Sentinel-2 images for LCLU classification in a sub-basin in Colombia. The model achieved the highest accuracy (96.5%) compared to six traditional DL architectures (e.g., AlexNet (96.0%) and ResNet (96.3%)) and it also outperformed an efficient model, EfficientNet (94.9%). Papoutsis et al. [
37] designed a lightweight CNN with ~30k parameters based on EfficientNet with Sentinel-2 images for multi-label image classification which assigns at least one label to each image patch. The model achieved comparable accuracy (76.3%) to ResNet (76.4%) and EfficientNet (76.1%). Sawant and Ghosh [
38] compared LinkNet backboned by MobileNetv2 with UNet for LCLU classification using Sentinel-2 images in two Indian cities; they found the lightweight LinkNet (61.3%) underperformed the UNet (71.3%). Wang et al. [
39] combined 1D CNN and transformer architectures and achieved high accuracy (96.8%) using limited training samples (~1200 samples) for crop mapping with 1-year pixel time-series of Sentinel-2, although it was not compared to other models.
For lightweight models using Landsat data, Sencaki et al. [
40] combined 1DCNN and a bi-directional Gated Recurrent Unit (GRU) for land cover classification in an Indonesia city using Landsat 8 time series and the lightweight model with 20k parameters (92.2%) obtained better accuracy than ResNet (89.8%) and TempCNN (91.1%). Wan and Yong [
41] proposed a lightweight CNN based on EfficientNetV2 and DeepLabV3+ for binary water classification using Landsat images. The lightweight model (95.3%) surpassed the other heavy 2D CNN models (e.g., DeepLabV3+ (93.2%), DeepwatermapV2 (94.5%), WatNet (93.0%)). Martono et al. [
42] proposed a lightweight 1D DL algorithm with ~10k parameters by combining 1D CNN and bi-directional GRU for land cover classification using Landsat 8 time-series data. The model (95.8%) achieved comparable results with TempCNN (95.7%) and Long Short-Term Memory (LSTM) (95.3%).
Table 1 below summarizes the studies on lightweight deep neural networks using Sentinel-2 and Landsat data, including their application, classification type and input data characteristics (spatial patches and/or time series), and comparative methods. The summary showed that studies either used CNN-based lightweight models to handle spatial features for LCLU classification, or 1DCNN/temporal models to extract temporal features, primarily for crop classification. Furthermore, most studies that compared lightweight models to their heavyweight counterparts often covered a local area and short time periods.
From
Table 1 it can be seen that no study has yet proposed or applied specific lightweight models that can simultaneously manage spatial, spectral, and temporal features for medium-resolution sensor time-series data, nor have several different lightweight models been compared for their performance in multiple-class LCLU classification using medium-resolution optical data. As Landsat provides freely available, continuous long-term, global-coverage data supporting diverse applications, it is important to propose, apply, and compare lightweight models that can efficiently leverage this rich data in multiple-class LCLU classification. This study presents the first large-scale comparison of lightweight models for spatial–spectral–temporal Landsat time-series LCLU classification. By evaluating seven popular computer vision models across five parameter scales (3k–50k) over the Conterminous United States (CONUS), practical guidance on how lightweight designs can outperform traditional CNNTransformer hybrids in accuracy and stability is provided.
To address this gap, this article aims to answer the following research questions on transitioning the existing lightweight algorithms of medium-resolution classification tasks:
How do different lightweight deep learning architectures perform in Landsat time-series LCLU classification?
Can lightweight deep learning models achieve comparable or better classification performance to a traditional CNN + transformer hybrid classifier?
Is there a specific model complexity that offers the highest classification improvement for a lightweight vs. traditional classifier?
4. Results
To evaluate model stability and robustness, the optimal architecture and hyperparameter configuration was then retrained across the 25 training repetitions. These repetitions were conducted using the same fixed architecture and hyperparameter settings but employed different training data samples for each repetition. Using a single repetition for model selection prevents information leakage and retraining the selected configuration on the remaining repetitions with fixed test sets isolates the effects of training data variation, enabling fair assessment of model generalization and stability. To further assess model robustness under limited-data scenarios, we performed an additional experiment using a small training dataset consisting of 50k samples. Given the higher variance expected in limited-data conditions, models were evaluated over 50 training repetitions while maintaining the same model selection and testing protocol. In this section, we first present the benchmark model performance, followed by the performance of lightweight models relative to the benchmark, a further investigation of the optimal lightweight model performance relative to the benchmark, and model performance with small training dataset.
4.1. Benchmark Performance
As mentioned, the CNNTransformer hybrid was the benchmark model used in this study for comparison with the lightweight models for Landsat time-series LCLU classification.
Figure 2 presents the benchmark performance (mean F1 and per-class F1) across 25 training repetitions. The mean F1 distribution, typically falling between 72 and 75% depending on model size, showed high stability, with moderate variability across repetitions, indicating that the benchmark is robust to variation in training data. The mean F1 increased with model size from 3k to 25k, reflecting the expected benefits of wider attention layers and deeper temporal modeling capacity. However, the 25k and 50k models achieved nearly identical mean F1, indicating that model capacity reached a point of saturation, where additional parameters offered limited marginal gains. The tighter boxplots at 50k further supported this interpretation. While the mean performance did not improve beyond the 25k model, the narrower spread indicated that the 50k model showed lower variability across training repetitions, meaning more stable optimization and more consistent convergence. Larger models often have smoother loss landscapes and more redundant capacity, thus reducing run-to-run fluctuations even when accuracy plateaus.
Class performance revealed clear differences among the LCLU types. Some land cover types (e.g., water and agriculture) displayed a consistently high F1, while more dynamic classes (e.g., wetland or grass/shrub) exhibited greater variability.
The water and agriculture classes consistently achieved the highest F1 scores (85–90%), reflecting the water class’s strong spectral and temporal separability and low within-class heterogeneity and the agriculture class’s strong spatial and temporal separability. The developed and grass/shrub classes followed, with relatively stable performance (approximately 75–80%) across repetitions. Developed areas had spatial contrast (e.g., edges, boundaries) and stable reflectance with weak seasonality, while grass/shrub areas had strong but consistent seasonal patterns (i.e., large spectral amplitude with consistent annual phenological timing). Forest and wetland exhibited notably lower F1 values (roughly 65–70%); however, forest showed lower variability while wetland displayed substantially higher variability, indicating its more inconsistent separability across training repetitions. Forest areas tended to have homogeneous canopy structures and stable phenological cycles (e.g., spring–summer–fall temporal changes) across years. However, wetlands often exhibited mixed vegetation and water signals within Landsat pixels at 30 m resolution, and wetlands also usually included inconsistent temporal changes across years, such as affection of flooding and drought. Furthermore, the spectral similarity between wetlands and adjacent land cover types increases class ambiguity in optical imagery, potentially introducing greater label uncertainty compared to more spectrally homogeneous classes. Bare areas had the lowest F1 values (52–62%), reflecting their spectral ambiguity and weak temporal signal. Bare land surfaces resembled the developed, grass/shrub, and even agriculture classes depending on soil moisture, mineral content, or tillage. They often lacked consistent phenological behavior, limiting the extraction of temporal patterns. The bare land class also had large intra-class variability, as it can include deserts, mine excavations, riverbanks, and burn scars, each with very different spectral patterns.
Overall, the benchmark’s performance provided a reliable baseline for evaluating whether lightweight models can approximate or exceed the traditional CNNTransformer hybrid.
4.2. Performance of Lightweight Models Relative to the Benchmark
To evaluate performance consistency across different model architectures and parameter scales, a Two-Way ANOVA with blocking was conducted as shown in
Table 5. The analysis revealed significant main effects for both the model (F (7, 936) = 12.12,
p < 0.001) and parameter scales (F (4, 936) = 5.87,
p < 0.001). The blocking factor (training repetition) was also highly significant (
p < 0.001), confirming that the performance varied naturally between different training repetitions. Most notably, a highly significant interaction effect between the model and parameter scales was observed (F (28, 936) = 134.49,
p < 0.001). This interaction indicates that the performance gain from increasing the parameter size is dependent on the specific model. A more detailed analysis of impact of model size separated by model type is needed.
To further evaluate if lightweight models can perform comparably to or better than the benchmark, the mean F1 difference was calculated by subtracting the benchmark’s mean F1 score from the model’s mean F1 score for the same training repetition.
Figure 3 summarizes each lightweight model’s mean F1 difference over the benchmark across model sizes (3k–50k). A positive value indicates performance surpassing the benchmark and vice versa. For better visualization, EVT values at small model sizes (3k and 5k) were removed due to poor performance (lower than −20%) relative to the benchmark.
The CNNRNN models (MobileNetSRU and ConvNextKanSRU) achieved the strongest performance across all model sizes, consistently outperforming the benchmark. MobileNetSRU exceeded the benchmark at every model size, including at smallest size (3k parameters), where it outperformed the benchmark by a substantial margin, demonstrating exceptional parameter efficiency and stable temporal modeling [
11,
46] via SRU (more details in
Section 4.3). ConvNextKanSRU followed closely but was typically slightly below MobileNetSRU in performance at most model sizes; however, at larger model sizes, ConvNextKanSRU achieved comparable performance to MobileNetSRU. At 25k, the highest mean F1 surpassed MobileNetSRU, suggesting that the ConvNext + KAN spatial encoder benefited more from increased model size, allowing the architecture to realize its full spatial and temporal modeling potential at a larger model size. The variability of the two models decreased as model size increased, highlighting improved model stability and reduced sensitivity to training data variation. This pattern indicated that larger lightweight models gain stronger representational capacity, allowing them to learn more consistent spatial and temporal features across repetitions. As parameter budgets increase, both architectures become less prone to overfitting individual training draws and better capture class-invariant patterns, resulting in tighter performance distributions and more reliable generalization.
The lightweight CNNTransformer models (MobileNetTransformer and ConvNextKanTransformer) showed a model-size-dependent pattern. Transformer-based lightweight models underperformed the benchmark at a small model size (3k) but became competitive at larger model sizes. At 3k, both models fell below the benchmark (negative F1 difference), showing instability and underfitting. The two models scaled down less efficiently than the benchmark as they carry larger structural overhead in their MobileNet and ConvNextKan backbones (e.g., depthwise convolutions, and squeeze-and-excite, normalization, and residual connections) and attention mechanisms (e.g., projection matrices), leaving insufficient effective capacity for learning meaningful spatial–temporal features at this extremely small size. As the model size increased (5k to 25k), the relative performance of MobileNetTransformer improved, with F1 difference centers above the benchmark. Between the 5k and 25k model sizes, the parameter numbers became adequate to support both spatial and temporal modeling. MobileNet and ConvNextKan started to build meaningful spatial representations and the linear attention became expressive enough to capture annual phenology.
Then, relative performance decreased with a model size increase from 25k to 50k, indicating lightweight transformers could not achieve comparable performance to traditional transformers for large model sizes. Both models fell below the benchmark again at 50k, suggesting that the linear attention mechanism, while efficient for small models, became less effective than the traditional quadratic self-attention as model size increased. At large model sizes, it is possible that linear attention could not exploit the increased capacity for the following reasons: it compressed temporal interactions too aggressively and cannot model fine-grained pairwise relationships between time steps as traditional transformers do by design [
59]. Therefore, the extra parameters could not improve temporal modeling quality. In contrast, traditional self-attention used full pairwise comparisons between time steps, enabling larger models to exploit their representational power to utilize the additional capacity better [
9]. This decrease may also occur in lightweight CNNs, but this is not likely. For example, MobileNets were specifically designed to be highly parameter-efficient by using tricks such as depthwise separable convolutions to achieve comparable or higher accuracy with a much smaller parameter size than heavy CNN models [
46,
48]. If traditional CNNs are forced to have the same parameter size as MobileNets, the traditional CNNs’ accuracy often drops as they cannot leverage a small parameter size as efficiently as MobileNets. Although there is no direct comparison example found in LCLU classification, Melyani et al. [
64] found MobileNetV2 achieved better accuracy (5% gain) than the traditional CNN of DenseNet 121 at the same parameter scale in eye disease image classification. The variability of MobileNetTransformer decreased as model size increased to 25k, then increased from 25k to 50k; model size also indicated model-size-dependent patterns. However, the non-monotonic stability observed in ConvNeXtKanTransformer (3k more stable than 5k) likely reflected optimization dynamics interacting with the KAN component, where certain intermediate capacities lead to mismatch between attention head size and spatial feature dimensionality. Temporal transformer modeling requires larger representational capacity due to reliance on attention mechanisms and weaker inductive bias [
9], and transformers often require large datasets (e.g., millions of samples) to reach their full potential, because they are essentially trying to learn the entire sequential structure from the data, which requires vast amounts of evidence.
The 3D CNN models (MoviNet and TVN) achieved comparable or better performance than the benchmark at all model sizes. At 3k, MoviNet had mean F1 difference center below 0%, showing slightly weaker average performance than the benchmark; however, at other model sizes, MoviNet became competitive, with centers exceeding the benchmark. Moreover, the relative improvement to benchmark first increased (5k to 10k) then decreased (25k to 50k) as model size increased. This underperformance indicated the inefficiency of 3D CNN compared to CNNTransformer at extremely small parameter sizes. MoViNet used 3D kernels (e.g., 3 × 3 × 3 or 1 × 3 × 3) to capture spatial and temporal correlations simultaneously and the model design mechanisms (e.g., depthwise 3D convolutions, temporal buffering) only begin to function effectively once the channel width is high enough. In contrast, the CNNTransformer hybrid separated spatial and temporal modeling, and this separation allows the CNNTransformer to remain functional even at 3k, whereas MoViNet became too shallow and narrow to extract meaningful 3D features. This was also indicated by the hybrid design of TVN with 2D + 1DCNN.
As model size increased from 5k to 10k, MoviNet started to obtain enough capacity to capture joint spatial–temporal features and gain better relative performance compared to the benchmark. However, the relative improvements of MoviNet declined at 25k and 50k, illustrating that the 3D CNN model overfitted more easily at larger model sizes. MoviNet mixed spatial and temporal features too early, which is inherent to the design of 3D CNN architectures and may amplify noise and temporal irregularity as model size increases.
TVN consistently achieved mean F1 difference values near or above the benchmark across model sizes. Particularly at 3k, TVN’s relative performance was strong and stable, outperforming the benchmark and indicating that the 2D + 1D CNN architecture is more parameter-efficient than the 3DCNN architecture of MoviNet at small model sizes. However, the 3D CNN architecture of MoViNet became slightly superior to TVN at larger sizes (10k, 25k and 50k), possibly related to its joint spatial–temporal convolutions, which can exploit model capacity to learn richer and more holistic spatiotemporal representations than the separated 2D + 1D CNN design.
The video transformer (EVT) showed substantial underperformance compared to the benchmark across all model sizes, failing to reach the benchmark’s level of performance at any model size. Although performance improved with model size, EVT remained below the benchmark at large model size of 50k. The high variance also indicated unstable convergence at small and medium model sizes. Video transformers typically require large model widths and deep hierarchies to stabilize attention across space–time tokens [
65,
66]. EVT likely struggled to maintain consistent attention patterns at small model sizes, resulting in low relative performance and high instability.
Overall, MobileNetSRU was the best across the seven lightweight architectures. The next section compares it with the benchmark based on per-class performance.
4.3. Further Investigation of Optimal Lightweight Model Relative to Benchmark
To compare MobilenetSRU, the best-performing model overall, with the benchmark in more detail and to find where accuracy gains are maximized, two model sizes (7.5k and 15k) were added.
Figure 4 shows the mean F1 and per-class F1 of MobileNetSRU relative to the benchmark across model sizes (3k to 50k).
Principally, MobileNetSRU exhibited a size-dependent performance pattern, with stronger relative improvements at smaller model sizes and diminishing gains at larger sizes. This trend highlighted the specific efficiency advantages of MobileNetSRU’s lightweight design and the limitations of hybrid CNNRNN architectures as model size increases. The relative improvement of MobileNetSRU was highest at 7.5k and then the performance gains declined gradually from 10k to 50k, with the 50k model showing marginal relative improvement.
At the smallest model sizes (3k–5k), MobileNetSRU also outperformed the benchmark, but with larger variability. The improvement from 3k to 7.5k reflected the range wherein MobileNetSRU obtained enough representational capacity to express MobileNet’s parameter-efficient design by compact convolutional blocks and recurrent temporal modeling. MobileNetV3Small effectively captured spatial information with inverted residual blocks, linear bottlenecks, and squeeze-and-excite mechanisms. Unlike transformers, the SRU avoided quadratic attention costs by providing efficient temporal modeling with simplified sequential recurrence. Furthermore, the SRU benefited from the inductive bias inherent to sequential architectures (the model did not need to use its limited parameters to learn sequential relationships from scratch) at smaller model sizes. When the model was small, these design advantages yielded better performance than the CNNTransformer benchmark. As the model size increased, however, the benchmark benefited more from the additional parameters, especially the transformer self-attention scaling with model size (e.g., wider keys/queries stabilized attention, deeper temporal receptive fields enabled modeling of fine-grained phenological differences). MobileNetSRU, in contrast, did not scale as effectively with model size because the SRU uses fixed projections (i.e., projecting the input data into gate and update components by one large linear transformation) and cannot expand temporal modeling complexity like transformers due to the SRU’s sequential recurrence design.
The following analysis examines the per-class performance of MobileNetSRU relative to the benchmark models. The water class showed strong positive gains at 3k to 7.5k, gradually decreasing toward zero at larger sizes. Water has simple and stable spectral–temporal signatures. MobileNetSRU’s small, efficient filters capture these reliably, whereas CNNTransformer is inefficient at small model sizes. As the model size increased, the transformer became better at modeling subtle temporal differences (e.g., seasonal changes), reducing MobileNetSRU’s advantage.
The accuracy gains for the developed class peaked at around 7.5k to 10k and narrowed slightly with larger sizes. Developed areas often exhibit mixed spectral characteristics but relatively stable temporal features. Hybrid CNNRNN architectures captured texture-like patterns efficiently at small sizes. At larger sizes, the transformer layers better captured structural heterogeneity, narrowing the performance gap.
The grass/shrub class showed consistent positive gains across 5k to 15k, peaking at 10k. This class is moderately separable but temporally variable. SRU’s efficient recurrence is effective for capturing moderate temporal variability but does not scale as strongly as transformer attention for long-range seasonal modeling. This explained the decline in relative performance after 15k. Notably, at the smallest model size (3k), MobileNetSRU underperformed against the CNNTransformer; this could be because both the MobileNet backbone and SRU became too compressed, while the transformer, although minimal, can still model some long-range temporal relationships through its attention mechanism, which is important for the grass/shrub class.
The forest class had positive gains at smaller sizes (3k to 10k), and relatively small gains at larger sizes, becoming near-zero at 25k to 50k. The forest class exhibited stable phenology with subtle but important temporal cues. CNNTransformer benefited from self-attention for modeling these long-term seasonal dynamics, causing MobileNetSRU to lose its advantage when the model size increased.
The bare land class showed the largest gains at small sizes (~20% improvement at 3k–10k) but this declined as model size increased (25k–50k). Bare areas had relatively simple spectral and temporal structures. CNNTransformer is inefficient under small model sizes, while MobileNetSRU captures these patterns efficiently. However, as model size increased, CNNTransformer scaled more effectively, enabling richer representation of the high intra-class variability of the bare class and reducing MobileNetSRU’s relative performance advantage.
The agriculture class exhibited modest gains (~2%), first increasing from 3k to 7.5k and then decreasing at larger sizes (15k–50k). MobileNetSRU effectively captured the mid-range phenological cycles of agriculture and performed comparably to CNNTransformer at very small sizes, reaching peak performance around 7.5k. Although this performance advantage decreased with model size, the temporal complexity of agriculture is not sufficient for the transformer to close the gap entirely, resulting in MobileNetSRU maintaining a slight performance edge even at 50k.
The wetland class showed moderate gains (5–10%) across most sizes, with gains increasing at smaller sizes (5k–7k), but a gradual decline was found as model size increased. Wetlands exhibit irregular and class-specific phenological patterns. MobileNetSRU benefits early from efficient temporal modeling, but as model size increased, the CNNTransformer better leveraged spatial–temporal heterogeneity, reducing MobileNetSRU’s relative advantage and yielding a comparable performance at larger model sizes.
Collectively, the relative performance patterns indicate that MobileNetSRU was most competitive for small-to-mid-size models (5k–15k) where its architectural efficiency was maximized. The optimal point was 7.5k, after which performance declined relative to the CNNTransformer benchmark. The transformer architecture scaled more effectively with model size, providing stronger long-range temporal modeling for complex classes (forest, agriculture, and wetland). MobileNetSRU excelled at simple or moderately variable classes (water, bare land), especially with small model sizes (3k–10k). This reinforced that different architectures perform best at different model sizes, and that the lightweight CNNRNN hybrid can outperform a transformer-based hybrid when the parameter size needs to be small.
4.4. Model Performance with a Small Training Dataset
The results reported in
Section 4.1,
Section 4.2 and
Section 4.3 used a substantial training dataset size of approximately 500k samples. The intent of the large reference dataset was to isolate improvements to algorithmic architectures instead of sampling limitations. Here, the analysis is expanded using a substantially smaller reference dataset of 50k samples to assess algorithmic performance under more realistic sampling sizes.
4.4.1. Benchmark Performance with 50k Training Samples
Figure 5 shows the CNNTransformer benchmark’s performance on 50k training samples across 50 repetitions and model sizes ranging from 3k to 10k parameters. Compared to the 500k training sample, the benchmark exhibited a clear reduction in mean F1 scores and increased variability, highlighting the stronger sensitivity of the transformer-based model to limited training data. The mean F1 distribution typically fell between 54% and 58%, a drop of approximately 20% to 25% compared to the larger dataset. The median F1 increased from 3k to 5k parameters, remaining approximately unchanged at 7.5k, and then decreased at 10k. This non-monotonic behavior suggests that, under limited training data, increasing transformer capacity can lead to overfitting rather than improved generalization.
Class performance largely mirrored that observed in the large 500k training sample experiment but with amplified disparities. The water and agriculture classes again achieved the highest F1 scores, reflecting their strong spectral and phenological separability even under data constraints. The developed, grass/shrub and forest classes showed moderate performance with increased variance, while the wetland and bare land classes experienced notable declines in both accuracy and stability. Wetland exhibited high variability, suggesting that limited training samples were insufficient to capture its heterogeneous and temporally inconsistent signatures. The bare class displayed very low F1 scores and high variability, reinforcing its dependence on larger training datasets to disentangle spectral ambiguity and intra-class variability.
Overall, the benchmark results on the 50k training sample dataset demonstrate that the CNNTransformer remains functional under reduced-data conditions but suffers from degraded performance and stability.
4.4.2. Testing of Lightweight Models Relative to the Benchmark with 50k Training Samples
Figure 6 summarizes the mean F1 difference in four lightweight models (MobileNetSRU, ConvNextKanSRU, MoviNet, and TVN) over the CNNTransformer benchmark under the 50k training sample size. MobileNetSRU emerged as the most consistently robust architecture for the small dataset size, maintaining a positive performance margin over the benchmark across the 3k to 10k parameter scales. At the 3k and 5k scales, it achieved a median F1 improvement of approximately 2% to 3%, which expanded to over 4% at the 7.5k and 10k scales. This sustained superiority suggests that the SRU’s recurrent gating mechanism captures temporal phenological patterns more effectively than the benchmark’s attention mechanism when training samples are sparse, likely due to MobileNetSRU’s ability to leverage sequential priors (temporal continuity and order dependence), maintaining both parameter and data efficiency. ConvNextKanSRU initially underperformed against the benchmark at the 3k and 5k parameter scales, with a median F1 difference of approximately −4%. At 10k, however, its relative performance improved substantially, reaching a median of 4% above the benchmark. This performance recovery is driven in part by a degradation in benchmark performance at 10k under limited training data, while ConvNextKanSRU exhibits comparatively more stable behavior. This suggests that while the ConvNextKanSRU architecture may be disadvantaged under constrained parameter and data combinations, it becomes more resilient as model capacity increases.
The 3D CNN architecture of MoviNet exhibited a clear scaling effect under the limited-data regime of 50k training samples. At the 3k and 5k parameter scales, MoviNet slightly underperformed against the benchmark, with median F1 differences ranging from approximately −1% to −0.5%, likely reflecting the high degrees of freedom associated with joint 3D spatiotemporal convolutions, requiring sufficient capacity to be effectively utilized. At the 10k scale, MoviNet achieved a marked relative improvement, surpassing the benchmark with a median F1 difference of 6%. This gain is partially attributable to a degradation in benchmark performance at higher parameter counts under limited-data conditions, while MoviNet remains comparatively stable. These results suggest that under limited-data conditions, MoviNet exhibits robustness to increasing parameter budgets and is better able to leverage additional capacity to model holistic spatiotemporal features when benchmark architectures struggle to do so. Conversely, TVN exhibited a non-monotonic performance trend under the limited-data condition. The median F1 difference improved slightly from approximately −0.8% to −0.2% from 3k to 5k, followed by a decline to approximately −8% at the 7.5k scale. At the 10k scale, TVN showed a partial relative recovery; however, this improvement coincides with a degradation in benchmark performance. This pattern suggests that TVN is sensitive to parameter sizes under limited-data conditions, where increases in model size do not consistently translate into improved generalization. While additional parameters may alleviate underfitting at very low capacities, they may also exacerbate optimization instability or overfitting when training data are insufficient.
5. Discussion and Limitations
Across the 25 training repetitions and five model sizes, the SRU-based lightweight hybrids (MobileNetSRU and ConvNextKanSRU) were the best performers overall, consistently beating the CNNTransformer benchmark at the small-to-medium parameter scales and retaining competitive performance at larger model sizes as shown in
Figure 3. Several interacting factors likely explain why SRU-based hybrids fit our data better than other models. First, the inductive bias (i.e., the assumption that an algorithm uses to generalize from training data to unseen data) of RNNs, assuming past observations are relevant to current predictions, aligned with the annual phenology of the Landsat sequences. The Landsat time series in this study exhibit three key characteristics: first, moderate sequence length (yearly sequences); second, strong but regular temporal dynamics driven by phenology; and third, class-dependent spectral stability (e.g., water vs. wetland). Under these conditions, SRU-based models benefit from the inherent sequential inductive bias that directly encodes temporal order and continuity, allowing efficient modeling of phenological trajectories with limited parameters [
46,
67]. In contrast, transformers with self-attention mechanisms, as well as the lightweight linear-attention variants, must learn temporal relations more flexibly. Unlike RNNs, they do not assume data is sequential by default; instead, they process all time steps simultaneously to identify global dependencies [
68]. This flexibility allows them to start with fewer assumptions, making them more general but also requiring more capacity or larger model sizes and more data to discover and capture the sequential patterns that the SRU already has built-in [
51]. Second, with the parameter-efficient design of compact set of gates and parallelizable linear transforms, the SRU achieves good temporal modeling with few parameters, reducing overfitting on classes with limited or noisy signals and benefiting small models [
69]. Third, SRUs benefit from robustness to temporal irregularity and missing observations [
46]. We applied cloud filtering to the Landsat sequences, and these sequences also contain sensor transitions (L5, L7, L8/9). The SRU’s sequential gating and localized temporal dynamics tolerate uneven or sparse observations more naturally than attention mechanisms, which assume dense (i.e., observations are frequent with very few or no missing observations over the sequence), and stable (i.e., time intervals between observations are fixed and regular) temporal relationships [
70]. The observed advantage of the SRU-based architecture in this study is consistent with prior findings that recurrent models are well-suited for satellite image time-series classification. For instance, research has demonstrated that recurrent designs effectively capture seasonal dynamics in multi-temporal data [
71] and outperform traditional classifiers in complex tasks such as crop mapping [
72]. While transformer-based models can achieve superior performance when provided with sufficient capacity and massive datasets [
9,
73], they often require larger parameter scales to fully exploit their attention mechanisms [
74,
75].
A notable pattern observed in
Figure 4 is that as model size increased, the CNNTransformer benchmark progressively closed the performance gap with the SRU-based models. This suggests that RNN-based temporal modules are less able to capitalize on increased model capacity, whereas transformer layers scale more effectively. This observation is consistent with prior findings in sequence modeling, where recurrent architectures tend to saturate in performance as depth increases, while attention-based models benefit more from scaling in both width and depth [
9,
76]. RNNs must process sequences sequentially, limiting parallelism, increasing gradient decay over longer dependencies, and reducing their ability on deeper architectures. These limitations have been widely discussed in the literature on sequence learning, particularly in the context of long-term dependency modeling [
77,
78]. In contrast, transformer architectures are explicitly designed to scale: their parallel attention mechanism enables more effective use of additional parameters, supports the modeling of long-range temporal relationships, and maintains stable gradients as depth increases [
9]. These scaling advantages are widely recognized in other domains, most notably in NLP, where transformers have replaced RNNs as the foundation of modern LLMs [
60]. The lightweight video transformer (EVT) struggled to form stable and meaningful attention patterns at small model sizes, leading to low performance and high variability as shown in
Figure 3. This behavior is consistent with prior studies indicating that transformer-based models are sensitive to data availability and model size [
74]. However, as model size increased, the EVT showed clear improvements, suggesting that attention-based architectures may require greater capacity to effectively model the temporal structure of Landsat time series. With larger datasets or substantial pretraining, EVT may eventually match or surpass the CNNTransformer benchmark. Such scaling behavior has also been reported in both the computer vision and machine learning literature, where transformer performance improves significantly with increased training data and model complexity [
79]. A similar pattern is observed in the 3D CNN family: MoviNet outperformed the 2D + 1D TVN at larger model sizes. One possible reason is that MoviNet’s entirely joint spatiotemporal convolutions can better exploit additional parameters to learn richer and more holistic spatial–temporal features. This aligns with prior work showing that joint spatiotemporal feature learning can outperform factorized approaches when sufficient model capacity is available [
80,
81]. By analogy, a higher-capacity EVT, capable of jointly capturing spatial and temporal relationships within its attention mechanism, may ultimately outperform architectures that separate spatial encoding (CNN) and temporal modeling (transformer), provided there it has a sufficient model size and available training data. Besides jointly capturing spatial and temporal relationships, the video transformer architecture can perform dynamic token weighting across both space and time [
82], while the CNN + transformer architecture cannot perform this dynamic cross-dimensional weighting until after fixed spatial features are extracted by the CNN. This flexibility has been identified as a key advantage of attention mechanisms in modeling complex dependencies across dimensions [
9]. Moreover, while CNN spatial encoders excel at modeling local spatial patterns, attention-based mechanisms are superior at capturing global relationships [
9]. This distinction between local inductive bias (CNN) and global context modeling (attention) has been widely documented in both the computer vision and remote sensing literature [
74,
83]. This capacity for global context is crucial in large-scale RS LCLU classification, as local convolutional operations often struggle to integrate distant spatial information without extensive network depth [
84].
The transition from a 500k to a 50k training sample size suggests the critical role of inductive bias in maintaining model robustness when empirical evidence is sparse. While the CNNTransformer benchmark exhibited performance degradation, MobileNetSRU demonstrated resilience. This divergence reflects the sample inefficiency that is characteristic of transformer architectures, which typically rely on large training datasets to learn meaningful attention weights and stable temporal representations. In the absence of sufficient data, attention mechanisms struggle to reliably identify important time steps and long-range dependencies, particularly at small parameter scales where representational capacity is constrained [
9,
85,
86]. Conversely, the sequential inductive biases of the SRU provide a regularizing framework, enabling the model to maintain temporal representation despite limited observations [
87].
Another important mechanism underlying the superior performance of the SRU-based lightweight models is the separation between spatial encoding and temporal aggregation. In the convolutional and recurrent hybrid architectures, the convolutional backbone first compresses the 7 × 7 Landsat patches into compact spatial representations before temporal modeling is applied. This staged design reduces the dimensionality of the temporal learning problem and allows the recurrent module to focus primarily on phenological evolution rather than simultaneously resolving spatial texture and temporal relationships. In contrast, architectures such as 3D CNNs and video transformers attempt to jointly model spatial and temporal interactions from the beginning of the network [
12,
60], substantially increasing optimization complexity at small parameter scales. Under constrained parameter budgets, the SRU-based hybrids therefore allocate model capacity more efficiently by decomposing the learning process into two simpler subtasks of spatial feature extraction and temporal sequence modeling.
The SRU mechanism itself also provides several advantages for medium-spatial-resolution optical satellite image time series. Unlike traditional recurrent architectures such as LSTM or GRU, the SRU removes most recurrent dependencies from the heavy matrix multiplication operations, enabling parallel computation across temporal steps while still preserving sequential gating behavior [
46]. This design improves computational efficiency and stabilizes optimization, particularly when yearly Landsat sequences contain irregular temporal gaps caused by cloud masking, sensor transitions, or variable acquisition frequency. Because the hidden-state interaction in the SRU is relatively lightweight, the architecture can preserve temporal continuity without requiring the large parameter budgets typically needed for transformer-based architectures. This property is especially important in Landsat applications where the temporal signal is often smoother and more seasonally structured. Consequently, the SRU mechanism appears well aligned with the moderate temporal complexity and relatively stable annual phenology of Landsat observations.
The behavior of the transformer-based lightweight architectures further highlights the importance of matching architectural assumptions to the statistical structure of the data. The lightweight transformer hybrids showed underperformance at the smallest parameter scales, despite transformers being highly successful in other domains [
55,
62]. One explanation is that self-attention mechanisms rely on learning flexible pairwise temporal relationships directly from data [
9], but the limited parameter budgets at small scales may be insufficient to construct stable and discriminative attention patterns. Additionally, linear-attention approximations improve computational efficiency by compressing temporal interactions [
59], but this compression may discard the subtle phenological relationships needed for separating spectrally similar and temporally heterogeneous classes. As model size increased, the transformer-based models gradually improved, supporting the interpretation that attention mechanisms become increasingly effective once sufficient representational capacity is available.
Despite the strong performance of the SRU-based lightweight architectures, several model-specific limitations should also be acknowledged. First, recurrent architectures inherently process temporal information sequentially, which may limit their ability to capture very-long-range dependencies compared to transformer-based architectures with global self-attention mechanisms. While the SRU improves computational parallelization relative to traditional recurrent models, its temporal modeling capacity may still saturate as model complexity increases, as observed in the diminishing relative gains at larger parameter scales. Second, the lightweight CNN backbone prioritizes parameter efficiency and local spatial feature extraction, which may reduce the ability to capture broader spatial context and complex landscape structures compared to heavier convolutional or fully attention-based architectures.
In addition, lightweight architectures involve an inherent trade-off between efficiency and representational flexibility. Mechanisms designed to reduce computational complexity, such as depthwise convolutions, compact gating structures, and linear-attention approximations, can improve parameter efficiency but may also restrict the richness of learned spatial–temporal representations. Consequently, lightweight models may become less effective when classification tasks require modeling highly irregular temporal dynamics, subtle class boundaries, or complex multi-scale spatial interactions. Although the present study demonstrates that lightweight architectures can achieve strong performance for yearly Landsat time-series classification, the extent to which these models generalize to denser temporal observations, higher spatial resolutions, or more heterogeneous ecological systems remains uncertain.
Our findings should be interpreted within the context of several limitations. First, the training samples used in this study were fixed to 7 × 7-pixel neighborhoods. Altering this size, either by expanding the patch to incorporate a wider spatial context or by reducing it to 3 × 3 or 5 × 5 to assess finer local feature extraction, could potentially alter the spatial discrimination capabilities of the tested architectures. Second, the temporal inputs in this study were restricted to single-year Landsat sequences. While this captures intra-annual phenology, it does not evaluate the models’ ability to handle inter-annual variation, disturbance and recovery trajectories. Third, although the dataset spans diverse CONUS ecoregions, generalization to other global ecological zones (e.g., tropical or arid systems) remain untested, where the phenological patterns and spectral separability may differ. Fourth, this study focuses only on Landsat data; integrating multi-source data (e.g., Sentinel-2, SAR, DEM, or climate variables) may change model rankings by enhancing spatial resolution, temporal density, or feature richness. Future work should systematically evaluate cross-region transferability and multi-source fusion to better understand the robustness of lightweight architectures under more heterogeneous environmental conditions.
Another limitation is the use of parameter count as the primary measure of model efficiency. While parameter size provides a useful and architecture-independent measure for fair comparison, it does not fully capture operational efficiency in practical deployment scenarios. Different lightweight architectures may exhibit substantially different inference latency, memory access patterns, or hardware utilization characteristics even when parameter counts are similar. For example, recurrent operations, depthwise convolutions, and attention mechanisms may behave differently on GPUs, CPUs, or edge devices due to differences in parallelization efficiency and memory bandwidth requirements. Consequently, models with comparable parameter counts may still differ substantially in real-world inference speed and energy consumption. Future studies should therefore include hardware-aware benchmarking metrics such as inference latency, floating point operations (FLOPs), memory footprint, and power consumption across multiple deployment environments.
The current study evaluates yearly Landsat sequences as independent samples and does not explicitly model long-term ecological trajectories or abrupt disturbance events. Processes such as wildfire recovery, urban expansion, forest harvesting, or long-term wetland transitions often evolve across multiple years or decades. While yearly sequences capture intra-annual phenology effectively, they may not fully represent slower ecological dynamics that require multi-year temporal context. Similarly, the study focuses exclusively on classification performance and does not investigate model interpretability or uncertainty estimation. Understanding which temporal observations or spatial regions contribute most strongly to lightweight model predictions could provide valuable ecological insight and improve reliability for operational Earth-observation applications. Future work should investigate interpretable lightweight architectures, uncertainty-aware prediction frameworks, and long-term temporal modeling strategies for large-scale remote sensing monitoring tasks.