LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification

Tan, Yongmei; Chen, Gong; Huang, Yan; Ye, Hengzhou; Tang, Jincheng

doi:10.3390/computers15050281

Open AccessArticle

LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification

by

Yongmei Tan

¹

,

Gong Chen

^1,2,3

,

Yan Huang

^2,4

,

Hengzhou Ye

^1,3,*

and

Jincheng Tang

^2,4,*

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Technology Innovation Center for Natural Resources Monitoring and Evaluation of Beibu Gulf Economic Zone, Ministry of Natural Resources, Nanning 530219, China

³

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin 541006, China

⁴

Guangxi Institute of Natural Resources Survey and Monitoring, Nanning 530201, China

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 281; https://doi.org/10.3390/computers15050281

Submission received: 27 March 2026 / Revised: 19 April 2026 / Accepted: 22 April 2026 / Published: 28 April 2026

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

The Swin Transformer exhibits limitations in fine-grained land use and land cover (LULC) classification, particularly in capturing high-frequency texture details and representing low-contrast regions. To address these issues, we propose a novel network model, termed LACE-Net, which integrates local frequency-domain energy and adaptive contrast enhancement. Built upon the Swin Transformer backbone, the model introduces an innovative Local Frequency-Domain Energy-Adaptive Contrast Enhancement Multi-Scale Attention (LACE). This block consists of parallel branches for frequency-domain perception and contrast enhancement, which effectively combine texture and illumination physical priors. In addition, a texture-adaptive momentum adjustment mechanism is incorporated to refine the spatial enhancement attention weights dynamically. Consequently, LACE-Net greatly strengthens the modeling and representation of high-frequency details and complex spatial structural features. Experiments are performed on a self-constructed Guangxi regional dataset (denoted as GLC-30) and the publicly available remote sensing scene classification benchmark dataset NWPU-RESISC45. The results show that LACE-Net achieves a Top-1 accuracy (Top-1 Acc) of 96.48% and a macro-averaged F1 score (mF1) of 93.13%. These results outperform current mainstream vision models, particularly in mitigating the spectral confusion issue of “same spectrum, different objects.” The model exhibits superior fine-grained classification performance and robust generalization across datasets.

Keywords:

fine-grained land cover classification; swin transformer; local frequency-domain energy; adaptive contrast enhancement

1. Introduction

Land resources constitute the fundamental material basis for human survival and development, serving as the core medium for ecological civilization construction. With the rapid development of Earth big data and multi-source remote sensing technologies, high-resolution remote sensing imagery has become an essential foundation for natural resource surveys, spatial planning and precision agriculture management [1]. However, while Very High Resolution (VHR) remote sensing imagery provides rich spatial details, it simultaneously amplifies intra-class spectral variability and diminishes subtle inter-class disparities [2]. This phenomenon presents significant classification confusion for land cover recognition. In particular, traditional feature extraction methods often fail to meet high-precision requirements when faced with complex scenarios, such as the phenological variations of crops during different growth stages or the subtle differences between similar tree species within forest lands. In recent years, the evolution of deep learning (DL) [3,4] has provided new opportunities to overcome these bottlenecks. Models such as Convolutional Neural Network (CNN) and Vision Transformer (ViT) leverage hierarchical feature learning to achieve performance that significantly outperforms traditional methods across various remote sensing (RS) tasks [5].

Despite recent progress, existing mainstream vision models still face significant challenges in fine-grained land cover recognition. As the complexity of land categories increases, inter-class variances diminish while intra-class variations expand [6]. This leads to frequent misclassifications, particularly in scenarios involving weak textures, low-contrast regions, or high levels of background noise. For instance, the subtle textural disparities among various tree species in dense forest regions often exhibit feature convergence under complex illumination, which underscores the limitations of current models in texture enhancement, local structural representation, and contrast sensitivity. Specifically, while the Swin Transformer [7] balances local and global modeling via its shifted window mechanism, it fundamentally relies on spatial–pixel domain interactions. Consequently, it struggles to effectively capture high-frequency textural details hidden within the frequency-domain perspective. Similarly, although standard Vision Transformers (ViTs) excel at modeling long-range dependencies, they exhibit weak performance in capturing low-contrast regions and blurred boundaries. Most existing models suffer from insufficient utilization of spatial–frequency domain information and low feature saliency under varying lighting conditions. This results in limited feature discriminability when dealing with the “same spectrum, different objects” phenomenon or highly similar textures, such as internal forest species. To address these limitations, we propose a fine-grained land cover recognition network named LACE-Net, which integrates local frequency-domain energy and adaptive contrast enhancement.

The primary contributions of this work are summarized as follows:

(1): A novel network architecture, LACE-Net, is proposed by integrating local frequency domain energy with adaptive contrast enhancement. Employing the Swin Transformer as the backbone, the model innovatively embeds the Local Frequency-Domain Energy-Adaptive Contrast Enhancement Multi-Scale Attention (LACE) block, which incurs minimal parameter overhead. This design effectively mitigates key challenges such as the confusion of similar land cover categories, insufficient utilization of high-frequency textural information, and limited feature representation in low-contrast regions.
(2): A texture-adaptive momentum adjustment mechanism is designed to enhance feature discriminability. By leveraging the physical priors extracted from the local frequency-domain energy and contrast enhancement branches, this mechanism utilizes dynamically adjusted momentum coefficients to adaptively optimize the weight distribution of spatial attention. This approach significantly improves the model’s ability to distinguish complex textures.
(3): Systematic experimental validation is conducted on both a self-constructed regional dataset and a public benchmark. Through field data collection and rigorous screening, this study has developed the Guangxi Regional Dataset (GLC-30), which includes 30 categories of fine-grained land features. Experimental results on the self-constructed Guangxi regional datasets (GLC-30) and NWPU-RESISC45 datasets demonstrate that LACE-Net achieves higher classification accuracy than mainstream vision models, including ResNet, EfficientNet, and the original Swin Transformer. These findings confirm the effectiveness and technical superiority of the proposed model in fine-grained land cover classification tasks.

2. Related Work

2.1. Traditional LULC Classification Research

Early research in LULC primarily focused on the extraction of spectral statistical features and the application of shallow machine learning (ML) algorithms. Supervised classification algorithms, exemplified by Maximum Likelihood Classification (MLC), serve as traditional benchmarks for land cover recognition due to their well-established statistical decision-making mechanisms. As a parametric statistical approach, MLC assigns pixels to the category with the highest probability. This allows it to achieve superior accuracy compared to the Minimum Distance and Mahalanobis Distance methods, particularly when handling complex land features characterized by overlapping signatures [8]. However, the performance limitations of purely spectral-based statistical methods in complex scenarios prompted a shift toward integrating ML algorithms with multidimensional features. Algorithms such as Random Forest (RF) and Support Vector Machines (SVMs) effectively addressed the constraints of traditional statistical assumptions by operating within high-dimensional feature spaces [9]. Specifically, RF has been extensively applied to multi-source data classification due to its inherent robustness to noise. For instance, previous studies have integrated Sentinel-2 imagery with auxiliary geographic data, utilizing the Normalized Difference Vegetation Index (NDVI) and red-edge indices to construct multidimensional feature sets. This approach has significantly improved classification accuracy for complex vegetation types, such as various forest species [10].

To further enhance feature discriminability and mitigate the salt-and-pepper noise effect, feature engineering and spatial analysis techniques have undergone extensive development. On one hand, constructing multidimensional feature spaces by integrating spectral indices (e.g., NDVI) with textural features (e.g., Gray-Level Co-occurrence Matrix, GLCM) has proven effective. This approach alleviates the confusion between spectrally similar classes in heterogeneous tropical landscapes—such as distinguishing primary from secondary mangroves—while strengthening the model’s ability to discriminate between specific vegetation, cropland, and open spaces [11]. On the other hand, Object-Based Image Analysis (OBIA) provides a structured solution by using image segmentation to aggregate homogeneous pixels into object units. By leveraging the shape, texture, and topological relationships of these objects, OBIA effectively addresses the high intra-class spectral variation inherent in high-resolution imagery [12]. Furthermore, to address the challenge of limited labeled samples, probabilistic frameworks based on the Variational Bayesian Gaussian Mixture Model (VBGMM) have demonstrated effectiveness in high-resolution crop classification [13]. Overall, while optimizing training data selection strategies can improve classification results to some extent [14], it does not overcome the fundamental limitations of traditional methods. These methods rely heavily on manual feature engineering and struggle to adaptively extract deep discriminative features. Especially in fine-grained classification tasks, traditional approaches lack deep mechanisms for mining subtle spectral energy differences and explicit contrast enhancement tools, making them insufficient for high-precision mapping requirements.

2.2. Research on LULC Classification Based on Deep Learning

Driven by the rapid advancement of computational power, land cover recognition has shifted from manual feature engineering toward data-driven deep feature learning. Convolutional Neural Networks (CNNs), with their powerful local feature extraction capabilities, significantly outperform traditional algorithms in representation robustness and have become the mainstream architecture for remote sensing (RS) image interpretation [15]. By leveraging translation invariance and local receptive fields, CNNs can automatically extract multi-level spatial features, demonstrating excellent performance in land use scene classification [16]. In the field of semantic segmentation, the DeepLab series has addressed the limitation of restricted receptive fields in standard CNNs. By introducing dilated convolution, these models capture long-range contextual information without reducing spatial resolution, serving as a critical cornerstone for high-resolution land cover recognition [17]. For hyperspectral data, 3D convolutional architectures combined with spectral–spatial residual networks (SSRN) extract discriminative features directly from spectral–spatial cubes. This approach underscores the importance of jointly utilizing spectral and spatial information to resolve inter-class confusion [18].

The integration of attention mechanisms, which simulate the focusing process of the human visual system, has further enhanced feature discriminability. Modules such as Squeeze-and-Excitation (SE) [19] and Convolutional Block Attention Module (CBAM) [20] improve the network’s responsiveness to critical land cover features by adaptively weighting channel and spatial characteristics. Furthermore, the Efficient Multi-Scale Attention (EMA) module [21] reduces computational overhead while preserving channel information. This method reshapes the channel dimension into the batch dimension for feature grouping, utilizing parallel sub-networks to extract global coordinate context and local texture information through a cross-spatial learning mechanism. Such a strategy—generating attention maps by aggregating global channel weights with local feature matrices—provides a significant theoretical foundation for the Spatial Enhanced Attention (SEA) branch proposed in this paper. Building on these advancements, dual-stream network architectures that fuse spatial and spectral attention have successfully suppressed redundant noise in optical and SAR image fusion, thereby improving feature fusion effectiveness [22]. However, CNNs remain inherently limited by the local receptive fields of their convolutional kernels, presenting a fundamental drawback in establishing long-range global dependencies.

To overcome this bottleneck, the Transformer architecture has emerged as the current technological frontier. Vision Transformer (ViT) models global features through the self-attention mechanism [23], demonstrating superior classification performance compared to traditional CNNs when processing satellite imagery with complex spatial relationships [24]. Building on this, the Swin Transformer introduces a hierarchical shifted window mechanism, which maintains global modeling capabilities while significantly reducing computational complexity [7]. Consequently, it has become a highly effective backbone network for fine-grained image analysis tasks, such as land cover recognition.

Current research trends have long shifted toward hybrid architectures, fine-grained models, and foundational large models. To balance local details with global semantics, hybrid frameworks combining Transformers and CNNs have become a significant research focus. For instance, the land cover classification network (LCC-Net) composite model utilizes the self-attention mechanism of the Swin Transformer to capture complex spatial features and temporal dynamics, while leveraging the feature extraction capabilities of CNNs, demonstrating excellent performance in natural disaster monitoring [25]. For large-scale, fine-grained object classification, the PatchOut framework extracts long-range dependencies and local complementary features by coupling simplified Transformer and CNN modules. Combined with a multi-scale spatial–spectral feature fusion mechanism, it demonstrates exceptional accuracy and computational efficiency in complex large-scale classification tasks [26]. To address the challenge of blurred boundaries in fine-grained classification, recent focused attention-based deformable convolutional networks have introduced multi-scale contour rendering mechanisms, significantly improving the segmentation of small objects in high-resolution imagery [27]. Furthermore, foundation models such as the Segment Anything Model (SAM) have been introduced to the remote sensing domain; models like FlexiSAM have further enhanced the flexibility and generalization capabilities of multi-modal land cover recognition [28]. While SpectralFormer introduces Transformers from a spectral sequence perspective, such pure attention mechanisms risk over-abstraction in very deep networks, potentially neglecting the original physical spectral responses [29]. Although existing attention mechanisms facilitate feature weighting, they lack explicit adaptive contrast enhancement modules to linearly stretch the distances of hard-to-classify samples within the feature space.

Consequently, in scenarios involving complex illumination or the “same spectrum, different objects” phenomenon, classification boundaries remain insufficiently distinct.

To address the aforementioned challenges in high-resolution LULC classification, we propose a novel approach termed LACE-Net, which integrates local frequency-domain energy with adaptive contrast enhancement. The proposed method aims to compensate for the loss of high-frequency spatial structural information in deep networks by introducing a frequency-aware mechanism. Simultaneously, it employs an adaptive contrast enhancement strategy to explicitly amplify the discriminative disparities of hard-to-classify samples within the feature space. This design is not only capable of recovering subtle textural variations—which are often smoothed out by deep layers—from a physical prior perspective, but also effectively resolves the challenges of “same spectrum, different objects” and blurred boundaries in fine-grained classification by enhancing local feature contrast.

3. Methodology

3.1. System Overview

This paper proposes a fine-grained land cover classification model that integrates local frequency-domain energy and adaptive contrast enhancement. The model is designed to overcome the feature representation bottlenecks of existing vision models in extracting subtle textures and handling low-contrast scenarios. Unlike traditional methods that rely solely on the aggregation of spatial-domain features, our approach introduces a “frequency–spatial dual-stream collaborative” processing paradigm. The overall system architecture is illustrated in Figure 1 and consists of three core processing stages:

(1): Utilizing the Swin Transformer as the backbone, a hierarchical feature extraction path is constructed through its shifted window mechanism. This allows the model to capture multi-scale spatial semantics ranging from local details to global contexts.
(2): To address the inherent insensitivity of Transformer architectures to high-frequency information, the LACE block is embedded between Stage 2 and Stage 3 of the backbone network. Serving as a physical-aware corrector, this module guides feature reconstruction by decoupling frequency-domain textural energy from spatial-domain contrast features, thereby incorporating physical priors into the learning process.
(3): Feature maps calibrated through both frequency and spatial domains are fed into subsequent Transformer blocks for high-level semantic abstraction. Finally, the land cover recognition results are generated via Global Average Pooling (GAP) and a linear classification head.

3.2. LACE Block

Traditional neural networks often suffer from over-smoothing effects when processing fine image textures and experience feature degradation under complex illumination conditions. To address these limitations, we design the LACE block. The core innovation of this module lies in preserving spatial structures via the SEA branch, while simultaneously extracting physical constraints from the frequency and contrast domains through the Local Frequency Energy (LFE) and Contrast Enhancement Branch (CEB). Finally, a momentum mechanism is employed to achieve dynamic feature weighting. The detailed architecture is illustrated in Figure 2.

3.2.1. SEA Branch

The SEA branch is a simplified variant derived from the Efficient Multi-Scale Attention (EMA) module [21]. Both architectures employ a group-based interaction strategy. Specifically, given an input feature tensor X ∈ ℝ^b×c×h×w, the tensor is first partitioned into g sub-groups along the channel dimension, denoted as X = [X₁, X₂, …, X_g], where each sub-group is represented as X_g ∈ ℝ^{b×(c/g)×h×w}. The c represents the total number of channels. For each sub-group X_i, a direction-aware coordinate pooling mechanism is introduced. This process aggregates features along the height (h) and width (w) dimensions separately to generate a pair of direction-aware feature vectors, Z_h ∈ ℝ^(c/g)×h×1 and Z_w ∈ ℝ^(c/g)×1×w. For the k-th channel, the calculation formulas are as follows:

Z_{k}^{h} (h, 1) = \frac{1}{w} \sum_{j = 1}^{w} X_{i, k} (h, j)

(1)

Z_{k}^{w} (1, w) = \frac{1}{h} \sum_{j = 1}^{h} X_{i, k} (j, w)

(2)

To capture cross-dimensional long-range dependencies, Z_h is concatenated with the transposed Z_w along the spatial dimension. Cross-channel information interaction is then performed via a 1 × 1 convolution, followed by the Sigmoid activation function σ to scale the original group features. To enhance the stability of intra-group features, Group Normalization (GroupNorm), with the number of groups equal to the number of channels, is applied to the adjusted features to obtain the intermediate attention feature A. Unlike the original EMA mechanism, which employs complex cross-spatial weight aggregation through bidirectional parallel paths (i.e., utilizing attention weights from two paths for weighted summation), the SEA branch introduces a streamlined interaction logic. In EMA, the system must simultaneously compute dual attention maps based on spatially enhanced features and local convolutional features. In contrast, the SEA branch simplifies this process into a unidirectional feature reconstruction mechanism. Specifically, a 3 × 3 convolution in a parallel branch is first employed to compute the local feature X₂. Following this, global average pooling (GAP) is applied to the attention feature A, and a Softmax function is utilized to calculate the channel-wise adaptive weight vector. Subsequently, this weight vector is fused with the reshaped local feature

{X^{'}}_{2}

via matrix multiplication. Finally, a Sigmoid activation function is applied to generate the spatial attention weight matrix W.

W = σ (S o f t m a x {(G A P (A))}^{T} \otimes {X^{'}}_{2})

(3)

where ⊗ denotes matrix multiplication, GAP stands for Global Average Pooling, and X’₂ represents the flattened spatial representation of the local features X₂. Ultimately, each sub-group produces reconstructed features through spatial weighting. All sub-groups are then concatenated to restore the original dimensions, resulting in the spatially weighted feature representation X_sea. This refinement allows the SEA branch to retain the advantages of EMA in capturing long-range spatial dependencies while eliminating redundant cross-aggregation computations. Consequently, this significantly reduces the model’s computational overhead and parameter complexity.

3.2.2. Physics-Based Frequency-Domain Energy Sensing

This represents the core innovation of the proposed module. Traditional texture extraction typically relies on “black-box” convolutional operations, which lack interpretability. Based on signal processing theory, this work constructs the LFE and CEB to quantitatively characterize textural complexity and illumination saliency from a physical perspective.

According to Parseval’s Theorem, the energy of a signal in the time domain is related to its energy in the frequency domain. In image processing, textural details primarily correspond to high-frequency components. Since performing a direct Fast Fourier Transform (FFT) is computationally expensive, we employ gradient operators as a spatial approximation of high-frequency components. Given a local image patch I(x,y), the gradient magnitude |▽ I|=

\sqrt{{f_{h}}^{2} + {f_{v}}^{2}}

reflects the intensity of signal variations (i.e., high-frequency strength). To capture subtle textural oscillations, we designed orthogonal frequency-aware filters—a horizontal kernel Conv_1×5(x) and a vertical kernel Conv_5×1(x)—to approximate gradient calculations and extract orthogonal high-frequency responses, f_h and f_v. Based on this, we define the normalized local frequency energy map

E_{f}^{n o r m}

to quantify texture intensity. Mathematically, this is similar to calculating the energy density of the local high-frequency signal:

E_{f}^{n o r m} = \frac{\sqrt{G A P [\max ({f_{h}}^{2} + {f_{v}}^{2}, 10^{- 6})] + ∊}}{\max_{b} (\sqrt{G A P [\max ({f_{h}}^{2} + {f_{v}}^{2}, 10^{- 6})] + ∊})}

(4)

In this context,

{f_{h}}^{2}

and

{f_{v}}^{2}

represent the local gradient energy, which serves as a spatial-domain approximation for the energy of high-frequency components.

G A P [\cdot]

denotes a global average pooling operation in the spatial dimension. The term

{m a x}_{b}

( ) is used to extract the maximum energy scalar value within the current batch. To suppress background noise, max (·, 10⁻⁶) is employed to truncate extremely low-energy responses. Additionally, a numerical stability constant

∊

is set to 10⁻⁵ to prevent division by zero or other numerical instabilities. The normalized energy map

E_{f}^{n o r m}

provides an intuitive quantification of the textural complexity of the input image, establishing a physical basis for subsequent dynamic parameter adjustments.

The CEB is designed to extract edge features that are robust to illumination variations. This branch first utilizes a 3 × 3 convolution to capture local brightness changes, followed by an absolute value operation to reinforce the edge response. Subsequently, the initial contrast feature B is generated through Batch Normalization and a Sigmoid activation function. To adaptively handle regions with shadows or overexposure, a learnable temperature parameter τ is introduced. This parameter facilitates a power-law transformation to non-linearly regulate B, resulting in the final contrast mask B_τ:

B_{τ} = B^{\frac{1}{τ}}

(5)

In this formulation, τ controls the steepness of the activation function. Specifically, when τ < 1, the mechanism amplifies subtle disparities, whereas when τ > 1, it effectively suppresses noise. This adjustment results in the generation of a high-saliency contrast feature map.

3.2.3. Texture-Adaptive Momentum Adjustment Mechanism

To achieve a dynamic equilibrium in feature fusion, this paper proposes an adaptive momentum adjustment mechanism based on textural complexity. This mechanism first utilizes the local frequency-domain energy

E_{f}^{n o r m}

to quantify textural complexity and dynamically calculate the momentum coefficient m. In high-frequency complex regions, a larger

E_{f}^{n o r m}

results in a smaller m, thereby reducing historical inertia to enhance the instantaneous capture of local detail variations. Conversely, in low-frequency smooth regions, a smaller

E_{f}^{n o r m}

leads to a larger m, increasing historical inertia to suppress random noise fluctuations effectively. Secondly, by fusing the contrast mask with frequency-domain energy, we extract an instantaneous prediction weight

α_{p r e}

that reflects the current input. Using m as a regulatory valve, we perform a weighted update between the global historical weight

α_{l f e}

and the instantaneous prediction weight

α_{p r e}

. Finally, the updated global historical weight and the instantaneous prediction weight are fused via weighted averaging to obtain the final enhancement factor

α

, which is then applied to the feature map to complete the adaptive reconstruction. The instantaneous prediction weight

α_{p r e}

is defined by the following formula:

α_{p r e} = σ (W_{2} \cdot φ (W_{1} \cdot X_{f u s i o n}))

(6)

In this formulation, W₁ and W₂ represent 1 × 1 convolutional kernels;

φ (\cdot)

denotes the GELU activation function; and

σ (\cdot)

is the Sigmoid function. X_fusion is a multimodal feature constructed by concatenating the original features, the contrast-enhanced features B_τ fused with

E_{f}^{n o r m}

, and the frequency energy map

E_{f}^{n o r m}

along the channel dimension.

To maintain a balance between historical priors and instantaneous features, we define the textural complexity-based momentum coefficient m as follows:

m = m_{b a s e} - (m_{b a s e} - m_{m i n}) \cdot E (E_{f}^{n o r m})

(7)

In this formulation, m_base is a constant set to 0.9, and m_min is a constant set to 0.6. Using this momentum coefficient, the model maintains a global historical weight

α_{l f e}

, which is updated according to the following Exponential Moving Average rule:

α_{l f e} = m α_{l f e} + (1 - m) E (α_{p r e})

(8)

In this context, the global historical weight

α_{l f e}

is initialized to a constant value of 0.5. Its primary purpose is to smooth instantaneous prediction fluctuations by retaining memory of historical statistical features. The operator

E (\cdot)

denotes the averaging function across the current training batch dimension. The final feature adjustment coefficient

α

is jointly determined by the instantaneous weight and the historical weight. Finally, the enhancement factor

α

is applied alongside the inter-group channel calibration weights to the spatial features output by the SEA branch. To correct for channel information bias introduced by grouping, a residual connection is employed to complete the adaptive reconstruction.

4. Experiments

To validate the effectiveness of the proposed method, we conducted extensive experiments on a self-constructed dataset (GLC-30) and a publicly available benchmark dataset (NWPU-RESISC45). This section provides a detailed discussion of the dataset characteristics, experimental configurations, evaluation metrics, and the analysis of both qualitative and quantitative results.

4.1. Dataset

In natural resource survey and monitoring, field personnel using handheld mobile devices to capture photographic evidence serve as a critical monitoring method. The experiment utilized the self-constructed GLC-30 dataset, which consists of close-range ground images captured by field personnel in Guangxi using handheld mobile devices during natural resource surveys and monitoring. The dataset covers 30 typical land cover categories and comprises a total of approximately 21,877 images. As shown in Figure 3, the large areas of homogeneity and environmental complexity exhibited by this dataset constitute the core physical motivation for the innovative design of the LACE block in this paper. Examining Figure 3a–j reveals that forest samples such as eucalyptus, pine, and cypress exhibit strong spectral convergence at the macro level, while peanut and soybean seedlings in their early growth stages show tonal values that nearly coincide with the population envelope. From a signal processing perspective, the key features distinguishing these categories are hidden in high-frequency spatial components, such as the impulse response at leaf edges and the periodic spatial frequencies of plant arrangements. Traditional Swin Transformers employ a sliding-window-based self-attention mechanism that tends to capture long-range semantic dependencies during deep feature aggregation, which can easily produce global smoothing effects and result in a blunted perception of local high-frequency details. Furthermore, Figure 3k–t reveal the environmental complexities inherent to ground-based imaging perspectives. Due to the uncontrollable shooting environment, ground objects are often affected by non-uniform illumination, cloud shadows, fog, and complex background noise. This environmental complexity causes significant fluctuations in local image contrast and reduces the edge contours and textural salience of ground objects, making it difficult for traditional fixed-weight attention mechanisms (such as the standard Attention in Swin-B) to stably capture features when processing such low-salience features.

4.2. Experimental Environment and Setup

During the data preprocessing stage, to improve the model’s generalization ability, this study adopted a multi-stage enhancement pipeline. First, the input images were uniformly resized to 224 × 224 pixels. During training, two operators were randomly selected from the following: automatic contrast adjustment, histogram equalization, random rotation up to 30°, and a shear transformation with an intensity of 0.3, and combined to enhance the images. We also introduced a random erasure technique with a probability of 0.25 to simulate occlusion by randomly masking local regions. Finally, the images were normalized using the mean ([123.68, 116.28, 103.53]) and standard deviation ([58.40, 57.12, 57.38]) of ImageNet. During the validation phase, images are first resized to 256 pixels and then center-cropped to 224 × 224 to ensure consistency in evaluation. The dataset is randomly split into training and test sets in an 8:2 ratio, with the test set excluded from training.

The proposed method was implemented using the PyTorch (2.3.1+cu118) deep learning framework. During the training process, the batch size was set to 16 for a total of 50 epochs. We utilized the AdamW optimizer with an initial learning rate of 1.25 × 10⁻⁴. Notably, the learning rate for the LACE block was independently set to 2.5 × 10⁻⁴. To ensure stable convergence after incorporating the new module, a cosine annealing scheduler was paired with a linear warm-up mechanism. During the warm-up phase spanning the first three epochs, the learning rate increases linearly from 0.001 times the initial learning rate to the full initial learning rate; after the warm-up phase ends, the cosine-smoothed annealing formula is applied:

\begin{matrix} η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + \cos (π \cdot \frac{t}{T})) \end{matrix}

(9)

Here,

η_{m a x}

is the initial learning rate,

η_{m i n}

=

η_{m a x}

× 0.01. Let t be the current training iteration, and T be the total number of iterations. This combination of linear warm-up and cosine annealing ensures that the model converges stably after new modules are introduced.

4.3. Evaluation Metrics

To provide a comprehensive assessment of the model’s performance, we selected Top-1 accuracy (Top-1 Acc), macro-averaged precision (mPre), macro-averaged Recall (mRec), and macro-averaged F1-score (mF1) as the primary evaluation metrics.

Top-1 Acc represents the proportion of instances where the category with the highest predicted probability matches the ground truth label. It is formally defined by (10):

T o p - 1 A c c = \frac{1}{N} \sum_{i = 1}^{N} [\arg \max (\hat{y_{i}}) = y_{i}]

(10)

where N denotes the total number of test samples, and

\hat{y_{i}}

represents the predicted probability vector generated by the model for the i-th sample. The variable y_i signifies the ground truth label of the i-th sample. The term

\arg \max (\hat{y_{i}})

retrieves the index corresponding to the maximum value within the probability vector, which identifies the predicted category assigned by the model.

The mPre is calculated by determining the ratio of true positive predictions to the total number of positive predictions for each individual class, and subsequently computing the arithmetic mean across all categories. It is formally defined in (11):

m P r e = \frac{1}{K} \sum_{i = 1}^{K} (\frac{{T P}_{i}}{{T P}_{i} + {F P}_{i}})

(11)

where TP denotes the number of true positives, representing samples correctly predicted as the positive class, while TN denotes the number of true negatives. FP indicates the number of false positives, which are samples incorrectly predicted as the positive class by the model. Furthermore, K represents the total number of distinct categories within the classification task.

Recall is defined as the proportion of actual positive samples that are correctly identified by the model. Furthermore, the mRec is obtained by calculating the recall for each category and subsequently computing the arithmetic mean across all classes. It is formally defined by (12):

m R e c = \frac{1}{K} \sum_{i = 1}^{k} (\frac{{T P}_{i}}{{T P}_{i} + {F N}_{i}})

(12)

In this context, FN denotes False Negatives, which represent the number of positive samples incorrectly predicted as the negative class.

The mF1 is defined as the arithmetic mean of the F1-scores calculated independently for each class in the multi-class classification task. It is formally expressed in (13):

m F 1 = \frac{1}{n} \sum_{i = 1}^{n} (\frac{2 \cdot {T P}_{i}}{2 {T P}_{i} + {F P}_{i} + {F N}_{i}})

(13)

4.4. Quantitative Comparison Experiment

To objectively evaluate the effectiveness and generalization capability of the proposed LACE-Net method in land cover classification tasks, this study conducted a comprehensive comparison against both mainstream Convolutional Neural Networks (CNNs)—including ResNet-152 [30], EfficientNet-B4 [31], and ConvNeXt-Base [32]—and advanced Vision Transformer (ViT) architectures, such as ViT [23] and Swin-B [7]. Table 1 presents a detailed quantitative assessment of each model across four core performance metrics: Top-1 Acc, mPre, mRec, and mF1.

To ensure a fair comparison under optimal configurations, EfficientNet-B4 was evaluated using its recommended 380 × 380 input resolution, while all other backbone networks followed the standard 224 × 224 resolution commonly used in computer vision. The experimental results show that our method outperforms the baseline models on all evaluation metrics. On the challenging GLC-30 dataset, the CNN-based ResNet-152 and EfficientNet-B4 with a resolution of 380 × 380 performed relatively poorly, achieving Top-1 Acc of 94.10% and 95.84%, respectively. This indicates that traditional CNNs, constrained by their local receptive fields, struggle to capture long-range semantic dependencies in complex scenes fully. Compared to the baseline Swin-B, LACE-Net achieves further performance breakthroughs, with a Top-1 Acc of 96.48%, with the mF1 rising to 93.13%. The mPre significantly improved from Swin-B’s 92.02% to 93.81%, a 1.79% increase, confirming the effectiveness of the LACE block in enhancing frequency-domain features. By supplementing high-frequency texture details, the model successfully reduced misclassifications in “spectral similarity but different objects” scenarios and significantly enhanced its ability to distinguish between highly similar land classes. Additionally, its leading performance on the public dataset NWPU-RESISC45 demonstrates its generalization robustness across different geographical scenarios, rather than overfitting to a specific task.

Furthermore, this study evaluates the computational complexity (FLOPs) and the number of model parameters (Parameters) of the proposed method against mainstream benchmark models, as summarized in Table 2. The data indicate that LACE-Net requires 15.17 G FLOPs and contains 87.34 M parameters. Compared to the backbone network Swin-B, the integration of the LACE block results in a marginal increase of only 0.03 G in FLOPs and 0.06 M in parameters. This negligible resource overhead provides strong evidence of the lightweight nature of the proposed module. Simultaneously, when compared to models of similar scale such as ViT (16.86 G) and ConvNeXt-Base (15.36 G), LACE-Net not only maintains lower computational complexity but also achieves superior classification accuracy (as shown in Table 1). This suggests that LACE-Net does not rely on increasing parameter counts to trade for accuracy; instead, it achieves an optimal balance between performance and computational cost through efficient frequency-domain feature mining. Consequently, the proposed architecture is better suited for large-scale remote sensing data processing tasks.

4.5. Melting Experiments and Analysis

This section conducts a series of comparative experiments to thoroughly examine the impact of the LACE block’s core components, hyperparameter settings, and insertion points on model performance and to validate the effectiveness of each design decision.

4.5.1. Analysis of the Contributions of Each Component in LACE

To verify the individual contributions of the components within the LACE block, we conducted an ablation study based on the Swin-B baseline, as shown in Table 3. The experimental results show that after incorporating EMA into Swin-B, all metrics except the mF1 declined to varying degrees. This confirms that in complex remote sensing scenarios, although EMA’s bidirectional parallel paths perform complex cross-spatial weighted aggregation, this highly integrated attention mapping tends to lead to excessive feature smoothing, thereby weakening the discriminative power between fine-grained objects. In contrast, the SEA branch minimizes the issue of over-smoothing while preserving spatial enhancement information. The model’s mPre improved to 93.44%, and the mF1 increased to 92.81%, indicating that this branch enhances the model’s ability to represent class boundaries and discriminative features. With the addition of the CEB, the model shows improvement across all metrics, with mPre and mF1 reaching 93.71% and 92.99%, respectively. Notably, the CEB yields the most significant gains among the individual components, highlighting the positive impact of the local contrast enhancement mechanism on distinguishing fine-grained land cover textures. The LFE branch also provides a stable performance boost, increasing the mF1 to 92.58%, which validates the utility of frequency-domain features in complementing spatial information. However, its improvement margin is slightly lower than that of the CEB.

4.5.2. Analysis of Hierarchical Synergistic Effects at the LACE Embedding Location

A comparative analysis of the LACE block at different insertion points within the Swin-B backbone network reveals that the model’s performance follows a distinct inverted U-shaped trend, peaking between Stages 2 and 3 with a Top-1 Acc of 96.48% and an mF1 of 93.13%, as shown in Table 4.

From the perspective of the feature representation hierarchy, before Stage 1 in the shallow layers of the network, feature maps have extremely high spatial resolution and contain a large amount of raw high-frequency noise. Introducing contrast enhancement at this stage is highly likely to amplify non-discriminative redundant information, thereby interfering with the extraction of fundamental features. Conversely, in the very deep layers (Stage 4 and beyond), features have been highly abstracted through multiple downsampling steps. The sharp drop in resolution leads to the loss of fine-grained frequency signals, limiting the marginal contribution of frequency enhancement to semantic reconstruction. In contrast, embedding this module in the intermediate layers allows for the precise capture of features that possess a certain degree of semantic context while retaining moderate spatial detail, effectively compensating for the texture degradation in Swin Transformer’s feature propagation during the intermediate layers.

4.5.3. Comparison of Local Frequency Operators in Different Directions

This study found that using orthogonal frequency detection operators in the horizontal and vertical directions outperforms multi-directional combinations that include diagonal directions, as shown in Table 5. The 0° + 90° combination achieves efficient coverage of core structural features through asymmetric convolution, avoiding anisotropic noise introduced by excessive spectral overlap. This ensures that local frequency energy exhibits higher spatial selectivity and signal-to-noise ratio when serving as the enhancement factor α in the texture-guided adaptive momentum regulation mechanism.

4.5.4. Performance Comparison of Frequency Kernels of Different Sizes

Table 6 evaluates the impact of different kernel dimensions on model performance. The experiments found that a combination of (1 × 5) & (5 × 1) split convolutional kernels achieved the optimal balance of performance.

From the perspective of spatio-frequency integration, smaller kernel sizes (1 × 3) & (3 × 1) are limited by insufficient receptive fields, making it difficult to capture discriminative long-range texture patterns; conversely, excessively large kernel sizes (1 × 7) & (7 × 1), while expanding spatial coverage, tend to introduce unnecessary background bias, leading to distorted frequency responses. The (1 × 5) & (5 × 1) configurations achieve an accurate balance between sensitivity to local features and global energy distribution while maintaining computational efficiency.

4.5.5. Validation of the Dynamic Momentum Strategy Based on Texture Complexity

Table 7 compares the fixed-value and dynamic adjustment schemes, verifying the influence of the momentum coefficient m in Equation (7) on the evolutionary stability of the feature representation. The results indicate that dynamically adjusting m ∈ [0.6, 0.9] based on texture complexity yields the best results.

When m is set to a fixed, relatively high value of 0.9, the model exhibits strong historical smoothing capabilities and is able to suppress random noise interference; however, this often results in the model being slow to capture transient object details in the current scene. Conversely, a lower fixed value of 0.6 enhances the model’s responsiveness to current inputs but is highly prone to feature representation oscillations in complex backgrounds. The dynamic strategy establishes a dynamic equilibrium between suppressing random noise interference and maintaining the sensitivity of feature updates by sensing the complexity of object textures, thereby significantly enhancing the robustness of feature evolution.

4.5.6. Confusion Matrix Analysis

To thoroughly evaluate the classification performance of the LACE-Net model in complex geographical environments, this study conducted a quantitative comparison of the confusion matrices for the baseline model Swin-B and the proposed model on the public dataset NWPU-RESISC45 and the in-house dataset GLC-30, as shown in Figure 4 and Figure 5.

LACE-Net demonstrates significant performance gains in the classification task on the GLC-30 dataset, particularly in the fine-grained discrimination between forest and crop categories. Within the forest category, the classification accuracy for pines (label 15) improved by 6 percentage points compared to the baseline Swin-B model, while that for cypresses (label 12) also improved by 2 percentage points. This improvement is primarily attributed to the LFE module’s ability to capture high-frequency texture information, effectively overcoming the over-smoothing phenomenon commonly observed in deep features of traditional Transformers. This enables the model to accurately extract the periodic high-frequency distribution characteristics unique to coniferous forests, thereby distinguishing them from interfering classes such as mixed forests (label 21). The accuracy for paddy fields (label 14) and vegetable fields (label 25) improved simultaneously, effectively reducing confusion with objects such as dry land that share similar spectral characteristics. This fully validates the effectiveness of the texture-adaptive momentum adjustment mechanism, which enhances the model’s ability to resolve the semantic boundaries of objects under complex lighting conditions by strengthening the texture of crop ridges and furrows and local contrast. Additionally, LACE-Net’s performance on eucalyptus (Label 24) declined by 3 samples, with the majority of misclassifications shifting toward pine and miscellaneous trees. This may indicate that while the LACE block enhances certain textural features, it inadvertently exacerbates the overlap between eucalyptus and pine under specific lighting conditions.

On the NWPU-RESISC45 remote sensing image dataset, LACE-Net also demonstrated excellent cross-scale feature modeling capabilities, achieving a breakthrough, particularly in addressing the semantic confusion problem characterized by high intra-class variability and high inter-class similarity. For transportation and industrial facilities with strong structural features, the number of correct classifications for the railway (label 42) category increased by 4, effectively alleviating the difficulty Swin-B faced in distinguishing between railway and railway station (label 43). Meanwhile, improved accuracy in categories such as thermal power station (label 11) and runway (label 2) further demonstrates that LACE-Net can compensate for the loss of spatial-domain features through local frequency enhancement, enabling the model to achieve higher geometric fidelity when processing industrial-scale complex structures. Although the model still experiences some competitive confusion when resolving the extremely subtle geometric structural differences between semantic pairs with highly overlapping architectural styles, such as church (label 20) and palace (label 40), the overall purity of class boundaries has achieved a significant leap. In summary, whether from a ground-level micro-perspective or a remote sensing macro-perspective, LACE-Net significantly improves the discrimination of fine-grained objects through the synergistic enhancement of local frequency and contrast, demonstrating strong generalization capabilities and academic application value.

4.6. Visual Analytics

To visually analyze the differences in the model’s focus during feature extraction, we used Grad-CAM to visualize and compare the class activation maps of the baseline Swin-B model and the improved LACE-Net model on the GLC-30 dataset. As illustrated in Figure 6, the activation maps for Swin-B, shown in (b) and (e), reveal a significant diffusion of attention and background bias, indicating that the baseline model struggles to accurately localize key land cover features. In contrast, the LACE-Net maps in (c) and (f) exhibit superior feature focusing and noise robustness. Attributable to the integration of the frequency-domain perception and adaptive enhancement modules, the high-response regions in LACE-Net’s heatmaps closely align with the canopy structures and high-frequency textural details of the trees. This effectively suppresses interference from complex background noise, such as ground-level vegetation. Such a clear refinement in attentional focus provides strong evidence that our method can successfully guide the network to decouple highly discriminative fine-grained textures from the global background. Consequently, this provides a visual explanation for the underlying factors driving the model’s improved classification performance.

To further validate the discriminative power of our model from the perspective of feature space geometry, we employed the UMAP algorithm to perform dimensionality reduction and visualization of the high-dimensional feature distributions for both the Swin-B baseline and the proposed LACE-Net. As illustrated in Figure 7, while the feature space of the Swin-B model exhibits a general clustering trend, there is significant feature overlap among highly similar land cover categories, and the inter-cluster boundaries remain blurred. This suggests that the baseline model lacks sufficient feature discriminability when processing fine-grained textures. In contrast, the feature distribution generated by LACE-Net (Figure 8) demonstrates superior clustering characteristics. Specifically, the sample distribution within each category is more compact (significantly reduced intra-class distance), while the separation between different clusters is markedly widened (enhanced inter-class separability). This compact within-class and dispersed between-class structure intuitively confirms that the LACE block effectively guides the network to extract more discriminative semantic and textural representations. This significantly reduces the probability of confusion in fine-grained land cover recognition and provides interpretable, feature-level evidence for the improved classification performance.

5. Conclusions

This paper addresses the limitations of existing land cover classification models, specifically their inadequate use of local texture information in the frequency domain and low feature saliency in complex lighting conditions, by proposing a reconstructive LACE-Net architecture. Integrating local frequency-domain energy with adaptive contrast enhancement within the Swin Transformer backbone network enables the architecture to effectively extract high-frequency texture features from land cover images and adaptively enhance the contrast of feature maps. The experimental results demonstrate that our method delivers superior performance on a multi-class land cover dataset from the Guangxi region. It consistently outperforms current state-of-the-art models across key metrics, including Top-1 Acc, mPre, and mF1, effectively resolving the misclassification challenges associated with highly similar land cover categories.

Despite its promising performance, the proposed method still has room for further improvement. In future work, we will explore more effective integration of local geometric modeling and global contextual representation to further enhance the discrimination of fine structures and irregular land cover boundaries [33,34]. In addition, extending the proposed framework to more diverse remote sensing datasets and complex real-world scenarios may further improve its robustness and generalization capability.

Author Contributions

Conceptualization, H.Y., Y.T. and J.T.; methodology, H.Y. and Y.T.; software, Y.T.; validation, H.Y., J.T. and G.C.; formal analysis, H.Y., Y.T. and G.C.; investigation, G.C. and Y.H.; resources, Y.H. and J.T.; data curation, Y.T. and Y.H.; writing—original draft preparation, Y.T.; writing—review and editing, H.Y., Y.T. and J.T.; visualization, Y.T.; supervision, J.T. and G.C.; project administration, H.Y. and J.T.; funding acquisition, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Technical Services for Intelligent Identification of Land Categories in Guangxi Based on ‘One Map’ Technology (No. 2026SC500), the National Natural Science Foundation of China (No. 62262011), and Guangxi Major S&T Special Program (No. GuikeAA23062035-2).

Data Availability Statement

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, H.; Liang, D.; Sun, Z.; Chen, F.; Wang, X.; Li, J.; Zhu, L.; Bian, J.; Wei, Y.; Huang, L.; et al. Measuring and evaluating SDG indicators with Big Earth Data. Sci. Bull. 2022, 67, 1792–1801. [Google Scholar] [CrossRef] [PubMed]
Qin, R.; Liu, T. A review of landcover classification with very-high resolution remotely sensed optical images—Analysis unit, model scalability and transferability. Remote Sens. 2022, 14, 646. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Cao, L.; Liu, G. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Paheding, S.; Saleem, A.; Siddiqui, M.F.H.; Rawashdeh, N.; Essa, A.; Reyes, A.A. Advancing horizons in remote sensing: A comprehensive survey of deep learning models and applications in image classification and beyond. Neural Comput. Appl. 2024, 36, 16727–16767. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Mahmon, N.A.; Ya’acob, N.; Yusof, A.L. Differences of image classification techniques for land use and land cover classification. In Proceedings of the 2015 IEEE 11th International Colloquium on Signal Processing & Its Applications (CSPA), Kuala Lumpur, Malaysia, 6–8 March 2015; pp. 90–94. [Google Scholar]
Maxwell, A.E.; Warner, T.A.; Fang, F. Implementation of machine-learning classification in remote sensing: An applied review. Int. J. Remote Sens. 2018, 39, 2784–2817. [Google Scholar] [CrossRef]
You, H.; Huang, Y.; Qin, Z.; Chen, J.; Liu, Y. Forest tree species classification based on Sentinel-2 images and auxiliary data. Forests 2022, 13, 1416. [Google Scholar] [CrossRef]
Maulani, Y.; Surendro, K. Detailed land use classification model based on vegetation indices and texture features. Remote Sens. Appl. Soc. Environ. 2025, 40, 101786. [Google Scholar] [CrossRef]
Hossain, M.D.; Chen, D. Segmentation for object-based image analysis (OBIA): A review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote Sens. 2019, 150, 115–134. [Google Scholar] [CrossRef]
Le, M.T.; Tran, K.H.; Dao, P.D.; El-Askary, H.; Ha, T.V.; Park, T. High spatial resolution crop type and land use land cover classification without labels: A framework using multi-temporal PlanetScope images and variational Bayesian Gaussian mixture model. Sci. Remote Sens. 2025, 12, 100264. [Google Scholar] [CrossRef]
Hermosilla, T.; Wulder, M.A.; White, J.C.; Coops, N.C. Land cover classification in an era of big and open data: Optimizing localized implementation and training data selection to improve mapping outcomes. Remote Sens. Environ. 2022, 268, 112780. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Wang, X.; Zhang, X.; Su, C. Land use classification of remote sensing images based on multi-scale learning and deep convolution neural network. J. Zhejiang Univ. Sci. Ed. 2020, 47, 715–723. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Jiang, W.; Pan, J.; Yue, X. Feature fusion classification for optical image and SAR image based on spatial-spectral attention. J. Electron. Inf. Technol. 2023, 45, 987–995. [Google Scholar]
Karishma, S.; Anitha, V.; Kalaiselvi, S.; Manimaran, V. Enhancing land use and land cover classification in satellite imagery using vision transformers: A comparative analysis with convolutional neural networks. In Proceedings of the 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 5–7 January 2025; pp. 1613–1618. [Google Scholar]
Dosovitskiy, A. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. Available online: https://arxiv.org/abs/2010.11929 (accessed on 2 April 2026).
Shailaja, P.; Kumar, P.M.; Nikhitha, N.; Reddy, K.N.K.; Reddy, E.M.; Reddy, G.G.; Indu, V. LCC-Net: Swin Transformer-CNN hybrid for enhanced land cover classification in natural disaster monitoring. Syst. Soft Comput. 2025, 7, 200303. [Google Scholar] [CrossRef]
Ji, R.; Tan, K.; Wang, X.; Jiao, L.; Wang, L. PatchOut: A novel patch-free approach based on a transformer-CNN hybrid framework for fine-grained land-cover classification on large-scale airborne hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2025, 138, 104457. [Google Scholar] [CrossRef]
Fan, Y.; Zhang, D.; Li, J.; Xiao, J. A focusing-attention deformable convolution and transformer network with multi-scale contour-render for land cover classification in high-resolution remote-sensing images. Eng. Appl. Artif. Intell. 2025, 160, 111949. [Google Scholar] [CrossRef]
Zhang, Z.; Shu, D.; Liao, C.; Liu, C.; Zhao, Y.; Wang, R.; Huang, X.; Zhang, M.; Gong, J. FlexiSAM: A flexible SAM-based semantic segmentation model for land cover classification using high-resolution multimodal remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2025, 227, 594–612. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5511815. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Yang, C.; Chen, M.; Xiong, Z.; Yuan, Y.; Wang, Q. Cm-net: Concentric mask based arbitrary-shaped text detection. IEEE Trans. Image Process. 2022, 31, 2864–2877. [Google Scholar] [CrossRef]
Yang, C.; Chen, M.; Yuan, Y.; Wang, Q. Zoom text detector. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15745–15757. [Google Scholar] [CrossRef]

Figure 1. LACE-Net network architecture.

Figure 2. LACE block structure. Here, “g” means the divided groups, “X Avg Pool” represents the 1D horizontal global pooling and “Y Avg Pool” indicates the 1D vertical global pooling. ⊛ represents element-wise multiplication.

Figure 3. Examples of dataset samples. (a–j) are features that are highly similar at the macro-scale. (k–t) represent ground features in scenes with cloud cover, light fog, and shadows.

Figure 4. Confusion matrix of Swin-B and LACE-Net on GLC-30 test set. (a) Confusion matrix of Swin-B on GLC-30; (b) LACE-Net confusion matrix on GLC-30.

Figure 5. Confusion matrix of Swin-B and LACE-Net on NWPU-RESISC45 test set. (a) Confusion matrix of Swin-B on NWPU-RESISC45; (b) LACE-Net confusion matrix on NWPU-RESISC45.

Figure 6. Comparison of Grad-CAM between LACE-Net and Swin-B. (a) Original image of Chinese fir; (b) Swin-B Grad-CAM for Chinese fir; (c) LACE-Net Grad-CAM for Chinese fir; (d) original image of pine; (e) Swin-B CAM for pine; (f) LACE-Net Grad-CAM for pine.

Figure 7. Swin-B UMAP feature visualization.

Figure 8. LACE-Net UMAP feature visualization.

Table 1. Comparison of land cover classification experiments.

Methods	GLC-30				NWPU-RESISC45
Methods	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
ResNet-152	94.10	89.15	89.29	89.03	95.63	95.74	95.63	95.64
ConvNeXt-Base	95.47	90.91	90.01	90.31	96.84	96.87	96.84	96.84
EfficientNet-B4 (380 × 380)	95.84	92.19	92.08	92.10	96.98	97.02	96.98	96.98
Vit-B/16	94.17	90.21	89.26	89.37	96.37	96.40	96.37	96.37
Swin-B	96.23	92.02	92.62	92.26	97.16	97.17	97.16	97.15
LACE-Net	96.48	93.81	92.72	93.13	97.32	97.34	97.32	97.31

Table 2. Comparison of computational complexity and parameter quantity among different models.

Methods	Flops (G)	Parameters (M)
ResNet-152	11.58	58.21
ConvNeXt-Base	15.36	87.6
EfficientNet-B4	4.51	17.6
Vit-B/16	16.86	88.25
Swin-B	15.14	87.28
LACE-Net	15.17	87.34

Table 3. Ablation experiment results.

Methods	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
Swin Transformer	96.23	92.02	92.62	92.26
Swin Transformer + EMA	95.89	92.25	92.32	92.26
Swin Transformer + SEA	96.14	93.44	92.45	92.81
Swin Transformer + CEB	96.30	93.71	92.56	92.99
Swin Transformer + LFE	96.18	92.94	92.54	92.58
LACE-Net	96.48	93.81	92.72	93.13

Table 4. Comparison of LACE across different stages of Swin-B.

Insertion Point	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
Patch Partition-Stage 1	96.00	91.35	92.47	91.85
Stage 1–2	96.32	92.59	92.61	92.58
Stage 2–3	96.48	93.81	92.72	93.13
Stage 3–4	96.27	92.01	92.60	92.25
Stage4-GAP	96.07	92.33	92.33	92.30

Table 5. Comparative results of frequency components in different directions in LACE.

Configuration	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
45° + 135°	96.64	92.50	92.87	92.66
0° + 90° + 45° + 135°	96.05	93.43	92.43	92.80
0° + 90°	96.48	93.81	92.72	93.13

Table 6. Performance comparison of the frequency kernel of different sizes in LACE.

Kernel Size	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
(1 × 3) & (3 × 1)	96.21	92.56	92.45	92.48
(1 × 5) & (5 × 1)	96.48	93.81	92.72	93.13
(1 × 7) & (7 × 1)	96.07	93.42	92.42	92.78

Table 7. Impact of momentum coefficient m on model performance.

m	Top-1 Acc (%)	mPre (%)	mRec (%)	mF1 (%)
Static 0.9	96.00	93.07	92.37	92.58
Static 0.75	96.41	93.79	92.63	93.08
Static 0.6	96.16	92.45	92.39	92.39
Dynamic [0.6, 0.9]	96.48	93.81	92.72	93.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, Y.; Chen, G.; Huang, Y.; Ye, H.; Tang, J. LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification. Computers 2026, 15, 281. https://doi.org/10.3390/computers15050281

AMA Style

Tan Y, Chen G, Huang Y, Ye H, Tang J. LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification. Computers. 2026; 15(5):281. https://doi.org/10.3390/computers15050281

Chicago/Turabian Style

Tan, Yongmei, Gong Chen, Yan Huang, Hengzhou Ye, and Jincheng Tang. 2026. "LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification" Computers 15, no. 5: 281. https://doi.org/10.3390/computers15050281

APA Style

Tan, Y., Chen, G., Huang, Y., Ye, H., & Tang, J. (2026). LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification. Computers, 15(5), 281. https://doi.org/10.3390/computers15050281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LACE-Net: A Swin Transformer with Local Frequency-Domain Energy and Adaptive Contrast Enhancement for Fine-Grained Land Cover Classification

Abstract

1. Introduction

2. Related Work

2.1. Traditional LULC Classification Research

2.2. Research on LULC Classification Based on Deep Learning

3. Methodology

3.1. System Overview

3.2. LACE Block

3.2.1. SEA Branch

3.2.2. Physics-Based Frequency-Domain Energy Sensing

3.2.3. Texture-Adaptive Momentum Adjustment Mechanism

4. Experiments

4.1. Dataset

4.2. Experimental Environment and Setup

4.3. Evaluation Metrics

4.4. Quantitative Comparison Experiment

4.5. Melting Experiments and Analysis

4.5.1. Analysis of the Contributions of Each Component in LACE

4.5.2. Analysis of Hierarchical Synergistic Effects at the LACE Embedding Location

4.5.3. Comparison of Local Frequency Operators in Different Directions

4.5.4. Performance Comparison of Frequency Kernels of Different Sizes

4.5.5. Validation of the Dynamic Momentum Strategy Based on Texture Complexity

4.5.6. Confusion Matrix Analysis

4.6. Visual Analytics

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI