Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing

Xie, Fuyuan; Yang, Yongguo

doi:10.3390/rs18030413

Open AccessArticle

Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing

by

Fuyuan Xie

^1,2,* and

Yongguo Yang

^1,2

¹

Key Laboratory of Coalbed Methane Resources and Reservoir Formation Process, Ministry of Education, China University of Mining and Technology, Xuzhou 221008, China

²

School of Resources and Geosciences, China University of Mining and Technology, Xuzhou 221116, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 413; https://doi.org/10.3390/rs18030413

Submission received: 21 November 2025 / Revised: 16 January 2026 / Accepted: 20 January 2026 / Published: 26 January 2026

(This article belongs to the Section Remote Sensing for Geospatial Science)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose SegNeXt-HFCA, a SegNeXt-based hierarchical multiscale fusion network with coordinate attention for lithologic mapping from Sentinel-2 and DEM data.
Class-frequency-aware and boundary-weighted losses combined with seamless sliding-window inference and DenseCRF refinement significantly improve boundary fidelity and the recognition of long-tailed lithologic units, yielding 3–4% mIoU point gains over strong baselines on two structurally complex areas.

What are the implications of the main findings?

The proposed framework better preserves thin lithologic belts and clarifies ambiguous contacts, producing geologically more plausible lithologic maps in arid, structurally complex terranes.
The methodology is transferable to other geological remote sensing applications using multispectral satellite imagery and auxiliary terrain data.

Abstract

Accurate lithologic maps derived from satellite imagery underpin structural interpretation, mineral exploration, and geohazard assessment. However, automated mapping in complex terranes remains challenging because spectrally similar units, narrow anisotropic bodies, and ambiguous contacts can degrade boundary fidelity. In this study, we propose SegNeXt-HFCA, a hierarchical multiscale fusion network with coordinate attention for lithologic segmentation from a Sentinel-2/DEM feature stack. The model builds on SegNeXt and introduces a hierarchical multiscale encoder with coordinate attention to jointly capture fine textures and scene-level structure. It further adopts a class-frequency-aware hybrid loss that combines boundary-weighted online hard-example mining cross-entropy with Lovász-Softmax to better handle long-tailed classes and ambiguous contacts. In addition, we employ a robust training and inference scheme, including entropy-guided patch sampling, exponential moving average of parameters, test-time augmentation, and a DenseCRF-based post-refinement. Two study areas in the Beishan orogen, northwestern China (Huitongshan and Xingxingxia), are used to evaluate the method with a unified 10-channel Sentinel-2/DEM feature stack. Compared with U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, SegFormer, BiSeNetV2, and the SegNeXt backbone, SegNeXt-HFCA improves mean intersection-over-union (mIoU) by about 3.8% in Huitongshan and 2.6% in Xingxingxia, respectively, and increases mean pixel accuracy by approximately 3–4%. Qualitative analyses show that the proposed framework better preserves thin-unit continuity, clarifies lithologic contacts, and reduces salt-and-pepper noise, yielding geologically more plausible maps. These results demonstrate that hierarchical multiscale fusion with coordinate attention, together with class- and boundary-aware optimization, provides a practical route to robust lithologic mapping in structurally complex regions.

Keywords:

geological remote sensing; coordinate attention; lithologic mapping; semantic segmentation; deep learning

Graphical Abstract

1. Introduction

Lithology is a fundamental property of geological bodies that reflects the composition and fabric of the Earth’s crust. Its spatial distribution underpins geological mapping, mineral exploration, and geohazard assessment [1]. Traditionally, lithologic information is obtained through field surveys and manual interpretation of imagery—procedures that are labor-intensive and time-consuming and, in structurally complex terranes, prone to interpreter subjectivity and ambiguous contacts [2]. These limitations motivate automated approaches that improve both efficiency and reliability.

Satellite remote sensing provides wide spatial coverage, rapid revisit, and rich spectral content, forming a solid basis for lithologic mapping [3]. With the broad availability of medium-/high-resolution multispectral data (e.g., Sentinel-2) and advances in deep-learning-based semantic segmentation [4,5,6], automated lithologic interpretation has received renewed attention [7,8]. CNN-based [9], U-Net-style [10], and transformer-based models [11] have achieved encouraging results. However, robust lithologic mapping in structurally complex terranes remains challenging. Three issues are particularly prominent in geological remote-sensing-based lithologic mapping:

(1): Spectral ambiguity and intra-class heterogeneity. Weathering and alteration, mixed pixels, and vegetation cover often yield low inter-class contrast among different lithologies, while the same lithology can exhibit substantial within-class variability in tone, texture, and grain size [12]. This “different materials–similar spectra/same material–varying spectra” phenomenon undermines the stability of pixel-wise decisions, induces systematic confusion between adjacent units, and degrades class separability and lithologic contact delineation [13,14]. Therefore, a practical model should strengthen discrimination under low inter-class contrast while leveraging context to reduce systematic confusion.
(2): Multi-scale anisotropy and receptive-field mismatch. Banded and laminated units, as well as slender, stripe-like bodies, span multiple scales and exhibit strong directional anisotropy. When the network’s effective receptive field is misaligned with object geometry, predictions often exhibit stripe breakage, width distortion, and strike offsets, accompanied by cross-band misclassification and shape deformation [15,16]. Shallow or rigid multi-scale encoders struggle to capture fine textures and scene-level structure simultaneously, degrading stripe continuity and topological consistency. Hence, the representation should preserve high-resolution details while fusing multi-level semantics to better align the effective receptive field with anisotropic geometries.
(3): Ambiguous contacts, weak spatial consistency, and long-tailed imbalance. Transitional zones and mixed contacts yield diffuse boundaries; pixel-independent decisions are prone to salt-and-pepper noise and topological artifacts (holes and islands). Pronounced class-frequency imbalance further drives the collapse of rare classes—especially along contacts—and exacerbates cross-region and cross-season semantic drift [17]. Without explicit spatial-consistency and boundary-aware constraints, boundary stability and object connectivity deteriorate. Accordingly, optimization and inference should emphasize boundary and contact pixels, hard examples, and rare classes while promoting spatial consistency.

Motivated by these failure modes, we develop SegNeXt-HFCA by jointly optimizing the network architecture and the learning and inference scheme in a challenge-driven manner.

For Issues (1–2), we adopt a hierarchical multi-scale (HMS) fusion encoder with coordinate attention (CA) to jointly model local textures and long-range contextual cues, strengthening discrimination among spectrally similar units [18,19]. HMS fusion also preserves high-resolution representations while aggregating multi-level semantics, and a global feature-fusion head further improves stripe continuity and preserves topology across scales. For Issue (3), we employ a class-frequency-aware hybrid objective combining boundary-weighted online hard example mining (OHEM) cross-entropy with Lovász-Softmax to emphasize hard pixels and contact zones [20,21]. In addition, entropy-guided patch sampling increases exposure to rare and uncertain patterns during training; exponential moving average (EMA) and test-time augmentation (TTA) improve prediction stability; and a lightweight DenseCRF refinement sharpens lithologic contacts and suppresses salt-and-pepper noise, enhancing spatial consistency with limited overhead. This combination consistently benefits rare classes and transition zones, striking a practical balance between accuracy and deployability.

In summary, the contributions of this article include the following:

(1): We propose SegNeXt-HFCA, a framework tailored for lithologic segmentation that augments a SegNeXt backbone with HMS fusion and CA plus a global feature-fusion head, improving the tracking of stripe-like units and the delineation of lithologic contacts in the presence of spectral ambiguity and anisotropic geometries.
(2): We develop a robust training/inference scheme driven by class-frequency priors and hard examples: class-frequency reweighting and boundary-weighted OHEM cross-entropy are combined with Lovász-Softmax under a two-stage learning-rate schedule with EMA averaging; inference uses multi-scale/flip TTA, overlapping sliding windows with seamless stitching, and a lightweight DenseCRF refinement to sharpen contacts and enhance spatial consistency for rare classes and transition zones.
(3): We provide a unified evaluation on two large Sentinel-2 study areas (Huitongshan and Xingxingxia), showing consistent gains over strong baselines (U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, BiSeNetV2, and SegNeXt-base). SegNeXt-HFCA improves mIoU by ≈3.8 and 2.6% points, respectively, with notably better stripe continuity, stabilized contacts, and separation of spectrally similar units across two structurally contrasting blocks.

2. Related Work

Automated geological interpretation from Earth observation data has evolved from spectral-feature engineering to machine learning and, more recently, deep learning. While these lines of work have substantially improved lithologic mapping, three recurring challenges remain particularly relevant to structurally complex terranes: (1) weak texture and low inter-class contrast caused by spectral ambiguity and intra-class heterogeneity, (2) narrow, stripe-like anisotropic bodies that span multiple scales and stress the effective receptive field of segmentation models, and (3) long-tailed class distributions and ambiguous contacts that degrade boundary fidelity and spatial consistency. Below we review representative efforts through the lens of these challenges.

2.1. Weak-Texture Units and Spectral Ambiguity Under Low Contrast

Early geological remote-sensing interpretation relied heavily on spectral-feature methods that enhance diagnostic signals. Band-ratio and color-composite techniques (e.g., [22]) strengthened iron oxide and clay alteration responses across multiscene imagery and became foundational for alteration mapping. Within the continuum-removal framework, the shape and position of diagnostic absorption features were quantified for mineral and lithologic identification [23,24]. At the image-mapping level, Kruse et al. [25] advanced spectral-library-driven approaches based on spectral angle mapper (SAM), which inspired rule-based expert systems such as Tetracorder. To mitigate ambiguity from “same material–different spectra” and “different materials–similar spectra”, Chang [26] introduced information-theoretic metrics such as spectral information measure (SIM) and spectral information divergence (SID), improving discrimination among near-spectral classes. Despite these advances, hand-crafted spectral descriptors can remain brittle when low contrast, mixed pixels, weathering/alteration, and vegetation jointly reduce class separability. This limitation motivates representation learning that couples local evidence with broader contextual cues, as pursued in later learning-based methods.

2.2. Narrow Anisotropic Geological Bodies and Multi-Scale Receptive-Field Mismatch

Capturing stripe-like and anisotropic geological bodies requires models that preserve fine-scale structure while integrating larger-scale context. In the feature-engineering and supervised-classification phase, geographic object-based image analysis (GEOBIA) introduced shape/texture/context features to reduce pixel-level “salt-and-pepper” effects and to better respect object geometry [27]. Meanwhile, a transferable multisource-fusion baseline gradually emerged: support vector machines (SVMs) highlighted small-sample efficiency and margin-based robustness [28,29], random forests (RFs) were shown to be robust under noisy high-dimensional variables [30,31], and multisource integration (e.g., imagery with topographic cues) became common in regional mapping.

However, shallow representations and rigid multi-scale schemes often struggle to maintain the continuity of slender, banded units across scales, leading to stripe breakage and topology inconsistencies. Deep-learning-based geological remote sensing interpretation (GRSI) therefore increasingly leverages encoder–decoder segmentation backbones and multibranch/multimodal designs to integrate multi-level semantics. For instance, Latifovic et al. [32] integrated multiplatform imagery and DEMs within CNNs for regional mapping; Sang et al. [33] developed a CNN-based pipeline using UAV imagery; and Han et al. [34] proposed a multimodal, multibranch segmentation network to improve robustness for lithologic mapping. Such designs highlight the importance of multi-scale fusion and structural context for preserving anisotropic body continuity.

2.3. Long-Tailed Imbalance, Ambiguous Contacts, and Spatial Consistency

A further difficulty in lithologic mapping is that transitional zones and mixed contacts yield diffuse boundaries, while class-frequency imbalance suppresses rare units and increases boundary errors. Pixel-wise predictions may exhibit salt-and-pepper noise and fragmented objects, particularly near contacts and for under-represented classes. Recent deep segmentation studies in GRSI have therefore explored context modeling and regularization strategies to improve spatial coherence. Chen et al. [4], for example, introduced an adversarial semi-supervised network with object–context modeling and global attention to mitigate label scarcity and improve contextual reasoning, which is also relevant for stabilizing predictions near ambiguous contacts. Nevertheless, robust handling of long-tailed classes and boundary-sensitive optimization remains an open issue in practice, especially when mapping large regions with heterogeneous acquisition conditions. This motivates training objectives that emphasize hard examples and contact zones and inference-time refinement that enforces lightweight spatial consistency.

2.4. Summary and Positioning of This Work

Overall, prior work has made progress on (i) enhancing discriminative spectral cues under low contrast [22,23,24], (ii) incorporating shape/context or multi-scale fusion to better preserve narrow anisotropic bodies [32,33,34], and (iii) improving contextual reasoning and spatial coherence in deep segmentation [4,27]. However, a unified and lightweight framework that simultaneously strengthens discrimination among spectrally similar units, preserves stripe continuity across scales, and improves robustness for long-tailed classes and ambiguous contacts remains limited. Motivated by these gaps, we develop SegNeXt-HFCA, which combines hierarchical multi-scale fusion with coordinate attention and a geometry-aware head, together with class-/boundary-aware optimization and lightweight post-refinement, to improve lithologic mapping in structurally complex regions.

3. Study Area and Data

3.1. Study Areas

The regional tectonic framework of the eastern Tianshan–Beishan segment in the southern Central Asian Orogenic Belt (CAOB) is summarized in Figure 1, together with the locations of the two study areas. Huitongshan is situated within the Beishan Rift domain, whereas Xingxingxia lies closer to the Yamansu back-arc basin and is adjacent to the Kangguer–Huangshan ductile shear zone. Although both areas are overprinted by EW- to NE-striking dextral strike-slip fault systems, they differ in map-scale lithologic architecture: Huitongshan is characterized by more fragmented, narrow stripe-like units and dense faulting, while Xingxingxia contains comparatively broader granitoid bodies and more laterally continuous contacts. These contrasts are evident in the Sentinel-2 composites and the corresponding lithologic maps shown in Figure 2 and Figure 3, and they partly explain the higher spectral mixing and overall mapping difficulty in Huitongshan at 10 m resolution.

We consider two representative study areas in the Beishan orogen, Gansu Province, NW China—Huitongshan and Xingxingxia—located in an arid to semi-arid setting with sparse vegetation, where extensive bedrock exposure facilitates lithologic interpretation from satellite imagery [34]. Representative Sentinel-2 composites and the corresponding geological maps are shown in Figure 2 (Huitongshan) and Figure 3 (Xingxingxia). Huitongshan chiefly exposes syenogranite, two-mica quartz schist, diorite, sedimentary successions, basalt, quartz diorite, amphibole–mica schist, gabbro, and monzonite; Xingxingxia comprises syenogranite, two-mica quartz schist, diorite, sedimentary rocks, migmatitic gneiss, dolomitic marble, coarse- and fine-grained monzogranite, granodiorite, and trachyandesitic tuff. The areas contain narrow, stripe-like units and dense faulting, providing a challenging yet representative test bed for automated lithologic mapping.

3.2. Data and Preprocessing

Sentinel-2 Level-2A bottom-of-atmosphere (BOA) reflectance imagery was used for both study areas. For the Huitongshan area, we selected a Sentinel-2B Level-2A scene (product ID S2B_MSIL2A_20240917T042659_N0511_R133_T46TFL_20240917T074601) acquired on 17 September 2024, within the local snow-free season (June–October). A similarly low-cloud Level-2A scene from the same season was used for the Xingxingxia area. All scenes were downloaded from the Copernicus Open Access Hub and required to have less than 10% scene-level cloud cover. Clouds and shadows were further masked using the QA60 bitmask and the s2cloudless product with a 2–3-pixel dilation. Bands at 20 m (B5, B6, B7, B8A, B11, and B12) were resampled to 10 m (bilinear interpolation for reflectance; nearest-neighbor resampling for masks). Sentinel-2 MSI provides 10 m spatial resolution for B2–B4 and B8, 20 m for B5–B7, B8A, B11, and B12, and 60 m for B1, B9, and B10 [35]. For lithologic mapping, visible bands (B2–B4) delineate landforms and bare-rock exposures, red-edge bands (B5–B7 and B8A) are sensitive to canopy/chlorophyll and weathering, and NIR/SWIR bands (B8, B11, and B12) capture mineralogical and moisture contrasts [36]. The topography was represented by a 12.5 m resolution digital elevation model (DEM) derived from a province-wide photogrammetric DEM mosaic for Gansu Province and resampled to 10 m to match the Sentinel-2 grid. All channels were standardized using z-score statistics computed on the training set only and then fixed for validation and testing.

3.3. Feature Stack Construction (10 Channels)

To reduce inter-band redundancy and suppress sensor/atmospheric noise while preserving diagnostically useful spectral–textural cues, we apply principal component analysis (PCA) and the minimum noise fraction (MNF) to the atmospherically corrected Sentinel-2 reflectance stack and retain only the leading components that capture the dominant variability of the scene [37]. For PCA, we mask no-data pixels and perform per-band z-score normalization; the first four principal components (PC1–PC4) explain 99.94% of the total variance in the 12-band stack, whereas higher-order PCs contribute marginal variance and tend to amplify residual artifacts. MNF further performs noise whitening and ranks components by effective signal-to-noise; accordingly, we keep the first three MNF components (MNF1–MNF3) as the most informative noise-robust features. The final 10-channel input stack is therefore {PCA1–PCA4, MNF1–MNF3, Elevation, Slope, Topographic Position Index (TPI)}.

All layers are resampled to 10 m (bilinear interpolation for continuous variables), co-registered, and standardized (per-band z-score) prior to training. Slope is derived using the Horn operator, and TPI is computed using an approximately 250 m neighborhood. Grayscale base images of the 10 input layers are provided in Figure 4 to illustrate their spatial characteristics. Elevation, slope, and TPI are incorporated as non-causal contextual descriptors. We note that the topography in the study region is largely controlled by tectonic structure and erosion rather than lithology alone. Nevertheless, at the mapping scale, DEM derivatives can highlight geomorphic expressions associated with contacts and structural relief (e.g., ridge–valley patterns and breaks in slope), which helps suppress speckle noise and improves spatial continuity without implying a causal lithology–topography relationship. In addition, class-wise pixel-count histograms for Huitongshan and Xingxingxia (Figure 5) show a pronounced long-tailed distribution across lithologic units; these statistics are subsequently used to derive class-frequency priors in Section 4 [36].

3.4. Annotation and Splits

Reference labels were compiled from regional geological maps and limited field checks and harmonized into a unified taxonomy of K = 15 lithologic classes across both areas (Huitongshan: 9 classes; Xingxingxia: 10 classes). Polygons were rasterized to 10 m (nearest neighbor) and cross-checked by two annotators; low-confidence transition zones were masked. The co-registered images and labels were tiled into non-overlapping 128 × 128 patches (10 m GSD). For each area, we formed an independent dataset and performed an 80%/20% random split into training and test sets with a fixed seed to ensure reproducibility (no cross-patch overlap between splits). For area-wise reporting, classes absent in a given area are marked “n/a” and excluded from the mean-over-classes metrics.

4. Methodology

The overall architecture of SegNeXt-HFCA is shown in Figure 6. Building on the preceding problem analysis and related work, this section details the proposed framework: we first introduce the hierarchical multi-scale encoder with coordinate-aware re-calibration (HMS-CA), then describe the loss functions and class-frequency priors, followed by the robust training and inference strategy and the DenseCRF-based refinement. Each component is designed to address a specific geological difficulty: (i) the HMS encoder mitigates receptive-field mismatch and improves topology preservation for thin, anisotropic lithologic belts; (ii) the coordinate-aware boundary re-calibration injects boundary/positional cues to stabilize predictions in weak-texture or low-contrast regions; (iii) robust hard-example training (RHT) together with class-frequency priors strengthens the learning of rare classes under long-tailed distributions; and (iv) DenseCRF refinement sharpens ambiguous contacts and suppresses salt-and-pepper artifacts. The effectiveness of these designs is evaluated through the ablation study and qualitative analyses presented in the Results section.

4.1. Hierarchical Multi-Scale Encoder with Coordinate-Aware Re-Calibration (HMS-CA)

To jointly capture local textures, inter-band correlations, and cross-scale spatial relations, we adopt a stage-wise multi-scale fusion strategy with coordinate-aware boundary re-calibration. Let the backbone outputs be feature maps at four stages

\{A^{(1)}, A^{(2)}, A^{(3)}, A^{(4)}\}

corresponding to

\{C 1, C 2, C 3, C 4\}

. We perform hierarchical multi-scale fusion on

\{C 2, C 3, C 4\}

and use

C 1

only for edge-guided boundary cues.

For

s \in \{2, 3, 4\}

(corresponding to

\{C 2, C 3, C 4\}

), each feature map

A^{(s)} \in ℝ^{C_{s} \times H_{s} \times W_{s}}

is projected to a unified channel width

C

and resized to a common spatial resolution

H \times W

(set to the spatial size of

C 2

) as follows:

{\tilde{A}}^{(s)} = ϕ_{s} ​ (A^{(s)}) \in ℝ^{C \times H \times W}, s \in \{2, 3, 4\} .

(1)

In our implementation,

C = 256

, and

ϕ_{s} (\cdot)

is a

1 \times 1

pointwise convolution followed by bilinear interpolation to match

H \times W

. The hierarchical sampling ratios (relative to the input) are

C 2 : 1 / 2, C 3 : 1 / 4

, and

C 4 : 1 / 8

, and all stages are aligned to

H \times W = H_{C 2} \times W_{C 2}

.

Instead of a weighted summation, the aligned multi-scale features are concatenated and fused by lightweight depthwise-separable operators:

X = C o n c a t ({\tilde{A}}^{(2)}, {\tilde{A}}^{(3)}, {\tilde{A}}^{(4)}) \in ℝ^{3 C \times H \times W} .

(2)

We apply a local fusion block

ψ (\cdot)

composed of a depthwise

3 \times 3

+ pointwise

1 \times 1

convolution and an anisotropic strip convolution to enhance elongated/banded structures:

F_{0} = ψ (X) \in ℝ^{C \times H \times W} .

(3)

To further aggregate multi-scale contextual information, a lightweight ASPP module with dilated depthwise convolutions is employed:

F_{1} = A S P P (F_{0}) \in ℝ^{C \times H \times W} .

(4)

Specifically, ASPP-lite uses dilation rates

\{1, 6, 12, 18\}

with an image-level pooling branch, while the anisotropic strip convolution uses kernel size

k = 9

.

Next, we generate a coarse segmentation prediction and use it to compute class-wise context via a lightweight OCR-style aggregation:

Y_{c o a r s e} = C o n v_{1 \times 1} (F_{1}), F_{2} = O C R (F_{1}, Y_{c o a r s e}) \in ℝ^{C \times H \times W}

(5)

Here, OCR-lite uses

s o f t \max (Y_{c o a r s e})

as soft masks to form per-class prototypes and re-project them back to the pixel space, yielding class-conditioned contextual enhancement.

Finally, we inject boundary cues from the high-resolution shallow feature

A^{(1)} (i . e ., C 1)

through an edge-guided gating mechanism. Let the edge attention map be

M_{e} = σ (C o n v_{1 \times 1} ({‖\nabla A^{(1)}‖}_{2})),

(6)

where

\nabla (\cdot)

is implemented by Sobel filters,

σ (\cdot)

is a sigmoid, and

M_{e}

is resized to

H \times W

. The refined feature is obtained by

E = A l i g n (F_{2}) ⊙ (1 + α \cdot ↑ M_{e}),

(7)

where

⊙

denotes element-wise multiplication,

↑

denotes bilinear resizing,

A l i g n (\cdot)

is a

1 \times 1

projection, and

α = 0.7

in our implementation.

Optionally, we further apply a memory-safe classwise transformer attention

A (\cdot)

on

E

, followed by the final prediction layer:

\hat{E} = E + γ A (E), Y = C o n v_{1 \times 1} (\hat{E}) .

(8)

Here,

A (\cdot)

is implemented with a linear-attention approximation and an adaptive token downsampling strategy that ensures the effective token number

\leq 4096 (\max_t o k e n s)

, and it includes a learnable residual scaling coefficient

γ

(initialized to zero). In our implementation,

A (\cdot)

uses

h = 4

heads and

d = 32

per head (i.e., inner dimension

h d = 128

). The prediction head

H e a d (\cdot)

is a lightweight convolutional classifier (a

3 \times 3

convolution followed by a

1 \times 1

prediction layer).

Computational complexity: The HMS fusion in (1)–(4) is linear in the number of pixels, dominated by

1 \times 1

projections, bilinear resizing, and depthwise-separable convolutions, i.e.,

O (H W)

at the aligned resolution. The OCR-lite aggregation in (5) scales linearly with the spatial size and the number of classes and remains lightweight in practice. For the optional attention in (8), compared with full self-attention

O (N^{2})

, our implementation avoids forming an

N \times N

attention matrix and achieves approximately linear complexity in

N

; moreover, the adaptive downsampling caps

N \leq 4096

, keeping additional computation and memory manageable for high-resolution inputs.

4.2. Loss Functions and Class-Frequency Prior

To cope with the long-tailed class distribution, sample imbalance, and blurred boundaries in GRSI, SegNeXt-HFCA adopts a class-frequency-weighted online hard example mining cross-entropy (OHEM-CE) combined with the Lovász-Softmax loss. Together with auxiliary-branch supervision and a class-frequency prior, this design improves the consistency between the training objective and the mIoU-oriented evaluation metric. Let the logits of the main branch be

Z \in ℝ^{K \times H \times W}

(channel-first implementation), where

K

is the number of lithological classes. The predicted probabilities are

P = softmax (Z)

, with the

s o f t \max

applied along the class dimension, and the ground-truth labels are

Y \in {\{0, \dots, K - 1\}}^{H \times W}

. In each mini-batch, OHEM selects a set

T

of hard pixels ranked by per-pixel loss (with

|T|

determined by the hard-pixel ratio

ρ

) and computes the averaged cross-entropy over

T

:

L_{ce} = \frac{1}{| T |} \sum_{p \in T} l_{ce} (P_{p}, Y_{p}),

(9)

where

l_{ce} (\cdot)

denotes the class-weighted cross-entropy with label smoothing (set to 0.02), and the class weights are computed from empirical class frequencies. In our implementation, the hard-pixel ratio

ρ

follows a piecewise schedule in Phase I:

ρ = 1.0

(0–2 k iters), 0.8 (2 k–6 k), 0.6 (6 k–12 k), and 0.5 (12 k–15 k); in Phase II,

ρ

is further annealed from 1.0 to a floor value of 0.5. This schedule makes early iterations emphasize stable global convergence, while later iterations focus more on difficult pixels and boundary regions.

To better align the optimization objective with the mIoU-oriented metric, we additionally apply the Lovász-Softmax loss

L_{l o v a s z}

on the main branch, The total loss for the main branch is

L_{main} = L_{ce} + λ_{lov} L_{lovasz},

(10)

where

λ_{lov}

is a balancing coefficient. On the decoder side, we retain two auxiliary prediction heads (from the intermediate and shallow backbone stages (

C 3

and

C 2

) using

1 \times 1

predictors) and supervise them using the same OHEM-CE formulation. The corresponding auxiliary losses are denoted by

L_{ce}^{(1)}

and

L_{ce}^{(2)}

. Their contribution is incorporated into the overall objective with weight

λ_{a u x} = 0.1

:

L_{a u x} = λ_{a u x} (L_{ce}^{(1)} + L_{ce}^{(2)}),

(11)

and the final training objective is

L = L_{m a i n} + L_{a u x},

(12)

To mitigate the bias introduced by class imbalance, we estimate the empirical class prior

π_{k}

from the training set (e.g.,

π_{k} = n_{k} / \sum_{j = 1}^{K} n_{j}

). Let

b_{k}

denote the bias term of the final segmentation layer for class

k

. We initialize

b_{k} = \log π_{k}

to reduce the variance of early logit estimates and accelerate convergence to a useful regime. In addition, median-frequency balancing is used to define the class weights:

w_{k} = {(\frac{median ({\{π_{j}\}}_{j = 1}^{K})}{π_{k}})}^{γ} .

(13)

where

γ

is a scaling exponent. We use a two-stage setting of

γ

: it is set to 0.45 initially and switched to 0.60 at 2 k iterations to strengthen the reweighting of rare classes. For numerical stability,

w_{k}

is clipped to

[0.65, 2.2]

.

To make the loss explicitly boundary-sensitive, we derive a thin boundary mask

M_{b d} \in {\{0, 1\}}^{H \times W}

from the ground-truth labels by applying a morphological gradient followed by a slight dilation. For pixels on the boundary mask, we up-weight the per-pixel cross-entropy loss by a factor

(1 + λ_{b d})

before applying OHEM, whereas non-boundary pixels keep a weight of 1.0. Concretely, the reweighted per-pixel loss is

l_{c e}^{*} (p) = (1 + λ_{b d} M_{b d} (p)) l_{c e} (p),

(14)

and OHEM is performed on

l_{c e}^{*} (p) .

We set

λ_{b d} = 0.5

in our experiments. This design biases optimization toward structurally important contact zones, which are typically under-represented in a class-balanced objective.

4.3. Robust Training and Inference Strategies

To achieve stable optimization and reliable generalization in large-area lithologic mapping—where long-tailed class distributions, mixed pixels at contacts, and fragmented unit boundaries are common—we adopt a training–inference protocol that separates representation learning from hard-structure refinement.

We use a two-phase schedule. Phase I targets stable convergence and learns transferable cross-scale representations with the HMS-CA encoder. Phase II performs a short refinement from the best Phase-I checkpoint using a smaller step size, where the peak learning rate is reduced to 0.3× of Phase I (from

2 \times 10^{- 4}

to

6 \times 10^{- 5}

under the same warm-up + cosine decay policy). This reduction stabilizes optimization when the training focus shifts toward informative hard pixels and contact regions, preventing oscillation or excessive drift after Phase-I convergence. In addition, robust hard-example training (RHT) concentrates updates on hard-but-reliable pixels to better delineate lithologic contacts and mitigate confusion among spectrally similar units; the sensitivity of the threshold parameter

τ

is analyzed in Section 5.3.

For full-scene prediction, we apply overlapping tiling with probability fusion (“seamless tiling”) to reduce border artifacts and improve spatial continuity. Optional test-time augmentation (TTA) averages a small set of geometric/scale transforms to enhance robustness in heterogeneous or low-contrast areas. When enabled, DenseCRF refinement is applied to the probability maps to sharpen contacts and suppress salt-and-pepper noise.

4.4. Dense Conditional Random Field (DenseCRF) Refinement

To further improve boundary adherence and intra-region consistency, a dense conditional random field (DenseCRF) is applied to the network output probability maps for posterior refinement. Let

y

denote the pixelwise label field with energy:

E (y) = \sum_{p} ψ_{u} (y_{p}) + \sum_{p < q} ψ_{p} (y_{p}, y_{q}),

(15)

where the unary potential

ψ_{u} (y_{p})

is given by the network logits, and the pairwise term

ψ_{p} (y_{p}, y_{q})

is modeled as a linear combination of spatial and bilateral kernels under a Potts compatibility function, jointly exploiting spatial proximity and spectral similarity between pixels. Inference is performed using a mean-field approximation accelerated by high-dimensional filtering, which typically converges within 5–10 iterations. With properly chosen weights and bandwidths for the spatial and bilateral kernels, the DenseCRF suppresses salt-and-pepper noise and small holes while preserving sharp edges, yielding more continuous contact zones and smoother interiors of lithologic bodies. As a post-processing module, the DenseCRF does not modify the backbone architecture and instead provides a complementary refinement to the front-end predictions of SegNeXt-HFCA. In the ablation study (Section 5.6), we refer to the combination of this DenseCRF refinement and the seamless sliding-window inference of Section 4.3 as the “BCRF + Seamless” configuration.

5. Experiments and Results

This section presents a comprehensive evaluation of the proposed method. We first describe the experimental setup and evaluation protocol, including dataset partitioning, training/inference settings, and primary metrics—overall pixel accuracy (PA) and mean intersection-over-union (mIoU). We then compare the model, under identical configurations, with representative semantic-segmentation baselines to verify its effectiveness. Finally, we conduct ablation studies and report model complexity and inference speed to quantify the accuracy–efficiency trade-off.

5.1. Experimental Settings

In both study areas (Huitongshan and Xingxingxia), the co-registered multispectral images and labels were tiled into non-overlapping 128 × 128 patches. For each area, we formed an independent dataset and randomly split it 80/20 into training and test sets. To ensure a fair and informative evaluation across different design paradigms, we benchmark against conventional CNN-based segmentation models (PSPNet [38], DeepLabV3+ [39], and DANet [40]), a lightweight real-time CNN (BiSeNetV2 [41]), hybrid architectures (U-NetFormer [42] and LGMSFNet [43]), transformer-based baselines (SegFormer and Swin-UperNet), and the backbone-matched SegNeXt-base model [18]. All baseline models were retrained under a unified protocol: identical data splits and image augmentations, the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴, batch size eight, and 15 k iterations, using linear warm-up followed by cosine decay; other hyper-parameters were kept at their default settings. For SegNeXt-HFCA, we adopt the two-phase schedule described in Section 4.3, with 15 k iterations in Phase I and a further 4 k iterations in Phase II, an initial learning rate of 2 × 10⁻⁴ (the same warm-up/cosine policy and weight decay 5 × 10⁻³), and the robust hard-example training scheme and EMA model averaging enabled. Apart from these optimization-related differences, the data splits and augmentations are identical across all methods. At test time, we report pixel accuracy (PA), mean pixel accuracy (mPA), and mean intersection-over-union (mIoU).

All experiments were implemented in Python 3.10.18 using PyTorch (2.9.0.dev2025 0703+cu128) with CUDA 12.8. Data preparation and cartographic post-processing were performed in ArcGIS 10.8.2 (Esri, Redlands, CA, USA). Model training and inference were conducted on a workstation equipped with an NVIDIA GeForce RTX 5060 Ti GPU (NVIDIA, Santa Clara, CA, USA; 16 GB VRAM).

5.2. Evaluation Metrics

To quantitatively evaluate segmentation performance, we report pixel accuracy (PA) and intersection-over-union (IoU), together with their class-averaged counterparts (mPA and mIoU). Let

C

be the number of lithologic classes, and let

n_{i j}

denote the number of pixels whose ground-truth label is class

i

but are predicted as class

j

(i.e., entries of the confusion matrix). Pixels with the ignore label (e.g.,

255

) are excluded from all computations.

Pixel accuracy (PA): The overall pixel accuracy is defined as

P A = \frac{\sum_{i = 1}^{C} n_{i i}}{\sum_{i = 1}^{C} \sum_{j = 1}^{C} n_{i j}} .

(16)

The per-class pixel accuracy (i.e., class-wise recall) for class

i

is

P A_{i} = \frac{n_{i i}}{\sum_{j = 1}^{C} n_{i j}},

(17)

and the mean pixel accuracy is

m P A = \frac{1}{|Ω_{P A}|} \sum_{i \in Ω_{P A}} P A_{i},

(18)

where

Ω_{P A} = {i | \sum_{i = 1}^{C} n_{i j} > 0}

denotes the set of classes present in the evaluation set.

Intersection-over-union (IoU): For class

i

, IoU is defined as

{IoU}_{i} = \frac{n_{i i}}{\sum_{j = 1}^{C} n_{i j} + \sum_{j = 1}^{C} n_{j i} - n_{i i}},

(19)

where the denominator corresponds to the union of the predicted and ground-truth regions for class

i

. The mean IoU is computed by

mIoU = \frac{1}{|Ω_{IoU}|} \sum_{i \in Ω_{IoU}} {IoU}_{i}

(20)

with

Ω_{IoU} = {i | \sum_{j = 1}^{C} n_{i j} + \sum_{j = 1}^{C} n_{j i} - n_{i i} > 0}

indicating classes whose union is non-empty.

In practice,

P A

reflects the overall correctness dominated by frequent classes, whereas

m P A

reduces the influence of class imbalance by averaging class-wise recall. IoU/mIoU additionally penalizes both over- and under-segmentation and is more sensitive to boundary errors and fragmented predictions.

5.3. Sensitivity Analysis

To assess the sensitivity of the proposed refinement to the threshold parameter

τ

, we performed a sweep on the Huitongshan validation set with

τ \in [0, 1]

. In our implementation,

τ

is a confidence-gating threshold that controls the activation region of the refinement: pixels satisfying

{m a x}_{c} P (c ∣ x_{p}) < τ

are treated as low-confidence and refined, while the remaining pixels keep the original prediction. The degenerate case

τ = 0

yields a gate ratio of 0% and therefore corresponds to the baseline without refinement.

As shown in Figure 7, the performance is essentially unchanged for

τ \leq 0.3

(gate ratio

\leq 0.23 %

), where mPA and mIoU remain at 78.36% and 66.34%, respectively. When

τ

increases beyond this range, the gate ratio grows rapidly (e.g., 1.72% at

τ = 0.4

, 11.10% at

τ = 0.6

, and 100% at

τ = 1.0

), and both mPA and mIoU exhibit a consistent, albeit mild, decrease (mPA: 78.36%

\to

77.66%; mIoU: 66.34%

\to

66.03%). This trend suggests that refining an excessively large region tends to introduce over-smoothing around thin lithologic bodies and complex contact zones, where IoU is particularly sensitive to small boundary displacements under 10 m resolution and mixed pixels. Considering the stable plateau and the subsequent degradation when the gate becomes large, we set

τ = 0.3

for all experiments.

5.4. Comparative Experiments and Analysis on the Huitongshan Dataset

On the Huitongshan study area with nine lithologic classes, we compare SegNeXt-HFCA against a diverse set of baselines spanning CNNs (PSPNet, DeepLabV3+, and DANet), hybrid designs (U-NetFormer and LGMSFNet), a lightweight model (BiSeNetV2), transformer-based methods (SegFormer and Swin-UperNet), and the SegNeXt-base backbone. Table 1 and Table 2 summarize the quantitative results, and Figure 8 provide confusion-matrix.

At the class level, segmentation performance varies markedly with the intrinsic characteristics of each geological unit. Two-mica quartz schist, quartz diorite, amphibole–mica schist, and monzonite generally achieve PA values above 85% and IoU values above 75%, which can be attributed to their relatively uniform tone, homogeneous structure, and distinctive texture patterns. In contrast, syenogranite, gabbro, Quaternary deposits, and basalt have fuzzy contacts and occur as narrow, ribbon-like outcrops, making them difficult for deep-learning models to delineate accurately. Diorite forms more extensive bodies in the imagery, and its PA and IoU exceed 75% and 60%, respectively, despite residual confusion along unit boundaries.

Table 1 and Table 2 summarize the class-wise pixel accuracy (PA) and intersection-over-union (IoU) for the nine lithologic units in the Huitongshan study area (SyG, 2MQS, DI, Q, BA, QDI, HMS, GB, and MZ). Across methods, several units are consistently mapped with high fidelity—most notably 2MQS, QDI, and MZ, which maintain high PA (typically ~83–96%) and strong IoU (generally >70%). This pattern suggests that these units exhibit comparatively coherent spatial occurrences and more diagnostic spectral/texture cues at the mapping scale. By contrast, Quaternary deposits (Q) remain the principal source of uncertainty, showing the lowest region overlap for all models (IoU~25–35%) together with only moderate PA (about 50–57%), which is consistent with discontinuous surficial cover, mixed pixels along contacts, and frequent adjacency to multiple bedrock units.

In terms of overall performance, SegNeXt-HFCA yields the best mean scores, achieving mPA = 79.18% (Table 1) and mIoU = 67.29% (Table 2). Relative to the SegNeXt backbone alone (mPA = 75.68%, mIoU = 63.45%), this corresponds to improvements of +3.50 mPA and +3.84 mIoU, indicating a more consistent balance between pixel-wise recognition and area-level delineation. The gains are most evident for units that are sensitive to boundary ambiguity and local fragmentation: for Q, PA increases from 35.90% to 54.57% and IoU from 25.48% to 34.32%, and for BA, PA rises from 68.30% to 78.20%. Meanwhile, robust overlap is maintained for the easier units, e.g., 2MQS (IoU 87.64%) and QDI (IoU 82.97%), supporting stable extraction of their lithologic bodies. Notably, the best single-class scores are not always produced by the same model (e.g., Q IoU = 35.43% with U-NetFormer; DI IoU = 70.10% with DANet), underscoring that class-wise performance is strongly controlled by outcrop geometry and inter-unit similarity in addition to network design. As expected, mPA is generally higher than mIoU, because IoU penalizes boundary displacement and small-object omission more strictly, which becomes critical in narrow or heterogeneous contact zones.

In addition, PSPNet and DeepLabV3+ introduce multi-scale context through pyramid pooling and ASPP, which partly mitigates the uncertainty caused by scale differences and texture variability. On the Huitongshan dataset, they achieve mPA values of approximately 76% and 77%, respectively, indicating that contextual modeling is important for geological-element segmentation. Building on this, SegNeXt-HFCA further performs hierarchical feature fusion along the encoder–decoder path and employs coordinate attention to explicitly guide the segmentation of slender bands and lithologic contacts in the spatial domain, while a combination of Lovász loss and OHEM is used to better align optimization with the IoU metric and to enhance learning from hard examples. Under the same data and training settings, this strategy yields an mPA of about 79% and an mIoU of roughly 67%, representing a consistent improvement over the multi-scale baselines. At the class level, units with clear boundaries and homogeneous textures show relatively small differences across models, whereas in more ambiguous settings—such as narrow sedimentary bodies and contact zones between mafic and intermediate intrusive rocks—the cross-layer aggregation and spatial guidance of HFCA effectively reduce breaks and “sticking” along boundaries, leading to more coherent and continuous maps. Overall, under medium-resolution multispectral conditions, the combination of multi-scale context, structured spatial attention, and IoU-aligned losses appears to be critical for improving the accuracy of geological remote-sensing interpretation.

Figure 9 compares segmentation results of six representative Huitongshan scenes obtained with different methods. “Label” denotes the lithologic ground truth rasterized from the regional geological map and co-registered to the Sentinel-2 10 m grid; the Sentinel-2 false-color composites are shown for visual context only. Qualitatively, benefiting from the hierarchical multi-scale encoder with coordinate attention and the robust training strategy, SegNeXt-HFCA achieves superior stripe continuity, boundary adherence, and discrimination between spectrally similar lithologies compared with the competing models. In panels (1) and (2), the banded, strongly layered structural units are continuously tracked by SegNeXt-HFCA, with stripe width and strike closely matching the reference labels and with markedly fewer cross-stripe misclassifications, illustrating the ability of multi-scale fusion and coordinate attention to represent slender targets. In panel (3), where syenogranite and diorite exhibit limited spectral contrast, class-frequency re-weighting and OHEM suppress holes and speckle, yielding a more stable spatial extent. Panels (4) and (5) show mixed zones involving schist/gneiss, basalt, and Quaternary deposits; here, the combination of boundary-weighted losses and DenseCRF refinement produces more complete contacts and clearer transition zones, while preserving narrow lithologic bodies. Under the more complex structural configuration of panel (6), the separation between granitic units and schistose rocks is also more stable, and the overall topology is more consistent with geological expectations. Compared with U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, BiSeNetV2, and the SegNeXt-base backbone, SegNeXt-HFCA yields fewer misclassifications and less random noise, improving not only mIoU and related quantitative metrics but also visual coherence and practical usefulness.

5.5. Comparative Experiments and Analysis on the Xingxingxia Dataset

In this part, we conduct the same ten-class experiments on the Xingxingxia study area. Unlike Huitongshan, where geological elements are highly fragmented, Xingxingxia exhibits relatively continuous spatial patterns and comparable sample sizes across classes, with pronounced homogeneity. The segmentation results in terms of PA and IoU are reported in Table 3 and Table 4.

Among the ten lithologic classes, diorite and Quaternary deposits are the easiest to interpret, with class-wise PA values close to or exceeding 90%. Trachyandesitic tuff and dolomitic marble also achieve high IoU scores (>80%). By contrast, two-mica quartz schist and monzgranite, whose textures resemble those of the surrounding granitic units and whose boundaries are locally fragmented, are more difficult to delineate, with IoU values mostly in the 65–70% range. Dolomitic marble and trachyandesitic tuff are widely distributed within the area and exhibit relatively stable lithologic signatures, maintaining accuracies of about 90% PA and consistently high IoU. In contrast, migmatitic gneiss commonly occurs interleaved with adjacent granitic bodies and within strongly fractured structural zones, which increases confusion with neighboring lithologies and leads to slightly lower accuracies.

From the PA/mPA and IoU/mIoU statistics in Table 3 and Table 4, mainstream architectures deliver overall stable performance on the Xingxingxia dataset, yet pronounced gaps remain across both backbones and lithologic classes. U-NetFormer achieves mPA = 83.11% and mIoU = 70.53%. PSPNet and DeepLabV3+ provide comparable overall accuracies (mPA = 87.15–87.26%, mIoU = 73.28–73.38%), while DANet attains mPA = 83.23% and mIoU = 71.16%. LGMSFNet is slightly lower (mPA = 80.34%, mIoU = 68.20%). The lightweight BiSeNetV2 remains competitive (mPA = 86.05%, mIoU = 73.71%), indicating that compact designs can still be effective in this setting. With the two newly added baselines, Swin-UperNet yields mPA = 80.73% and mIoU = 68.58%, whereas SegFormer shows a clear performance drop (mPA = 75.41%, mIoU = 59.96%). SegNeXt-base obtains mPA = 84.46% and mIoU = 73.05%. In comparison, the proposed SegNeXt-HFCA achieves the strongest overall performance, reaching mPA = 87.40% (highest in Table 3) and mIoU = 75.69% (highest in Table 4), indicating more accurate and more consistent lithologic delineation.

At the class level, SegNeXt-HFCA shows particularly robust discrimination for lithologies with relatively coherent bodies and clearer contacts, including DI (PA = 93.89%, IoU = 71.56%), Q (PA = 93.11%, IoU = 77.81%), FMG (PA = 91.76%, IoU = 84.59%), and DCT (PA = 91.28%, IoU = 83.41%); DM also remains reliably mapped (PA = 89.15%, IoU = 86.36%). The more challenging units are mainly 2MQS, GDI, and CMG, with IoU values of 67.50%, 68.31%, and 64.09%, respectively, which is consistent with their strong spectral/textural similarity to neighboring granitoid units and their locally fragmented spatial occurrence. Overall, Table 3 and Table 4 jointly indicate that mPA is generally higher than mIoU, reflecting the stricter penalty of IoU for boundary displacement and small-object omission under class imbalance and inter-class resemblance. Within this setting, SegNeXt-HFCA provides the most balanced gains across classes and achieves the best overall mapping quality in Xingxingxia.

Across the seven representative examples in Figure 10, the proposed method yields more stable segmentations that better conform to the spatial distribution of geological units in the Xingxingxia dataset. In rows (1)–(3), lithologies such as monzogranite, two-mica quartz schist, and diorite occur as banded or sheet-like bodies. Compared with the mainstream models (U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, BiSeNetV2, and the SegNeXt backbone alone), SegNeXt-HFCA not only preserves the correct semantic classes but also better restores band continuity and width, substantially reducing cross-band “bleeding” and salt-and-pepper noise within bodies, while small patches and pinch-out terminations are more completely retained. In rows (4)–(6), several obliquely arranged lithologic units (e.g., coarse- and fine-grained monzogranite and Neogene–Quaternary cover) are mutually interdigitated, with highly intricate boundaries. Competing methods tend to produce over-smoothed boundaries, holes, and misclassifications, particularly between the coarse/fine monzogranite units and in the contact zones where migmatitic gneiss–schist adjoins carbonate rocks. In contrast, SegNeXt-HFCA achieves the highest boundary adherence: contact zones are continuous and smooth, and heterogeneous blocks remain more clearly separable and structurally intact. In the large-scale, high-contrast scene of row (7), SegNeXt-HFCA simultaneously preserves the overall geometry and fine contours of the main bodies and avoids confusing Quaternary deposits with granitoid units, resulting in the most consistent map overall. Taken together, these examples indicate that the proposed method outperforms the baselines in terms of object completeness, boundary consistency, and fine-grained class discrimination—especially for coarse/fine units and in granitoid–gabbro/diorite adjacency zones—thereby enhancing the practical usability and reliability of lithologic mapping.

5.6. Ablation Study

In this section, SegNeXt is taken as the baseline model, and a stepwise ablation study is conducted on the Huitongshan dataset to assess the contribution of each component in SegNeXt-HFCA. As shown in Table 5, the plain SegNeXt backbone achieves 75.68% mPA and 63.45% mIoU. Adding only coordinate attention (SegNeXt + CA, row 2) already raises the scores to 77.35% mPA and 65.10% mIoU (+1.67%/+1.65%), indicating that lightweight channel-and-spatial re-calibration is beneficial even without modifying the multi-scale fusion. When the full HMS-CA encoder is enabled (SegNeXt-HFCA, row 3), mPA and mIoU further increase to 77.69% and 65.37% (+0.34%/+0.27% over SegNeXt + CA), confirming that hierarchical multi-scale aggregation effectively couples texture–spectral cues and alleviates receptive-field mismatch for stripe-like units. Incorporating the robust hard-example training (RHT, row 4) yields a modest but consistent improvement to 77.75% mPA and 65.69% mIoU, mainly stabilizing optimization and enhancing the representation of rare or weak-contrast classes. When the BCRF + seamless inference pipeline is activated on top of HMS-CA while RHT is disabled (row 5), the performance rises to 78.63% mPA and 67.15% mIoU; this suggests that overlapping sliding-window inference with Hann blending and DenseCRF refinement is particularly effective at sharpening contacts and suppressing isolated misclassified pixels. Finally, combining all components (row 6) yields the best performance, 79.18% mPA and 67.29% mIoU, corresponding to cumulative gains of +3.50 mPA and +3.84 mIoU over the SegNeXt baseline. These results demonstrate that encoder-level HFCA, training-level RHT, and inference-level BCRF + seamless inference provide complementary benefits rather than redundant ones.

Module-wise mechanism analysis: HMS primarily improves the recognition of thin, anisotropic lithologic belts by alleviating receptive-field mismatch across scales and preserving elongated topology after multi-scale alignment and fusion; this explains why enabling HMS-CA yields additional gains over CA-only and reduces broken/fragmented stripe-like bodies in the qualitative examples (Figure 9 and Figure 10). The coordinate-aware boundary re-calibration (CA) contributes most to weak-texture or low-contrast regions by injecting boundary/positional cues and suppressing ambiguous responses around contacts, which is consistent with the reduced inter-class confusion observed in Figure 8. RHT + class-frequency priors mainly target rare classes under long-tailed distributions: by prioritizing reliable hard pixels and re-balancing gradients, it prevents majority classes from dominating the optimization, leading to consistent (though modest) improvements and better stability for minority/weak-contrast lithologies. Finally, DenseCRF refinement operates at inference time to enforce local label consistency under spectral similarity while preserving sharp edges, which directly explains the reduction in salt-and-pepper artefacts and the sharpening of ambiguous contacts reported for the BCRF + seamless setting.

Loss strategy comparison: To ensure a fair comparison, we fix the network architecture and all training settings for all strategies, including the optimizer and learning-rate schedule, data augmentation, classifier bias initialization using class priors, class reweighting schedule, EMA, TTA, and the same fine-tuning protocol. The only change is the loss formulation for handling class imbalance: (1) weighted cross-entropy (WCE) + Lovász, (2) class-weighted focal loss (γ = 2) + Lovász, and (3) OHEM cross-entropy + Lovász. We report mIoU and mPA on the validation set.

Effect of loss strategies: As shown in Table 6, our OHEMCE + Lovász achieves the best performance among the compared loss strategies, reaching 67.29% mIoU and 79.18% mPA. It improves upon WCE + Lovász by +0.24 mIoU and +0.07 mPA, and upon Focal (γ = 2) + Lovász by +0.73 mIoU and +0.50 mPA. These results suggest that, under the same training settings in Huitongshan, hard-pixel mining can provide a modest but measurable benefit.

As shown in Table 7, the proposed 10-channel stack consistently outperforms the full-band MSI baseline on both study areas. For Huitongshan, the 10-channel input achieves 67.29% mIoU and 79.18% mPA, exceeding the 12-band MSI baseline by +6.40 mIoU and +4.75 mPA. For Xingxingxia, the 10-channel input yields 75.69% mIoU and 87.40% mPA, improving upon the full-band baseline by +4.23 mIoU and +4.29 mPA. These results indicate that the PCA/MNF-based compressed representation together with DEM-derived constraints retains (and enhances) the most discriminative information for lithologic segmentation.

6. Discussion

6.1. Overall Performance in the Two Study Areas

Table 1, Table 2, Table 3 and Table 4 show that SegNeXt-HFCA achieves the best or near-best scores in both study areas under a unified training protocol. On the more fragmented Huitongshan dataset, our method attains an mPA of about 79% and an mIoU of ~67%, outperforming the SegNeXt-base backbone by roughly 3.5% points in mPA and 3.8 points in mIoU and exceeding other strong baselines such as U-NetFormer, PSPNet, and DeepLabV3+. On the structurally more continuous Xingxingxia dataset, SegNeXt-HFCA still delivers the highest mPA (~87%) and mIoU (~75%), with an mIoU gain of about 2.5–3.0% points over the best competing model.

These improvements are not limited to a single class but are reflected in most lithologic units. In Huitongshan, the model maintains high PA and IoU for well-behaved classes such as two-mica quartz schist and quartz diorite, while notably improving the performance of spectrally ambiguous or thin-stripe classes such as syenogranite, basalt, and Quaternary deposits. In Xingxingxia, SegNeXt-HFCA achieves PA values above 90% for several key lithologies (e.g., diorite and coarse- and fine-grained monzogranite), and clearly improves IoU for dolomitic marble and dacitic tuff, which are often confused with surrounding granitoids by other networks.

In addition, we performed a controlled comparison between the proposed 10-channel input (PCA + MNF + DEM) and a full MSI-band baseline (12-band MSI), while keeping the network architecture and the entire training/inference protocol identical (Table 7). The 10-channel configuration yields consistently higher performance across both study areas (Huitongshan: 67.29% vs. 60.89% mIoU, 79.18% vs. 74.43% mPA; Xingxingxia: 75.69% vs. 71.46% mIoU, 87.40% vs. 83.11% mPA), indicating that the compact PCA/MNF representation combined with DEM-derived topographic cues is more effective than directly using all MSI bands. This improvement is likely due to two complementary effects: (i) PCA/MNF alleviates inter-band redundancy and enhances signal-to-noise characteristics, producing a more compact and better-separated feature space; and (ii) DEM derivatives provide a complementary geomorphological context that supports the delineation of lithologic contacts. By contrast, feeding all MSI bands increases input dimensionality and may introduce redundant or weakly informative channels, which can reduce sample efficiency and generalization under limited labeled data.

6.2. Effect of the HFCA Design

The ablation study in Table 5 clarifies how each component of SegNeXt-HFCA contributes to the final performance. Starting from the SegNeXt backbone, adding the hierarchical multi-scale encoder with coordinate attention already increases mPA and mIoU by roughly 1.5–2.0% points. This indicates that the proposed HMS-CA encoder effectively aggregates multi-scale contextual information while preserving high-resolution details, which is crucial for tracking narrow lithologic ribbons and subtle facies transitions.

Subsequently introducing the robust hard-example training (RHT) and the BCRF + seamless inference pipeline yields further, more moderate gains. RHT improves optimization stability and enhances the learning of rare or weak-contrast classes, while the overlapping sliding-window inference and DenseCRF refinement mainly sharpen contacts and suppress isolated misclassified pixels. Overall, the full configuration of SegNeXt-HFCA achieves the best trade-off between accuracy and complexity, with cumulative gains of about +3.5 mPA and +3.8 mIoU over the plain SegNeXt baseline on the Huitongshan dataset.

6.3. Class-Wise Behavior and Geological Implications

The confusion matrices in Figure 8 and the qualitative comparisons in Figure 9 and Figure 10 provide further insight into how SegNeXt-HFCA behaves for different geological elements. In Huitongshan (Figure 8), our method substantially reduces confusion among five of the nine classes compared with PSPNet and SegNeXt-base, particularly for syenogranite, basalt, and gabbro, whose spectral signatures partly overlap. The number of pixels misclassified between syenogranite and quartz diorite, or between basalt and Quaternary deposits, is markedly reduced, which is consistent with the higher IoU reported for these classes in Table 1 and Table 2.

From a geological perspective, these improvements translate into more continuous and geologically plausible lithologic units. In Figure 9 and Figure 10, SegNeXt-HFCA preserves the continuity of thin, stripe-like schist and basalt bodies, maintains the correct thickness of sedimentary and volcanic layers, and yields smoother but still sharp contacts between granitoids and surrounding metamorphic rocks. Other methods tend to either over-smooth boundaries—merging adjacent units—or produce “salt-and-pepper” noise and broken stripes, which complicate geological interpretation and downstream mapping. The proposed framework therefore better reflects the expected structural patterns and contact relationships in both study areas. In particular, the improved continuity of thin stripe-like bodies is mainly attributable to HMS-based cross-scale aggregation, whereas the stabilized predictions in weak-texture/low-contrast areas are more closely related to the coordinate-aware boundary re-calibration. The suppression of isolated noisy pixels and the sharpening of ambiguous contacts are further enhanced by the DenseCRF refinement used in our inference pipeline.

At the same time, the results highlight some remaining challenges. First, spectrally very similar granitoid sub-types (e.g., different monzogranites or the syenogranite–monzonite pair) still exhibit moderate confusion, especially in weathered or shadowed areas. Second, in zones of intense deformation and strong topographic relief, small slivers and lens-shaped bodies may still be partially missed or slightly displaced, indicating that even richer structural priors could be beneficial.

6.4. Geological Contrast Between Huitongshan and Xingxingxia and Its Expression in Remote-Sensing Mapping

The two study areas differ fundamentally in lithologic architecture and surface expression, which directly controls the difficulty of remote-sensing lithologic mapping. Huitongshan is characterized by a highly fragmented mosaic of lithologies with abundant thin, stripe-like bodies, complex contact networks, and frequent small-scale alternations between intrusive and metamorphic units. In such settings, lithologic boundaries are short, tortuous, and locally discontinuous, and spectral responses are easily mixed within a 10 m pixel due to sub-pixel heterogeneity, weathering, and topographic shading. These properties explain why the overall accuracy in Huitongshan is lower and why confusion is concentrated among spectrally/texture-similar units (e.g., granitoids vs. intermediate intrusives; volcanic/mafic units vs. Quaternary cover), as reflected by the confusion patterns and the visually evident boundary complexity in Figure 9 and Figure 10.

In contrast, Xingxingxia is dominated by broad, laterally continuous lithologic domains (e.g., large granitoid plutons and comparatively smoother contacts) and exhibits stronger regional-scale spatial coherence. Such geological organization yields larger homogeneous patches, more persistent textural trends, and fewer abrupt alternations at the pixel scale. Consequently, the mapping task is less affected by sub-pixel mixing, and model outputs display higher spatial stability and class separability, consistent with the systematically higher mIoU/mPA reported for Xingxingxia.

Importantly, these geological contrasts are clearly expressed in our remote-sensing results. In Huitongshan, the key challenge is preserving the continuity of narrow lithologic ribbons and maintaining contact fidelity under fragmentation; the proposed SegNeXt-HFCA reduces stripe breakage and cross-stripe leakage, producing contacts that better respect the reference geological boundaries in the qualitative comparisons. In Xingxingxia, the dominant challenge shifts toward suppressing over-smoothing across gentle contacts and avoiding large-area drift within extensive plutonic domains; here, SegNeXt-HFCA improves within-unit consistency while keeping contacts sharp, leading to cleaner domain partitioning and fewer spurious islands. This geology-aware comparison indicates that the framework does not merely optimize pixel-wise metrics but also improves map products in ways aligned with geological expectations: (i) continuity of stratiform/lineated units in fragmented terrains and (ii) stable domain delineation in regionally coherent plutonic terrains.

6.5. Robustness Across Contrasting Geological Settings

An important observation is that SegNeXt-HFCA generalizes well between the two study areas, despite their different geological styles. Huitongshan features highly fragmented units and multiple narrow intrusive bodies, whereas Xingxingxia is dominated by broad granitoid plutons and gently varying contacts. The consistent mIoU improvements on both datasets suggest that the combination of HMS-CA encoding, class-frequency-aware loss design and robust training strategy captures transferable patterns rather than over-fitting to a specific structural style. Moreover, the model maintains competitive performance across both majority and minority classes, which is non-trivial given the pronounced class imbalance seen in the pixel-count histograms. This supports the effectiveness of the class-frequency prior, OHEM-based sampling and entropy-guided tiling in enhancing the representation of rare lithologies without sacrificing performance on dominant units.

6.6. Limitations and Future Work

Despite the encouraging results, several limitations remain. First, the experiments are restricted to Sentinel-2 MSI and two regions in north-western China; extending the evaluation to additional sensors (e.g., Landsat-8/9 and hyperspectral data) and tectonic settings would provide a more comprehensive assessment of robustness. Second, although the proposed framework is relatively lightweight compared to very deep transformer-based models, the two-phase training and TTA + DenseCRF inference introduce extra computational cost, which may be non-negligible for continental-scale mapping.

Future work will focus on incorporating more explicit geological priors—such as structural lineaments, stratigraphic rules, or multi-task learning with fault/lineament detection—into the network and exploring more efficient approximate inference schemes that retain most of the boundary-refinement benefits of DenseCRF at lower cost. In addition, integrating semi-supervised or domain-adaptation strategies could further improve performance in poorly labeled regions and enhance the applicability of SegNeXt-HFCA to truly large-scale lithologic mapping campaigns.

7. Conclusions

This study addresses two persistent challenges in geological remote-sensing interpretation (GRSI) from Sentinel-2 imagery: (i) high intra-class variability and strong inter-class similarity among lithologic units, and (ii) complex, fragmented contacts under pronounced class imbalance, where narrow bands and small bodies are easily missed and boundary errors can dominate IoU. To tackle these issues in a unified framework, we developed SegNeXt-HFCA by augmenting the SegNeXt backbone with hierarchical multi-scale fusion and coordinate attention, enabling stronger cross-scale interaction while improving spatial localization of lithologic patterns. In parallel, we designed a class-balanced hybrid objective that integrates OHEM cross-entropy, Lovász-Softmax, and a class-frequency prior so that optimization is less biased toward dominant units and more consistent with the IoU-oriented evaluation target. A two-stage training protocol (including OHEM scheduling, EMA, AMP, and TTA) is further adopted to stabilize convergence and improve generalization on large-area scenes, and a lightweight DenseCRF post-refinement is used to enhance spatial coherence along geological contacts without altering the backbone structure.

Extensive experiments on two large-scale datasets (Huitongshan and Xingxingxia) demonstrate that the proposed framework yields consistent gains over a strong SegNeXt baseline, improving mIoU by approximately 3.8% points for Huitongshan and 2.6% points for Xingxingxia, while delivering more balanced class-wise behavior under class imbalance and strong lithologic similarity. Beyond numerical improvements, the predicted maps show better continuity of banded/sheet-like units and more coherent contact geometry, which is critical for practical lithologic mapping and subsequent geological analysis. These results support the view that lithologic units can be treated as regionalized variables whose spatial continuity and contact relationships should be explicitly respected during learning and inference. Future work will focus on tighter coupling of geostatistical priors with deep networks to further improve interpretability and boundary reliability, and on developing lightweight task-specific variants for operational, large-area lithologic mapping.

Author Contributions

Conceptualization, F.X.; Methodology, F.X.; Validation, F.X.; Resources, F.X. and Y.Y.; Data curation, F.X.; Writing—original draft, F.X.; Writing—review & editing, Y.Y.; Supervision, Y.Y.; Funding acquisition, F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Sentinel-2 MSI imagery used in this study is publicly available from the Copernicus Open Access Hub. The derived 10-band feature stack and lithologic annotation data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the China University of Mining and Technology for general support. The authors also thank the anonymous reviewers for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saini-Eidukat, B.; Schwert, D.P.; Slator, B.M. Geology Explorer: Virtual Geologic Mapping and Interpretation. Comput. Geosci. 2002, 28, 1167–1176. [Google Scholar] [CrossRef]
Van Der Meer, F.D.; Van Der Werff, H.M.A.; Van Ruitenbeek, F.J.A.; Hecker, C.A.; Bakker, W.H.; Noomen, M.F.; Van Der Meijde, M.; Carranza, E.J.M.; Smeth, J.B.D.; Woldai, T. Multi- and Hyperspectral Geologic Remote Sensing: A Review. Int. J. Appl. Earth Obs. Geoinf. 2012, 14, 112–128. [Google Scholar] [CrossRef]
Lyons, M.B.; Keith, D.A.; Phinn, S.R.; Mason, T.J.; Elith, J. A Comparison of Resampling Methods for Remote Sensing Classification and Accuracy Assessment. Remote Sens. Environ. 2018, 208, 145–153. [Google Scholar] [CrossRef]
Chen, C.; Wang, C.; Liu, B.; He, C.; Cong, L.; Wan, S. Edge Intelligence Empowered Vehicle Detection and Image Segmentation for Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 13023–13034. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Xu, Q.; Ma, Z.; He, N.; Duan, W. DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation. Comput. Biol. Med. 2023, 154, 106626. [Google Scholar] [CrossRef]
Lanaras, C.; Bioucas-Dias, J.; Galliani, S.; Baltsavias, E.; Schindler, K. Super-Resolution of Sentinel-2 Images: Learning a Globally Applicable Deep Neural Network. ISPRS J. Photogramm. Remote Sens. 2018, 146, 305–319. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Wang, C.; Liu, Y.; Fu, K. PBNet: Part-Based Convolutional Neural Network for Complex Composite Object Detection in Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 173, 50–65. [Google Scholar] [CrossRef]
Ozcelik, F.; Alganci, U.; Sertel, E.; Unal, G. Rethinking CNN-Based Pansharpening: Guided Colorization of Panchromatic Images via GANs. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3486–3501. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Yan, L.; Zhou, T.; Li, S.; Wang, J.; Li, X. A 3D Attention U-Net Network and Its Application in Geological Model Parameterization. Pet. Explor. Dev. 2023, 50, 183–190. [Google Scholar] [CrossRef]
Khan, M.; Hanan, A.; Kenzhebay, M.; Gazzea, M.; Arghandeh, R. Transformer-Based Land Use and Land Cover Classification with Explainability Using Satellite Imagery. Sci. Rep. 2024, 14, 16744. [Google Scholar] [CrossRef]
Asadzadeh, S.; De Souza Filho, C.R. A Review on Spectral Processing Methods for Geological Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2016, 47, 69–90. [Google Scholar] [CrossRef]
Mittal, S.; Tatarchenko, M.; Brox, T. Semi-Supervised Semantic Segmentation with High- and Low-Level Consistency. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1369–1379. [Google Scholar] [CrossRef] [PubMed]
Ohn-Bar, E.; Trivedi, M.M. Multi-Scale Volumes for Deep Object Detection and Localization. Pattern Recognit. 2017, 61, 557–572. [Google Scholar] [CrossRef]
Alioscha-Perez, M.; Sahli, H. Efficient Learning of Spatial Patterns with Multi-Scale Conditional Random Fields for Region-Based Classification. Remote Sens. 2014, 6, 6727–6764. [Google Scholar] [CrossRef]
Campbell, N.D.F.; Subr, K.; Kautz, J. Fully-Connected CRFs with Non-Parametric Pairwise Potentials. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 1658–1665. [Google Scholar]
Mensink, T.; Gavves, E.; Snoek, C.G.M. COSTA: Co-Occurrence Statistics for Zero-Shot Classification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2441–2448. [Google Scholar]
Cui, P.; Meng, X.; Zhang, W. Road Extraction from High-Resolution Remote Sensing Images of Open-Pit Mine Using D-SegNeXt. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6009005. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Guo, R.; Zhao, X.; Zuo, G.; Wang, Y.; Liang, Y. Polarimetric Synthetic Aperture Radar Image Semantic Segmentation Network with Lovasz-Softmax Loss Optimization. Remote Sens. 2023, 15, 4802. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Ashley, R.P.; Goetz, A.F.H.; Rowan, L.C.; Abrams, M.J. Detection and Mapping of Hydrothermally Altered Rocks in the Vicinity of the Comstock Lode, Virginia Range, Nevada, Using Enhanced Landsat Images; U.S. Geological Survey: Reston, VA, USA, 1979.
Clark, R.N.; Swayze, G.A.; Livo, K.E.; Kokaly, R.F.; Sutley, S.J.; Dalton, J.B.; McDougal, R.R.; Gent, C.A. Imaging Spectroscopy: Earth and Planetary Remote Sensing with the USGS Tetracorder and Expert Systems. J. Geophys. Res. Planets 2003, 108, 5131. [Google Scholar] [CrossRef]
Clark, R.N.; Roush, T.L. Reflectance Spectroscopy: Quantitative Analysis Techniques for Remote Sensing Applications. J. Geophys. Res. Solid Earth 1984, 89, 6329–6340. [Google Scholar] [CrossRef]
Kruse, F.A.; Lefkoff, A.B.; Boardman, J.W.; Heidebrecht, K.B.; Shapiro, A.T.; Barloon, P.J.; Goetz, A.F.H. The Spectral Image Processing System (SIPS)—Interactive Visualization and Analysis of Imaging Spectrometer Data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
Chang, C.-I. An Information-Theoretic Approach to Spectral Variability, Similarity, and Discrimination for Hyperspectral Image Analysis. IEEE Trans. Inf. Theory 2000, 46, 1927–1932. [Google Scholar] [CrossRef]
Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.; Feitosa, R.Q.; van der Meer, F.; van der Werff, H.; van Coillie, F.; et al. Geographic Object-Based Image Analysis—Towards a New Paradigm. ISPRS-J. Photogramm. Remote Sens. 2014, 87, 180–191. [Google Scholar] [CrossRef]
Pal, M.; Mather, P.M. Support Vector Machines for Classification in Remote Sensing. Int. J. Remote Sens. 2005, 26, 1007–1011. [Google Scholar] [CrossRef]
Camps-Valls, G.; Bruzzone, L. Kernel-Based Methods for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1351–1362. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An Assessment of the Effectiveness of a Random Forest Classifier for Land-Cover Classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
Belgiu, M.; Drăguţ, L. Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Latifovic, R.; Pouliot, D.; Campbell, J. Assessment of Convolution Neural Networks for Surficial Geology Mapping in the South Rae Geological Region, Northwest Territories, Canada. Remote Sens. 2018, 10, 307. [Google Scholar] [CrossRef]
Sang, X.; Xue, L.; Ran, X.; Li, X.; Liu, J.; Liu, Z. Intelligent High-Resolution Geological Mapping Based on SLIC-CNN. ISPRS Int. J. Geo-Inf. 2020, 9, 99. [Google Scholar] [CrossRef]
Mao, Q.; Xiao, W.; Wang, H.; Ao, S.; Windley, B.F.; Song, D.; Sang, M.; Tan, Z.; Li, R.; Wang, M. Prolonged Late Mesoproterozoic to Late Triassic Tectonic Evolution of the Major Paleo-Asian Ocean in the Beishan Orogen (NW China) in the Southern Altaids. Front. Earth Sci. 2022, 9, 825852. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
Chen, Q.; Xia, J.; Zhao, Z.; Zhou, J.; Zhu, R.; Zhang, R.; Zhao, X.; Chao, J.; Zhang, X.; Zhang, G. Interpretation of Hydrothermal Alteration and Structural Framework of the Huize Pb–Zn Deposit, SW China, Using Sentinel-2, ASTER, and Gaofen-5 Satellite Data: Implications for Pb–Zn Exploration. Ore Geol. Rev. 2022, 150, 105154. [Google Scholar] [CrossRef]
Shebl, A.; El-Desoky, H.M.; Abdel-Rahman, A.M.; Fahmy, W.; El-Awny, H.; El-Sherif, A.; El-Rahmany, M.M.; Csámer, Á. Impact of DEMs for Improvement Sentinel 2 Lithological Mapping Utilizing Support Vector Machine: A Case Study of Mineralized Fe-Ti-Rich Gabbroic Rocks from the South Eastern Desert of Egypt. Minerals 2023, 13, 826. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Wang, Z.; Dai, X.; Guo, Z.; Huang, C.; Zhang, H. Unsupervised Monocular Depth Estimation with Channel and Spatial Attention. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 7860–7870. [Google Scholar] [CrossRef]

Figure 1. Regional tectonic framework of the Beishan orogen (NW China) and locations of the Huitongshan and Xingxingxia study areas. Major tectonic units are shown; the Huitongshan and Xingxingxia study areas are outlined by blue and red rectangles, respectively.

Figure 2. Sentinel-2 false-color composite and lithologic map of the Huitongshan study area. (a) False-color Sentinel-2 MSI image highlighting lithologic and structural patterns. (b) Simplified lithologic map interpreted from regional geological mapping.

Figure 3. Sentinel-2 false-color composite and lithologic map of the Xingxingxia study area. (a) False-color Sentinel-2 MSI image highlighting lithologic and structural patterns. (b) Simplified lithologic map interpreted from regional geological mapping.

Figure 4. Base images (grayscale) of the 10-channel input stack used in this study: (a–d) PC1–PC4, (e–g) MNF1–MNF3, (h) elevation, (i) slope, and (j) TPI. All layers are co-registered to the same grid; no-data regions are shown in white.

Figure 5. Class-wise pixel-count histograms of lithologic units in the (a) Huitongshan and (b) Xingxingxia study areas. Abbreviations: (a) SyG, syenogranite; 2MQS, two-mica quartz schist; DI, diorite; Q, Quaternary deposits; BA, basalt; QDI, quartz diorite; HMS, Hornblende–mica schist; GB, gabbro; MZ, monzonite. (b) SyG, syenogranite; 2MQS, two-mica quartz schist; DI, diorite; Q, Quaternary deposits; MGN, migmatitic gneiss; DM, dolomitic marble; CMG, coarse-grained monzogranite; FMG, fine-grained monzogranite; GDI, granodiorite; DCT, dacitic tuff.

Figure 6. Architecture of SegNeXt-HFCA. Arrows indicate the information flow. Solid arrows denote feature propagation between modules, and dashed boxes indicate components used only during training or only during inference.

Figure 7. Sensitivity of SegNeXt-HFCA performance to the loss-threshold parameter τ: (a) mean pixel accuracy (mPA); (b) mean intersection-over-union (mIoU). The vertical dashed line marks the selected threshold (τ = 0.3) used in all experiments.

Figure 8. Normalized confusion matrices for the Huitongshan study area: (a) PSPNet; (b) SegNeXt baseline; (c) proposed SegNeXt-HFCA.

Figure 9. Example predictions of different methods on the Huitongshan dataset. Red boxes highlight representative areas with challenging boundaries/contacts for visual comparison.

Figure 10. Example predictions of different methods in the Xingxingxia dataset. Red boxes highlight representative areas with challenging boundaries/contacts for visual comparison.

Table 1. PA and mPA (%) of different methods in the Huitongshan study area *.

Method Name	SyG	2MQS	DI	Q	BA	QDI	HMS	GB	MZ	mPA
U-NetFormer	70.97	96.07	84.86	57.26	71.7	88.16	85.17	68.42	78	77.84
PSPNet	67.43	94.61	80.13	52.26	78.69	83.28	85.16	66.02	80.1	76.41
DeepLabV3+	68.37	93.13	78.63	52.94	75.15	86.2	87.57	68.13	89.73	77.76
DANet	61.79	93.59	84.43	52.46	74.64	85.61	88.21	65.7	88.32	77.19
LGMSFNet	59.15	83.02	81.52	56.94	69.33	87.46	87.8	58.13	88.26	74.62
BiSeNetV2	69.03	90.35	73	50.02	69.69	85.96	83.62	69.36	92.41	75.94
Segformer	45.35	84.14	67.96	55.17	63.47	76.18	79.79	37.67	83.99	65.97
Swin-UperNet	56.61	88.87	81.53	52.23	60.54	87.68	89.16	65.38	79.58	73.51
SegNeXt-base	69.67	90.52	78.57	35.9	68.3	91.65	90.03	71.45	85.03	75.68
SegNeXt-HFCA (ours)	66.19	90.95	80.61	54.57	78.2	91.1	88.63	72.39	90.01	79.18

* Syenogranite (SyG), Two-mica quartz schist (2MQS), diorite (DI), Quaternary deposits (Q), basalt (BA), quartz diorite (QDI), Hornblende–mica schist (HMS), gabbro (GB), and monzonite (MZ). Bold values indicate the best performance in each column.

Table 2. IoU and mIoU (%) of different methods in the Huitongshan study area *.

Method Name	SyG	2MQS	DI	Q	BA	QDI	HMS	GB	MZ	mIoU
U-NetFormer	52.97	86.19	67.51	35.43	58.11	83.12	76.09	57.95	69.43	65.2
PSPNet	55.9	83.64	64.37	26.29	59.87	75.94	75.28	56.56	71.2	63.23
DeepLabV3+	58.64	80.27	65.18	31.95	61.33	76.87	77.85	59.6	77.31	65.44
DANet	52.9	87.22	70.1	29.6	59.36	77.3	80.88	54.38	75.2	65.21
LGMSFNet	51.84	77.95	62.52	29.31	54.8	78.13	77.53	53.67	76.47	62.47
BiSeNetV2	58.08	83.62	61	29.73	54.7	71.76	74.89	57.68	77.56	63.23
Segformer	33.84	76.00	51.37	27.76	46.47	64.44	68.97	32.96	63.53	51.71
Swin-UperNet	43.42	81.48	63.56	28.64	54.52	76.71	77.74	55.32	66.70	60.90
SegNeXt-base	52.96	83.29	61.37	25.48	58.58	78.32	80.48	55.33	75.28	63.45
SegNeXt-HFCA (ours)	58.07	87.64	67.77	34.32	59.58	82.97	80.85	58.02	76.42	67.29

* Bold values indicate the best performance in each column.

Table 3. PA and mPA of different methods in the Xingxingxia study area (%) *.

Method Name	SyG	2MQS	DI	Q	MGN	DM	CMG	FMG	GDI	DCT	mPA
U-NetFormer	82.47	85.69	88.76	86.22	81.23	92.96	60.17	80.19	86.29	87.08	83.11
PSPNet	77.66	84.58	93.31	86.63	83.84	90.75	83	91.1	86.52	94.12	87.15
DeepLabV3+	83.26	87.77	83.41	89.03	83.44	87.5	89.51	84.64	90.8	93.21	87.26
DANet	87.57	76.17	74.78	84.4	85.94	91.83	67.03	88.05	88.4	88.17	83.23
LGMSFNet	83.44	78.85	66.28	89.09	85.82	90.97	66.63	81.78	76.87	83.67	80.34
BiSeNetV2	87.15	81.86	84.56	87.86	87.35	91.84	80.47	88.42	86.73	84.25	86.05
Segformer	75.17	69.47	79.19	69.62	77.89	83.65	72.81	84.04	72.09	70.22	75.41
Swin-UperNet	83.23	78.00	74.43	88.84	85.13	90.06	63.57	88.89	74.32	80.45	80.73
SegNeXt-base	89.94	62.74	93.01	88.49	85.16	89.76	79.41	77.36	85.88	92.82	84.46
SegNeXt-HFCA (ours)	84.49	80.29	93.89	93.11	83.15	89.15	84.91	91.76	81.99	91.28	87.40

* Syenogranite (SyG), two-mica quartz schist (2MQS), diorite (DI), Quaternary deposits (Q), migmatitic gneiss (MGN), dolomitic marble (DM), coarse-grained monzogranite (CMG), fine-grained monzogranite (FMG), granodiorite (GDI), dacitic tuff (DCT). Bold values indicate the best performance in each column. Bold values indicate the best performance in each column.

Table 4. IoU and mIoU of different methods in the Xingxingxia study area (%) *.

Method Name	SyG	2MQS	DI	Q	MGN	DM	CMG	FMG	GDI	DCT	mIoU
U-NetFormer	73.32	62.37	67.7	78.2	70.93	84.35	49.34	76.26	59.34	83.47	70.53
PSPNet	69.81	68.79	66.42	77.45	75.44	83.09	63.35	79.41	62	88.03	73.38
DeepLabV3+	75.67	64.9	57.18	80.23	75.68	83.58	65.97	76.25	65.59	87.73	73.28
DANet	77.71	63.94	58.02	76.94	76.66	83.92	57.86	74.05	62.08	80.38	71.16
LGMSFNet	76.15	63.52	49.65	76.8	75.77	81.92	53.32	69.51	53.46	81.86	68.2
BiSeNetV2	75.92	67.94	65.27	76.07	75.68	80.88	71.83	80.41	64.07	79.1	73.71
Segformer	54.68	48.92	59.27	58.80	65.21	73.06	52.18	69.51	54.53	63.48	59.96
Swin-UperNet	71.64	60.49	70.24	76.60	72.85	79.31	56.85	79.44	63.71	75.91	68.58
SegNeXt-base	77.26	56.68	74.83	77.52	77.41	84.11	61.49	74.4	59.44	87.34	73.05
SegNeXt-HFCA (ours)	76.21	67.5	71.56	77.81	77.02	86.36	64.09	84.59	68.31	83.41	75.69

* Bold values indicate the best performance in each column.

Table 5. Ablation study of SegNeXt-HFCA in the Huitongshan study area *.

Method	HMS	CA	RHT	BCRF + Seamless	mPA (%)	mIoU (%)
SegNeXt					75.68	63.45
SegNeXt + CA		√			77.35	65.1
SegNeXt-HFCA	√	√			77.69	65.37
SegNeXt-HFCA	√	√	√		77.75	65.69
SegNeXt-HFCA	√	√		√	78.63	67.15
SegNeXt-HFCA	√	√	√	√	79.18	67.29

* √ indicates that the corresponding module is enabled in the ablation setting; blank cells indicate it is disabled.

Table 6. Comparison of loss strategies under the same training settings on the Huitongshan dataset *.

Strategy	mIoU (%)	mPA (%)
WCE + Lovász	67.05	79.11
Focal (γ = 2) + Lovász	66.56	78.68
Ours (OHEM-CE + Lovász)	67.29	79.18

* Bold values indicate the best performance in each column.

Table 7. Comparison of input feature stacks (10-channel vs. full MSI bands) in the two study areas *.

Study Area	Input Feature Stack	mIoU (%)	mPA (%)
Huitongshan	10-ch(PCA + MNF + DEM)	67.29	79.18
Huitongshan	12-band MSI (full-band baseline)	60.89	74.43
Xingxingxia	10-ch(PCA + MNF + DEM)	75.69	87.40
Xingxingxia	12-band MSI (full-band baseline)	71.46	83.11

* Bold values indicate the best performance in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, F.; Yang, Y. Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing. Remote Sens. 2026, 18, 413. https://doi.org/10.3390/rs18030413

AMA Style

Xie F, Yang Y. Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing. Remote Sensing. 2026; 18(3):413. https://doi.org/10.3390/rs18030413

Chicago/Turabian Style

Xie, Fuyuan, and Yongguo Yang. 2026. "Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing" Remote Sensing 18, no. 3: 413. https://doi.org/10.3390/rs18030413

APA Style

Xie, F., & Yang, Y. (2026). Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing. Remote Sensing, 18(3), 413. https://doi.org/10.3390/rs18030413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Multiscale Fusion with Coordinate Attention for Lithologic Mapping from Remote Sensing

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Weak-Texture Units and Spectral Ambiguity Under Low Contrast

2.2. Narrow Anisotropic Geological Bodies and Multi-Scale Receptive-Field Mismatch

2.3. Long-Tailed Imbalance, Ambiguous Contacts, and Spatial Consistency

2.4. Summary and Positioning of This Work

3. Study Area and Data

3.1. Study Areas

3.2. Data and Preprocessing

3.3. Feature Stack Construction (10 Channels)

3.4. Annotation and Splits

4. Methodology

4.1. Hierarchical Multi-Scale Encoder with Coordinate-Aware Re-Calibration (HMS-CA)

4.2. Loss Functions and Class-Frequency Prior

4.3. Robust Training and Inference Strategies

4.4. Dense Conditional Random Field (DenseCRF) Refinement

5. Experiments and Results

5.1. Experimental Settings

5.2. Evaluation Metrics

5.3. Sensitivity Analysis

5.4. Comparative Experiments and Analysis on the Huitongshan Dataset

5.5. Comparative Experiments and Analysis on the Xingxingxia Dataset

5.6. Ablation Study

6. Discussion

6.1. Overall Performance in the Two Study Areas

6.2. Effect of the HFCA Design

6.3. Class-Wise Behavior and Geological Implications

6.4. Geological Contrast Between Huitongshan and Xingxingxia and Its Expression in Remote-Sensing Mapping

6.5. Robustness Across Contrasting Geological Settings

6.6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI