This section presents a comprehensive evaluation of the proposed method. We first describe the experimental setup and evaluation protocol, including dataset partitioning, training/inference settings, and primary metrics—overall pixel accuracy (PA) and mean intersection-over-union (mIoU). We then compare the model, under identical configurations, with representative semantic-segmentation baselines to verify its effectiveness. Finally, we conduct ablation studies and report model complexity and inference speed to quantify the accuracy–efficiency trade-off.
5.1. Experimental Settings
In both study areas (Huitongshan and Xingxingxia), the co-registered multispectral images and labels were tiled into non-overlapping 128 × 128 patches. For each area, we formed an independent dataset and randomly split it 80/20 into training and test sets. To ensure a fair and informative evaluation across different design paradigms, we benchmark against conventional CNN-based segmentation models (PSPNet [
38], DeepLabV3+ [
39], and DANet [
40]), a lightweight real-time CNN (BiSeNetV2 [
41]), hybrid architectures (U-NetFormer [
42] and LGMSFNet [
43]), transformer-based baselines (SegFormer and Swin-UperNet), and the backbone-matched SegNeXt-base model [
18]. All baseline models were retrained under a unified protocol: identical data splits and image augmentations, the AdamW optimizer with an initial learning rate of 1 × 10
−4, batch size eight, and 15 k iterations, using linear warm-up followed by cosine decay; other hyper-parameters were kept at their default settings. For SegNeXt-HFCA, we adopt the two-phase schedule described in
Section 4.3, with 15 k iterations in Phase I and a further 4 k iterations in Phase II, an initial learning rate of 2 × 10
−4 (the same warm-up/cosine policy and weight decay 5 × 10
−3), and the robust hard-example training scheme and EMA model averaging enabled. Apart from these optimization-related differences, the data splits and augmentations are identical across all methods. At test time, we report pixel accuracy (PA), mean pixel accuracy (mPA), and mean intersection-over-union (mIoU).
All experiments were implemented in Python 3.10.18 using PyTorch (2.9.0.dev2025 0703+cu128) with CUDA 12.8. Data preparation and cartographic post-processing were performed in ArcGIS 10.8.2 (Esri, Redlands, CA, USA). Model training and inference were conducted on a workstation equipped with an NVIDIA GeForce RTX 5060 Ti GPU (NVIDIA, Santa Clara, CA, USA; 16 GB VRAM).
5.2. Evaluation Metrics
To quantitatively evaluate segmentation performance, we report pixel accuracy (PA) and intersection-over-union (IoU), together with their class-averaged counterparts (mPA and mIoU). Let be the number of lithologic classes, and let denote the number of pixels whose ground-truth label is class but are predicted as class (i.e., entries of the confusion matrix). Pixels with the ignore label (e.g., ) are excluded from all computations.
Pixel accuracy (PA): The overall pixel accuracy is defined as
The per-class pixel accuracy (i.e., class-wise recall) for class
is
and the mean pixel accuracy is
where
denotes the set of classes present in the evaluation set.
Intersection-over-union (IoU): For class
, IoU is defined as
where the denominator corresponds to the union of the predicted and ground-truth regions for class
. The mean IoU is computed by
with
indicating classes whose union is non-empty.
In practice, reflects the overall correctness dominated by frequent classes, whereas reduces the influence of class imbalance by averaging class-wise recall. IoU/mIoU additionally penalizes both over- and under-segmentation and is more sensitive to boundary errors and fragmented predictions.
5.3. Sensitivity Analysis
To assess the sensitivity of the proposed refinement to the threshold parameter , we performed a sweep on the Huitongshan validation set with . In our implementation, is a confidence-gating threshold that controls the activation region of the refinement: pixels satisfying are treated as low-confidence and refined, while the remaining pixels keep the original prediction. The degenerate case yields a gate ratio of 0% and therefore corresponds to the baseline without refinement.
As shown in
Figure 7, the performance is essentially unchanged for
(gate ratio
), where mPA and mIoU remain at 78.36% and 66.34%, respectively. When
increases beyond this range, the gate ratio grows rapidly (e.g., 1.72% at
, 11.10% at
, and 100% at
), and both mPA and mIoU exhibit a consistent, albeit mild, decrease (mPA: 78.36%
77.66%; mIoU: 66.34%
66.03%). This trend suggests that refining an excessively large region tends to introduce over-smoothing around thin lithologic bodies and complex contact zones, where IoU is particularly sensitive to small boundary displacements under 10 m resolution and mixed pixels. Considering the stable plateau and the subsequent degradation when the gate becomes large, we set
for all experiments.
5.4. Comparative Experiments and Analysis on the Huitongshan Dataset
On the Huitongshan study area with nine lithologic classes, we compare SegNeXt-HFCA against a diverse set of baselines spanning CNNs (PSPNet, DeepLabV3+, and DANet), hybrid designs (U-NetFormer and LGMSFNet), a lightweight model (BiSeNetV2), transformer-based methods (SegFormer and Swin-UperNet), and the SegNeXt-base backbone.
Table 1 and
Table 2 summarize the quantitative results, and
Figure 8 provide confusion-matrix.
At the class level, segmentation performance varies markedly with the intrinsic characteristics of each geological unit. Two-mica quartz schist, quartz diorite, amphibole–mica schist, and monzonite generally achieve PA values above 85% and IoU values above 75%, which can be attributed to their relatively uniform tone, homogeneous structure, and distinctive texture patterns. In contrast, syenogranite, gabbro, Quaternary deposits, and basalt have fuzzy contacts and occur as narrow, ribbon-like outcrops, making them difficult for deep-learning models to delineate accurately. Diorite forms more extensive bodies in the imagery, and its PA and IoU exceed 75% and 60%, respectively, despite residual confusion along unit boundaries.
Table 1 and
Table 2 summarize the class-wise pixel accuracy (PA) and intersection-over-union (IoU) for the nine lithologic units in the Huitongshan study area (SyG, 2MQS, DI, Q, BA, QDI, HMS, GB, and MZ). Across methods, several units are consistently mapped with high fidelity—most notably 2MQS, QDI, and MZ, which maintain high PA (typically ~83–96%) and strong IoU (generally >70%). This pattern suggests that these units exhibit comparatively coherent spatial occurrences and more diagnostic spectral/texture cues at the mapping scale. By contrast, Quaternary deposits (Q) remain the principal source of uncertainty, showing the lowest region overlap for all models (IoU~25–35%) together with only moderate PA (about 50–57%), which is consistent with discontinuous surficial cover, mixed pixels along contacts, and frequent adjacency to multiple bedrock units.
In terms of overall performance, SegNeXt-HFCA yields the best mean scores, achieving mPA = 79.18% (
Table 1) and mIoU = 67.29% (
Table 2). Relative to the SegNeXt backbone alone (mPA = 75.68%, mIoU = 63.45%), this corresponds to improvements of +3.50 mPA and +3.84 mIoU, indicating a more consistent balance between pixel-wise recognition and area-level delineation. The gains are most evident for units that are sensitive to boundary ambiguity and local fragmentation: for Q, PA increases from 35.90% to 54.57% and IoU from 25.48% to 34.32%, and for BA, PA rises from 68.30% to 78.20%. Meanwhile, robust overlap is maintained for the easier units, e.g., 2MQS (IoU 87.64%) and QDI (IoU 82.97%), supporting stable extraction of their lithologic bodies. Notably, the best single-class scores are not always produced by the same model (e.g., Q IoU = 35.43% with U-NetFormer; DI IoU = 70.10% with DANet), underscoring that class-wise performance is strongly controlled by outcrop geometry and inter-unit similarity in addition to network design. As expected, mPA is generally higher than mIoU, because IoU penalizes boundary displacement and small-object omission more strictly, which becomes critical in narrow or heterogeneous contact zones.
In addition, PSPNet and DeepLabV3+ introduce multi-scale context through pyramid pooling and ASPP, which partly mitigates the uncertainty caused by scale differences and texture variability. On the Huitongshan dataset, they achieve mPA values of approximately 76% and 77%, respectively, indicating that contextual modeling is important for geological-element segmentation. Building on this, SegNeXt-HFCA further performs hierarchical feature fusion along the encoder–decoder path and employs coordinate attention to explicitly guide the segmentation of slender bands and lithologic contacts in the spatial domain, while a combination of Lovász loss and OHEM is used to better align optimization with the IoU metric and to enhance learning from hard examples. Under the same data and training settings, this strategy yields an mPA of about 79% and an mIoU of roughly 67%, representing a consistent improvement over the multi-scale baselines. At the class level, units with clear boundaries and homogeneous textures show relatively small differences across models, whereas in more ambiguous settings—such as narrow sedimentary bodies and contact zones between mafic and intermediate intrusive rocks—the cross-layer aggregation and spatial guidance of HFCA effectively reduce breaks and “sticking” along boundaries, leading to more coherent and continuous maps. Overall, under medium-resolution multispectral conditions, the combination of multi-scale context, structured spatial attention, and IoU-aligned losses appears to be critical for improving the accuracy of geological remote-sensing interpretation.
Figure 9 compares segmentation results of six representative Huitongshan scenes obtained with different methods. “Label” denotes the lithologic ground truth rasterized from the regional geological map and co-registered to the Sentinel-2 10 m grid; the Sentinel-2 false-color composites are shown for visual context only. Qualitatively, benefiting from the hierarchical multi-scale encoder with coordinate attention and the robust training strategy, SegNeXt-HFCA achieves superior stripe continuity, boundary adherence, and discrimination between spectrally similar lithologies compared with the competing models. In panels (1) and (2), the banded, strongly layered structural units are continuously tracked by SegNeXt-HFCA, with stripe width and strike closely matching the reference labels and with markedly fewer cross-stripe misclassifications, illustrating the ability of multi-scale fusion and coordinate attention to represent slender targets. In panel (3), where syenogranite and diorite exhibit limited spectral contrast, class-frequency re-weighting and OHEM suppress holes and speckle, yielding a more stable spatial extent. Panels (4) and (5) show mixed zones involving schist/gneiss, basalt, and Quaternary deposits; here, the combination of boundary-weighted losses and DenseCRF refinement produces more complete contacts and clearer transition zones, while preserving narrow lithologic bodies. Under the more complex structural configuration of panel (6), the separation between granitic units and schistose rocks is also more stable, and the overall topology is more consistent with geological expectations. Compared with U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, BiSeNetV2, and the SegNeXt-base backbone, SegNeXt-HFCA yields fewer misclassifications and less random noise, improving not only mIoU and related quantitative metrics but also visual coherence and practical usefulness.
5.5. Comparative Experiments and Analysis on the Xingxingxia Dataset
In this part, we conduct the same ten-class experiments on the Xingxingxia study area. Unlike Huitongshan, where geological elements are highly fragmented, Xingxingxia exhibits relatively continuous spatial patterns and comparable sample sizes across classes, with pronounced homogeneity. The segmentation results in terms of PA and IoU are reported in
Table 3 and
Table 4.
Among the ten lithologic classes, diorite and Quaternary deposits are the easiest to interpret, with class-wise PA values close to or exceeding 90%. Trachyandesitic tuff and dolomitic marble also achieve high IoU scores (>80%). By contrast, two-mica quartz schist and monzgranite, whose textures resemble those of the surrounding granitic units and whose boundaries are locally fragmented, are more difficult to delineate, with IoU values mostly in the 65–70% range. Dolomitic marble and trachyandesitic tuff are widely distributed within the area and exhibit relatively stable lithologic signatures, maintaining accuracies of about 90% PA and consistently high IoU. In contrast, migmatitic gneiss commonly occurs interleaved with adjacent granitic bodies and within strongly fractured structural zones, which increases confusion with neighboring lithologies and leads to slightly lower accuracies.
From the PA/mPA and IoU/mIoU statistics in
Table 3 and
Table 4, mainstream architectures deliver overall stable performance on the Xingxingxia dataset, yet pronounced gaps remain across both backbones and lithologic classes. U-NetFormer achieves mPA = 83.11% and mIoU = 70.53%. PSPNet and DeepLabV3+ provide comparable overall accuracies (mPA = 87.15–87.26%, mIoU = 73.28–73.38%), while DANet attains mPA = 83.23% and mIoU = 71.16%. LGMSFNet is slightly lower (mPA = 80.34%, mIoU = 68.20%). The lightweight BiSeNetV2 remains competitive (mPA = 86.05%, mIoU = 73.71%), indicating that compact designs can still be effective in this setting. With the two newly added baselines, Swin-UperNet yields mPA = 80.73% and mIoU = 68.58%, whereas SegFormer shows a clear performance drop (mPA = 75.41%, mIoU = 59.96%). SegNeXt-base obtains mPA = 84.46% and mIoU = 73.05%. In comparison, the proposed SegNeXt-HFCA achieves the strongest overall performance, reaching mPA = 87.40% (highest in
Table 3) and mIoU = 75.69% (highest in
Table 4), indicating more accurate and more consistent lithologic delineation.
At the class level, SegNeXt-HFCA shows particularly robust discrimination for lithologies with relatively coherent bodies and clearer contacts, including DI (PA = 93.89%, IoU = 71.56%), Q (PA = 93.11%, IoU = 77.81%), FMG (PA = 91.76%, IoU = 84.59%), and DCT (PA = 91.28%, IoU = 83.41%); DM also remains reliably mapped (PA = 89.15%, IoU = 86.36%). The more challenging units are mainly 2MQS, GDI, and CMG, with IoU values of 67.50%, 68.31%, and 64.09%, respectively, which is consistent with their strong spectral/textural similarity to neighboring granitoid units and their locally fragmented spatial occurrence. Overall,
Table 3 and
Table 4 jointly indicate that mPA is generally higher than mIoU, reflecting the stricter penalty of IoU for boundary displacement and small-object omission under class imbalance and inter-class resemblance. Within this setting, SegNeXt-HFCA provides the most balanced gains across classes and achieves the best overall mapping quality in Xingxingxia.
Across the seven representative examples in
Figure 10, the proposed method yields more stable segmentations that better conform to the spatial distribution of geological units in the Xingxingxia dataset. In rows (1)–(3), lithologies such as monzogranite, two-mica quartz schist, and diorite occur as banded or sheet-like bodies. Compared with the mainstream models (U-NetFormer, PSPNet, DeepLabV3+, DANet, LGMSFNet, BiSeNetV2, and the SegNeXt backbone alone), SegNeXt-HFCA not only preserves the correct semantic classes but also better restores band continuity and width, substantially reducing cross-band “bleeding” and salt-and-pepper noise within bodies, while small patches and pinch-out terminations are more completely retained. In rows (4)–(6), several obliquely arranged lithologic units (e.g., coarse- and fine-grained monzogranite and Neogene–Quaternary cover) are mutually interdigitated, with highly intricate boundaries. Competing methods tend to produce over-smoothed boundaries, holes, and misclassifications, particularly between the coarse/fine monzogranite units and in the contact zones where migmatitic gneiss–schist adjoins carbonate rocks. In contrast, SegNeXt-HFCA achieves the highest boundary adherence: contact zones are continuous and smooth, and heterogeneous blocks remain more clearly separable and structurally intact. In the large-scale, high-contrast scene of row (7), SegNeXt-HFCA simultaneously preserves the overall geometry and fine contours of the main bodies and avoids confusing Quaternary deposits with granitoid units, resulting in the most consistent map overall. Taken together, these examples indicate that the proposed method outperforms the baselines in terms of object completeness, boundary consistency, and fine-grained class discrimination—especially for coarse/fine units and in granitoid–gabbro/diorite adjacency zones—thereby enhancing the practical usability and reliability of lithologic mapping.
5.6. Ablation Study
In this section, SegNeXt is taken as the baseline model, and a stepwise ablation study is conducted on the Huitongshan dataset to assess the contribution of each component in SegNeXt-HFCA. As shown in
Table 5, the plain SegNeXt backbone achieves 75.68% mPA and 63.45% mIoU. Adding only coordinate attention (SegNeXt + CA, row 2) already raises the scores to 77.35% mPA and 65.10% mIoU (+1.67%/+1.65%), indicating that lightweight channel-and-spatial re-calibration is beneficial even without modifying the multi-scale fusion. When the full HMS-CA encoder is enabled (SegNeXt-HFCA, row 3), mPA and mIoU further increase to 77.69% and 65.37% (+0.34%/+0.27% over SegNeXt + CA), confirming that hierarchical multi-scale aggregation effectively couples texture–spectral cues and alleviates receptive-field mismatch for stripe-like units. Incorporating the robust hard-example training (RHT, row 4) yields a modest but consistent improvement to 77.75% mPA and 65.69% mIoU, mainly stabilizing optimization and enhancing the representation of rare or weak-contrast classes. When the BCRF + seamless inference pipeline is activated on top of HMS-CA while RHT is disabled (row 5), the performance rises to 78.63% mPA and 67.15% mIoU; this suggests that overlapping sliding-window inference with Hann blending and DenseCRF refinement is particularly effective at sharpening contacts and suppressing isolated misclassified pixels. Finally, combining all components (row 6) yields the best performance, 79.18% mPA and 67.29% mIoU, corresponding to cumulative gains of +3.50 mPA and +3.84 mIoU over the SegNeXt baseline. These results demonstrate that encoder-level HFCA, training-level RHT, and inference-level BCRF + seamless inference provide complementary benefits rather than redundant ones.
Module-wise mechanism analysis: HMS primarily improves the recognition of thin, anisotropic lithologic belts by alleviating receptive-field mismatch across scales and preserving elongated topology after multi-scale alignment and fusion; this explains why enabling HMS-CA yields additional gains over CA-only and reduces broken/fragmented stripe-like bodies in the qualitative examples (
Figure 9 and
Figure 10). The coordinate-aware boundary re-calibration (CA) contributes most to weak-texture or low-contrast regions by injecting boundary/positional cues and suppressing ambiguous responses around contacts, which is consistent with the reduced inter-class confusion observed in
Figure 8. RHT + class-frequency priors mainly target rare classes under long-tailed distributions: by prioritizing reliable hard pixels and re-balancing gradients, it prevents majority classes from dominating the optimization, leading to consistent (though modest) improvements and better stability for minority/weak-contrast lithologies. Finally, DenseCRF refinement operates at inference time to enforce local label consistency under spectral similarity while preserving sharp edges, which directly explains the reduction in salt-and-pepper artefacts and the sharpening of ambiguous contacts reported for the BCRF + seamless setting.
Loss strategy comparison: To ensure a fair comparison, we fix the network architecture and all training settings for all strategies, including the optimizer and learning-rate schedule, data augmentation, classifier bias initialization using class priors, class reweighting schedule, EMA, TTA, and the same fine-tuning protocol. The only change is the loss formulation for handling class imbalance: (1) weighted cross-entropy (WCE) + Lovász, (2) class-weighted focal loss (γ = 2) + Lovász, and (3) OHEM cross-entropy + Lovász. We report mIoU and mPA on the validation set.
Effect of loss strategies: As shown in
Table 6, our OHEMCE + Lovász achieves the best performance among the compared loss strategies, reaching 67.29% mIoU and 79.18% mPA. It improves upon WCE + Lovász by +0.24 mIoU and +0.07 mPA, and upon Focal (γ = 2) + Lovász by +0.73 mIoU and +0.50 mPA. These results suggest that, under the same training settings in Huitongshan, hard-pixel mining can provide a modest but measurable benefit.
As shown in
Table 7, the proposed 10-channel stack consistently outperforms the full-band MSI baseline on both study areas. For Huitongshan, the 10-channel input achieves 67.29% mIoU and 79.18% mPA, exceeding the 12-band MSI baseline by +6.40 mIoU and +4.75 mPA. For Xingxingxia, the 10-channel input yields 75.69% mIoU and 87.40% mPA, improving upon the full-band baseline by +4.23 mIoU and +4.29 mPA. These results indicate that the PCA/MNF-based compressed representation together with DEM-derived constraints retains (and enhances) the most discriminative information for lithologic segmentation.