1. Introduction
Geological outcrops serve as natural archives exposed at the Earth’s surface by tectonic and sedimentary processes, preserving multi-scale information ranging from microscopic lithological fabrics to macroscopic structural deformations [
1]. Within sedimentary basins, alternating sequences of sandstone and mudstone are particularly prevalent. The accurate delineation of these lithologies, especially when identifying extremely thin mudstone interbeds, is fundamental for the transition from qualitative description to quantitative, refined characterization [
2,
3,
4]. Such detailed characterization provides critical evidence for unraveling geodynamic processes and reconstructing paleogeographic environments [
5]. However, field outcrops are frequently compromised by long-term weathering and are often obscured by surface covers such as vegetation and debris, rendering geological semantic information fragmented and ambiguous [
6]. Consequently, the precise and automated extraction of key geological semantics from complex field settings remains a significant challenge in geological exploration [
7,
8].
The lithological identification of geological outcrops has traditionally relied primarily on field observation, geophysical interpretation, and geochemical analysis by geological experts [
9,
10,
11]. Although these traditional methods can usually achieve satisfactory results, they demand extensive domain expertise and are time-consuming and labor-intensive [
12,
13]. Crucially, constrained by the physical accessibility of sampling sites, traditional approaches pose safety hazards and struggle to generate continuous, high-resolution representations of the entire outcrop face [
14,
15,
16]. Consequently, there is a critical need to develop more efficient, automated methodologies to overcome these limitations.
In recent years, the advancement of oblique aerial photogrammetry using Unmanned Aerial Vehicles (UAVs) and high-precision digital sensing technology has generated a wealth of outcrop imagery, providing a new perspective for geological research [
17,
18]. In this context, integrating computer vision technology into geological image analysis has emerged as an essential strategy to overcome the spatiotemporal limitations of traditional methods [
19,
20]. Driven by breakthroughs in deep learning, image segmentation techniques grounded in Convolutional Neural Networks (CNNs) have demonstrated considerable efficacy across medical diagnosis, autonomous driving, and remote sensing [
21,
22,
23]. In the earth sciences, segmentation methods such as FCN, U-Net, and the DeepLab series have gradually replaced traditional shallow machine learning algorithms, showing the potential for pixel-level classification [
24,
25]. For instance, Zhu et al. [
26] introduced RockNet, which captures textural fingerprints of rock surfaces via an end-to-end feature enhancement strategy. Jing et al. [
27] constructed a network architecture designed to fuse multi-modal features that enhances the characterization ability of complex lithologies by fusing spectral and texture information. Addressing data scarcity, Qin et al. [
28] employed unsupervised superpixel segmentation techniques to achieve an effective discrimination of clastic rock particles in scenarios with limited annotated data. Collectively, these studies validate the capacity of deep learning to extract complex, non-linear features in images, laying a robust foundation for the intelligent identification of geological outcrops [
12].
Despite these advancements, the existing semantic segmentation models encounter significant bottlenecks when applied to specific geological scenarios, particularly sandstone–mudstone interbedded formations [
29]. This research gap primarily stems from two geological challenges:
First, the “Noise–Boundary” Dilemma: Extensive weathering and environmental noise (e.g., uneven illumination, vegetation shadows) cause pronounced boundary ambiguity between lithologies [
30]. Standard CNNs often struggle to filter out this background noise, leading to misclassified boundaries [
31,
32]. Recently, several advanced frequency-based architectures have been proposed in remote sensing to address similar issues by isolating noise and emphasizing boundary features in the frequency domain. For instance, Guo et al. [
33] developed a knowledge distillation framework that decouples high-frequency structural boundaries from low-frequency semantic backgrounds, effectively minimizing background interference in UAV-based assessments. Similarly, Yu [
34] proposed FHSA, a framework that employs hybrid sequence attention to specifically suppress environmental noise in the frequency domain while preserving critical textures. Furthermore, Li et al. [
35] introduced a frequency-aware Transformer that utilizes wavelet decomposition to adaptively integrate multi-resolution subbands, filtering out irrelevant noise subbands to sharpen semantic boundaries. These methods attempt to isolate noise and decorrelate complex features in the frequency domain to sharpen semantic boundaries. However, their efficacy in geological outcrops remains limited, as the stochastic texture of weathered rock surfaces often overlaps with semantic information in the high-frequency spectrum, making pure frequency-based separation challenging.
Second, the “Scale–Loss” Effect: Sedimentary strata exhibit extreme scale variations, characterized by extremely thin mudstone layers embedded within thick sandstone beds. The downsampling operations inherent to deep networks often cause the texture information of these thin rock layers to be diluted or even lost during multiple convolutions, resulting in poor segmentation accuracy [
36,
37]. Recently, various multi-scale context aggregation and attention-based modules have been proposed to dynamically capture features across different receptive fields. As an illustration, Chen et al. [
38] developed GLDSFNet, a segmentation network integrating global attention mechanisms with multi-size deformable convolutions to dynamically expand receptive fields, effectively capturing fine boundaries and global–local multi-scale information. Similarly, Liang and Li [
39] proposed ScaleRSNet, a contextual attention-based framework that utilizes multi-rate dilated convolutions alongside spatial attention to precisely extract and combine features across different receptive fields. However, despite these modifications, CNN-based methods fundamentally rely on localized window operations. While techniques like dilated or deformable convolutions expand the receptive field to some extent, they still suffer from an inherent local bias and struggle to model long-range spatial dependencies. Consequently, they often fail to track continuous but extremely thin geological features that span across an entire image.
To overcome the spatial limitations and inherent local bias of traditional convolution operations, researchers have increasingly turned to Vision Transformers (ViTs) as a robust structural solution [
40]. Because traditional CNNs struggle to track continuous but extremely thin geological features across an entire image, ViTs provide a distinct advantage through their self-attention mechanisms. By establishing spatial correlations across the entire image regardless of distance, ViTs theoretically excel at maintaining the structural integrity and continuity of elongated thin beds [
40,
41]. Nevertheless, applying these architectures to geological scenarios is severely hindered by their massive appetite for training data [
41]. The prohibitive cost and profound scarcity of expert-level, pixel-wise annotations in this specific domain make the training of high-performing ViTs largely impractical [
40,
41,
42].
In this study, to bridge this critical technical gap, we propose the AFPN-ResUNet framework—a high-precision semantic segmentation architecture tailored specifically for ultra-thin sand–mudstone interbeds in complex field outcrops. The core of this framework manifests in two fundamental dimensions of architectural reconstruction. First, regarding feature fusion, we depart from the naive direct concatenation conventionally used in standard skip connections, adopting an Asymptotic Spatial Feature Fusion (ASFF) strategy [
43] instead. By executing a dynamic and smooth fusion based on the contribution weights of different feature levels, this strategy effectively resolves the semantic loss of thin layers caused by successive convolutional operations in deep networks. Consequently, it successfully bridges the semantic chasm between low-level structural details and high-level abstract contexts [
43,
44]. To counteract the noise interference inherent in complex outcrop backgrounds, a Convolutional Block Attention Module (CBAM) is integrated seamlessly into the residual network [
45,
46]. Empowered by dual attention mechanisms, this integration enables the model to dynamically filter out cluttered background noise, firmly concentrating the network’s focus on fundamental lithological textures and micro-geological features.
The primary innovations presented in this research include the following: (1) we introduce a symmetric semantic segmentation framework, designated as AFPN-ResUNet, designed to tackle the difficulties associated with classifying lithology in complex field outcrops with extreme scale variations; (2) a CBAM is embedded into the ResNet50 encoder, which allows the model to adaptively emphasize discriminative lithological textures while suppressing environmental interference; (3) an asymptotic pyramidal architecture with a cross-scale feature fusion mechanism is deployed along the skip pathways, which aims to mitigate the semantic gap across different deep layers.
Section 2 of this paper introduces the study area and data sources, detailing the data processing workflow, evaluation metrics, and training methodologies. Subsequently,
Section 3 focuses on the proposed AFPN-ResUNet architecture, elaborating on specific module designs, the loss function, and the Grad-CAM++ visualization method.
Section 4 comprehensively presents the experimental results, encompassing model training and segmentation performance, comparative experiments, and ablation studies. Finally,
Section 5 provides a relevant discussion of our findings, and
Section 6 delivers the conclusions.
4. Results
4.1. Experimental Training Results
As illustrated in
Figure 8, the training loss experiences a steep initial decline, subsequently plateauing near the 100-epoch mark after some minor variations. The trajectory of the validation loss closely mirrors that of the training phase, implying the model does not suffer from significant overfitting. Additionally, the mean intersection-over-union (mIoU) progression is remarkably consistent across both datasets. These metrics ultimately stabilize at 94.82% for the training set and 94.15% for the validation set, leaving a marginal gap of merely 0.67%. Although short-term variations emerged during the learning phase—likely driven by mini-batch variance and the inherent complexity of sample features—they did not disrupt the general trajectory. The persistent drop in the loss function, coupled with climbing mIoU scores, confirms that stable convergence was successfully reached. In summary, the proposed architecture exhibits reliable training capabilities.
4.2. Segmentation Results
To thoroughly evaluate the proposed AFPN-ResUNet framework, we conducted extensive evaluations using an unseen, independent dataset.
Figure 9 presents the segmentation results under different surface outcrop conditions (illumination, shelter, interbedding, crumbling).
The experimental results demonstrate that AFPN-ResUNet achieved desirable segmentation performance across various challenging conditions. First, in regions with uneven illumination, the model accurately restored the spatial continuity of sedimentary layers, with mIoU scores ranging from 87.35% to 95.57%. Furthermore, in areas (shelters) covered by weathered debris and vegetation, the mIoU consistently exceeded 90%, validating the feature refinement capability of the RE-CBAM and its effectiveness in mitigating class misclassification caused by noise.
Notably, in regions with extremely thin interbedded sandstone and mudstone layers, the model maintained stable segmentation performance, with mIoU scores reaching over 93%, achieving a precise delineation of ultra-thin interlayer boundaries. Finally, when handling severely weathered and crumbling outcrop areas, AFPN-ResUNet maintained reliable performance. Specifically, even when confronting severely crumbling outcrops and blurred boundaries, the model yielded mIoU scores ranging from 88.30% to 94.82%. These results verify that the proposed model demonstrates robust performance in complex and fractured field outcrop segmentation tasks, accurately identifying and localizing specific lithologies.
4.3. Analysis of Module Mechanisms
4.3.1. Multi-Dimensional Visualization Study on the Mechanism of AFPN-ResUNet
To reveal the internal learning mechanism of AFPN-ResUNet, a multi-dimensional visualization analysis was conducted using Grad-CAM++ during the training process. In
Figure 10, the horizontal axis captures the shifting focus of attention regions over various epochs, while the vertical dimension portrays the progressive development of features across successive network layers.
Examining the vertical axis (network depth), the visualization results clearly demonstrate the process of feature refinement and spatial reconstruction. At the front end of the encoder (e.g., RCB-E1), the model’s activation signals appear diffuse, primarily capturing low-level visual information such as color gradients, minor surface fractures, weathering marks, and basic textures on rock surfaces. As the network depth increases toward deeper layers (RCB-E4), the activated regions exhibit noticeable spatial concentration and intensified activation, shifting focus away from superficial noise to strongly highlight the prominent horizontal bedding planes and macro-structural boundaries. Subsequently, in the decoder stage (from SYM-D1 to SYM-D4), abstracted semantic features are progressively upsampled and refined. The activation focus transitions from blob-like abstract regions back to precise spatial activations, ultimately generating sharp, continuous heatmaps that accurately align with the actual physical boundaries of the thin interbedded strata.
Examining the horizontal axis (training epochs), in the early stage of training, the attention mechanism anchors on highly contrasted and prominent lithological features, such as the thickest visually distinct sedimentary bands (e.g., the dominant reddish strata). Entering the middle stage, the model begins to finely identify previously overlooked weak feature regions and edges, specifically activating along the thinner interbedded layers and subtle transitional boundary zones. By the convergence stage of training, the model can clearly identify the target lithological areas, generating concentrated, high-intensity activation maps that precisely trace the continuous morphological strike of the specific geological strata, while effectively suppressing non-geological background elements like shadows or surface debris.
This process demonstrates that RE-CBAM effectively filters out complex background noise in field settings by gradually decoupling lithological signals from the complex geological environment, achieving a transformation from fragmented local feature perception to systematic global feature representation. Furthermore, combined with the asymptotic fusion of AFPN, the model effectively reconstructs abstracted spatial information during the decoding stage.
4.3.2. Visual Analysis of the Asymptotic Fusion Mechanism in AFPN-ResUNet
To intuitively demonstrate the AFPN asymptotic fusion mechanism, this study conducted a visualization analysis of its workflow and generated feature heatmaps for mudstone lithological units in the thin interbedded area of the outcrop, as illustrated in
Figure 11. The input is downsampled through the backbone network, and a large amount of irrelevant background noise is effectively filtered by the RE-CBAM attention module, thereby providing a refined initial feature input for the subsequent AFPN module. Feature maps from different levels are then respectively processed through 1 × 1 convolution (indicated by the red arrows) to unify the number of channels (adjusted to 128), which significantly reduces computational complexity while preserving the representative features of each level.
As illustrated in the AFPN structure, asymptotic feature fusion proceeds progressively from shallow to deep levels. In this process, shallow features (e.g., C1 and C2)—which distinctly capture localized visual information such as granular surface textures, minor fractures, and the precise geometric boundary lines of thin mudstone layers—participate in the fusion first (Step 1). These are then weighted and integrated through the ASFF nodes, with the output progressively fused with deeper features (such as C3 in Step 2 and C4 in Step 3) via downsampling (indicated by the green arrows). This design effectively compensates for the limitations of deep features, which possess strong semantic information (successfully localizing the macro-structural strike of the strata) but lack sufficient spatial details (often appearing as diffuse, blurred activation blobs).
Simultaneously, deep features propagate back to shallow layers via upsampling (indicated by the orange arrows), enriching the shallow features with higher-level semantic contexts. This overcomes the inherent limitation of shallow features containing only fragmented local details without overarching geological associations. Through this bidirectional, dense feature interaction, AFPN achieves a complementary integration of multi-scale features. As seen in the final fusion stage (Step 4), the model successfully generates concentrated activation bands. These refined multi-scale features are then finally fed into the decoder (from P1 to P4), where spatial resolution is progressively restored. The resulting continuous, high-intensity activation maps precisely delineate the thin interbedded mudstone units, contributing to the accurate and robust final segmentation output.
4.4. Comparison Study and Performance Assessment
4.4.1. Comparative Analysis of Segmentation Results Across Mainstream Models
For the purpose of validating the AFPN-ResUNet framework, a comparative analysis was conducted utilizing five established networks, namely UNet, ViT, DeepLabV3+, PSPNet, and SegNeXt. All baseline models were evaluated under the same data splits and hardware setup. A comprehensive summary of the same hyperparameter configurations and augmentation strategies applied uniformly to all models is provided in
Table 1.
Table 2 shows the performances of different segmentation methods. AFPN-ResUNet clearly outperforms other models across all evaluation metrics in the task of field outcrop lithological segmentation. Specifically, baseline and extremely lightweight models like UNet, ViT, and SegNeXt struggle with this complex task, yielding mIoU scores of only 70.21%, 69.49%, and 67.37%, respectively. While classical semantic segmentation architectures such as DeepLabV3+ and PSPNet perform relatively better (achieving mIoU scores of around 81%), AFPN-ResUNet significantly eclipses them, with an mIoU of 93.41%, exceeding the second-best model (PSPNet) by a substantial margin of 12.38 percentage points. Furthermore, this superiority extends beyond mIoU. AFPN-ResUNet achieves a remarkable F1-Score and DSC of 96.58%, whereas the lowest-performing model, SegNeXt, only manages an F1-Score of 69.54%. The consistent superiority of its recall, precision, and F1-score clearly demonstrates the model’s capacity to effectively mitigate prediction errors and omissions in complex geological scenes.
Figure 12 illustrates key visual comparisons of the segmentation tasks, with the layout organized such that the original image crops and their corresponding manual annotations (ground truth) occupy the leftmost two columns. The remaining sections detail the predicted masks produced by our AFPN-ResUNet, as well as those derived from UNet, ViT, DeepLabV3+, PSPNet, and SegNeXt.
As shown in
Figure 12a, in regions with uneven illumination, models like ViT and SegNeXt are heavily disturbed by shadows, completely misclassifying shadowed rock layers into incorrect categories. In contrast, AFPN-ResUNet demonstrates illumination invariance. In areas covered by weathered debris and vegetation (
Figure 12b), UNet and ViT exhibit severe noise susceptibility, generating large, fragmented patches of misclassified pixels around the occlusions. AFPN-ResUNet, however, effectively suppresses this environmental noise and delineates the continuous outcrop boundaries.
Figure 12c–e depict regions with extremely thin interbeds. Here, the limitations of the comparative models are glaringly apparent: SegNeXt almost fails to capture the thin structures (fusing them into solid background blocks), while UNet and ViT produce disjointed and fragmented predictions. Even DeepLabV3+, which performs moderately well, tends to over-smooth the edges and lose fine geometric details. Conversely, AFPN-ResUNet accurately delineates the boundaries and maintains the spatial continuity of these ultra-thin layers. Finally, in severely weathered and crumbling areas (
Figure 12f,g), ViT suffers from severe blocky artifacts, and SegNeXt exhibits massive misclassifications, whereas AFPN-ResUNet consistently isolates the fragmented lithology with high precision.
Overall, while other baseline models can identify macro-lithological distributions, they fundamentally struggle with severe noise, occlusions, and fine-grained structural details. Meanwhile, AFPN-ResUNet consistently delivers accurate, continuous, and robust segmentation results under various challenging field conditions.
4.4.2. Comparison of Inference Speed and Computational Cost
A comprehensive evaluation of a model’s true capability demands looking beyond mere segmentation accuracy—computational overhead is equally vital for real-world deployment. Consequently, we benchmarked the operational efficiency of our AFPN-ResUNet against five established networks using identical software and hardware configurations. To quantify architectural complexity, we tracked total trainable weights (Params) alongside floating-point operations (FLOPs). Concurrently, the execution speed was measured via per-image processing times and frames per second (FPS). These detailed metrics are documented in
Table 3.
Regarding model complexity, the proposed AFPN-ResUNet comprises 38.67 M parameters, placing it between DeepLabV3+ (40.35 M) and PSPNet (30.39 M), while remaining substantially more compact than the Transformer-based ViT (91.66 M). This demonstrates that AFPN-ResUNet maintains a relatively moderate parameter footprint despite the integration of multi-scale asymptotic fusion modules. In terms of computational overhead, the model requires 179.79 G FLOPs. While this computational cost is markedly lower than that of classical architectures such as UNet (321.19 G) and PSPNet (222.63 G), the dense multi-level feature interactions within the AFPN and the weighted fusion operations inherent to the ASFF mechanism inevitably introduce structural complexity.
Regarding inference efficiency, AFPN-ResUNet yields an inference time of 33.99 ms per image, translating to a throughput of 29.42 FPS. This throughput is comparatively lower than that of standard architectures like UNet (40.94 FPS) and DeepLabV3+ (52.98 FPS) and falls short of the optimized SegNeXt (97.74 FPS) and ViT (141.45 FPS). The elevated inference latency—despite having fewer FLOPs than UNet and PSPNet—can be attributed to the complex, sequential feature fusion pathways. These fragmented structures inherently introduce substantial memory access costs and reduce the degree of hardware parallelism, shifting the bottleneck from computation to memory bandwidth.
Notably, despite possessing the largest parameter count, ViT achieves the highest throughput. This efficient hardware utilization is primarily attributed to the parallelizable matrix operations inherent to its self-attention blocks, which bypass the sequential constraints of convolutions. However, this architectural trait represents an inherent trade-off. While it maximizes FPS, the pure Transformer architecture lacks the local inductive biases necessary for fine-grained texture extraction. Consequently, without the support of massive pre-training datasets, ViT struggles to accurately delineate complex geological boundaries, resulting in an mIoU of only 69.49%.
Conversely, SegNeXt represents an extreme approach to lightweight design. While its minimal parameter count (3.70 M) and low computational cost (18.99 G) yield an exceptionally high throughput, its poor segmentation accuracy (mIoU of 67.37%) indicates that excessive structural simplification severely compromises spatial feature representation. This contrast underscores the rationality of the proposed AFPN-ResUNet, which strategically accepts a moderate increase in inference latency to achieve superior and reliable segmentation accuracy in complex field environments.
In summary, while AFPN-ResUNet successfully constrains its parameter scale and achieves superior segmentation accuracy, its sophisticated feature fusion mechanisms impose non-trivial inference latency. Consequently, future work will focus on structural pruning, the integration of lightweight convolutional adaptations, and the optimization of fusion modules to enhance real-time applicability without compromising model accuracy.
4.5. Ablative Study
The individual impacts of the RE-CBAM and the AFPN architectures on segmentation efficacy were systematically analyzed through a series of component-removal tests.
Table 4 details the exact structural arrangements used in this phase. To prevent experimental bias and ensure a fair evaluation, all ablated model versions were trained using the identical unified configurations previously outlined in
Table 1. The recorded performance indicators—specifically mIoU, F1-Score, recall, precision, and DSC—are cataloged in
Table 5. Visual evidence of these improvements across the Baseline, Baseline + RE-CBAM, Baseline + AFPN, and full AFPN-ResUNet configurations is displayed in
Figure 13. Ultimately, the empirical data confirms that every added component successfully boosts the core network’s capabilities.
As presented in
Table 5, the incremental integration of each module leads to a consistent performance uplift. The Baseline + RE-CBAM configuration achieves a 13.11% improvement in mIoU over the Baseline. This is qualitatively evidenced in
Figure 13a,b, where the attention mechanism effectively suppresses shadows and vegetative occlusions. By recalibrating channel and spatial importance, RE-CBAM prevents the encoder from being distracted by environmental noise, thereby enhancing the discriminative signatures of lithological textures.
Notably, the Baseline + AFPN model yields a substantial 13.98% gain in mIoU, surpassing the attention-only version. This underscores the critical necessity of asymptotic feature fusion in geological tasks. In the standard ResUNet, the abrupt concatenation of shallow and deep features often leads to a “semantic gap,” manifesting as fragmented or over-smoothed boundaries in thin interbedded sequences. AFPN mitigates this by facilitating a smoother semantic transition across adjacent levels. As shown in
Figure 13c, AFPN significantly preserves the spatial continuity of ultra-thin layers, which is further reflected by the highest precision (96.72%) among all ablation models, indicating its ability to reject false-positive background pixels in complex structures.
Finally, AFPN-ResUNet shows better performance across most comprehensive metrics, with the highest mIoU (93.41%) and recall (97.05%). A nuanced comparison between the AFPN-only model and the full architecture reveals an advantageous performance trade-off. While the final model experiences a slight decline in precision (96.72% to 96.11%), it achieves definitive gains in recall, F1-Score, DSC, and mIoU. This indicates that the integration of RE-CBAM enables a more inclusive feature capture. By filtering out environmental noise prior to fusion, RE-CBAM acts as a “feature purifier,” freeing the AFPN module from being overly conservative. Consequently, the model excels at recovering severely fragmented or weathered geological units, reducing missed detections. In complex geological scenes, such as the crumbling outcrops in
Figure 13d, prioritizing structural completeness (high recall) over strict pixel-level conservatism (precision) is a crucial requirement. The rise in all comprehensive metrics confirms that this gain in structural integrity outweighs the minor pixel-level trade-off.
In conclusion, the integration of a ResNet50 backbone for robust hierarchy, RE-CBAM for local discriminative power, and AFPN for global semantic consistency allows this model to effectively adapt to the complexities of field geological environments.
Table 6 compares the operational efficiency of different ablation configurations. The Baseline model has 73.28 M parameters, which increases slightly to 75.80 M after incorporating RE-CBAM. However, the introduction of AFPN significantly reduces the parameter count to 36.16 M, a remarkable decrease of 50.7%. This substantial reduction is primarily attributed to AFPN’s architectural design, which replaces the redundant, dimension-inflating skip connections of the original architecture with early dimensionality reduction via 1 × 1 convolutions.
Despite this drastic reduction, the segmentation performance of the model is preserved through the complementary roles of the two modules. Instead of acting as a filter itself, AFPN relies on its progressive multi-scale strategy to efficiently integrate features. The true role of the information filter is fulfilled by RE-CBAM, which removes unrefined, noisy low-level features before fusion. As a result, AFPN successfully eliminates bloated convolution operations, while RE-CBAM optimizes feature representation. This allows the final AFPN-ResUNet to maintain a relatively compact parameter count of 38.67 M without sacrificing its strong representational capacity.
Regarding computational complexity, the Baseline exhibits a high number of FLOPs, at 304.62 G, which remains essentially unchanged after adding RE-CBAM (304.88 G). In contrast, incorporating AFPN drastically reduces FLOPs to 179.49 G, a 41.1% decrease, fully demonstrating the advantage of the asymptotic fusion mechanism in reducing redundant computations. AFPN-ResUNet achieves 179.79 G FLOPs, nearly identical to Baseline + AFPN.
In terms of inference speed, the Baseline achieves an inference time of 24.10 ms per image. After introducing RE-CBAM, the inference time increases to 30.44 ms due to the sequential computation of the attention mechanism. However, the addition of AFPN reduces the inference time to 22.41 ms, a 7.0% improvement over the Baseline. The final AFPN-ResUNet achieves an inference time of 33.99 ms, falling between the two. This comprehensively reflects the synergistic trade-off achieved by RE-CBAM and AFPN: RE-CBAM trades a slight computational overhead for optimal feature purity, while AFPN significantly reduces the parameter count and computational load by streamlining the fusion path. Their synergy enables the model to maintain a high geological identification accuracy while maintaining a certain level of efficiency.
5. Discussions
5.1. Comparative Analysis of Segmentation Performance Across Different Models
Accurate lithology segmentation in complex outcrops is constrained by severe background interference and extreme scale variations. While conventional architectures like DeepLabV3+ expand the receptive field, they lack adaptive mechanisms to suppress background noise—often misclassifying shadows with similar spectra. To address this, AFPN-ResUNet incorporates an RE-CBAM that recalibrates spatial and channel representations, adaptively suppressing non-geological interference early in the network to preserve fundamental lithological features before deep processing.
Furthermore, preserving the geometric continuity of ultra-thin interbeds demands meticulous multi-scale feature integration—a challenge for standard architectures. UNet’s direct concatenation produces coarse boundaries under occlusion; PSPNet’s global pooling blurs localized structures; and ViT’s patch-embedding mechanism disrupts the spatial continuity of narrow geological bodies (mIoU 69.49%). Even efficient designs like SegNeXt compromise fine edge preservation in weathered zones. To overcome these bottlenecks, our framework employs an asymptotic feature pyramid network with an ASFF mechanism. Rather than utilizing abrupt concatenation or aggressive pooling, this strategy fuses features progressively through dynamic spatial weighting. This approach resolves cross-scale semantic discrepancies while balancing global contextual coherence with the precise retention of high-frequency boundaries in extremely thin layers.
5.2. Contribution and Synergy of RE-CBAM and AFPN Modules
The RE-CBAM primarily functions by isolating valid lithological signals from high-noise backgrounds. In outcrop areas characterized by common interferences such as shadows, vegetation, and weathered debris, this module employs a serialized channel and spatial attention mechanism for feature recalibration. Channel attention highlights the channels encoding key lithological features, while spatial attention helps to localize them. By suppressing the background responses of non-geological objects during the early stages of feature extraction, RE-CBAM provides a foundation for refined features, contributing a 13.11% improvement in mIoU.
However, deep networks are prone to losing fine lithological details during repeated downsampling processes, which is particularly pronounced in extremely thin interbedded layers. To address this, the AFPN module abandons the standard feature stacking typical of traditional FPNs by utilizing ASFF to dynamically assign multi-level weights to each pixel location. This mechanism allows the model to prioritize high-frequency spatial details at lithological boundaries while emphasizing deep semantic context within rock bodies. It compensates for the limitations of traditional networks in segmenting thin layers, yielding a 13.98% performance gain.
Overall, AFPN-ResUNet constructs a coupled segmentation paradigm of feature purification and semantic fusion. If multi-scale fusion techniques are applied directly to unprocessed feature data, background noise is extracted alongside it and propagates across different pyramid levels. The RE-CBAM effectively prevents this risk by feeding relatively clean signals into the fusion process. Subsequently, AFPN reconstructs the thin-layer lithological features lost during the downsampling and denoising processes through a progressive feature fusion approach. This complementary design preserves the delicate texture boundaries within extremely thin interbedded structures and maintains the overall semantic coherence of rock masses, thereby surpassing the segmentation performance of a single module.
5.3. Balance Between Segmentation Precision and Inference Speed
While AFPN-ResUNet achieves a significant increase in accuracy, it necessitates a higher inference latency, registering the highest single-image inference latency (33.99 ms) among the compared models. This increased load stems from three primary architectural factors. Specifically, the attention mechanism creates a bottleneck in parallelization, as the core channel-wise global pooling and multi-layer perceptron (MLP) within the RE-CBAM rely on serial computation. This dependency limits GPU acceleration efficiency, increasing the inference time by 26.3%. Additionally, the complexity of feature interaction tends to offset parameter reduction. Despite AFPN’s lower parameter count, its frequent resolution alignment and ASFF weight generation require high-density matrix operations. Furthermore, system-level module coordination adds scheduling overhead, where the feature flow must sequentially traverse RE-CBAM recalibration, AFPN multi-stage fusion, and decoder upsampling, thereby accumulating latency.
The justification for this computational cost generally depends on the application context. For offline precision mapping, where the cost of rectifying geological misinterpretations can be significant, the 12.38% accuracy gain over PSPNet often outweighs the computational investment. In such cases, the 33.99 ms latency typically falls within acceptable limits for routine data processing workflows. Conversely, for real-time outcrop segmentation tasks, a throughput of 29.42 FPS could potentially limit the overall system responsiveness, and alternatives like DeepLabV3+ (52.98 FPS) might present a more practical balance between speed and accuracy.
5.4. Study Limitations and Future Directions
5.4.1. Dataset Limitations and Spatial Generalization
In this study, 20 high-resolution outcrop images (5472 × 3648 pixels) from the Zhuozishan area were selected for manual annotation. Constrained by the relatively small initial sample size, even with the training set expanded through cropping and data augmentation, the model may still be restricted to specific scenes and affected by spatial autocorrelation, which may limit generalization to unseen scenarios. Additionally, this research focused primarily on sandstone and mudstone within the deep-water sedimentary environment of the Zhuozishan area, without covering other geological settings. Considering that outcrop textures and structural morphologies can vary considerably across different geological environments. While the model performed well on an independent and unseen test set, this performance reflects its reliability within the Zhuozishan outcrop region. Cross-regional validation would be required to assess broader generalizability.
Furthermore, outcrop images are frequently affected by variations in illumination, weathering, and vegetation. Coupled with the fact that pixel-level annotation of high-frequency interbedded structures relies heavily on expert experience, high-quality public datasets in this field remain scarce. Given these data constraints, we aimed to design a segmentation model for ultra-thin sandstone-mudstone interbeds. While the cross-domain generalizability of the model requires further verification, its performance in the detailed characterization of complex thin beds in the Zhuozishan area demonstrates its application value. Future work will focus on constructing comprehensive outcrop datasets that encompass multiple lithologies, environments, and regions to further evaluate and enhance the applicability and robustness under complex geological conditions.
5.4.2. Limitations of Spatial Resolution and Boundary Ambiguity
The experimental results indicate that, despite the introduction of class weights to mitigate majority-class bias, the recall rate for thin mudstone layers remains comparatively low. This bottleneck highlights a complex interplay between the physical limits of image resolution and inherent annotation uncertainty.
Fundamentally, although our imagery possesses millimeter-level resolution, the extremely thin sandstone–mudstone interbeds (e.g., 1–2 cm) often occupy a few pixels in the image. At this scale, the mixed-pixel effect remains significant: pixels at the boundaries of thin layers frequently incorporate spectral information from the surrounding rock, causing intrinsic textural features to be smoothed or obscured. This physically constrains the model’s capacity to resolve subtle boundaries.
Moreover, this physical limitation amplifies the challenges of ground-truth delineation. Since dataset construction relies on manual interpretation, boundary ambiguity caused by both mixed pixels and natural lithological transitions introduces inherent variance into the annotation process. Consequently, observational noise and uncertainty are structurally integrated into the training data.
Given these compounding factors, the precise extraction of thin-layer details at the resolution limit remains inherently constrained at the current scale. Future research could address this bottleneck by exploring multi-source data fusion or sub-pixel analysis techniques to transcend current feature representation constraints.
5.4.3. Dimensional Constraints and Extension to 3D Analysis
The current methodology operates within a two-dimensional (2D) optical space, focusing on the extraction of surface lithological patterns. However, comprehensive geological interpretation relies on understanding three-dimensional (3D) spatial structures, such as true stratigraphic thickness, dip angles, and subsurface structural continuity. Because 2D optical segmentation is based on surface projections, it captures apparent boundaries rather than volumetric parameters, presenting a natural constraint for full spatial analysis.
Despite these dimensional constraints, highly accurate 2D segmentation serves as a crucial prerequisite for advanced 3D geological modeling. Reliable 2D pixel-level classification provides an essential semantic foundation—precisely delineating lithological boundaries and surface facies distributions. Without this high-fidelity semantic mapping, 3D point clouds or structural models lack the necessary geological context. Therefore, the proposed 2D segmentation functions not merely as a preliminary tool, but as a “semantic layer” required for informing and enriching spatial models.
Building upon this foundation, a logical trajectory for future research involves the fusion of these 2D semantic masks with 3D UAV photogrammetric products. By projecting the high-resolution 2D classifications into a 3D coordinate system, future studies can transition from surface mapping to quantitative volumetric analysis, enabling the automatic extraction of true spatial metrics and offering a more comprehensive framework for 3D geological characterization.
5.4.4. Data Modality and Geological Domain Knowledge
Finally, moving beyond the current optical and data-driven limitations, relying solely on a single optical source often leads to a restricted ability to penetrate vegetation canopies or deep weathering crusts. Integrating multi-modal data, such as hyperspectral imagery, can help to capture fine mineralogical signatures and significantly enhance surface interpretation. Furthermore, the current framework operates primarily as a data-driven paradigm with limited explicit constraints from earth science principles. Future endeavors could explore constructing Geology-Informed Neural Networks by embedding prior knowledge—such as stratigraphic superposition rules and spatial topological constraints—into the deep learning architecture. This approach would help the network to internalize geological logic, thereby minimizing segmentation predictions that contradict fundamental geological evolution laws.
6. Conclusions
This research addresses the intricate task of delineating lithological boundaries within field geological outcrops, a scenario frequently complicated by background complexity and significant scale variations within thin interbeds. To tackle these issues, we introduce AFPN-ResUNet, a semantic segmentation framework designed to capture multi-scale features and mitigate background interference. Through rigorous empirical testing and visual feature analysis, we confirm the robust performance of our network while simultaneously shedding light on its underlying decision-making processes.
The segmentation results show that the model exhibits favorable segmentation capability under complex field outcrop conditions. Even in the presence of severe illumination unevenness and vegetation occlusion, the model can accurately extract lithological features. For extremely thin sandstone–mudstone interbeds, the proposed method preserves high geometric continuity and sharp edge detail, surpassing traditional approaches. Comparative experiments show that the AFPN-ResUNet model achieves a 93.41% mIoU on the test set. This result suggests that architecture optimization for the scale variations and textural characteristics of geological objects may offer advantages over directly applying generic computer vision models.
Ablation experiments and mechanism analysis further indicate that the observed performance gain is largely attributable to the complementary roles of the RE-CBAM and AFPN modules. Within this framework, CBAM functions as an attention-weighted filter that adaptively suppresses non-geological background interference, while AFPN serves as a gradual semantic bridge, reconciling hierarchical discrepancies across feature scales. This dual design helps mitigate the loss of high-frequency spatial details in deep networks and notably strengthens the model’s capacity to delineate thin interbedded structures.
Future work could explore lightweight models to accommodate real-time UAV-based geological interpretation. In parallel, the incorporation of multispectral or hyperspectral data may provide additional lithological constraints that benefit segmentation in complex outcrop settings. Furthermore, coupling such segmentation architectures with explicit geological prior knowledge offers a potential pathway toward a “data–knowledge” dual-driven paradigm. Such an integration could improve generalization and interpretability across geologically diverse environments.