AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation

Tang, Mingming; Fu, Kang; Tian, Lei; Chen, Wanxin; Li, Yuhan; Zhang, Zongxu; Ma, Zhiyuan

doi:10.3390/rs18101457

Open AccessArticle

AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation

by

Mingming Tang

^1,2

,

Kang Fu

^1,*,

Lei Tian

¹,

Wanxin Chen

¹,

Yuhan Li

¹,

Zongxu Zhang

¹ and

Zhiyuan Ma

¹

School of Geosciences, China University of Petroleum (East China), Qingdao 266580, China

²

State Key Laboratory of Deep Oil and Gas, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1457; https://doi.org/10.3390/rs18101457

Submission received: 6 March 2026 / Revised: 24 April 2026 / Accepted: 1 May 2026 / Published: 7 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

An AFPN-ResUNet architecture, integrating a residual attention mechanism and an asymptotic Feature Pyramid Network, is proposed for the high-precision lithological segmentation of complex outcrops.
The structurally optimized RE-CBAM effectively suppresses environmental artifacts while precisely preserving critical geological boundary details.

What are the implications of the main findings?

By dynamically recalibrating network attention toward salient lithological features, RE-CBAM significantly enhances segmentation robustness and accuracy in highly heterogeneous field environments.
The AFPN paradigm progressively resolves semantic discrepancies across feature levels, unlocking a robust solution for the precise and continuous delineation of ultra-thin sand–mudstone interbeds.

Abstract

Although the accurate lithological segmentation of outcrops plays a key role in hydrocarbon exploration, complex field environments and substantial scale variations within outcrops, particularly in extremely thin sand–mudstone interbeds, present considerable obstacles to precise segmentation. To overcome these complexities, we propose a Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network (AFPN-ResUNet). This architecture employs a structurally optimized RE-CBAM, which seamlessly integrates a Convolutional Block Attention Module (CBAM) into the residual network framework. This mechanism dynamically recalibrates channel and spatial feature responses, thereby effectively suppressing background artifacts while accentuating salient geological boundaries. Furthermore, we abandon traditional naive feature concatenation and instead utilize automatically generated spatially adaptive weights to guide the asymptotic fusion of features across different layers. This asymptotic fusion strategy effectively resolves the semantic discrepancies between distinct network levels, preserving the fine-grained spatial details crucial for delineating ultra-thin interbedded lithologies. To evaluate the architecture, a dedicated outcrop dataset was constructed. Compared to representative baselines (UNet, Vision Transformer, DeepLabV3+, PSPNet, and SegNeXt), AFPN-ResUNet achieves an mIoU of 93.41%, outperforming the baseline models by margins of 23.20%, 23.92%, 12.40%, 12.38%, and 26.04%, respectively. Additionally, ablation studies indicate that incorporating RE-CBAM and AFPN modules improves the mIoU by 13.11% and 13.98% over the backbone, respectively. These quantitative results demonstrate that AFPN-ResUNet effectively mitigates boundary blurring and preserves spatial continuity, an advantage visually corroborated by the Grad-CAM heatmaps. Notably, despite a relatively longer inference latency (33.99 ms), the model maintains a low computational overhead (179.79 G FLOPs), underscoring its practical application potential for outcrop lithology segmentation.

Keywords:

outcrop lithology segmentation; AFPN-ResUNet; attention mechanism; feature pyramid network; thin layers

1. Introduction

Geological outcrops serve as natural archives exposed at the Earth’s surface by tectonic and sedimentary processes, preserving multi-scale information ranging from microscopic lithological fabrics to macroscopic structural deformations [1]. Within sedimentary basins, alternating sequences of sandstone and mudstone are particularly prevalent. The accurate delineation of these lithologies, especially when identifying extremely thin mudstone interbeds, is fundamental for the transition from qualitative description to quantitative, refined characterization [2,3,4]. Such detailed characterization provides critical evidence for unraveling geodynamic processes and reconstructing paleogeographic environments [5]. However, field outcrops are frequently compromised by long-term weathering and are often obscured by surface covers such as vegetation and debris, rendering geological semantic information fragmented and ambiguous [6]. Consequently, the precise and automated extraction of key geological semantics from complex field settings remains a significant challenge in geological exploration [7,8].

The lithological identification of geological outcrops has traditionally relied primarily on field observation, geophysical interpretation, and geochemical analysis by geological experts [9,10,11]. Although these traditional methods can usually achieve satisfactory results, they demand extensive domain expertise and are time-consuming and labor-intensive [12,13]. Crucially, constrained by the physical accessibility of sampling sites, traditional approaches pose safety hazards and struggle to generate continuous, high-resolution representations of the entire outcrop face [14,15,16]. Consequently, there is a critical need to develop more efficient, automated methodologies to overcome these limitations.

In recent years, the advancement of oblique aerial photogrammetry using Unmanned Aerial Vehicles (UAVs) and high-precision digital sensing technology has generated a wealth of outcrop imagery, providing a new perspective for geological research [17,18]. In this context, integrating computer vision technology into geological image analysis has emerged as an essential strategy to overcome the spatiotemporal limitations of traditional methods [19,20]. Driven by breakthroughs in deep learning, image segmentation techniques grounded in Convolutional Neural Networks (CNNs) have demonstrated considerable efficacy across medical diagnosis, autonomous driving, and remote sensing [21,22,23]. In the earth sciences, segmentation methods such as FCN, U-Net, and the DeepLab series have gradually replaced traditional shallow machine learning algorithms, showing the potential for pixel-level classification [24,25]. For instance, Zhu et al. [26] introduced RockNet, which captures textural fingerprints of rock surfaces via an end-to-end feature enhancement strategy. Jing et al. [27] constructed a network architecture designed to fuse multi-modal features that enhances the characterization ability of complex lithologies by fusing spectral and texture information. Addressing data scarcity, Qin et al. [28] employed unsupervised superpixel segmentation techniques to achieve an effective discrimination of clastic rock particles in scenarios with limited annotated data. Collectively, these studies validate the capacity of deep learning to extract complex, non-linear features in images, laying a robust foundation for the intelligent identification of geological outcrops [12].

Despite these advancements, the existing semantic segmentation models encounter significant bottlenecks when applied to specific geological scenarios, particularly sandstone–mudstone interbedded formations [29]. This research gap primarily stems from two geological challenges:

First, the “Noise–Boundary” Dilemma: Extensive weathering and environmental noise (e.g., uneven illumination, vegetation shadows) cause pronounced boundary ambiguity between lithologies [30]. Standard CNNs often struggle to filter out this background noise, leading to misclassified boundaries [31,32]. Recently, several advanced frequency-based architectures have been proposed in remote sensing to address similar issues by isolating noise and emphasizing boundary features in the frequency domain. For instance, Guo et al. [33] developed a knowledge distillation framework that decouples high-frequency structural boundaries from low-frequency semantic backgrounds, effectively minimizing background interference in UAV-based assessments. Similarly, Yu [34] proposed FHSA, a framework that employs hybrid sequence attention to specifically suppress environmental noise in the frequency domain while preserving critical textures. Furthermore, Li et al. [35] introduced a frequency-aware Transformer that utilizes wavelet decomposition to adaptively integrate multi-resolution subbands, filtering out irrelevant noise subbands to sharpen semantic boundaries. These methods attempt to isolate noise and decorrelate complex features in the frequency domain to sharpen semantic boundaries. However, their efficacy in geological outcrops remains limited, as the stochastic texture of weathered rock surfaces often overlaps with semantic information in the high-frequency spectrum, making pure frequency-based separation challenging.

Second, the “Scale–Loss” Effect: Sedimentary strata exhibit extreme scale variations, characterized by extremely thin mudstone layers embedded within thick sandstone beds. The downsampling operations inherent to deep networks often cause the texture information of these thin rock layers to be diluted or even lost during multiple convolutions, resulting in poor segmentation accuracy [36,37]. Recently, various multi-scale context aggregation and attention-based modules have been proposed to dynamically capture features across different receptive fields. As an illustration, Chen et al. [38] developed GLDSFNet, a segmentation network integrating global attention mechanisms with multi-size deformable convolutions to dynamically expand receptive fields, effectively capturing fine boundaries and global–local multi-scale information. Similarly, Liang and Li [39] proposed ScaleRSNet, a contextual attention-based framework that utilizes multi-rate dilated convolutions alongside spatial attention to precisely extract and combine features across different receptive fields. However, despite these modifications, CNN-based methods fundamentally rely on localized window operations. While techniques like dilated or deformable convolutions expand the receptive field to some extent, they still suffer from an inherent local bias and struggle to model long-range spatial dependencies. Consequently, they often fail to track continuous but extremely thin geological features that span across an entire image.

To overcome the spatial limitations and inherent local bias of traditional convolution operations, researchers have increasingly turned to Vision Transformers (ViTs) as a robust structural solution [40]. Because traditional CNNs struggle to track continuous but extremely thin geological features across an entire image, ViTs provide a distinct advantage through their self-attention mechanisms. By establishing spatial correlations across the entire image regardless of distance, ViTs theoretically excel at maintaining the structural integrity and continuity of elongated thin beds [40,41]. Nevertheless, applying these architectures to geological scenarios is severely hindered by their massive appetite for training data [41]. The prohibitive cost and profound scarcity of expert-level, pixel-wise annotations in this specific domain make the training of high-performing ViTs largely impractical [40,41,42].

In this study, to bridge this critical technical gap, we propose the AFPN-ResUNet framework—a high-precision semantic segmentation architecture tailored specifically for ultra-thin sand–mudstone interbeds in complex field outcrops. The core of this framework manifests in two fundamental dimensions of architectural reconstruction. First, regarding feature fusion, we depart from the naive direct concatenation conventionally used in standard skip connections, adopting an Asymptotic Spatial Feature Fusion (ASFF) strategy [43] instead. By executing a dynamic and smooth fusion based on the contribution weights of different feature levels, this strategy effectively resolves the semantic loss of thin layers caused by successive convolutional operations in deep networks. Consequently, it successfully bridges the semantic chasm between low-level structural details and high-level abstract contexts [43,44]. To counteract the noise interference inherent in complex outcrop backgrounds, a Convolutional Block Attention Module (CBAM) is integrated seamlessly into the residual network [45,46]. Empowered by dual attention mechanisms, this integration enables the model to dynamically filter out cluttered background noise, firmly concentrating the network’s focus on fundamental lithological textures and micro-geological features.

The primary innovations presented in this research include the following: (1) we introduce a symmetric semantic segmentation framework, designated as AFPN-ResUNet, designed to tackle the difficulties associated with classifying lithology in complex field outcrops with extreme scale variations; (2) a CBAM is embedded into the ResNet50 encoder, which allows the model to adaptively emphasize discriminative lithological textures while suppressing environmental interference; (3) an asymptotic pyramidal architecture with a cross-scale feature fusion mechanism is deployed along the skip pathways, which aims to mitigate the semantic gap across different deep layers.

Section 2 of this paper introduces the study area and data sources, detailing the data processing workflow, evaluation metrics, and training methodologies. Subsequently, Section 3 focuses on the proposed AFPN-ResUNet architecture, elaborating on specific module designs, the loss function, and the Grad-CAM++ visualization method. Section 4 comprehensively presents the experimental results, encompassing model training and segmentation performance, comparative experiments, and ablation studies. Finally, Section 5 provides a relevant discussion of our findings, and Section 6 delivers the conclusions.

2. Data

2.1. Study Area

Located on the southern edge of the Ordos Basin within Inner Mongolia, the research site is positioned in the Zhuozishan region of the Hainan District, Wuhai City (106°53′4″E, 39°25′33″N), illustrated in Figure 1a. Moving upwards stratigraphically, the Ordovician strata consist of the Wulalike, Lashizhong, Gongwusu, and Sheshan formations. The target interval, the Lashizhong Formation, consists primarily of gray-green sandstone and siltstone, with interbedded mudstone. Although the study area generally exhibits distinct lithological assemblages, pervasive surface weathering, erosional debris, and vegetation cover significantly complicate lithological segmentation. The field outcrop of Zhuozishan area is shown in Figure 1b.

The sampling locations were spatially distributed across the target area, covering approximately 20 km². This design systematically captures complex field conditions while ensuring data representativeness and minimizing spatial autocorrelation. Rather than concentrating on a single outcrop, the sampling sites were selected along a north–south transect of the Lashizhong Formation. The distance between adjacent sampling sites ranged from 0.2 to 1.5 km. This geographically dispersed sampling strategy ensures the images encompass a wide spectrum of depositional micro-environments, weathering degrees, and illumination conditions from different slope aspects, maximizing the independence and diversity of the source data.

2.2. Data Acquisition

The foundation of our deep learning model is a novel dataset consisting of 20 raw outcrop photographs (5472 × 3648 pixels). RGB imagery was captured using a drone-mounted high-resolution optical sensor (24 mm equivalent focal length, f/2.8–f/11 aperture range; refer to Figure 2). By strictly maintaining a flight altitude of 5–10 m and a 45° oblique perspective, our specific flight strategy ensured precise spatial accuracy and high-fidelity inputs, yielding a Ground Sampling Distance (GSD) of roughly 1.9–3.9 mm/pixel.

Following data acquisition, fine pixel-level semantic annotation was performed using CorelDRAW Graphics Suite 2020. To ensure geological accuracy, the annotations were conducted by two scholars from our team, each with a Ph.D. in sedimentology and over five years of specialized experience in sedimentary petrology. Furthermore, a cross-validation protocol was adopted, wherein ambiguous boundaries were resolved through consensus. The geological scene was divided into five core categories, each assigned a unique RGB color code for mask generation. The target classes were sandstone (216, 38, 40) and mudstone (115, 193, 90), and the background classes were sky (136, 189, 215), weathering detritus (161, 161, 161), and vegetation (235, 237, 143). The raw data encompassed a variety of lighting conditions and complex backgrounds, ensuring baseline diversity.

Due to the large dimensions of the raw field images, feeding them directly into the neural network would cause GPU memory overflow and reduce computational efficiency. Therefore, the raw images needed to be cropped into standardized patches of 512 × 512 pixels. To prevent spatial autocorrelation and guarantee strict separation, the twenty original images were distributed according to an 8:1:1 split. This allocation provided sixteen full images for training purposes, while the validation and testing phases each received two distinct images. After this image-level partitioning, the raw images and their corresponding ground-truth masks were cropped synchronously. The final base dataset comprised 5950 image patches and their respective masks, consisting of 4760 training patches, 595 validation patches, and 595 test patches. The data processing flowchart is shown in Figure 3.

Moreover, to broaden the dataset’s variety and boost the network’s predictive performance on unseen data, we incorporated dynamic data augmentation during the learning process. This method—which included altering lighting conditions (color jittering with brightness and contrast at 0.3, saturation at 0.2, and hue at 0.1) and performing spatial flips (random horizontal and vertical flips with a 50% probability)—was executed exclusively on the training examples. This data processing and training strategy ensures that the data from each dataset are independent of one another to avoid data leakage, which can to some extent further strengthen the network’s robustness and overall capacity to generalize.

3. Materials and Methods

3.1. Overview of the Network Framework

The complex background interference and extreme variations in rock layer thickness in geological outcrop images pose a severe challenge to traditional segmentation methods. To overcome this issue, we propose a symmetrical segmentation framework, AFPN-ResUNet. As shown in Figure 4, this architecture primarily consists of two core components working in synergy: a ResNet encoder enhanced by the CBAM (denoted as RE-CBAM) and the AFPN. The corresponding source code is provided in the Supplementary Materials.

The encoder employs ResNet-50, with the CBAM embedded in residual bottleneck blocks. By applying channel and spatial dual attention, this module suppresses background noise and amplifies key lithological features, ensuring the purity of semantic input from the outset. Furthermore, to address the segmentation challenge of sedimentary rock layers with extreme thickness variations, an AFPN is introduced at the skip connections. As the encoder–decoder hub, the AFPN uses ASFF to enable multi-level feature interaction and capture high-level semantics from deeper layers. This mechanism effectively bridges the semantic gap between hierarchical layers. It enables the model to dynamically balance global context and local detail, improving the detection of thin interbeds and small geological bodies.

Finally, a symmetric decoding architecture is employed to accurately map the fused high-quality multi-scale features back to the original resolution. Instead of directly utilizing the raw features from the bottom of the encoder, this module takes the bottom features enhanced by AFPN (P₁) as input and incorporates the refined multi-scale features (P₂, P₃, P₄) via skip connections. Spatial details are then restored through cascaded upsampling. Then, to recover edge information lost during multiple downsampling, an Initial Feature Extraction Stem (IFES) layer is connected to the final decoder block (Dec4) through a skip connection. This ensures that the geometric continuity of lithological boundaries and the accuracy of pixel-level classification are effectively preserved when restoring the original image resolution.

3.2. CBAM Residual Module (RE-CBAM)

Rather than simply appending attention mechanisms at the network’s tail, the CBAM is seamlessly integrated into the main pathway of each residual block, as illustrated in Figure 5. This design ensures that the attention mechanism can adaptively refine the convolutional features at multiple network levels. Crucially, by applying attention within the residual branch, the architecture preserves the identity mapping pathway. This allows the network to benefit from enhanced feature representation without compromising the residual skip connections, thereby maintaining stable gradient flow and avoiding the vanishing gradient problem inherent in deep networks.

Specifically, for the feature tensor

X \in R^{C \times H \times W}

, we assume that the convolution operation in the residual block is denoted as

F (\cdot)

. The intermediate feature

F = F (X)

, obtained after the convolution transformation, is then sequentially refined via the respective channel (

M_{c}

) and spatial attention module (

M_{s}

). The final residual block output

Y

is formulated as follows:

F^{'} = M_{c} (F) \otimes F

(1)

F^{″} = M_{s} (F^{'}) \otimes F^{'}

(2)

Y = {ReLU (F}^{″} + X)

(3)

where

\otimes

indicates element-wise multiplication, and

F^{″}

represents a residual feature weighted by dual attention mechanisms.

3.2.1. Channel Attention Module

This sub-module focuses on modeling the inter-dependencies across feature channels, allowing the network to emphasize channels carrying crucial geological semantics. By jointly executing global max-pooling and average-pooling, the spatial resolution of the incoming feature tensor is collapsed, effectively summarizing the spatial details. For a given input tensor F, this yields two distinct spatial context descriptors,

F_{avg}^{c}

and

F_{\max}^{c}

. In order to learn complex, non-linear cross-channel relationships, these pooled descriptors are put into a shared multi-layer perceptron (MLP) network, which utilizes a hidden layer with Rectified Linear Unit (ReLU) activation to connect two fully connected layers. The subsequent mathematical operation defines the generated channel attention map, symbolized as

M_{c} (F) \in R^{C \times 1 \times 1}

:

M_{c} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F))) = σ (W_{1} (δ {(W}_{0} (F_{avg}^{c}))) + W_{1} (δ (W_{0} (F_{\max}^{c}))))

(4)

where

σ

and

δ

denote the Sigmoid and ReLU activation functions, respectively. The variables

W_{0} \in R^{C / r \times C}

and

W_{1} \in R^{C \times C / r}

function as the adaptable weight matrices for the MLP network, and

r

acts as the reduction factor for the bottleneck. By applying this attention formulation, the model successfully diminishes the influence of irrelevant channel representations that do not contribute to the geological analysis.

3.2.2. Spatial Attention Module

Building upon the channel dimension enhancements, a spatial focusing mechanism is applied to filter out background clutter and accentuate key geological targets. The initial step involves squeezing the input feature map via parallel max-pooling and average-pooling, which results in a pair of single-channel matrices,

F_{avg}^{s}, F_{\max}^{s} \in R^{1 \times H \times W}

. These two matrices are immediately joined along their channel axis. To effectively assimilate the wide-ranging spatial context, the fused volume is then passed through a convolutional operation equipped with a wide

7 \times 7

filter. The formulation for generating the ultimate spatial mask,

M_{s} (F^{'}) \in R^{1 \times H \times W}

, is presented below:

M_{s} (F^{'}) = σ (f^{7 \times 7} ([AvgPool (F^{'}); MaxPool (F^{'})]))

(5)

where

f^{7 \times 7}

denotes a convolutional layer employing a 7 × 7 kernel. The spatial mask generated by this step can accurately locate the spatial positions of geological bodies, enhancing the ability of the model to maintain the edge of the lithological boundary in the complex field background.

3.3. Asymptotic Feature Pyramid Network (AFPN)

Unlike traditional Feature Pyramid Networks (FPNs), which force a drastic and direct merger between low-level (highly detailed) and high-level (highly abstract) features, the AFPN adopts a gradual, step-by-step fusion approach. As shown in Figure 6, after unifying the channel dimensions of the four encoder levels to 128, AFPN decomposes the feature fusion process into multiple transitional stages.

Rather than undergoing abrupt scale transformations in scale, features are progressively refined through parallel convolutional blocks, allowing adjacent levels to interact gradually before final fusion. Consequently, this ensures that the network successfully retains both the high-resolution geometric details and the robust semantic context. The former enables precise delineation of lithological boundaries, while the latter supports accurate lithological classification.

3.3.1. Cross-Scale Feature Alignment Mechanism

Before feature fusion, it is necessary to address the inconsistency in resolution among features from different levels. Assuming that a larger index denotes a deeper network level with lower spatial resolution, for the target resolution at level

l

, features

X^{n}

from other levels

n

need to be adjusted to that resolution.

When n >

l

, an upsampling operation is performed to enhance the shallow features by introducing deeper semantics. After compressing the channels through a 1 × 1 convolution, the resolution is enlarged by

2^{(n - l)}

times through bilinear interpolation.

In scenarios where n < l, the architecture performs downsampling to embed high-resolution textural cues within more abstract semantic layers. By deploying a strided convolution with a step size of

2^{(l - n)}

, the model effectively decreases the spatial scale while concurrently enlarging its receptive field.

The aligned feature is denoted as

X^{n \to l}

, representing feature maps originally from level

n

but aligned to the spatial scale of level

l

.

3.3.2. Adaptive Spatial Feature Fusion Mechanism

In the process of multi-scale fusion, features from different levels contribute differently to various spatial locations within an image. We assume that all levels contribute equally and that simply fusing them through element-wise addition is inadequate in outcrop segmentation scenarios. For instance, identifying small geological bodies relies more on low-level textural features, whereas recognizing large-area sandstone depends on high-level semantic information.

Therefore, this study introduces an ASFF mechanism to generate spatially adaptive weights to address this issue. As shown in Figure 7, taking three-layer feature fusion as an example, for each spatial position

(i, j)

of the target level l, the fused output

Y_{ij}^{l}

is calculated as follows:

Y_{ij}^{l} = α_{ij}^{l} \cdot X_{ij}^{1 \to l} + β_{ij}^{l} \cdot X_{ij}^{2 \to l} + γ_{ij}^{l} \cdot X_{ij}^{3 \to l}

(6)

where

α_{ij}^{l}

,

β_{ij}^{l}

, and

γ_{ij}^{l}

represent the fusion weights of levels 1, 2, and 3 for the current pixel, respectively, and satisfy the normalization constraint

α_{ij}^{l} + β_{ij}^{l} + γ_{ij}^{l} = 1

.

To enable the network to learn these weights autonomously, a lightweight weight generation branch is introduced, as shown in Figure 7:

(1): Compression: A $1 \times 1$ convolution is utilized at each level to shrink the channel depth, mapping the features into a more compact space.
(2): Prediction: The resulting compact features are merged together and fed into a subsequent convolutional block, which generates the raw weight tensors ( $λ_{α}, λ_{β}, λ_{γ}$ ).
(3): Normalization: Finally, a Softmax function is deployed to distribute these values proportionally across the channels. Its mathematical derivation, illustrated by $α_{i j}^{l}$ , is expressed below:

α_{ij}^{l} = \frac{e^{λ_{α}^{ij}}}{e^{λ_{α}^{ij}} + e^{λ_{β}^{ij}} + e^{λ_{γ}^{ij}}}

(7)

Through this mechanism, ASFF can dynamically filter out conflicting information between features from different levels. During backpropagation, if the low-level texture at a certain position contains significant noise, the network will automatically reduce the value of

α

at that position, while concurrently increasing the weight

γ

for high-level features as compensation. This spatially adaptive balancing significantly enhances the model’s robustness in complex geological outcrop scenarios. Finally, after smoothing through a 3 × 3 convolution, the final fused features are fed to the decoder for the final lithology prediction.

3.4. Symmetric Decoding Architecture

In this architecture, the decoder receives the multi-scale fused features from the AFPN module, where

P_{1}

,

P_{2}

,

P_{3}

, and

P_{4}

serve as skip connection inputs, and

P_{1}

serves as the foundational deep-level feature input. This design ensures that features entering the decoder retain robust semantic context while also benefiting from denoising and enhancement by the preceding attention and asymptotic fusion mechanisms. Finally, the IFES layer is connected via a skip connection to Dec4 to recover edge information lost during multiple downsampling.

The decoder consists of four cascaded upsampling modules. Corresponding to decoder layers i (i = 1, 2, 3), its input includes the output

D_{i - 1}

(

D_{0} = P_{1})

from the previous decoder layer and the AFPN fusion feature

P_{i + 1}

from the same level. The feature fusion process is defined as follows:

D_{i}^{'} = up (D_{i - 1})

(8)

D_{i} = F_{conv} (Concat (D_{i}^{'}, P_{i + 1}))

(9)

where

up (\cdot)

denotes the bilinear interpolation upsampling operation, aimed at restoring spatial resolution, and

Concat (\cdot)

represents the concatenation along the channel dimension. Additionally,

F_{conv}

describes a specific module formulated by a pair of 3 × 3 convolutions, which are subsequently processed through Batch Normalization (BN) and a ReLU non-linearity.

Notably, for the final decoder layer (Dec4), its input consists of the upsampled feature (

D_{3})

and the high-resolution features from the IFES layer. Finally, the output of decoder Dec4 undergoes upsampling and a 1 × 1 convolution to generate a pixel-level classification prediction mask

Y_{pred} \in R^{N_{class} \times H \times W}

that matches the spatial dimensions of the input image.

3.5. Loss Function

The geological outcrop segmentation task faces severe class imbalance challenges, where macroscopic backgrounds dominate the pixel count, while thin-bedded mudstones and tectonic fracture zones account for only a small proportion. Using only the traditional cross-entropy loss tends to bias the model toward fitting the majority classes while neglecting critical minority classes. To address this issue,

L_{total}

is designed, which additionally incorporates Dice Loss (

L_{dice}

), with

λ

serving as the weighting factor:

L_{total} = L_{wce} + λ \cdot L_{dice}

(10)

L_{wce} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{c = 1}^{C} w_{c} \cdot y_{n, c} \log (p_{n, c})

(11)

L_{dice} = 1 - \frac{1}{C} \sum_{c = 1}^{C} \frac{2 \sum_{n = 1}^{N} (p_{n, c} {\times y}_{n, c}) + ϵ}{\sum_{n = 1}^{N} {(p}_{n, c} + y_{n, c}) + ϵ}

(12)

where the overall pixel count is represented by N, while C signifies the total quantity of distinct categories. The variable

y_{n, c}

corresponds to the one-hot encoded true labels, and

p_{n, c}

indicates the network’s predicted probability. The class weight vector, denoted as

w_{c}

, is derived from the inverse pixel frequency of each specific category within the dataset. This weighting strategy compels the algorithm to focus more heavily on underrepresented lithological classes. Finally,

ϵ

acts as a smoothing factor.

3.6. Grad-CAM++

The inherent opacity of deep neural networks often obscures their internal decision-making processes. To elucidate the specific spatial areas emphasized by the convolutional layers, we apply the Grad-CAM++ (Gradient-weighted Class Activation Mapping) algorithm. By incorporating higher-order derivatives and calibrating the disproportionate impact of channel weights, this advanced approach facilitates a more precise delineation of diminutive targets and subtle local patterns. In our research, Grad-CAM++ serves as a visual diagnostic tool throughout the training continuum, designed to reveal how the network incrementally captures the morphological traits of extremely thin mudstone beds. The mathematical computation of Grad-CAM++ is defined as follows:

L_{Grad - CAM + +}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

(13)

α_{k}^{c} = \sum_{i} \sum_{j} ω_{ij}^{c} \frac{\partial y^{c}}{\partial A_{ij}^{k}}

(14)

where

α_{k}^{c}

denotes the importance weight,

c

represents the target class,

k

represents the channel index, and

y^{c}

is the score for class

c

. Furthermore,

A_{ij}^{k}

describes the feature map’s activation strength located at spatial coordinates

(i, j)

within the k-th channel of layer A.

3.7. Performance Assessment and Experimental Configurations

To comprehensively assess the efficacy of the proposed network from various angles, several quantitative indicators were utilized. The mathematical formulations for these metrics are provided below:

Precision = \frac{TP}{TP + FP}

(15)

Recall = \frac{TP}{TP + FN}

(16)

F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(17)

DSC = \frac{2 \times TP}{FP + 2 \times TP + FN}

(18)

IoU = \frac{TP}{TP + FP + FN}

(19)

mIoU = \frac{1}{N} \sum_{i}^{N} I o U_{i}

(20)

where TP, FP, and FN represent the counts for true positives, false positives, and false negatives, while the total number of categories is denoted by N. The accuracy of the localized areas is measured by Precision, whereas Recall evaluates the network’s ability to fully identify the target regions. The F1-score is the harmonic mean of Precision and Recall. Furthermore, DSC indicates the degree of spatial intersection. The IoU metric calculates the intersection-over-union proportion between the model’s predictions and the actual ground truth, and taking the mean of these IoU values across all categories yields the mIoU. Because it consistently provides strict and robust performance evaluations, mIoU is widely recognized as a reliable metric for segmentation tasks and was therefore chosen as the primary evaluation criterion for our research.

The computational framework for all trials consisted of a Windows 11 machine equipped with an NVIDIA RTX 4060 graphics card (NVIDIA Corporation, Santa Clara, CA, USA) and 128 gigabytes of memory. The algorithm was implemented using Python (3.12) alongside the PyTorch (2.90 + cu126) framework. We trained the proposed AFPN-ResUNet on the prepared dataset over 150 iterations (epochs), utilizing the Adam algorithm to optimize network parameters. To mitigate the class imbalance problem—specifically, to prevent dominant categories from overshadowing underrepresented ones like mudstone—we applied targeted weights to each specific category: 1.5 for sandstone, 3.76 for mudstone, 0.8 for sky, and 1.0 for both weathered debris and vegetation. All baseline models were trained using the same hyperparameter configurations and data augmentation strategies as our proposed AFPN-ResUNet. The unified training settings are detailed in Table 1.

4. Results

4.1. Experimental Training Results

As illustrated in Figure 8, the training loss experiences a steep initial decline, subsequently plateauing near the 100-epoch mark after some minor variations. The trajectory of the validation loss closely mirrors that of the training phase, implying the model does not suffer from significant overfitting. Additionally, the mean intersection-over-union (mIoU) progression is remarkably consistent across both datasets. These metrics ultimately stabilize at 94.82% for the training set and 94.15% for the validation set, leaving a marginal gap of merely 0.67%. Although short-term variations emerged during the learning phase—likely driven by mini-batch variance and the inherent complexity of sample features—they did not disrupt the general trajectory. The persistent drop in the loss function, coupled with climbing mIoU scores, confirms that stable convergence was successfully reached. In summary, the proposed architecture exhibits reliable training capabilities.

4.2. Segmentation Results

To thoroughly evaluate the proposed AFPN-ResUNet framework, we conducted extensive evaluations using an unseen, independent dataset. Figure 9 presents the segmentation results under different surface outcrop conditions (illumination, shelter, interbedding, crumbling).

The experimental results demonstrate that AFPN-ResUNet achieved desirable segmentation performance across various challenging conditions. First, in regions with uneven illumination, the model accurately restored the spatial continuity of sedimentary layers, with mIoU scores ranging from 87.35% to 95.57%. Furthermore, in areas (shelters) covered by weathered debris and vegetation, the mIoU consistently exceeded 90%, validating the feature refinement capability of the RE-CBAM and its effectiveness in mitigating class misclassification caused by noise.

Notably, in regions with extremely thin interbedded sandstone and mudstone layers, the model maintained stable segmentation performance, with mIoU scores reaching over 93%, achieving a precise delineation of ultra-thin interlayer boundaries. Finally, when handling severely weathered and crumbling outcrop areas, AFPN-ResUNet maintained reliable performance. Specifically, even when confronting severely crumbling outcrops and blurred boundaries, the model yielded mIoU scores ranging from 88.30% to 94.82%. These results verify that the proposed model demonstrates robust performance in complex and fractured field outcrop segmentation tasks, accurately identifying and localizing specific lithologies.

4.3. Analysis of Module Mechanisms

4.3.1. Multi-Dimensional Visualization Study on the Mechanism of AFPN-ResUNet

To reveal the internal learning mechanism of AFPN-ResUNet, a multi-dimensional visualization analysis was conducted using Grad-CAM++ during the training process. In Figure 10, the horizontal axis captures the shifting focus of attention regions over various epochs, while the vertical dimension portrays the progressive development of features across successive network layers.

Examining the vertical axis (network depth), the visualization results clearly demonstrate the process of feature refinement and spatial reconstruction. At the front end of the encoder (e.g., RCB-E1), the model’s activation signals appear diffuse, primarily capturing low-level visual information such as color gradients, minor surface fractures, weathering marks, and basic textures on rock surfaces. As the network depth increases toward deeper layers (RCB-E4), the activated regions exhibit noticeable spatial concentration and intensified activation, shifting focus away from superficial noise to strongly highlight the prominent horizontal bedding planes and macro-structural boundaries. Subsequently, in the decoder stage (from SYM-D1 to SYM-D4), abstracted semantic features are progressively upsampled and refined. The activation focus transitions from blob-like abstract regions back to precise spatial activations, ultimately generating sharp, continuous heatmaps that accurately align with the actual physical boundaries of the thin interbedded strata.

Examining the horizontal axis (training epochs), in the early stage of training, the attention mechanism anchors on highly contrasted and prominent lithological features, such as the thickest visually distinct sedimentary bands (e.g., the dominant reddish strata). Entering the middle stage, the model begins to finely identify previously overlooked weak feature regions and edges, specifically activating along the thinner interbedded layers and subtle transitional boundary zones. By the convergence stage of training, the model can clearly identify the target lithological areas, generating concentrated, high-intensity activation maps that precisely trace the continuous morphological strike of the specific geological strata, while effectively suppressing non-geological background elements like shadows or surface debris.

This process demonstrates that RE-CBAM effectively filters out complex background noise in field settings by gradually decoupling lithological signals from the complex geological environment, achieving a transformation from fragmented local feature perception to systematic global feature representation. Furthermore, combined with the asymptotic fusion of AFPN, the model effectively reconstructs abstracted spatial information during the decoding stage.

4.3.2. Visual Analysis of the Asymptotic Fusion Mechanism in AFPN-ResUNet

To intuitively demonstrate the AFPN asymptotic fusion mechanism, this study conducted a visualization analysis of its workflow and generated feature heatmaps for mudstone lithological units in the thin interbedded area of the outcrop, as illustrated in Figure 11. The input is downsampled through the backbone network, and a large amount of irrelevant background noise is effectively filtered by the RE-CBAM attention module, thereby providing a refined initial feature input for the subsequent AFPN module. Feature maps from different levels are then respectively processed through 1 × 1 convolution (indicated by the red arrows) to unify the number of channels (adjusted to 128), which significantly reduces computational complexity while preserving the representative features of each level.

As illustrated in the AFPN structure, asymptotic feature fusion proceeds progressively from shallow to deep levels. In this process, shallow features (e.g., C₁ and C₂)—which distinctly capture localized visual information such as granular surface textures, minor fractures, and the precise geometric boundary lines of thin mudstone layers—participate in the fusion first (Step 1). These are then weighted and integrated through the ASFF nodes, with the output progressively fused with deeper features (such as C₃ in Step 2 and C₄ in Step 3) via downsampling (indicated by the green arrows). This design effectively compensates for the limitations of deep features, which possess strong semantic information (successfully localizing the macro-structural strike of the strata) but lack sufficient spatial details (often appearing as diffuse, blurred activation blobs).

Simultaneously, deep features propagate back to shallow layers via upsampling (indicated by the orange arrows), enriching the shallow features with higher-level semantic contexts. This overcomes the inherent limitation of shallow features containing only fragmented local details without overarching geological associations. Through this bidirectional, dense feature interaction, AFPN achieves a complementary integration of multi-scale features. As seen in the final fusion stage (Step 4), the model successfully generates concentrated activation bands. These refined multi-scale features are then finally fed into the decoder (from P₁ to P₄), where spatial resolution is progressively restored. The resulting continuous, high-intensity activation maps precisely delineate the thin interbedded mudstone units, contributing to the accurate and robust final segmentation output.

4.4. Comparison Study and Performance Assessment

4.4.1. Comparative Analysis of Segmentation Results Across Mainstream Models

For the purpose of validating the AFPN-ResUNet framework, a comparative analysis was conducted utilizing five established networks, namely UNet, ViT, DeepLabV3+, PSPNet, and SegNeXt. All baseline models were evaluated under the same data splits and hardware setup. A comprehensive summary of the same hyperparameter configurations and augmentation strategies applied uniformly to all models is provided in Table 1.

Table 2 shows the performances of different segmentation methods. AFPN-ResUNet clearly outperforms other models across all evaluation metrics in the task of field outcrop lithological segmentation. Specifically, baseline and extremely lightweight models like UNet, ViT, and SegNeXt struggle with this complex task, yielding mIoU scores of only 70.21%, 69.49%, and 67.37%, respectively. While classical semantic segmentation architectures such as DeepLabV3+ and PSPNet perform relatively better (achieving mIoU scores of around 81%), AFPN-ResUNet significantly eclipses them, with an mIoU of 93.41%, exceeding the second-best model (PSPNet) by a substantial margin of 12.38 percentage points. Furthermore, this superiority extends beyond mIoU. AFPN-ResUNet achieves a remarkable F1-Score and DSC of 96.58%, whereas the lowest-performing model, SegNeXt, only manages an F1-Score of 69.54%. The consistent superiority of its recall, precision, and F1-score clearly demonstrates the model’s capacity to effectively mitigate prediction errors and omissions in complex geological scenes.

Figure 12 illustrates key visual comparisons of the segmentation tasks, with the layout organized such that the original image crops and their corresponding manual annotations (ground truth) occupy the leftmost two columns. The remaining sections detail the predicted masks produced by our AFPN-ResUNet, as well as those derived from UNet, ViT, DeepLabV3+, PSPNet, and SegNeXt.

As shown in Figure 12a, in regions with uneven illumination, models like ViT and SegNeXt are heavily disturbed by shadows, completely misclassifying shadowed rock layers into incorrect categories. In contrast, AFPN-ResUNet demonstrates illumination invariance. In areas covered by weathered debris and vegetation (Figure 12b), UNet and ViT exhibit severe noise susceptibility, generating large, fragmented patches of misclassified pixels around the occlusions. AFPN-ResUNet, however, effectively suppresses this environmental noise and delineates the continuous outcrop boundaries. Figure 12c–e depict regions with extremely thin interbeds. Here, the limitations of the comparative models are glaringly apparent: SegNeXt almost fails to capture the thin structures (fusing them into solid background blocks), while UNet and ViT produce disjointed and fragmented predictions. Even DeepLabV3+, which performs moderately well, tends to over-smooth the edges and lose fine geometric details. Conversely, AFPN-ResUNet accurately delineates the boundaries and maintains the spatial continuity of these ultra-thin layers. Finally, in severely weathered and crumbling areas (Figure 12f,g), ViT suffers from severe blocky artifacts, and SegNeXt exhibits massive misclassifications, whereas AFPN-ResUNet consistently isolates the fragmented lithology with high precision.

Overall, while other baseline models can identify macro-lithological distributions, they fundamentally struggle with severe noise, occlusions, and fine-grained structural details. Meanwhile, AFPN-ResUNet consistently delivers accurate, continuous, and robust segmentation results under various challenging field conditions.

4.4.2. Comparison of Inference Speed and Computational Cost

A comprehensive evaluation of a model’s true capability demands looking beyond mere segmentation accuracy—computational overhead is equally vital for real-world deployment. Consequently, we benchmarked the operational efficiency of our AFPN-ResUNet against five established networks using identical software and hardware configurations. To quantify architectural complexity, we tracked total trainable weights (Params) alongside floating-point operations (FLOPs). Concurrently, the execution speed was measured via per-image processing times and frames per second (FPS). These detailed metrics are documented in Table 3.

Regarding model complexity, the proposed AFPN-ResUNet comprises 38.67 M parameters, placing it between DeepLabV3+ (40.35 M) and PSPNet (30.39 M), while remaining substantially more compact than the Transformer-based ViT (91.66 M). This demonstrates that AFPN-ResUNet maintains a relatively moderate parameter footprint despite the integration of multi-scale asymptotic fusion modules. In terms of computational overhead, the model requires 179.79 G FLOPs. While this computational cost is markedly lower than that of classical architectures such as UNet (321.19 G) and PSPNet (222.63 G), the dense multi-level feature interactions within the AFPN and the weighted fusion operations inherent to the ASFF mechanism inevitably introduce structural complexity.

Regarding inference efficiency, AFPN-ResUNet yields an inference time of 33.99 ms per image, translating to a throughput of 29.42 FPS. This throughput is comparatively lower than that of standard architectures like UNet (40.94 FPS) and DeepLabV3+ (52.98 FPS) and falls short of the optimized SegNeXt (97.74 FPS) and ViT (141.45 FPS). The elevated inference latency—despite having fewer FLOPs than UNet and PSPNet—can be attributed to the complex, sequential feature fusion pathways. These fragmented structures inherently introduce substantial memory access costs and reduce the degree of hardware parallelism, shifting the bottleneck from computation to memory bandwidth.

Notably, despite possessing the largest parameter count, ViT achieves the highest throughput. This efficient hardware utilization is primarily attributed to the parallelizable matrix operations inherent to its self-attention blocks, which bypass the sequential constraints of convolutions. However, this architectural trait represents an inherent trade-off. While it maximizes FPS, the pure Transformer architecture lacks the local inductive biases necessary for fine-grained texture extraction. Consequently, without the support of massive pre-training datasets, ViT struggles to accurately delineate complex geological boundaries, resulting in an mIoU of only 69.49%.

Conversely, SegNeXt represents an extreme approach to lightweight design. While its minimal parameter count (3.70 M) and low computational cost (18.99 G) yield an exceptionally high throughput, its poor segmentation accuracy (mIoU of 67.37%) indicates that excessive structural simplification severely compromises spatial feature representation. This contrast underscores the rationality of the proposed AFPN-ResUNet, which strategically accepts a moderate increase in inference latency to achieve superior and reliable segmentation accuracy in complex field environments.

In summary, while AFPN-ResUNet successfully constrains its parameter scale and achieves superior segmentation accuracy, its sophisticated feature fusion mechanisms impose non-trivial inference latency. Consequently, future work will focus on structural pruning, the integration of lightweight convolutional adaptations, and the optimization of fusion modules to enhance real-time applicability without compromising model accuracy.

4.5. Ablative Study

The individual impacts of the RE-CBAM and the AFPN architectures on segmentation efficacy were systematically analyzed through a series of component-removal tests. Table 4 details the exact structural arrangements used in this phase. To prevent experimental bias and ensure a fair evaluation, all ablated model versions were trained using the identical unified configurations previously outlined in Table 1. The recorded performance indicators—specifically mIoU, F1-Score, recall, precision, and DSC—are cataloged in Table 5. Visual evidence of these improvements across the Baseline, Baseline + RE-CBAM, Baseline + AFPN, and full AFPN-ResUNet configurations is displayed in Figure 13. Ultimately, the empirical data confirms that every added component successfully boosts the core network’s capabilities.

As presented in Table 5, the incremental integration of each module leads to a consistent performance uplift. The Baseline + RE-CBAM configuration achieves a 13.11% improvement in mIoU over the Baseline. This is qualitatively evidenced in Figure 13a,b, where the attention mechanism effectively suppresses shadows and vegetative occlusions. By recalibrating channel and spatial importance, RE-CBAM prevents the encoder from being distracted by environmental noise, thereby enhancing the discriminative signatures of lithological textures.

Notably, the Baseline + AFPN model yields a substantial 13.98% gain in mIoU, surpassing the attention-only version. This underscores the critical necessity of asymptotic feature fusion in geological tasks. In the standard ResUNet, the abrupt concatenation of shallow and deep features often leads to a “semantic gap,” manifesting as fragmented or over-smoothed boundaries in thin interbedded sequences. AFPN mitigates this by facilitating a smoother semantic transition across adjacent levels. As shown in Figure 13c, AFPN significantly preserves the spatial continuity of ultra-thin layers, which is further reflected by the highest precision (96.72%) among all ablation models, indicating its ability to reject false-positive background pixels in complex structures.

Finally, AFPN-ResUNet shows better performance across most comprehensive metrics, with the highest mIoU (93.41%) and recall (97.05%). A nuanced comparison between the AFPN-only model and the full architecture reveals an advantageous performance trade-off. While the final model experiences a slight decline in precision (96.72% to 96.11%), it achieves definitive gains in recall, F1-Score, DSC, and mIoU. This indicates that the integration of RE-CBAM enables a more inclusive feature capture. By filtering out environmental noise prior to fusion, RE-CBAM acts as a “feature purifier,” freeing the AFPN module from being overly conservative. Consequently, the model excels at recovering severely fragmented or weathered geological units, reducing missed detections. In complex geological scenes, such as the crumbling outcrops in Figure 13d, prioritizing structural completeness (high recall) over strict pixel-level conservatism (precision) is a crucial requirement. The rise in all comprehensive metrics confirms that this gain in structural integrity outweighs the minor pixel-level trade-off.

In conclusion, the integration of a ResNet50 backbone for robust hierarchy, RE-CBAM for local discriminative power, and AFPN for global semantic consistency allows this model to effectively adapt to the complexities of field geological environments.

Table 6 compares the operational efficiency of different ablation configurations. The Baseline model has 73.28 M parameters, which increases slightly to 75.80 M after incorporating RE-CBAM. However, the introduction of AFPN significantly reduces the parameter count to 36.16 M, a remarkable decrease of 50.7%. This substantial reduction is primarily attributed to AFPN’s architectural design, which replaces the redundant, dimension-inflating skip connections of the original architecture with early dimensionality reduction via 1 × 1 convolutions.

Despite this drastic reduction, the segmentation performance of the model is preserved through the complementary roles of the two modules. Instead of acting as a filter itself, AFPN relies on its progressive multi-scale strategy to efficiently integrate features. The true role of the information filter is fulfilled by RE-CBAM, which removes unrefined, noisy low-level features before fusion. As a result, AFPN successfully eliminates bloated convolution operations, while RE-CBAM optimizes feature representation. This allows the final AFPN-ResUNet to maintain a relatively compact parameter count of 38.67 M without sacrificing its strong representational capacity.

Regarding computational complexity, the Baseline exhibits a high number of FLOPs, at 304.62 G, which remains essentially unchanged after adding RE-CBAM (304.88 G). In contrast, incorporating AFPN drastically reduces FLOPs to 179.49 G, a 41.1% decrease, fully demonstrating the advantage of the asymptotic fusion mechanism in reducing redundant computations. AFPN-ResUNet achieves 179.79 G FLOPs, nearly identical to Baseline + AFPN.

In terms of inference speed, the Baseline achieves an inference time of 24.10 ms per image. After introducing RE-CBAM, the inference time increases to 30.44 ms due to the sequential computation of the attention mechanism. However, the addition of AFPN reduces the inference time to 22.41 ms, a 7.0% improvement over the Baseline. The final AFPN-ResUNet achieves an inference time of 33.99 ms, falling between the two. This comprehensively reflects the synergistic trade-off achieved by RE-CBAM and AFPN: RE-CBAM trades a slight computational overhead for optimal feature purity, while AFPN significantly reduces the parameter count and computational load by streamlining the fusion path. Their synergy enables the model to maintain a high geological identification accuracy while maintaining a certain level of efficiency.

5. Discussions

5.1. Comparative Analysis of Segmentation Performance Across Different Models

Accurate lithology segmentation in complex outcrops is constrained by severe background interference and extreme scale variations. While conventional architectures like DeepLabV3+ expand the receptive field, they lack adaptive mechanisms to suppress background noise—often misclassifying shadows with similar spectra. To address this, AFPN-ResUNet incorporates an RE-CBAM that recalibrates spatial and channel representations, adaptively suppressing non-geological interference early in the network to preserve fundamental lithological features before deep processing.

Furthermore, preserving the geometric continuity of ultra-thin interbeds demands meticulous multi-scale feature integration—a challenge for standard architectures. UNet’s direct concatenation produces coarse boundaries under occlusion; PSPNet’s global pooling blurs localized structures; and ViT’s patch-embedding mechanism disrupts the spatial continuity of narrow geological bodies (mIoU 69.49%). Even efficient designs like SegNeXt compromise fine edge preservation in weathered zones. To overcome these bottlenecks, our framework employs an asymptotic feature pyramid network with an ASFF mechanism. Rather than utilizing abrupt concatenation or aggressive pooling, this strategy fuses features progressively through dynamic spatial weighting. This approach resolves cross-scale semantic discrepancies while balancing global contextual coherence with the precise retention of high-frequency boundaries in extremely thin layers.

5.2. Contribution and Synergy of RE-CBAM and AFPN Modules

The RE-CBAM primarily functions by isolating valid lithological signals from high-noise backgrounds. In outcrop areas characterized by common interferences such as shadows, vegetation, and weathered debris, this module employs a serialized channel and spatial attention mechanism for feature recalibration. Channel attention highlights the channels encoding key lithological features, while spatial attention helps to localize them. By suppressing the background responses of non-geological objects during the early stages of feature extraction, RE-CBAM provides a foundation for refined features, contributing a 13.11% improvement in mIoU.

However, deep networks are prone to losing fine lithological details during repeated downsampling processes, which is particularly pronounced in extremely thin interbedded layers. To address this, the AFPN module abandons the standard feature stacking typical of traditional FPNs by utilizing ASFF to dynamically assign multi-level weights to each pixel location. This mechanism allows the model to prioritize high-frequency spatial details at lithological boundaries while emphasizing deep semantic context within rock bodies. It compensates for the limitations of traditional networks in segmenting thin layers, yielding a 13.98% performance gain.

Overall, AFPN-ResUNet constructs a coupled segmentation paradigm of feature purification and semantic fusion. If multi-scale fusion techniques are applied directly to unprocessed feature data, background noise is extracted alongside it and propagates across different pyramid levels. The RE-CBAM effectively prevents this risk by feeding relatively clean signals into the fusion process. Subsequently, AFPN reconstructs the thin-layer lithological features lost during the downsampling and denoising processes through a progressive feature fusion approach. This complementary design preserves the delicate texture boundaries within extremely thin interbedded structures and maintains the overall semantic coherence of rock masses, thereby surpassing the segmentation performance of a single module.

5.3. Balance Between Segmentation Precision and Inference Speed

While AFPN-ResUNet achieves a significant increase in accuracy, it necessitates a higher inference latency, registering the highest single-image inference latency (33.99 ms) among the compared models. This increased load stems from three primary architectural factors. Specifically, the attention mechanism creates a bottleneck in parallelization, as the core channel-wise global pooling and multi-layer perceptron (MLP) within the RE-CBAM rely on serial computation. This dependency limits GPU acceleration efficiency, increasing the inference time by 26.3%. Additionally, the complexity of feature interaction tends to offset parameter reduction. Despite AFPN’s lower parameter count, its frequent resolution alignment and ASFF weight generation require high-density matrix operations. Furthermore, system-level module coordination adds scheduling overhead, where the feature flow must sequentially traverse RE-CBAM recalibration, AFPN multi-stage fusion, and decoder upsampling, thereby accumulating latency.

The justification for this computational cost generally depends on the application context. For offline precision mapping, where the cost of rectifying geological misinterpretations can be significant, the 12.38% accuracy gain over PSPNet often outweighs the computational investment. In such cases, the 33.99 ms latency typically falls within acceptable limits for routine data processing workflows. Conversely, for real-time outcrop segmentation tasks, a throughput of 29.42 FPS could potentially limit the overall system responsiveness, and alternatives like DeepLabV3+ (52.98 FPS) might present a more practical balance between speed and accuracy.

5.4. Study Limitations and Future Directions

5.4.1. Dataset Limitations and Spatial Generalization

In this study, 20 high-resolution outcrop images (5472 × 3648 pixels) from the Zhuozishan area were selected for manual annotation. Constrained by the relatively small initial sample size, even with the training set expanded through cropping and data augmentation, the model may still be restricted to specific scenes and affected by spatial autocorrelation, which may limit generalization to unseen scenarios. Additionally, this research focused primarily on sandstone and mudstone within the deep-water sedimentary environment of the Zhuozishan area, without covering other geological settings. Considering that outcrop textures and structural morphologies can vary considerably across different geological environments. While the model performed well on an independent and unseen test set, this performance reflects its reliability within the Zhuozishan outcrop region. Cross-regional validation would be required to assess broader generalizability.

Furthermore, outcrop images are frequently affected by variations in illumination, weathering, and vegetation. Coupled with the fact that pixel-level annotation of high-frequency interbedded structures relies heavily on expert experience, high-quality public datasets in this field remain scarce. Given these data constraints, we aimed to design a segmentation model for ultra-thin sandstone-mudstone interbeds. While the cross-domain generalizability of the model requires further verification, its performance in the detailed characterization of complex thin beds in the Zhuozishan area demonstrates its application value. Future work will focus on constructing comprehensive outcrop datasets that encompass multiple lithologies, environments, and regions to further evaluate and enhance the applicability and robustness under complex geological conditions.

5.4.2. Limitations of Spatial Resolution and Boundary Ambiguity

The experimental results indicate that, despite the introduction of class weights to mitigate majority-class bias, the recall rate for thin mudstone layers remains comparatively low. This bottleneck highlights a complex interplay between the physical limits of image resolution and inherent annotation uncertainty.

Fundamentally, although our imagery possesses millimeter-level resolution, the extremely thin sandstone–mudstone interbeds (e.g., 1–2 cm) often occupy a few pixels in the image. At this scale, the mixed-pixel effect remains significant: pixels at the boundaries of thin layers frequently incorporate spectral information from the surrounding rock, causing intrinsic textural features to be smoothed or obscured. This physically constrains the model’s capacity to resolve subtle boundaries.

Moreover, this physical limitation amplifies the challenges of ground-truth delineation. Since dataset construction relies on manual interpretation, boundary ambiguity caused by both mixed pixels and natural lithological transitions introduces inherent variance into the annotation process. Consequently, observational noise and uncertainty are structurally integrated into the training data.

Given these compounding factors, the precise extraction of thin-layer details at the resolution limit remains inherently constrained at the current scale. Future research could address this bottleneck by exploring multi-source data fusion or sub-pixel analysis techniques to transcend current feature representation constraints.

5.4.3. Dimensional Constraints and Extension to 3D Analysis

The current methodology operates within a two-dimensional (2D) optical space, focusing on the extraction of surface lithological patterns. However, comprehensive geological interpretation relies on understanding three-dimensional (3D) spatial structures, such as true stratigraphic thickness, dip angles, and subsurface structural continuity. Because 2D optical segmentation is based on surface projections, it captures apparent boundaries rather than volumetric parameters, presenting a natural constraint for full spatial analysis.

Despite these dimensional constraints, highly accurate 2D segmentation serves as a crucial prerequisite for advanced 3D geological modeling. Reliable 2D pixel-level classification provides an essential semantic foundation—precisely delineating lithological boundaries and surface facies distributions. Without this high-fidelity semantic mapping, 3D point clouds or structural models lack the necessary geological context. Therefore, the proposed 2D segmentation functions not merely as a preliminary tool, but as a “semantic layer” required for informing and enriching spatial models.

Building upon this foundation, a logical trajectory for future research involves the fusion of these 2D semantic masks with 3D UAV photogrammetric products. By projecting the high-resolution 2D classifications into a 3D coordinate system, future studies can transition from surface mapping to quantitative volumetric analysis, enabling the automatic extraction of true spatial metrics and offering a more comprehensive framework for 3D geological characterization.

5.4.4. Data Modality and Geological Domain Knowledge

Finally, moving beyond the current optical and data-driven limitations, relying solely on a single optical source often leads to a restricted ability to penetrate vegetation canopies or deep weathering crusts. Integrating multi-modal data, such as hyperspectral imagery, can help to capture fine mineralogical signatures and significantly enhance surface interpretation. Furthermore, the current framework operates primarily as a data-driven paradigm with limited explicit constraints from earth science principles. Future endeavors could explore constructing Geology-Informed Neural Networks by embedding prior knowledge—such as stratigraphic superposition rules and spatial topological constraints—into the deep learning architecture. This approach would help the network to internalize geological logic, thereby minimizing segmentation predictions that contradict fundamental geological evolution laws.

6. Conclusions

This research addresses the intricate task of delineating lithological boundaries within field geological outcrops, a scenario frequently complicated by background complexity and significant scale variations within thin interbeds. To tackle these issues, we introduce AFPN-ResUNet, a semantic segmentation framework designed to capture multi-scale features and mitigate background interference. Through rigorous empirical testing and visual feature analysis, we confirm the robust performance of our network while simultaneously shedding light on its underlying decision-making processes.

The segmentation results show that the model exhibits favorable segmentation capability under complex field outcrop conditions. Even in the presence of severe illumination unevenness and vegetation occlusion, the model can accurately extract lithological features. For extremely thin sandstone–mudstone interbeds, the proposed method preserves high geometric continuity and sharp edge detail, surpassing traditional approaches. Comparative experiments show that the AFPN-ResUNet model achieves a 93.41% mIoU on the test set. This result suggests that architecture optimization for the scale variations and textural characteristics of geological objects may offer advantages over directly applying generic computer vision models.

Ablation experiments and mechanism analysis further indicate that the observed performance gain is largely attributable to the complementary roles of the RE-CBAM and AFPN modules. Within this framework, CBAM functions as an attention-weighted filter that adaptively suppresses non-geological background interference, while AFPN serves as a gradual semantic bridge, reconciling hierarchical discrepancies across feature scales. This dual design helps mitigate the loss of high-frequency spatial details in deep networks and notably strengthens the model’s capacity to delineate thin interbedded structures.

Future work could explore lightweight models to accommodate real-time UAV-based geological interpretation. In parallel, the incorporation of multispectral or hyperspectral data may provide additional lithological constraints that benefit segmentation in complex outcrop settings. Furthermore, coupling such segmentation architectures with explicit geological prior knowledge offers a potential pathway toward a “data–knowledge” dual-driven paradigm. Such an integration could improve generalization and interpretability across geologically diverse environments.

Supplementary Materials

Codes and models that support this study are available at the GitHub link: https://github.com/fk989/AFPN-ResUNet (accessed on 17 April 2026).

Author Contributions

Conceptualization, M.T. and K.F.; methodology, K.F.; software, K.F. and L.T.; validation, L.T., W.C. and Y.L.; formal analysis, M.T.; investigation, M.T., K.F. and Z.M.; resources, M.T.; data curation, L.T. and W.C.; writing—original draft preparation, M.T.; writing—review and editing, K.F.; visualization, Y.L. and Z.Z.; supervision, K.F.; project administration, K.F.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Major Project of China, grant number 2025ZD1403003, the National Natural Science Foundation of China, grant numbers 42072163 and 62305196, and the Natural Science Foundation of Shandong Province, grant number ZR2019MD006.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFPN-ResUNet	A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network
Grad-CAM++	Gradient-weighted Class Activation Mapping
RE-CBAM	A ResNet Encoder enhanced by the CBAM
CBAM	Convolutional Block Attention Module
AFPN	Asymptotic Feature Pyramid Network
ASFF	Asymptotic Spatial Feature Fusion
CNNs	Convolutional Neural Networks
IFES	Initial Feature Extraction Stem
mIoU	Mean intersection-over-union
GSD	Ground Sampling Distance
UAVs	Unmanned Aerial Vehicles
FPNs	Feature Pyramid Networks
ViTs	Vision Transformers
BN	Batch Normalization
ReLU	Rectified Linear Unit
L_dice	Dice Loss
Deci	Decoder i

References

Davies Neil, S.; Veenma Yorick, P.; Craig James, A.; Allport Hamilton, A.; McMahon William, J.; Shillito Anthony, P. Time, space and synoptic topography: How to read outcrops as a granular record of Earth history. Geol. Soc. Lond. Spec. Publ. 2025, 556, 15–71. [Google Scholar] [CrossRef]
Dong, Z.; Tang, P.; Chen, G.; Yin, S. Synergistic application of digital outcrop characterization techniques and deep learning algorithms in geological exploration. Sci. Rep. 2024, 14, 22948. [Google Scholar] [CrossRef]
Thiele, S.T.; Lorenz, S.; Kirsch, M.; Cecilia Contreras Acosta, I.; Tusa, L.; Herrmann, E.; Möckel, R.; Gloaguen, R. Multi-scale, multi-sensor data integration for automated 3-D geological mapping. Ore Geol. Rev. 2021, 136, 104252. [Google Scholar] [CrossRef]
Wang, M.; Wang, C.; Wang, E.; Liu, X.; Lu, Y. HVPS-DFN-DL: Intelligent capture and characterization of geological fracture outcrops based on a hybrid vision-photogrammetric system and discrete fracture network. J. Ind. Inf. Integr. 2024, 42, 100685. [Google Scholar] [CrossRef]
Perozzo, M.; Menegoni, N.; Crispini, L.; Federico, L.; Seno, S.; Maino, M. Quantitative characterization of fracture network in large sheath-fold: Field and UAV-based digital outcrop model analysis (Ligurian Alps, Italy). J. Struct. Geol. 2025, 201, 105551. [Google Scholar] [CrossRef]
Brigaud, B.; Vincent, B.; Pagel, M.; Gras, A.; Noret, A.; Landrein, P.; Huret, E. Sedimentary architecture, depositional facies and diagenetic response to intracratonic deformation and climate change inferred from outcrops for a pivotal period (Jurassic/Cretaceous boundary, Paris Basin, France). Sediment. Geol. 2018, 373, 48–76. [Google Scholar] [CrossRef]
Juliani, C.; Juliani, E. Deep learning of terrain morphology and pattern discovery via network-based representational similarity analysis for deep-sea mineral exploration. Ore Geol. Rev. 2021, 129, 103936. [Google Scholar] [CrossRef]
Pei, S.; Fan, T.; Shen, J.; Zhang, X.; Du, X. CBAM-U-Net and SCT-based outcrop fracture extraction and connectivity characterization with limited labeled data. Geoenergy Sci. Eng. 2025, 254, 214057. [Google Scholar] [CrossRef]
Wang, Z.; Zuo, R. Intelligent Lithological Mapping: Challenges and Future Prospective. Nat. Resour. Res. 2026, 35, 279–312. [Google Scholar] [CrossRef]
Xu, Y.; Zuo, R. Geochemical survey data cube: A useful tool for lithological classification and geochemical anomaly identification. Geochemistry 2024, 84, 125959. [Google Scholar] [CrossRef]
Gan, B.; Jing, R.; Shao, Y.; Liu, Y.; Duan, X.; Li, P.; Li, L. 3D point cloud lithology identification based on stratigraphically constrained continuous clustering. Sci. Rep. 2025, 15, 34988. [Google Scholar] [CrossRef]
Wu, S.; Wang, Q.; Zeng, Q.; Zhang, Y.; Shao, Y.; Deng, F.; Liu, Y.; Wei, W. Automatic extraction of outcrop cavity based on a multiscale regional convolution neural network. Comput. Geosci. 2022, 160, 105038. [Google Scholar] [CrossRef]
Diao, M.; Liu, K.; Wang, S.; Zhang, C. Research on Intelligent and High-Precision Structure-Recognition Methods for Field Geological Outcrop Images. IET Image Process. 2025, 19, e70087. [Google Scholar] [CrossRef]
Noguchi, R.; Shoji, D. Extraction of stratigraphic exposures on visible images using a supervised machine learning technique. Front. Earth Sci. 2023, 11, 1264701. [Google Scholar] [CrossRef]
Aabø, T.M.; Oldfield, S.J.; Yuan, H.; Kammann, J.; Sørensen, E.V.; Stemmerik, L.; Nielsen, L. Establishing a High Resolution 3D Fracture Dataset in Chalk: Possibilities and Obstacles Working with Outcrop Data. In Geomechanical Controls on Fracture Development in Chalk and Marl in the Danish North Sea: Understanding and Predicting Fracture Systems; Welch, M.J., Lüthje, M., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 9–46. [Google Scholar] [CrossRef]
Menegoni, N.; Giordan, D.; Perotti, C.; Tannant, D.D. Detection and geometric characterization of rock mass discontinuities using a 3D high-resolution digital outcrop model generated from RPAS imagery—Ormea rock slope, Italy. Eng. Geol. 2019, 252, 145–163. [Google Scholar] [CrossRef]
de Roda Husman, S.; Lhermitte, S.; Bolibar, J.; Izeboud, M.; Hu, Z.; Shukla, S.; van der Meer, M.; Long, D.; Wouters, B. A high-resolution record of surface melt on Antarctic ice shelves using multi-source remote sensing data and deep learning. Remote Sens. Environ. 2024, 301, 113950. [Google Scholar] [CrossRef]
Yin, S.-L.; Wu, Y.-x.; Zhu, B.-y.; Cheng, L.-L.; Zhao, J.-W.; Chen, W.-C. 3-D tight sandstone gas outcrop simulation based on unmanned aerial vehicle oblique photography data— A case study from the Pingtouxiang outcrop in North Shanxi, China. Unconv. Resour. 2023, 3, 93–102. [Google Scholar] [CrossRef]
da Silva Bomfim, L.; Soares, M.V.T.; Vidal, A.C.; Pedrini, H. Geological reservoir characterization tasks based on computer vision techniques. Mar. Pet. Geol. 2025, 173, 107231. [Google Scholar] [CrossRef]
Kazemi, A. A comprehensive review of computer vision for reservoir modelling and data assimilation. Discov. Appl. Sci. 2025, 7, 1399. [Google Scholar] [CrossRef]
Li, S.; Huang, C. Using convolutional neural networks for image semantic segmentation and object detection. Syst. Soft Comput. 2024, 6, 200172. [Google Scholar] [CrossRef]
Gui, B.; Sam, L.; Bhardwaj, A.; Gómez, D.S.; Peñaloza, F.G.; Buchroithner, M.F.; Green, D.R. SAGRNet: A novel object-based graph convolutional neural network for diverse vegetation cover classification in remotely-sensed imagery. ISPRS J. Photogramm. Remote Sens. 2025, 227, 99–124. [Google Scholar] [CrossRef]
Chen, C.; Mat Isa, N.A.; Liu, X. A review of convolutional neural network based methods for medical image classification. Comput. Biol. Med. 2025, 185, 109507. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wei, Z.; Gong, X.; Sun, M.; Cheng, Y.; Zhang, Y.; Zhang, Z. Lithological Mapping from UAV Imagery Based on Lightweight Semantic Segmentation Methods. Drones 2025, 9, 866. [Google Scholar] [CrossRef]
Marques, A.; Racolte, G.; Souza, E.M.d.; Domingos, H.V.; Horota, R.K.; Motta, J.G.; Zanotta, D.C.; Cazarin, C.L.; Gonzaga, L.; Veronez, M.R. Deep Learning Application for Fracture Segmentation Over Outcrop Images from UAV-Based Digital Photogrammetry. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4692–4695. [Google Scholar] [CrossRef]
Zhu, X.; Li, M.; Lan, Z.; Chen, J.; Li, Z.; Li, K. RockNet: Deep progressive lithology recognition model based on feature saliency and fusion. Neurocomputing 2025, 616, 128898. [Google Scholar] [CrossRef]
Jing, R.; Shao, Y.; Zeng, Q.; Liu, Y.; Wei, W.; Gan, B.; Duan, X. Multimodal feature integration network for lithology identification from point cloud data. Comput. Geosci. 2025, 194, 105775. [Google Scholar] [CrossRef]
Qin, S.; Wang, Q.; Zeng, Q.; Ye, M.; Fu, A.; Chen, G. Automatic recognition of debris rock lithology based on unsupervised semantic segmentation. Comput. Geosci. 2025, 196, 105790. [Google Scholar] [CrossRef]
Wang, Z.; Zuo, R.; Liu, H. Lithological Mapping Based on Fully Convolutional Network and Multi-Source Geological Data. Remote Sens. 2021, 13, 4860. [Google Scholar] [CrossRef]
He, K.; Feng, R.; Zhang, Z.; Dong, Y. Remote Sensing Interpretation of Geological Elements via a Synergistic Neural Framework with Multi-Source Data and Prior Knowledge. Remote Sens. 2025, 17, 2772. [Google Scholar] [CrossRef]
Hu, Y.; Deng, N.; Ye, F.; Zhang, Q.; Yan, Y. Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN. Buildings 2025, 15, 3516. [Google Scholar] [CrossRef]
Jogin, M.; Mohana; Madhulika, M.S.; Divya, G.D.; Meghana, R.K.; Apoorva, S. Feature Extraction using Convolution Neural Networks (CNN) and Deep Learning. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 18–19 May 2018; pp. 2319–2323. [Google Scholar] [CrossRef]
Guo, J.; Wan, Y.; Wang, J.; Ma, A.; Zhong, Y. DisasterKD: Frequency-guided cross-decoder knowledge distillation for UAV real-time disaster damage assessment. Pattern Recognit. 2026, 174, 112968. [Google Scholar] [CrossRef]
Yu, H. FHSA: Frequency-Guided Remote Sensing Image Denoising Based on Hybrid Sequence Attention. In Proceedings of the Proceedings of the 2025 11th International Conference on Communication and Information Processing, Hainan, China, 12–15 November 2025; pp. 78–83. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, J.; Su, Y.; Li, L.; Lyu, X.; Xu, Z.; Kaup, A. Frequency domain-enhanced spectral-spatial fusion transformer for semantic segmentation of remote sensing images. Inf. Fusion 2026, 132, 104248. [Google Scholar] [CrossRef]
Liu, Q.; Dong, L.; Zeng, Z.; Zhu, W.; Zhu, Y.; Meng, C. SSD with multi-scale feature fusion and attention mechanism. Sci. Rep. 2023, 13, 21387. [Google Scholar] [CrossRef]
Xu, H.; Wang, L.; Shu, B.; Zhang, Q.; Li, X. Automatic Detection of Landslide Surface Cracks from UAV Images Using Improved U-Network. Remote Sens. 2025, 17, 2150. [Google Scholar] [CrossRef]
Chen, N.; Yang, R.; Zhao, Y.; Dai, Q.; Wang, L. Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features. Remote Sens. 2025, 17, 1880. [Google Scholar] [CrossRef]
Liang, Z.; Li, M. ScaleRSNet: Advancing Remote Sensing Image Segmentation With Multi-Scale Contextual Attention Mechanisms. IEEE Access 2026, 14, 2188–2205. [Google Scholar] [CrossRef]
Liu, J.; Li, L.; Zhao, X.; Lv, M.; Jia, Z.; Zhang, X.; Vivone, G.; Ma, H. CMNet: Global–Local Feature Fusion CNN-Mamba Network for Remote Sensing Object Detection. Remote Sens. 2026, 18, 591. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, M.; Qi, Y.; Su, X.; Kong, D. Deep Learning-Based Methods for Lithology Classification and Identification in Remote Sensing Images. IEEE Access 2025, 13, 3038–3050. [Google Scholar] [CrossRef]
Li, F.; Gu, Y.; Zhao, M.; Chen, D.; Wang, Q. GLMAFuse: A Dual-Stream Infrared and Visible Image Fusion Framework Integrating Local and Global Features with Multi-Scale Attention. Electronics 2024, 13, 5002. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. arXiv 2023, arXiv:2306.15988. [Google Scholar] [CrossRef]
Gupta, A.K.; Mathur, P.; Sheth, F.; Travieso-Gonzalez, C.M.; Chaurasia, S. Advancing geological image segmentation: Deep learning approaches for rock type identification and classification. Appl. Comput. Geosci. 2024, 23, 100192. [Google Scholar] [CrossRef]
Ren, D.; He, T.; Dong, H. Joint Cross-Consistency Learning and Multi-Feature Fusion for Person Re-Identification. Sensors 2022, 22, 9387. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Zhou, L.; Huang, B.; Chen, S. Multi-feature fusion and multi-attention deep network for enhancing road extraction in remote sensing images. Eur. J. Remote Sens. 2024, 57, 2414008. [Google Scholar] [CrossRef]

Figure 1. Location and geological background of the study area. (a) Location of the Zhuozishan outcrops in Inner Mongolia. (b) Field photo of the Zhuozishan outcrops.

Figure 2. Drone used for field data collection: the DJI Phantom 4 RTK (DJI Technology Co., Ltd., Shenzhen, China), equipped with an integrated high-resolution RGB sensor.

Figure 3. Workflow of data processing and model training.

Figure 4. AFPN-ResUNet overall architecture diagram.

Figure 5. Illustration of the RE-CBAM residual structure, which integrates channel-wise and spatial-wise attention sequentially.

Figure 6. Asymptotic Feature Pyramid Network architecture diagram. Orange arrows indicate upsampling, green arrows indicate downsampling, and black arrows indicate no processing.

Figure 7. Adaptive spatial feature fusion mechanism. The left panel depicts the weight generation branch, which computes normalized spatial weights (

α

,

β

,

γ

) for different levels. The right panel illustrates the weighted fusion process, allowing the network to adaptively filter conflicting information and optimally combine multi-scale features.

Figure 7. Adaptive spatial feature fusion mechanism. The left panel depicts the weight generation branch, which computes normalized spatial weights (

α

,

β

,

γ

) for different levels. The right panel illustrates the weighted fusion process, allowing the network to adaptively filter conflicting information and optimally combine multi-scale features.

Figure 8. The loss curve and mIoU curve of the AFPN-ResUNet framework.

Figure 9. Demonstration of partial prediction results of the AFPN-ResUNet model under different outcrop conditions. The columns are grouped into four typical challenges: illumination (lighting variations), shelter (vegetation occlusion), interbedding (thin alternating layers), and crumbling (fragmented rock structures).

Figure 10. Visual analysis of the AFPN-ResUNet learning mechanism.

Figure 11. Asymptotic feature fusion mechanism visualization.

Figure 12. Comparison of segmentation performance among different models. The subfigures highlight specific environmental and structural interferences: (a) uneven illumination and shadows; (b) occlusions from weathered debris and vegetation; (c–e) extremely thin interbedded layers; and (f,g) severely weathered and crumbling outcrop regions. The yellow dotted frames are used to highlight areas with significant differences.

Figure 13. Comparison of segmentation results of different models in ablation experiments. The subfigures illustrate the performance of different models under complex outcrop environments: (a) uneven illumination, (b) vegetation occlusion, (c) thin alternating layers, and (d) crumbling outcrops. The yellow dotted frames are used to highlight areas with significant differences.

Table 1. Experimental parameters.

Basic Configuration	Value
Learning Rate	1 × 10⁻⁴
Optimizer	Adam
Max Epochs	150
Batch Size	4
Input Size	512 × 512 × 3
Class Weights	[1.5, 3.76, 0.8, 1.0, 1.0]
Random Horizontal/Vertical Flips	Probability = 0.5
Color Jitter	Brightness = 0.3; Contrast = 0.3; Saturation = 0.2; Hue = 0.1

Table 2. Comparison of segmentation performance metrics across mainstream models.

Model	mIoU (%)	F1-Score (%)	Recall (%)	Precision (%)	DSC (%)
AFPN-ResUNet	93.41	96.58	97.05	96.11	96.58
UNet	70.21	82.19	83.01	81.42	82.21
ViT	69.49	81.40	80.93	81.89	81.41
DeepLabV3+	81.01	89.31	89.75	88.93	89.34
PSPNet	81.03	89.35	91.12	87.89	89.48
SegNeXt	67.37	69.54	70.61	69.15	69.87

Table 3. Computational performance and speed metrics across the evaluated architectures.

Model	Params (M)	FLOPs (G)	Inference Time (ms)	FPS
AFPN-ResUNet	38.67	179.79	33.99	29.42
UNet	17.26	321.19	24.43	40.94
ViT	91.66	42.22	7.70	141.45
DeepLabV3+	40.35	138.93	18.88	52.98
PSPNet	30.39	222.63	25.32	39.50
SegNeXt	3.70	18.99	10.23	97.74

Table 4. Design of ablation experimental models. “Baseline” refers to the standard ResUNet architecture without the proposed RE-CBAM and AFPN modules. “√” indicates that the module is included, and “×” indicates that it is not included.

Model	RE-CBAM	AFPN
Baseline	×	×
Baseline + RE-CBAM	√	×
Baseline + AFPN	×	√
AFPN-ResUNet	√	√

Table 5. Segmentation performance metrics of different models in ablation experiments.

Model	mIoU (%)	F1-Score (%)	Recall (%)	Precision (%)	DSC (%)
Baseline	78.07	87.49	90.05	85.18	87.55
Baseline + RE-CBAM	91.18	95.34	95.76	94.94	95.35
Baseline + AFPN	92.05	95.89	96.43	96.72	95.67
AFPN-ResUNet	93.41	96.58	97.05	96.11	96.58

Table 6. Operational efficiency metrics of different ablation experiments.

Model	Params (M)	FLOPs (G)	Inference Time (ms)	FPS
Baseline	73.28	304.62	24.10	41.49
Baseline + RE-CBAM	75.80	304.88	30.44	32.85
Baseline + AFPN	36.16	179.49	22.41	44.63
AFPN-ResUNet	38.67	179.79	33.99	29.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, M.; Fu, K.; Tian, L.; Chen, W.; Li, Y.; Zhang, Z.; Ma, Z. AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation. Remote Sens. 2026, 18, 1457. https://doi.org/10.3390/rs18101457

AMA Style

Tang M, Fu K, Tian L, Chen W, Li Y, Zhang Z, Ma Z. AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation. Remote Sensing. 2026; 18(10):1457. https://doi.org/10.3390/rs18101457

Chicago/Turabian Style

Tang, Mingming, Kang Fu, Lei Tian, Wanxin Chen, Yuhan Li, Zongxu Zhang, and Zhiyuan Ma. 2026. "AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation" Remote Sensing 18, no. 10: 1457. https://doi.org/10.3390/rs18101457

APA Style

Tang, M., Fu, K., Tian, L., Chen, W., Li, Y., Zhang, Z., & Ma, Z. (2026). AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation. Remote Sensing, 18(10), 1457. https://doi.org/10.3390/rs18101457

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AFPN-ResUNet: A Residual Attention Mechanism-Guided Asymptotic Feature Pyramid Network for Complex Outcrop Lithology Segmentation

Highlights

Abstract

1. Introduction

2. Data

2.1. Study Area

2.2. Data Acquisition

3. Materials and Methods

3.1. Overview of the Network Framework

3.2. CBAM Residual Module (RE-CBAM)

3.2.1. Channel Attention Module

3.2.2. Spatial Attention Module

3.3. Asymptotic Feature Pyramid Network (AFPN)

3.3.1. Cross-Scale Feature Alignment Mechanism

3.3.2. Adaptive Spatial Feature Fusion Mechanism

3.4. Symmetric Decoding Architecture

3.5. Loss Function

3.6. Grad-CAM++

3.7. Performance Assessment and Experimental Configurations

4. Results

4.1. Experimental Training Results

4.2. Segmentation Results

4.3. Analysis of Module Mechanisms

4.3.1. Multi-Dimensional Visualization Study on the Mechanism of AFPN-ResUNet

4.3.2. Visual Analysis of the Asymptotic Fusion Mechanism in AFPN-ResUNet

4.4. Comparison Study and Performance Assessment

4.4.1. Comparative Analysis of Segmentation Results Across Mainstream Models

4.4.2. Comparison of Inference Speed and Computational Cost

4.5. Ablative Study

5. Discussions

5.1. Comparative Analysis of Segmentation Performance Across Different Models

5.2. Contribution and Synergy of RE-CBAM and AFPN Modules

5.3. Balance Between Segmentation Precision and Inference Speed

5.4. Study Limitations and Future Directions

5.4.1. Dataset Limitations and Spatial Generalization

5.4.2. Limitations of Spatial Resolution and Boundary Ambiguity

5.4.3. Dimensional Constraints and Extension to 3D Analysis

5.4.4. Data Modality and Geological Domain Knowledge

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI