2.1. Base Network
This study employs Swin-Unet as the foundational architecture, which deeply integrates the hierarchical feature fusion advantages of classic UNet with the efficient global modeling capability of Swin Transformer. Swin Transformer achieves a balance between local window partitioning and cross-window dependency modeling through the synergistic design of Window-Based Multi-Head Self-Attention (W-MSA) and Shifted Window-Based Multi-Head Self-Attention (SW-MSA), enabling hierarchical capture of multi-scale contextual information [
13].
As a classic framework in image signal processing, UNet’s symmetric encoder–decoder architecture extracts multi-scale semantic features via convolutional downsampling (encoder), restores resolution through deconvolutional upsampling (decoder), and directly fuses shallow spatial details with deep semantic features via skip connections, establishing a hierarchical feature representation system [
14].
Building upon UNet’s encoder–decoder framework and skip connection mechanism, Swin-Unet embeds Transformer modules into the feature extraction stage, forming a composite processing pipeline of “local feature extraction—global contextual modeling—multi-scale feature fusion”. This architecture not only inherits UNet’s sensitivity to detailed features but also enhances long-range dependency modeling through Transformer’s attention mechanisms, offering a technically sound solution for seismic profile reconstruction that balances accuracy and efficiency. The overall architecture is illustrated in
Figure 1 Overall architecture diagram of Swin-Unet.
The overall architecture of Swin-Unet in this study consists of four components: encoder, bottleneck, decoder, and three groups of skip connections. The fundamental computational unit is the swin transformer block.
In the encoder stage, spatial partitioning of seismic profile images is performed to convert input data into Sequence Embeddings, with a block size set to 4 × 4 pixels [
15]. This partitioning strategy calculates the feature dimension of a single image block as 4 × 4 × 3 (number of channels) = 48 dimensions, thereby transforming the original
×
× 3 input data into:
Furthermore, the linear embedding layer is applied to project the input into any dimension:
The transformed patch tokens are processed through multi-stage Swin Transformer modules and Patch Merging layers to generate Hierarchical Feature Representations. The Patch Merging layer achieves feature dimension expansion via downsampling operations, while Swin Transformer modules are responsible for deep feature representation learning.
The Transformer decoder, inspired by the U-Net architecture, is composed of Swin Transformer modules and Patch Expanding layers. Cross-scale fusion of Context Features and multi-scale encoder features is achieved via Skip Connections, effectively compensating for spatial information loss during downsampling. In contrast to the downsampling mechanism of Patch Merging, the Patch Expanding layer performs upsampling operations, achieving 2× resolution enhancement through neighborhood dimension reshaping. The final Patch Expanding layer completes 4× upsampling to restore the input resolution ( × ), and pixel-level segmentation predictions are generated via a Linear Projection Layer.
The core architectural component of Swin-Unet is the Swin Transformer, a Transformer variant specifically optimized for visual tasks. Its key innovation lies in the synergistic design of Window-Based Multi-Head Self-Attention (W-MSA) and Shifted Window-Based Multi-Head Self-Attention (SW-MSA), which enables efficient global contextual modeling while maintaining linear computational complexity [
16]. Specifically, W-MSA partitions features into non-overlapping local windows to capture intra-window dependencies while reducing self-attention computational costs, whereas SW-MSA establishes cross-window feature interactions through window position shifting, achieving effective global context fusion without significant computation overhead.
The fundamental module of Swin Transformer consists of Layer Normalization (LN), Window-Based Multi-Head Self-Attention (W-MSA) units, Shifted Window-Based Multi-Head Self-Attention (SW-MSA) units, and Multi-Layer Perceptrons (MLP). Here, LN normalizes features to provide stable inputs for subsequent attention calculations [
17]. W-MSA and SW-MSA form a hierarchical feature extraction mechanism through alternating stacking, enabling joint modeling of local details and global dependencies. The MLP enhances feature representational power via nonlinear transformations. The specific architecture of this module is illustrated in
Figure 2, showing the architecture of a Swin Transformer, while two consecutively stacked Swin Transformer units form a progressive contextual information aggregation system through hierarchical attention connections, as schematically depicted in
Figure 3, Schematic diagram of the plates of the Swin Transformer.
Based on this window segmentation mechanism, continuous swin transformer blocks can be formulaic as
where
denotes the input feature matrix of the (
)-th layer;
and
represent the intermediate and output features of the
-th layer after
and
operations, respectively;
and
are the intermediate and output features of the (
)-th layer after
and
operations; and
stands for Layer Normalization.
The Window-Based Multi-Head Self-Attention (W-MSA) serves as a core component of Transformer architecture. It reduces computational complexity by partitioning the input feature map into non-overlapping windows and computing multi-head self-attention within each window. The computational process of W-MSA is as follows:
Suppose the size of the input feature map is
, and it is divided into the window of
. The calculation formula for multi-head self-attention within each window is
In self-attention mechanisms, Q (Query), K (Key), and V (Value) correspond to feature matching, similarity measurement, and information aggregation, respectively, where
denotes the key dimension for dot product scaling and the
function enables normalization. The Window-Based Multi-Head Self-Attention (W-MSA) controls computational complexity effectively while preserving local feature dependencies by performing attention calculations within non-overlapping local windows. As a core innovation of Swin Transformer, the Shifted Window-Based Multi-Head Self-Attention (SW-MSA) captures long-range dependencies by shifting feature windows right and down by half the window size, creating overlapping regions for cross-window pixel interaction. Although its fundamental equations align with W-MSA, the input feature map is shifted prior to computation:
The difference lies in that the input feature map of SW-MSA undergoes a shift operation before partitioning the window, thereby achieving cross-window attention calculation.
Layer Normalization (LN) is a commonly used regularization technique, which is used to accelerate training and improve the generalization ability of the model. Its formula is
Here,
represents the input vector,
is the input mean,
is the input standard deviation, and
and
are learnable parameters for scaling and shifting. Layer Normalization is typically applied to the input and output of Transformer modules to maintain data distribution stability. The Multi-Layer Perceptron (MLP), a nonlinear transformation module in Swin Transformer, further maps features nonlinearly. The MLP is computed as follows
where
and
are weight matrices,
and
are bias vectors, and
is the activation function. The
enhances the model’s expressive power through nonlinear transformations, enabling Swin Transformer to better model complex feature relationships. By integrating Layer Normalization (LN), Window-Based Multi-Head Self-Attention (W-MSA), Shifted Window-Based Multi-Head Self-Attention (SW-MSA), and Multi-Layer Perceptron (MLP), Swin Transformer significantly enhances global contextual modeling while maintaining computational efficiency. The synergistic effect of these components makes Swin Transformer an efficient and powerful feature extraction tool, particularly suitable for tasks such as image segmentation and signal reconstruction.
2.3. Swin-ReshoUnet Model with Hierarchical Convolution and Regional Coordinate Attention Enhancement
The architecture of the improved Swin-ReshoUnet model as shown in
Figure 5 Network structure diagram of Swin-ReshoUnet continues the encoder–decoder symmetric structure of Swin-Unet, with its core innovations embodied in the synergistic optimization of three technical dimensions: the encoder stage constructs a hierarchical feature extraction system with multi-scale receptive fields by using a hierarchical convolution module for pre-feature enhancement of Swin-Transformer units; the decoder part replaces the traditional window attention module with the Overall Regional Coordinate Attention Mechanism (ORCA) to achieve balanced optimization of computational complexity and long-range dependency modeling capabilities; the skip connection links embed a dual-channel Residual Channel Attention Mechanism (RCAM), and the dynamic allocation of feature channel weights and the upgrade of nonlinear activation function (SiLU instead of ReLU) significantly improve the gradient conduction efficiency and feature expression ability of deep networks. Through the above iterative optimizations, this architecture retains the inherent advantage of multi-scale feature fusion in U-shaped networks and forms a three-level optimization mechanism: “local feature refinement–global dependency modeling-cross–layer information enhancement”. This mechanism synergistically enhances the ability to capture fine geological features, model long-range structural correlations, and mitigate cross-layer feature mismatch, thereby providing a structured technical solution for high-fidelity reconstruction of seismic profile signals.
In the Swin-Transformer modules of the encoder, hierarchical convolution is introduced before W-MSA, while in the decoder’s Swin-Transformer modules, the original W-MSA and SW-MSA are replaced with the ORCA module.
Figure 6—Structures of Swin-Transformer block, Swin HT block, and Swin OT block—illustrates the original Swin-Transformer architecture, the Swin HT module with embedded hierarchical convolution, and the Swin OT module where ORCA substitutes for W-MSA and SW-MSA. This modification enhances multi-scale feature extraction in the encoder and optimizes long-range dependency modeling in the decoder, as visually compared in the figure.
2.3.1. Hierarchical Dilated Convolutions
In the research on seismic profile data reconstruction, we adopt a hierarchical convolution approach to extract multi-scale feature information through convolutional operations at different levels. Specifically, the network architecture is divided into four stages, each employing distinct convolutional kernels and dilation rates to effectively capture features at various scales in seismic profile data [
18,
19,
20]. This hierarchical convolution mechanism is schematically illustrated in
Figure 7 Hierarchical convolution structure diagram, demonstrating how multi-level receptive fields enable comprehensive characterization of geological structures with diverse spatial resolutions.
Level 1 employs 1 × 1 convolution kernels to preserve spatial dimensions, focusing on extracting high-frequency shallow features. This operation rapidly captures detailed information, laying a foundation for subsequent feature extraction.
Level 2 utilizes 3 × 3 convolution kernels with a dilation rate of 6. This configuration expands the receptive field moderately without increasing computational cost, effectively capturing medium-scale geological structures.
Level 3 maintains 3 × 3 kernels but increases the dilation rate to 12. By further expanding the receptive field, this level captures broader contextual information, crucial for reconstructing complex geological features in seismic data.
Level 4 uses 3 × 3 kernels with a dilation rate of 18, providing the largest receptive field to capture low-frequency features distributed across the seismic profile, thereby enhancing overall reconstruction quality [
21]. This hierarchical convolution design enables the network to extract multi-scale features efficiently, significantly improving its ability to characterize geological structures with varying spatial resolutions. Compared to traditional convolution, this approach balances receptive field expansion and computational efficiency while mitigating information dilution. Mathematically, the feature extraction process of hierarchical convolution can be expressed as follows.
For the
-th layer, the convolution operation is formulated as
where
denotes the weight of the
-th convolution kernel in the
-th layer,
represents the input feature map,
is the bias term, and
is the number of convolution kernels in this layer [
22].
Through convolutional operations at different layers, the network can gradually extract feature information from local details to global structures, enabling efficient reconstruction of seismic profile data. This hierarchical design not only enhances the model’s expressive power but also improves its adaptability to geological features at different scales, providing strong support for the accurate reconstruction of seismic profile data.
2.3.2. Overall Regional Coordinate Attention Module
In the task of seismic profile data reconstruction, the ORCA (Overall Regional Coordinate Attention) module is specifically designed to address the key limitations of traditional convolutional neural networks in this domain: their fixed receptive fields struggle to adapt to the multi-scale geological features (e.g., thin layers vs. broad formations) in seismic signals, and their limited capacity to model long-range structural correlations (e.g., fault continuity and stratum trends). Traditional methods often struggle to simultaneously capture global information in both height and width dimensions, restricting feature representation capabilities. To this end, the ORCA module performs global average pooling and max pooling along the height and width directions, respectively, to effectively capture multi-dimensional global information and enhance the comprehensiveness of feature extraction [
23].
Moreover, considering the importance of disparities of features across different channels and spatial locations, the ORCA module introduces an attention mechanism. By generating attention maps for height and width dimensions, it weights the input feature maps to enhance key features and suppress trivial ones [
24,
25]. To tackle the computational complexity and memory overhead from high-resolution images and large-scale data, the ORCA module employs grouped processing, dividing input feature maps by channel count to reduce per-group computation while maintaining feature diversity.
By integrating multi-dimensional global information with attention mechanisms, ORCA not only leverages global context from both height and width but also boosts the model’s ability to capture complex geological structures in seismic profiles [
26], significantly improving reconstruction accuracy and robustness. This design enables the ORCA module to excel in seismic profile reconstruction, efficiently handling multi-scale features for high-quality results. The architecture of the ORCA module is illustrated in
Figure 8 ORCA module structure diagram.
The ORCA module enhances feature representation by capturing global information along both height and width dimensions of the feature map, generating corresponding attention maps to weight the input feature maps. This design enables ORCA to focus on critical regions, improving the attention mechanism’s effectiveness and ultimately achieving higher-quality feature extraction and model performance [
27]. The module consists of five main components, which are denoted by different colors in the structure diagram.
The first component is feature map grouping, represented by the yellow box in the diagram. For the input feature map,
The input feature map is divided into
groups along the channel dimension, with each group containing
channels, where
denotes the batch size,
the number of channels, and
,
the height and width of the feature map. The grouped feature maps are denoted as
The second component is global pooling processing, represented by the green box in the diagram, where global average pooling and global max pooling operations are performed on the grouped feature maps along the height and width directions, respectively:
Here,
is input feature map,
and
are horizontal average/max pooling results, with spatial dimension compressed along the width.
and
are vertical average/max pooling results, with spatial dimension compressed along the height. The third component is shared convolutional layers, indicated by the red box in the diagram. For each grouped feature map, we apply shared convolutional layers for feature processing. The shared convolutional layers consist of two 1 × 1 convolutional layers, batch normalization layers, and ReLU activation functions, which are used to reduce and restore the channel dimensions:
Here,
is convolution operation, used to refine spatial features
and
are the convolved results of horizontal average/max pooling features.
and
are the convolved results of vertical average/max pooling features, the tensor dimensions of
remain consistent with the corresponding
. The fourth component is attention weight computation, shown as the blue box in the diagram. By summing the outputs of convolutional layers and applying the Sigmoid activation function (where
denotes the Sigmoid function), attention weights for both height and width directions are generated:
Here,
and
are horizontal/vertical attention maps, which assign weights to different spatial regions.
is activation function, used to normalize attention weights to the range [0, 1].
and
are fusions of convolved average and max pooling features, integrating global context and local saliency. The tensor dimensions of
are consistent with the input pooling results, ensuring compatibility with the original feature map for recalibration. The fifth component is the application of attention weights, denoted by the purple section in the diagram. The input feature maps are weighted by the computed attention weights to obtain the output feature maps:
Here, the attention weights and are expanded along the height and width dimensions, respectively, to match the spatial dimensions of the input feature map.
2.3.3. Residual Channel Attention Mechanism Module
After incorporating hierarchical convolution in the encoder and replacing it with the ORCA module in the decoder, we observed gradient vanishing. To address this, a Residual Channel Attention Mechanism (RCAM) was introduced into the skip connections of Swin-Unet [
28]. The residual structure resolves gradient vanishing via identity mapping, enabling the network to learn input-output differences rather than directly fitting complex transformations. In seismic signal reconstruction, this enhances high-frequency detail capture and mitigates performance degradation in deep networks. The standard residual block is formulated as
where
denotes the input feature map,
the output feature map, and
represents the residual function with
as its parameters.
Channel attention mechanism enhances feature representation by learning importance weights for each channel of the feature map, suppressing irrelevant channels and enhancing target channels [
29]. In seismic profile reconstruction, this mechanism adaptively weights signal channels with different frequencies and phases, improving the signal-to-noise ratio of reconstructed signals.
Traditional channel attention typically employs only global average pooling (GAP) to aggregate channel information, whereas RCAM introduces global max pooling (GMP) to form a dual-channel structure, capturing both global statistical features and salient characteristics of channels [
30,
31,
32]. Specifically, global average pooling calculates the mean value of all spatial locations within a channel, reflecting the overall response intensity of the channel:
Global max pooling highlights local peaks within channels by extracting the maximum values.
where
denotes the feature value of the
-th channel at position
, and
represents the spatial dimensions of the feature map. Finally, the dual-channel outputs are concatenated and processed through a Multi-Layer Perceptron (MLP) to generate channel weights:
Here
and
are MLP weight matrices,
is the dimensionality reduction ratio, and
is the Sigmoid activation function that normalizes the weights to [0, 1]. The final channel attention output is
where
denotes channel-wise multiplication.
The schematic diagram of the channel attention mechanism is shown in
Figure 9, Schematic diagram of channel attention mechanism.
In the skip connections of Swin UNet, RCAM (Residual Channel Attention Mechanism) can be embedded either inside the residual blocks or within the skip paths, whose core structure is
where
is the residual function, and
is the channel attention module that performs channel-wise weighting on residual features.
The complete procedures include residual feature extraction, dual-channel global pooling, channel attention weight computation, and attention weighting with residual connection, which are expressed by Equations (31)–(34), respectively:
The schematic diagram of the residual channel attention mechanism is shown in
Figure 10, Structure diagram of the residual channel attention mechanism.