Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram

Ding, Qianlong; Chen, Shuangquan; Shen, Jinsong; Wang, Borui

doi:10.3390/app15158586

Open AccessArticle

Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram

¹

College of Geophysics, China University of Petroleum (Beijing), Beijing 102249, China

²

Sinopec Geophysical Corporation Southern Branch, Chengdu 610213, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8586; https://doi.org/10.3390/app15158586

Submission received: 2 July 2025 / Revised: 29 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Environmental noise is inevitable during seismic data acquisition, with major sources including heavy machinery, rivers, wind, and other environmental factors. During field data acquisition, it is important to assess the impact of environmental noise and evaluate data quality. In subsequent seismic data processing, these noise components also need to be eliminated. Accurate identification of noise traces facilitates rapid quality control (QC) during fieldwork and provides a reliable basis for targeted noise attenuation. Conventional environmental noise identification primarily relies on amplitude differences. However, in seismic data, high-amplitude signals are not necessarily caused by environmental noise. For example, surface waves or traces near the shot point may also exhibit high amplitudes. Therefore, relying solely on amplitude-based criteria has certain limitations. To improve noise identification accuracy, we use the Mel-spectrogram to extract features from seismic data and construct the dataset. Compared to raw time-series signals, the Mel-spectrogram more clearly reveals energy variations and frequency differences, helping to identify noise traces more accurately. We then employ a Vision Transformer (ViT) network to train a model for identifying noise in seismic data. Tests on synthetic and field data show that the proposed method performs well in identifying noise. Moreover, a denoising case based on synthetic data further confirms its general applicability, making it a promising tool in seismic data QC and processing workflows.

Keywords:

automated noise identification; noise attenuation; Mel-spectrogram; seismic data quality control; deep learning

1. Introduction

Environmental noise is unavoidable during seismic data acquisition. This type of noise typically originates from external sources such as heavy machinery, natural wind, flowing rivers and other anthropogenic or environmental disturbances [1]. When noise sources are located near geophones, their vibrations are often captured and typically exhibit stronger amplitudes than subsurface reflections. Some sources, such as passing vehicles or wind, can cause continuous disturbances over extended periods and broad areas. Others, like workers’ footsteps, are brief and sporadic, leading to localized interference. Consequently, environmental noise often shows high amplitude and irregular temporal patterns, which can significantly degrade signal quality [2]. During field acquisition, real-time noise monitoring is important for ensuring the collection of high-quality seismic data [3,4]. If data quality is severely compromised, re-acquisition may be necessary. Meanwhile, quality control (QC) should be performed on the collected data to ensure their reliability and usability for downstream processing [5]. In addition to field-level monitoring, accurate noise identification plays a critical role in seismic data processing. Based on the results of noise detection, targeted and precise noise attenuation techniques can be applied to suppress contaminating energy without damaging useful signals [6,7]. Therefore, accurate identification of noise-contaminated traces is essential not only for effective noise monitoring and QC during the acquisition process, but also for improving the quality and interpretability of seismic images in subsequent data processing workflows.

The identification of noise traces is essentially a pattern recognition problem involving seismic waveforms [8]. In oil and gas seismic exploration, waveform-based pattern recognition is primarily used for seismic phase classification, with a focus on analyzing subsurface geological structures such as faults, horizons, and stratigraphic features [9,10,11]. In natural earthquake research, it is mainly applied to the detection of seismic events [12,13,14]. These approaches demonstrate the power of pattern recognition in identifying meaningful structures or events within complex seismic signals. However, research on noise trace identification in raw field seismic data remains limited. These pattern recognition techniques identify waveforms by extracting features such as amplitude, frequency, and phase from seismic data. Some studies also incorporate techniques from related fields such as acoustics and speech processing [15,16]. Noise trace identification must detect and characterize unwanted or anomalous patterns that may vary significantly in amplitude, frequency content, and temporal attenuation. Most existing methods for anomalous noise identification still rely primarily on amplitude difference information in the original data [17], which can be insufficient for distinguishing between valid high-amplitude reflections and true noise anomalies. This limitation underscores the need for more robust, data-driven approaches that can exploit richer feature representations beyond simple amplitude thresholds.

In recent years, machine learning–based methods for waveform classification have advanced rapidly, achieving notable improvements in both automation and accuracy [18,19]. These methods can be divided into unsupervised and supervised approaches, depending on the learning strategy. Common unsupervised methods include K-means clustering [20,21,22] and self-organizing maps (SOMs) [23,24,25,26], which have the advantage of not requiring labeled data; however, they often rely on manual feature selection. Supervised methods typically perform better, but require high-quality labeled data. Most studies using labeled data rely on verified field data for training [27,28,29]. Some studies also use a small amount of manually labeled data to fine-tune pre-trained models and enhance performance [30]. As for network structures, convolutional neural networks (CNNs) are widely used [28,29,31,32,33]. Other studies have used recurrent neural networks (RNNs) [34,35] and Transformer models [36,37,38]. These models demonstrate strong learning ability across various tasks.

Although most studies do not focus directly on environmental noise identification, their ideas and methods remain valuable references. Compared to post-stack seismic data, raw seismic data affected by environmental noise exhibit distinct characteristics. Specifically, noise in raw recordings is often more complex and shows large amplitude variations. Moreover, raw seismic waveforms display less regularity than those in post-stack seismic sections, making it more difficult to accurately identify noise traces in field recordings. Currently, environmental noise is mainly identified based on amplitude differences [17]. When interference occurs near the acquisition instrument, the recorded signals are usually stronger than normal signals. This method performs well at locations far from the source, but near the shot point, signals such as surface and shear waves may cause abnormal energy levels [1].

To make signal features clearer, we use Mel-spectrograms to extract features from field seismic recordings. Mel-spectrograms provide an intuitive and informative representation of seismic signals by converting single-trace seismic data into two-dimensional images that display frequency and amplitude variations over time. This time–frequency representation enhances the visibility of anomalous energy patterns while preserving key signal characteristics, making it well-suited for noise identification tasks. Most importantly, Mel-spectrogram features remain relatively stable across seismic signals from different areas and acquired under varying field parameters. This helps reduce the model’s dependence on survey-specific characteristics, such as source–receiver distance or time sampling rate. As a result, networks trained on Mel-spectrograms exhibit better generalization and are more robust when applied to new datasets. To train the model, we constructed a high-quality dataset by combining pure noise recordings with carefully selected low-noise field data. This ensures a clear distinction between noise and valid signal traces, improving training effectiveness. A Vision Transformer (ViT) model was then trained using this dataset. The trained model can effectively identify noise-contaminated traces, supporting both field data quality control (QC) and subsequent seismic data denoising. Tests conducted on data different from the training set, including migration imaging experiments based on synthetic data, demonstrate that the proposed method is both reliable and practical.

2. Methods

2.1. Mel-Spectrogram

The Mel-spectrogram is a time–frequency representation method originally developed for audio and speech signal processing [39]. It transforms a one-dimensional signal into a two-dimensional spectrogram, which can be used as input features for further analysis in machine learning models. The process involves dividing the signal into short, overlapping time windows, and applying a Short-Time Fourier Transform (STFT) to each window to extract its localized frequency content [40]. In the resulting spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the value at each point (t, f) shows the amplitude of frequency f at time t. The frequency f conversion to the Mel scale is done using the following formula:

f_{mel} = 2595 \times \log_{10} (1 + \frac{f}{700})

(1)

where

f_{mel}

is the Mel frequency. Although various time–frequency analysis methods, such as wavelet transform decomposition, have been widely used to analyze non-stationary seismic signals [41,42], they often require careful parameter tuning and involve trade-offs between time and frequency resolution. In contrast, the Mel-spectrogram provides a compact and stable time–frequency representation by mapping the frequency axis onto a nonlinear Mel scale. The Mel scale compresses high-frequency components while enhancing resolution at low frequencies, making it particularly suitable for seismic applications. With appropriately selected parameters, the Mel-spectrogram effectively captures the low-frequency content where most seismic energy is concentrated. This nonlinear transformation helps preserve subtle variations and improves the localization of anomalies in seismic data. Additionally, the Mel-spectrogram is computationally efficient. Compared to wavelet-based time–frequency methods, it offers higher efficiency and generates compact, image-like representations that are well suited for deep learning models.

We converted seismic signals into Mel-spectrograms and applied image recognition techniques to identify seismic traces. Compared with raw seismic signals, Mel-spectrograms make it easier to visually observe the time–frequency characteristics of the data. Figure 1 presents two seismic traces. Figure 1a shows a trace with low noise, while Figure 1b shows a trace severely affected by noise. Both traces were converted into Mel-spectrograms, as shown in Figure 2. When noise is minimal, the frequency range of seismic reflections and the amplitude decay over time remain relatively stable. Figure 2a shows the Mel-spectrogram of the low-noise trace, where energy is more concentrated along both the time and frequency axes. In contrast, when a seismic trace is affected by environmental noise, its amplitude pattern over time and frequency often deviates from that of a clean signal. Figure 2b shows the spectrogram of the noise trace, where strong energy appears outside the main frequency band of the seismic signal, as highlighted by the red circles in the figure.

We adopt a single-trace identification approach because it offers greater flexibility and efficiency in practical applications. Converting data into Mel-spectrograms helps avoid issues caused by varying sampling rates across different surveys or equipment, as the time–frequency representation normalizes these variations and provides a consistent feature space. Single-trace identification also mitigates problems related to missing traces, which are common in complex mountainous areas or challenging terrain where acquisition is often incomplete. In contrast, approaches based on multi-trace analysis require fully sampled data and can suffer significant degradation when traces are missing or corrupted. Additionally, converting multi-trace data into Mel-spectrograms results in three-dimensional representations, which substantially increase computational costs and memory requirements. This higher complexity reduces the ease of implementation and limits scalability, especially for large-scale field datasets. Therefore, the single-trace approach strikes a balance between computational efficiency and robust noise identification, making it well suited for field deployment and operational workflows.

2.2. Training Dataset

We selected relatively clean seismic data with minimal noise and combined them with pure environmental recordings collected from noise-prone areas to create the training dataset. This combination ensures that the dataset is labeled more accurately, which is essential for effective supervised learning. Figure 3a shows an example of a pure environmental noise recording, characterized by irregular and high-amplitude fluctuations typical of field noise. In contrast, Figure 3b presents a relatively clean seismic recording, where coherent reflections are clearly visible with minimal contamination.

We combined the signals from Figure 3a,b. During this process, we adjusted the relative amplitude between the noise and clean seismic signals to determine whether a trace should be labeled as noise. Labels are defined as (1, 0) for traces with strong noise and (0, 1) for traces with weak noise.

m a r k = \{\begin{matrix} (1, 0) & , N M e a n > m \times C M e a n \\ (0, 1) & , N M e a n \leq m \times C M e a n \end{matrix}

(2)

where m is a threshold, NMean is the mean of the noise, and CMean is the mean of the seismic signal. After the training dataset is generated using this approach, each trace is first normalized by dividing it by its maximum value. Then, the one-dimensional time series is converted into a Mel-spectrogram. At present, most newly collected seismic data have a sampling rate of 1 ms and a recording time of about 8 s. Some older datasets may use a 4 ms sampling rate, with recording times usually longer than 6 s. When using Mel-spectrograms to build datasets, the effect of sampling rate is small. For example, the data in Figure 1a were resampled to 1 ms, 2 ms, and 4 ms, and then converted to Mel-spectrograms. The results are shown in Figure 4. As seen in the figure, the Mel-spectrograms look very similar under different sampling rates. This shows one advantage of the proposed method—it is not sensitive to the sampling rate. Since most seismic data have recording times longer than 6 s, we used 6 s as the standard length for building the dataset. All field data used in our experiments have a sampling rate of 1 ms. The data are segmented into 64 equally spaced windows, each overlapping by 50 samples. A Fourier transform is applied to each window, covering frequencies from 1 to 120 Hz. The resulting Mel-spectrograms have dimensions of 64 × 64.

2.3. Network Structure

ViT is a deep learning model based on the Transformer architecture, first introduced by Google Research in 2020 [43]. Unlike traditional convolutional neural networks, it uses a self-attention mechanism to capture global features, improving feature extraction. The model structure is shown in Figure 5, where the ViT follows the classic design without modifications. Given an input Mel-spectrogram

X \in ℝ^{H \times W \times C}

, where H and W denote the height and width of the image, and C represents the number of channels, the image is first divided into

N = \frac{H W}{P^{2}}

non-overlapping patches of size P × P. Each patch is then flattened into a one-dimensional vector and passed through a linear projection layer that maps the original dimension

P^{2} \cdot C

to a fixed dimension D, forming the initial patch embedding sequence

{z_{1}, z_{2}, \dots, z_{N}}

. Since the Transformer architecture itself lacks an inherent mechanism for modeling spatial positions, ViT introduces learnable positional embeddings

{p_{1}, p_{2}, \dots, p_{N}}

, which are added to the corresponding patch vectors to retain the spatial structure of the input. The resulting patch embedding sequence is then fed into the Transformer encoder, as shown in the red box of Figure 5. The encoder consists of alternating layers of Multi-Head Self-Attention (MSA) and Multilayer Perceptron (MLP) blocks. Each sub-layer is preceded by Layer Normalization, and followed by a Residual Connection, ensuring training stability and effective convergence. This completes one Transformer encoder block, which can be stacked multiple times to build a deeper ViT architecture.

The MSA module serves as one of the core components, responsible for capturing long-range dependencies between different image patches from a global perspective. Unlike traditional convolutional operations, which rely on local receptive fields, MSA explicitly models the interactions between any pair of positions within the input sequence, enabling the network to learn global contextual relationships. As prepared in the previous stage, the input to the MSA module consists of N vectors, each of dimension D, which together form an input matrix

X \in R^{N \times D}

. The MSA module first applies three independent linear projections to the input, generating the Query (Q), Key (K), and Value (V) matrices:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(3)

where

W^{Q}, W^{K}, W^{V} \in R^{D \times d}

are learnable weight matrices, and d denotes the dimensionality of each attention head. The computation performed by each attention head is as follows:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(4)

This mechanism computes attention weights by measuring similarity between Queries and Keys, followed by a weighted sum of the Value vectors. In ViT, a multi-head attention mechanism is employed to capture diverse relationships among tokens. Specifically, h independent attention heads are applied in parallel:

{head}_{i} = Attention (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V}), i = 1, 2, \dots, h

(5)

Each head uses its own set of projection weights. The outputs from all heads are then concatenated and projected back to the original embedding dimension:

MSA (X) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{O}

(6)

where

W^{o} \in R^{h d \times D}

is the output projection matrix that restores the sequence to dimension D. To ensure stable training and better gradient flow, each MSA module is followed by a residual connection and layer normalization:

X^{'} = LayerNorm (X + MSA (X))

(7)

The resulting feature X′ is then passed into an MLP block to introduce nonlinearity and enhance representation learning. The MLP typically consists of two fully connected layers with an activation function in between, and may include dropout for regularization:

MLP (X^{'}) = Dropout (W_{2} \cdot σ (W_{1} \cdot X^{'} + b_{1}) + b_{2})

(8)

where

W_{1} \in R^{D \times D^{'}}

and

W_{2} \in R^{D \times D^{'}}

are learnable weights,

σ (\cdot)

is a nonlinear activation function, such as rectified linear unit (ReLU),

D^{'}

is the hidden dimension (often set as a multiple of D), and

b_{1}

and

b_{2}

are bias terms. The MLP block is also followed by a residual connection and layer normalization:

Y = LayerNorm (X^{'} + MLP (X^{'}))

(9)

In this study, the parameters are configured as follows: the number of input channels is set to C = 1, and the input image size is H = W = 64. Each image is divided into non-overlapping patches of size 8 × 8, resulting in a total of N = 64 patches. Each patch is projected to an embedding dimension of D = 64. After passing through the Transformer encoder, the output is processed by a linear layer to produce the classification result. The model performs a binary classification task using cross-entropy as the loss function.

3. Experimental Results

3.1. Data Preparation

To support model training and evaluation, we curated a diverse dataset composed of real noise, field seismic data, and synthetic simulations. Specifically, two sets of pure noise recordings were extracted from different seismic survey areas and labeled as NA (Noise Data A) and NB (Noise Data B). These recordings contain only ambient or acquisition-related noise, without any coherent subsurface reflections, and are used to simulate challenging noise scenarios. In addition, we collected two sets of field seismic data with relatively low inherent noise levels. These data were further processed using conventional denoising workflows to suppress anomalous noise and enhance signal fidelity. The resulting datasets are referred to as SA (Seismic Data A) and SB (Seismic Data B), and they represent clean seismic signals suitable for model learning. To enable quantitative analysis and assist in parameter testing during dataset construction, we also generated a synthetic seismic dataset using a finite-difference solution of the acoustic wave equation. This dataset is referred to as AA (Acoustic Data A). The training set was constructed by combining NA and SA, allowing the model to learn the distinction between pure noise and clean seismic signals. The test set was independently assembled using NB and SB, enabling the evaluation of model robustness on unseen data from different survey areas. Figure 6 presents representative examples from each dataset, illustrating the noise levels, structural features, and data quality across the different sources.

In Equation (2), the parameter m determines whether a seismic trace should be classified as a noise trace. This threshold is critical for guiding downstream processes such as noise suppression and data labeling, and thus must be carefully selected. In this study, to determine an appropriate and practically meaningful value for m, we established a set of evaluation criteria based on the impact of noise on imaging quality. We combined the NA and AA datasets and randomly selected one-tenth of the seismic traces to add noise, followed by anomalous noise attenuation processing. This process was designed to simulate real-world scenarios in which only a small fraction of traces are severely contaminated, allowing us to evaluate the sensitivity and robustness of the threshold parameter. The amplitude of the added noise is calculated using the following formula:

N M e a n = m \times C M e a n

(10)

where m, NMean and CMean have the same meanings as in Equation (2). When calculating CMean, only the portion below the first arrival is considered, excluding both the signal-free region above the first arrival and the influence of the first arrival itself.

Figure 7 shows an example of synthetic data. Based on migration imaging characteristics, some weak noise does not need to be removed, as noise attenuation may damage valid signals. This trade-off highlights the importance of distinguishing between tolerable noise and anomalous noise, particularly when dealing with weak reflection events. Therefore, we performed migration imaging on the three sets of data shown in Figure 7 and calculated the imaging signal-to-noise ratio (SNR) before and after noise attenuation. This migration imaging evaluation enables a quantitative assessment of how the denoising process affects both the preservation of useful signals and the suppression of undesired components. The formula used to calculate the SNR is as follows:

SNR = 10 \log_{10} \frac{{‖\hat{y}‖}_{2}^{2}}{{‖y - \hat{y}‖}_{2}^{2}}

(11)

where y represents the actual results, and

\hat{y}

represents the results with added noise or those computed by traditional methods.

We tested m values ranging from 0.2 to 2 with intervals of 0.2. This parameter sweep was designed to systematically evaluate the sensitivity of the noise classification threshold and its effect on imaging quality. Figure 8 shows the corresponding SNR as a function of m. As shown in Figure 8, when m exceeds 1.4, the noise-attenuated data achieve a higher imaging SNR. Below this threshold, excessive noise attenuation adversely affects signal integrity. Under these conditions, imaging directly with noisy data may produce superior results, since denoising unavoidably compromises valid signals. When the noise is weak, the advantages of noise attenuation do not outweigh the associated signal loss. Therefore, in this test, we selected 1.4 as the noise threshold for dataset preparation.

By combining the NA and SA datasets, we created approximately 3.41 million training samples with an equal proportion of weak and strong noise traces. This balanced distribution ensures that the model is adequately exposed to both subtle and severe noise scenarios during training, promoting generalization across a wide range of noise conditions. Figure 9 presents some examples from the training dataset. It can be observed that the Mel-spectrogram features of low-noise data are relatively consistent and simple, whereas those of high-noise traces exhibit greater variability and differ significantly from low-noise signals. The test dataset, composed of NB and SB, contains about 810,000 samples, with strong noise traces accounting for roughly 12%. This unbalanced distribution more accurately reflects real-world data, where strong noise is relatively sparse.

3.2. Comparison Method

To verify the advantage of using Mel-spectrograms for training, we also trained the model with raw seismic data as input. This comparison allows us to evaluate whether time–frequency representation provides better feature separability and learning efficiency compared to raw waveform input. The network structure remains the same as shown in Figure 5, except for an added linear layer at the input stage to adjust the data dimensions and ensure compatibility with the patch embedding sequence processing framework. Each raw seismic trace consists of 6001 samples, which are first compressed to 4096 samples through a linear layer. This dimensionality reduction helps reduce computational cost and aligns the input length with the patching sequence used for spectrograms. These compressed traces are then divided into 64 non-overlapping segments, each containing 64 samples. The remaining processing steps, including patch embedding, Transformer encoding, and classification, follow the same procedure as in Figure 5. Moreover, the network is trained using the same set of hyperparameters.

3.3. Training and Test

During training, the learning rate was set to 0.001 and the batch size to 64. These hyperparameters were selected based on preliminary experiments to ensure stable convergence across both input types. Figure 10 shows the model accuracy on both the training and test sets. The solid line represents the model trained with Mel-spectrograms, while the dashed line represents the model trained with raw seismic data. The model trained with Mel-spectrograms consistently outperforms the one trained with raw data across all epochs. This performance gap suggests that Mel-spectrograms provide a more structured and informative representation, allowing the model to extract discriminative features more effectively. Training with raw signals is also less stable, as indicated by the noticeable fluctuations in the red dashed line. These instabilities may stem from the complexity of raw seismic waveforms, which increase the difficulty of model optimization. In contrast, the Mel-spectrogram-based model achieves higher and more consistent accuracy on the test set, demonstrating superior generalization capability and robustness. This highlights the advantage of the model under a limited training set, where it is impossible to fully cover all possible noise conditions.

Table 1 shows the accuracy of both models on the test dataset. The model trained with Mel-spectrograms performs better overall, especially in detecting strong-noise traces. Single-trace normalization was applied during training to stabilize the learning process, as the wide amplitude range in seismic data can otherwise cause the model to diverge. However, normalization also reduces the difference in amplitude between normal and abnormal traces, making them harder to separate. Mel-spectrograms provide more information, including frequency and attenuation, not just amplitude. These extra features help the model detect anomalies more effectively. As a result, the Mel-spectrogram model shows higher accuracy, especially for strong-noise traces. Figure 11 shows the identification results of the test data. As illustrated in Figure 11, the Mel-spectrogram-based model produces fewer errors in these weak-noise traces, as indicated by the greater number of green traces. Both models successfully identify most noise traces, but the model trained with Mel-spectrograms achieves higher overall accuracy.

4. Identification Results to Support Noise Attenuation

Few studies have directly applied deep learning to anomalous amplitude attenuation, primarily due to the unstable effect that anomalous values have on model training. Most research combines deep learning for noise identification or uses it to provide better parameter settings for traditional denoising methods, thereby improving denoising performance. For example, Tian et al. [6] employed deep learning to directly predict optimal parameters for a threshold-based anomalous amplitude attenuation method; Mao et al. [2] used noise identification results from deep learning to adaptively attenuate anomalous noise; Sun et al. [44] leveraged a U-Net to identify spatiotemporal variations of strong energy noise in seismic data, enabling more effective determination of denoising thresholds.

Accurate noise identification supports both QC during field data acquisition and improves seismic data processing by enabling more accurate noise attenuation. Effective identification of noise-contaminated traces allows for targeted processing strategies, reducing the risk of overprocessing clean data and weak noise data. We present an example of noise attenuation using synthetic data to demonstrate the practical value of the proposed method. Traditional noise attenuation methods typically detect anomalies by comparing the amplitude of each sample with that of its neighboring samples. When a sample is deemed anomalous, its amplitude is reduced to the average of the surrounding values. This approach is commonly referred to as Anomalous Amplitude Attenuation (AAA). While simple and computationally efficient, this method depends entirely on amplitude deviations. As a result, methods that rely solely on amplitude may misclassify and damage valid signals, especially when the signal amplitude is naturally high or rapidly changing. By incorporating the results of noise identification, attenuation is applied only to samples explicitly recognized as noise, thereby better preserving valid signals. This selective processing helps minimize signal distortion while still effectively reducing anomalous energy. The recognition results are used as weights to blend the denoised data with the original data:

D = (1 - W) A + W S

(12)

where D is the result of weighting the AAA output, A represents the result of the AAA method, S is the original seismic data, and W is the weight, where 0 indicates a strong noise trace and 1 indicates a weak noise trace.

We used a portion of the Marmousi model as the velocity model for forward modeling and added noise from the NB dataset to the synthetic data. The Marmousi model, known for its complex structural features and strong velocity contrasts, provides a challenging test case for evaluating noise attenuation performance. Figure 12 shows an example of synthetic data, which exhibits both coherent reflections and superimposed noise. Noise attenuation and migration imaging were then performed to evaluate the effectiveness of the proposed method. Migration results allow for a direct assessment of whether signal integrity has been preserved and whether noise has been sufficiently suppressed to enhance reflector continuity. This ensures that our evaluation reflects real processing practices and highlights the potential advantages of incorporating intelligent noise identification into the workflow.

Here, we briefly introduce a conventional AAA method used in this article [2].Noise attenuation is applied only to data below first breaks. Data will be processed by dividing them into data subsets based on a space window with nx traces. For each data subset

S e i s (i, j)

with nx traces (where i represents the time sample, and j represents the spatial trace), the following operations will be performed:

First, the absolute value of each sample is computed. Then, a smoothing filter is applied in the time direction. The resulting smoothed data are denoted as

A (i, j)

.

Second, the reference amplitude trace

B (i)

is derived from the smoothed data by computing the median values along the spatial direction.

Third, when the raw amplitude exceeds m_a (threshold parameter) times the reference amplitude

m_{a} B (i)

, the attenuation coefficients are calculated through the reference amplitudes multiplied by the attenuation coefficient, then divided by the amplitude of the smoothed data.

C o e f (i, j) = \{\begin{matrix} \frac{B (i) \times α}{A (i, j)} & , S e i s (i, j) > B (i) \times m_{a} \\ 1 & , S e i s (i, j) < B (i) \times m_{a} \end{matrix}

(13)

where

α

is the attenuation scale, and

C o e f (i, j)

is the attenuation coefficient.

Finally, each sampling point in the raw data is multiplied by its corresponding attenuation coefficient to produce the denoised data

d e n o i s e (i, j)

:

d e n o i s e (i, j) = C o e f (i, j) \times S e i s (i, j)

(14)

For the AAA method, the window length nx was set to 60, the threshold

m_{a}

to 2, and the attenuation scale to 1.5. Figure 13 compares the noise attenuation results of the conventional AAA method and the weighting AAA approach. Red arrows indicate the loss of valid signals in the AAA result, highlighting the method’s tendency to over-attenuate regions with strong but meaningful reflections. The weighting AAA method, which selectively attenuates only the samples identified as noise, effectively restores most of the lost valid signals and better preserves waveform integrity. This demonstrates the benefit of incorporating noise identification into the attenuation process, especially for strong but geologically significant events. From an SNR perspective, the overall improvement appears limited because blending the denoised data with raw traces reintroduces a small amount of noise. Nevertheless, in practical seismic data processing, preserving valid signals is often more critical than achieving maximal noise suppression, particularly in areas of structural complexity or strong reflectors. Figure 14 shows the migration imaging results. In Figure 14b, noise dominates the section due to the presence of anomalous amplitudes that were not suppressed. In contrast, Figure 14c,d present more coherent reflectors. However, SNR analysis reveals that the image produced by the weighting AAA method achieves better quality. The imaging region in Figure 14 is enlarged, as indicated by the red box. The magnified result is shown in Figure 15. Figure 15 further supports this conclusion by comparing the denoised images against the original noise-free result. Since the velocity model used for imaging is accurate and identical across all datasets, the resulting imaging structures remain the same. As shown in Figure 15, the primary difference between the two denoised datasets lies in the amplitude preservation. With the application of weighting AAA, less amplitude information is lost, resulting in an image that more closely approximates the ideal, noise-free reference. This confirms that the proposed method not only ensures the effect of noise attenuation, but more importantly, safeguards signal fidelity during denoising.

5. Conclusions

In this study, we proposed a method for identifying environmental noise in raw seismic data by combining Mel-spectrogram representations with a ViT model. Unlike traditional amplitude-based approaches, our method captures both temporal and frequency–domain features, allowing for more accurate and robust discrimination between noise and valid seismic signals. Experimental results confirm that the proposed method performs well in identifying noise-contaminated traces across both synthetic and field datasets. Furthermore, by integrating noise identification into the denoising workflow, we achieve improved signal preservation compared to conventional methods. This makes the approach not only a powerful tool for quality control during field data acquisition, but also a valuable component in enhancing the effectiveness of subsequent seismic data processing.

Future research will aim to extend the proposed framework to identify specific types of noise, rather than broadly classifying them as environmental noise. For example, distinguishing between powerline interference, heavy machinery vibrations, and neighboring shot noise would enable the application of more targeted denoising strategies, as each noise type may require different handling. We also plan to perform localized noise identification to better capture temporal variations in noise, enabling the model to detect short-term or transient disturbances that may be overlooked in whole-trace classification. However, one major challenge in this direction is obtaining labeled data for each specific type of noise at the appropriate temporal resolution. Manually annotating such data is time-consuming and often impractical for large-scale datasets, which highlights the need for more efficient labeling strategies or the adoption of self-supervised learning techniques.

Author Contributions

Data curation, B.W.; Methodology, Q.D.; Supervision, S.C. and J.S.; Writing—original draft, Q.D.; Writing—review and editing, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of China National Petroleum Corporation, grant number 2023ZZ05. Funding was also provided by The National Natural Science Foundation of China, grant number 42074127, and The Science and Technology Project of China National Petroleum Corporation, grant number 2023ZZ15YJ02.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

Author Borui Wang was employed by the company Sinopec Geophysical Corporation Southern Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

QC	Quality control
ViT	Vision Transformer
CNNs	Convolutional neural networks
RNNs	Recurrent neural networks
STFT	Short-Time Fourier Transform
MSA	Multi-Head Self-Attention
MLP	Multilayer Perceptron
Q	Query
K	Key
V	Value
ReLU	Rectified linear unit
SNR	Signal-to-noise ratio
NA	Noise Data A
NB	Noise Data B
SA	Seismic Data A
SB	Seismic Data B
AA	Acoustic Data A
AAA	Anomalous Amplitude Attenuation

References

Li, X.; Qi, Q.; Yang, Y.; Duan, P.; Cao, Z. Removing Abnormal Environmental Noise in Nodal Land Seismic Data Using Deep Learning. Geophysics 2024, 89, WA143–WA156. [Google Scholar] [CrossRef]
Mao, X.; Yang, W.; Pang, Z.; Zhou, Q.; Mao, H.; Zhang, D.; Wang, X.; Pan, L. Combined Identification and Attenuation of Anomalous Amplitude Noises in Nodal Land Seismic Data. Front. Earth Sci. 2025, 13, 1535990. [Google Scholar] [CrossRef]
Tian, X.; Cai, C.; Qu, W.; Meng, X.; Lu, J. Research and Application of the Relationship between Wireless Node Status QC and Seismic Data Quality. Geophys. Prospect. Pet. 2024, 63, 735–745. [Google Scholar] [CrossRef]
Zou, S. Research and Application of Quality Monitoring Technology Based on Node Instrument Data Acquisition. Prog. Geophys. 2024, 39, 1493–1500. [Google Scholar] [CrossRef]
Evans, B.J. A Handbook for Seismic Data Acquisition in Exploration; Society of Exploration Geophysicists: Tulsa, Okla, 1997; ISBN 978-1-56080-041-5. [Google Scholar]
Tian, X.; Lu, W.; Li, Y. Improved Anomalous Amplitude Attenuation Method Based on Deep Neural Networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Li, X.; Qi, Q.; Wang, T.; Zhang, D. Removing Anomalous Noise from Seismic Data. In Proceedings of the SEG 1st Tarim Ultra-Deep Oil & Gas Exploration Technology Workshop, Korla, China, 3–5 June 2024; Society of Exploration Geophysicists: Tulsa, Okla, 2024; pp. 181–184. [Google Scholar]
Song, C.; Liu, Z.; Wang, Y.; Li, X.; Hu, G. Multi-Waveform Classification for Seismic Facies Analysis. Comput. Geosci. 2017, 101, 1–9. [Google Scholar] [CrossRef]
Roksandić, M.M. Seismic Facies Analysis Concepts. Geophys. Prospect. 1978, 26, 383–398. [Google Scholar] [CrossRef]
Coléou, T.; Poupon, M.; Azbel, K. Unsupervised Seismic Facies Classification: A Review and Comparison of Techniques and Implementation. Lead. Edge 2003, 22, 942–953. [Google Scholar] [CrossRef]
Xu, G.; Haq, B.U. Seismic Facies Analysis: Past, Present and Future. Earth-Sci. Rev. 2022, 224, 103876. [Google Scholar] [CrossRef]
Chen, C.H. Seismic Pattern Recognition. Geoexploration 1978, 16, 133–146. [Google Scholar] [CrossRef]
Tjøstheim, D. Improved Seismic Discrimination Using Pattern Recognition. Phys. Earth Planet. Inter. 1978, 16, 85–108. [Google Scholar] [CrossRef]
Ming, Z.; Shi, C.; Yuen, D. Waveform Classification and Seismic Recognition by Convolution Neural Network. Chin. J. Geophys. 2019, 62, 374–382. [Google Scholar] [CrossRef]
Amendola, A.; Gabbriellini, G.; Dell’Aversana, P.; Marini, A.J. Seismic Facies Analysis through Musical Attributes. Geophys. Prospect. 2017, 65, 49–58. [Google Scholar] [CrossRef]
Xie, T.; Zheng, X.; Zhang, Y. Seismic Facies Analysis Based on Speech Recognition Feature Parameters. Geophysics 2017, 82, O23–O35. [Google Scholar] [CrossRef]
Anderson, R.G.; McMECHAN, G.A. Automatic Editing of Noisy Seismic Data. Geophys. Prospect. 1989, 37, 875–892. [Google Scholar] [CrossRef]
Shen, S.; Wang, B.; Zeng, L.; Chen, S.; Xie, L.; She, Z.; Huang, L. Methods for Identifying Effective Microseismic Signals in a Strong-Noise Environment Based on the Variational Mode Decomposition and Modified Support Vector Machine Models. Appl. Sci. 2024, 14, 2243. [Google Scholar] [CrossRef]
Shakeel, M.; Nishida, K.; Itoyama, K.; Nakadai, K. 3D Convolution Recurrent Neural Networks for Multi-Label Earthquake Magnitude Classification. Appl. Sci. 2022, 12, 2195. [Google Scholar] [CrossRef]
Di, H.; Shafiq, M.; AlRegib, G. Multi-Attribute k-Means Clustering for Salt-Boundary Delineation from Three-Dimensional Seismic Data. Geophys. J. Int. 2018, 215, 1999–2007. [Google Scholar] [CrossRef]
Song, C.; Li, L.; Li, L.; Li, K. Robust K-Means Algorithm with Weighted Window for Seismic Facies Analysis. J. Geophys. Eng. 2021, 18, 618–626. [Google Scholar] [CrossRef]
Troccoli, E.B.; Cerqueira, A.G.; Lemos, J.B.; Holz, M. K-Means Clustering Using Principal Component Analysis to Automate Label Organization in Multi-Attribute Seismic Facies Analysis. J. Appl. Geophys. 2022, 198, 104555. [Google Scholar] [CrossRef]
Köhler, A.; Ohrnberger, M.; Scherbaum, F. Unsupervised Pattern Recognition in Continuous Seismic Wavefield Records Using Self-Organizing Maps. Geophys. J. Int. 2010, 182, 1619–1630. [Google Scholar] [CrossRef]
Saraswat, P.; Sen, M.K. Artificial Immune-Based Self-Organizing Maps for Seismic-Facies Analysis. Geophysics 2012, 77, O45–O53. [Google Scholar] [CrossRef]
Du, H.; Cao, J.; Xue, Y.; Wang, X. Seismic Facies Analysis Based on Self-Organizing Map and Empirical Mode Decomposition. J. Appl. Geophys. 2015, 112, 52–61. [Google Scholar] [CrossRef]
Liu, Z.; Cao, J.; Chen, S.; Lu, Y.; Tan, F. Visualization Analysis of Seismic Facies Based on Deep Embedded SOM. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1491–1495. [Google Scholar] [CrossRef]
Alaudah, Y.; Michałowicz, P.; Alfarraj, M.; AlRegib, G. A Machine-Learning Benchmark for Facies Classification. Interpretation 2019, 7, SE175–SE187. [Google Scholar] [CrossRef]
Liu, M.; Jervis, M.; Li, W.; Nivlet, P. Seismic Facies Classification Using Supervised Convolutional Neural Networks and Semisupervised Generative Adversarial Networks. Geophysics 2020, 85, O47–O58. [Google Scholar] [CrossRef]
Zhang, H.; Chen, T.; Liu, Y.; Zhang, Y.; Liu, J. Automatic Seismic Facies Interpretation Using Supervised Deep Learning. Geophysics 2021, 86, IM15–IM33. [Google Scholar] [CrossRef]
Chikhaoui, K.; Alfarraj, M. Self-Supervised Learning for Efficient Seismic Facies Classification. Geophysics 2024, 89, IM61–IM76. [Google Scholar] [CrossRef]
Chai, X.; Nie, W.; Lin, K.; Tang, G.; Yang, T.; Yu, J.; Cao, W. An Open-Source Package for Deep-Learning-Based Seismic Facies Classification: Benchmarking Experiments on the SEG 2020 Open Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
You, J.; Zhao, J.; Huang, X.; Zhang, G.; Chen, A.; Hou, M.; Cao, J. Explainable Convolutional Neural Networks Driven Knowledge Mining for Seismic Facies Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Yang, N.-X.; Li, G.-F.; Li, T.-H.; Zhao, D.-F.; Gu, W.-W. An Improved Deep Dilated Convolutional Neural Network for Seismic Facies Interpretation. Pet. Sci. 2024, 21, 1569–1583. [Google Scholar] [CrossRef]
dos Santos, D.T.; Roisenberg, M.; dos Santos Nascimento, M. Deep Recurrent Neural Networks Approach to Sedimentary Facies Classification Using Well Logs. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, W. Recurrent Autoencoder Model for Unsupervised Seismic Facies Analysis. Interpretation 2022, 10, T451–T460. [Google Scholar] [CrossRef]
Wang, Z.; Wang, Q.; Yang, Y.; Liu, N.; Chen, Y.; Gao, J. Seismic Facies Segmentation via a Segformer-Based Specific Encoder–Decoder–Hypercolumns Scheme. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Huo, J.; Liu, N.; Xu, Z.; Wang, X.; Gao, J. Seismic Facies Classification Using Label-Integrated and VMD-Augmented Transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–10. [Google Scholar] [CrossRef]
Zhou, L.; Gao, J.; Chen, H. Seismic Facies Classification Based on Multilevel Wavelet Transform and Multiresolution Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–12. [Google Scholar] [CrossRef]
Zhang, T.; Feng, G.; Liang, J.; An, T. Acoustic Scene Classification Based on Mel Spectrogram Decomposition and Model Merging. Appl. Acoust. 2021, 182, 108258. [Google Scholar] [CrossRef]
Ustubioglu, A.; Ustubioglu, B.; Ulutas, G. Mel Spectrogram-Based Audio Forgery Detection Using CNN. Signal Image Video Process. 2023, 17, 2211–2219. [Google Scholar] [CrossRef]
Sinha, S.; Routh, P.S.; Anno, P.D.; Castagna, J.P. Spectral Decomposition of Seismic Data with Continuous-Wavelet Transform. Geophysics 2015, 70, P19–P25. [Google Scholar] [CrossRef]
Chakraborty, A.; Okaya, D. Frequency-time Decomposition of Seismic Data Using Wavelet-based Methods. Geophysics 1995, 60, 1906–1916. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Sun, L.; Zheng, X.; Ding, L.; Shou, H.; Li, H. Automatic Identification of Strong Energy Noise in Seismic Data Based on U-Net. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]

Figure 1. Raw seismic signals: (a) seismic signal with minimal noise, (b) seismic signal contaminated by noise.

Figure 2. Mel-spectrograms: (a) seismic signal with minimal noise, (b) seismic signal contaminated by noise. The red circle in (b) marks noise signals beyond the dominant frequency of the seismic data.

Figure 3. Pure environmental recording (a) and relatively clean seismic recording (b).

Figure 4. Mel-spectrograms at different sampling intervals: (a) 1 ms, (b) 2 ms, and (c) 4 ms.

Figure 5. ViT structure.

Figure 6. Examples of collected data: (a) Noise Data A, (b) Noise Data B, (c) Seismic Data A, (d) Seismic Data B, (e) Acoustic Data A.

Figure 7. An example of synthetic data: (a) noise-free, (b) with added noise, (c) after noise attenuation.

Figure 8. Imaging SNR versus the parameter m.

Figure 9. Samples from the training set: (a) weak noise traces, (b) strong noise traces. Purple clusters represent the signal’s energy distribution.

Figure 10. Accuracy curves during model training.

Figure 11. Identification results of noise traces on test data: (a) raw seismic data, (b) results from the model trained with raw data, (c) results from the model trained with Mel-spectrograms. In the figures, red marks indicate correctly identified noise traces, green marks represent weak noise traces misclassified as strong noise, and blue marks indicate strong noise traces misclassified as weak noise.

Figure 12. Synthetic data with added noise.

Figure 13. Noise attenuation results: (a) denoised data by AAA, (b) noise removed by AAA, (c) denoised data by weighting AAA, (d) noise removed by weighting AAA. The red arrows indicate the loss of valid signals caused by the conventional AAA method.

Figure 14. Imaging results: (a) raw data, (b) data with added noise, (c) AAA denoising, (d) weighting AAA denoising. The red box highlights a subsurface structural anomaly, which is shown in an enlarged view in Figure 15.

Figure 15. Local imaging results: (a) AAA, (b) weighting AAA, (c) difference of AAA, (d) difference of weighting AAA.

Table 1. Model accuracy on test data.

Method	Type	Accuracy Rate	Overall Accuracy Rate
Mel-spectrogram	Weak noise	90.2%	89.2%
	Strong noise	82.5%	89.2%
Raw seismic trace	Weak noise	94.7%	94.5%
	Strong noise	93.1%	94.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Q.; Chen, S.; Shen, J.; Wang, B. Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram. Appl. Sci. 2025, 15, 8586. https://doi.org/10.3390/app15158586

AMA Style

Ding Q, Chen S, Shen J, Wang B. Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram. Applied Sciences. 2025; 15(15):8586. https://doi.org/10.3390/app15158586

Chicago/Turabian Style

Ding, Qianlong, Shuangquan Chen, Jinsong Shen, and Borui Wang. 2025. "Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram" Applied Sciences 15, no. 15: 8586. https://doi.org/10.3390/app15158586

APA Style

Ding, Q., Chen, S., Shen, J., & Wang, B. (2025). Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram. Applied Sciences, 15(15), 8586. https://doi.org/10.3390/app15158586

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram

Abstract

1. Introduction

2. Methods

2.1. Mel-Spectrogram

2.2. Training Dataset

2.3. Network Structure

3. Experimental Results

3.1. Data Preparation

3.2. Comparison Method

3.3. Training and Test

4. Identification Results to Support Noise Attenuation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI