Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram

Appl. Sci. 2025, 15(15), 8586; https://doi.org/10.3390/app15158586

by Qianlong Ding¹

, Shuangquan Chen^1,*

, Jinsong Shen¹ and Borui Wang²

Reviewer 1:

Alcindo Neckel

Reviewer 2: Anonymous

Appl. Sci. 2025, 15(15), 8586; https://doi.org/10.3390/app15158586

Submission received: 2 July 2025 / Revised: 29 July 2025 / Accepted: 31 July 2025 / Published: 1 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Editor

After detailed readings in the manuscript, entitled: "Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram", it is noted that the accurate identification of noise traces facilitates rapid quality control during fieldwork and has been able to provide a reliable basis for targeted noise attenuation. With time series, the MEL spectrogram more clearly reveals energy variations and frequency differences, helping to identify noise traces more accurately, resulting in good performance in noise identification, which arouses the broad curiosity of the Journal's readers. But there are major corrections needed for the manuscript to be accepted, I point out the following:

1 - Regarding the following excerpt at the end of the "Abstract": "Tests on synthetic and field data show that the proposed method performs well in noise identification." This gives a very local perspective on the study; its relevance for other applications should be highlighted. This would better arouse the curiosity.

2 - Note the existence of "Keywords" similar to the title. These "keywords" need to be modified to enhance the manuscript's broader scope.

3 - In the introduction (page 1, lines 33 to 37), the understanding of the writing is confusing and without reference: “Such environmental noise often exhibits high amplitude and irregular temporal patterns, which can significantly degrade the signal quality and obscure subsurface reflections. During field acquisition, real-time monitoring of noise interference is essential to identify problematic sources and mitigate their impact in a timely manner”. I suggest writing the sentence better and inserting the reference.

4 - On page 2, lines 64 to 67, it is necessary to include the citation corresponding to the excerpt.

5 - On page 2, lines 84 to 87, it is necessary to include the citation corresponding to the excerpt.

6 - In "Methods" (2.1. Mel-spectrogram), between pages 109 and 122, a greater theoretical basis is needed. I did not find any bibliographic citations in this section. It is necessary to include the sources to strengthen your arguments. This increases the reliability of the developed method. It's also worth noting that your methodology requires a theoretical foundation, which is essential in a scientific study.

7 - Figures are in order and have good resolution. They meet the journal's standards.

8 - To validate the results, it is necessary to include bibliographic disclosure in the manuscript results, to categorize a scientific article.

9 - In "Conclusions" the potential for future studies could be addressed in greater detail.

10 - I believe that the number of references used is small, I suggest allocating more references.

Author Response

Response letter to reviewers' comments

We thank the editor and two anonymous reviewers for providing many constructive and valuable comments. We accepted the suggestions made by the reviewers and made corresponding changes in the revised manuscript. We believe that the overall quality of our manuscript has significantly improved. The changes are highlighted or noted by remarks in the revised manuscript. Please find the point-by-point response to the reviewers' comments as follows.

Reviewer: 1

Thank you for your valuable comments. We have carefully addressed all the issues you raised, with corresponding revisions or responses provided below.

Comments 1 - Regarding the following excerpt at the end of the "Abstract": "Tests on synthetic and field data show that the proposed method performs well in noise identification." This gives a very local perspective on the study; its relevance for other applications should be highlighted. This would better arouse the curiosity.

Response 1: The table below shows the original and revised versions of the sentence. The revision was made to better highlight the specific applications of the proposed method. (Lines 26–28 in the manuscript)

Original	Tests on synthetic and field data show that the proposed method performs well in identifying noise.
Revised	Tests on synthetic and field data show that the proposed method performs well in identifying noise. Moreover, a denoising case based on synthetic data further confirms its general applicability, making it a promising tool in seismic data QC and processing workflows.

Comments 2 - Note the existence of "Keywords" similar to the title. These "keywords" need to be modified to enhance the manuscript's broader scope.

Response 2: We have revised some Keywords, and the modifications are as follows:

Original Keywords

Revised Keywords

Environmental noise;

Noise identification;

Mel-spectrogram;

ViT;

Deep learning

Automated noise identification;

Noise attenuation;

Mel-spectrogram;

Seismic data quality control

Deep learning;

Comments 3 - In the introduction (page 1, lines 33 to 37), the understanding of the writing is confusing and without reference: “Such environmental noise often exhibits high amplitude and irregular temporal patterns, which can significantly degrade the signal quality and obscure subsurface reflections. During field acquisition, real-time monitoring of noise interference is essential to identify problematic sources and mitigate their impact in a timely manner”. I suggest writing the sentence better and inserting the reference.

Response 3: We have added a more detailed explanation and included relevant references to support the revision. (Lines 35–43 in the manuscript)

Original

Such environmental noise often exhibits high amplitude and irregular temporal patterns, which can significantly degrade the signal quality and obscure subsurface reflections. During field acquisition, real-time monitoring of noise interference is essential to identify problematic sources and mitigate their impact in a timely manner.

Revised

When noise sources are located near geophones, their vibrations are often captured and typically exhibit stronger amplitudes than subsurface reflections. Some sources, such as passing vehicles or wind, can cause continuous disturbances over extended periods and broad areas. Others, like workers’ footsteps, are brief and sporadic, leading to localized interference. Consequently, environmental noise often shows high amplitude and irregular temporal patterns, which can significantly degrade signal quality [2]. During field acquisition, real-time noise monitoring is important for ensuring the collection of high-quality seismic data [3,4]. If data quality is severely compromised, re-acquisition may be necessary.

Comments 4 & 5- On page 2, lines 64 to 67, it is necessary to include the citation corresponding to the excerpt. On page 2, lines 84 to 87, it is necessary to include the citation corresponding to the excerpt.

Response 4 & 5: Thank you for pointing this out. We have added appropriate citations to support the corresponding statements.

Comments 6 - In "Methods" (2.1. Mel-spectrogram), between pages 109 and 122, a greater theoretical basis is needed. I did not find any bibliographic citations in this section. It is necessary to include the sources to strengthen your arguments. This increases the reliability of the developed method. It's also worth noting that your methodology requires a theoretical foundation, which is essential in a scientific study.

Response 6: We have added relevant references (e.g., [39] and [40]) and further clarified the rationale and theoretical basis for selecting the Mel-spectrogram. (Lines 125–138 in the manuscript)

The Mel-spectrogram is a time-frequency representation method originally developed for audio and speech signal processing [39]. It transforms a one-dimensional signal into a two-dimensional spectrogram, which can be used as input features for further analysis in machine learning models. The process involves dividing the signal into short, overlapping time windows, and applying a Short-Time Fourier Transform (STFT) to each window to extract its localized frequency content [40]. In the resulting spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the value at each point (t, f) shows the amplitude of frequency f at time t. The frequency f conversion to the Mel scale is done using the following formula:

where is the Mel frequency. Although various time–frequency analysis methods, such as wavelet transform de-composition, have been widely used to analyze non-stationary seismic signals [41,42], they often require careful parameter tuning and involve trade-offs between time and frequency resolution. In contrast, the Mel-spectrogram provides a compact and stable time–frequency representation by mapping the frequency axis onto a nonlinear Mel scale. The Mel scale compresses high-frequency components while enhancing resolu-tion at low frequencies, making it particularly suitable for seismic applications. With appropriately selected parameters, the Mel-spectrogram effectively captures the low-frequency content where most seismic energy is concentrated. This nonlinear transformation helps preserve subtle variations and improves the localization of anomalies in seismic data. Additionally, the Mel-spectrogram is computationally effi-cient. Compared to wavelet-based time–frequency methods, it offers higher efficiency and generates compact, image-like representations that are well suited for deep learn-ing models.

Comments 7 - Figures are in order and have good resolution. They meet the journal's standards.

Response 7: Thank you for your positive feedback on the figures. We can also provide all figures in EPS vector format to ensure clarity and suitability for publication.

Comments 8 - To validate the results, it is necessary to include bibliographic disclosure in the manuscript results, to categorize a scientific article.

Response 8: Thank you for your suggestion. We have added relevant background research and references in the denoising results section to demonstrate that our approach is well-founded and scientifically justified. The detailed descriptions can be found in lines 389–398 of the revised manuscript and are also summarized below.

Few studies have directly applied deep learning to anomalous amplitude attenua-tion, primarily due to the unstable effect that anomalous values have on model train-ing. Most research combines deep learning for noise identification or uses it to provide better parameter settings for traditional denoising methods, thereby improving de-noising performance. For example, Tian et al. [6] employed deep learning to directly predict optimal parameters for a threshold-based anomalous amplitude attenuation method; Mao et al. [2] used noise identification results from deep learning to adap-tively attenuate anomalous noise; Sun et al. [44] leveraged a U-Net to identify spatio-temporal variations of strong energy noise in seismic data, enabling more effective de-termination of denoising thresholds.

In addition, we have included a more detailed description of the method’s computational procedure along with relevant citations in this section. (lines 439–457 in the manuscript).

Here, we briefly introduce a conventional AAA method used in this article [2].Noise attenuation is applied only to data below first breaks. Data will be processed by dividing it into subdata based on a space window nx traces. For each subdata with nx traces (where i represents the time sample, and j represents the spatial trace), the following operations will be performed:

First, the absolute value of each sample is computed. Then, a smoothing filter is applied in the time direction. The resulting smoothed data is denoted as .

Second, the reference amplitude trace is derived from the smoothed data by computing the median values along the spatial direction.

Third, when the raw amplitude exceeds m_a (threshold parameter) times the reference amplitude , the attenuation coefficients are calculated through the reference amplitudes multiplied by the attenuation coefficient, then divided by the amplitude of the smoothed data.

where, is the attenuation scale， is the attenuation coefficient.

Finally, each sampling point in the raw data is multiplied by its corresponding attenuation coefficient to produce the denoised data ,

For the AAA method, the window length nx was set to 60, the threshold to 2, and the attenuation scale to 1.5. …

Comments 9 - In "Conclusions" the potential for future studies could be addressed in greater detail.

Response 9: Thank you for the suggestion. We agree that the previous discussion of future research was too general. So, we have expanded the “Conclusions” section to provide a more detailed discussion of potential future research directions. (Lines 502–512 in the manuscript)

Original

Future research will focus on extending the proposed framework to handle more complex noise types, such as coherent noise and source-generated artifacts. We also aim to explore self-supervised or semi-supervised training strategies to reduce reliance on labeled data, which remains a challenge in large-scale seismic applications. Additionally, integrating spatial context across multiple traces and leveraging 2D or 3D representations of seismic sections may further enhance the model’s capacity to identify subtle noise patterns and improve generalization.

Revised

Future research will aim to extend the proposed framework to identify specific types of noise, rather than broadly classifying them as environmental noise. For exam-ple, distinguishing between powerline interference, heavy machinery vibrations, and neighboring shot noise would enable the application of more targeted denoising strat-egies, as each noise type may require different handling. We also plan to perform lo-calized noise identification to better capture temporal variations in noise, enabling the model to detect short-term or transient disturbances that may be overlooked in whole-trace classification. However, one major challenge in this direction is obtaining labeled data for each specific type of noise at the appropriate temporal resolution. Manually annotating such data is time-consuming and often impractical for large-scale datasets, which highlights the need for more efficient labeling strategies or the adoption of self-supervised learning techniques.

Comments 10 - I believe that the number of references used is small, I suggest allocating more references.

Response 10: We have increased the number of references from 34 to 44, with most of the newly added citations published within the last five years.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article "Detection of environmental noise traces in seismic records using Vision Transformer and Mel-Spectrogram" has practical significance in processing seismograms in conditions of extraneous interference. The method proposed by the authors, using neural network models, allows us to obtain a new solution to this important problem.

Comments for author File: Comments.pdf

Author Response

Response letter to reviewers' comments

Reviewer: 2

We sincerely thank the reviewer for their valuable suggestions. We have carefully considered all comments and provided detailed revisions or responses below.

Comments 1 - (lines 80-90) A more formal explanation is needed as to why the authors use MEL spectrograms oriented towards human hearing for telephony. Spectrograms provide an objective assessment (not by ear). Wavelet packets exist for flexible partitioning of the frequency domain. Wavelet trees allow dividing the frequency space into arbitrary intervals. For more accurate localization, the Zak, Gabor, Vigneaua, and Cohen bilinear transforms are available. The authors do not mention these methods.

Response 1: We have added relevant descriptions and citations to further clarify our idea for selecting Mel spectrograms. The related content can be found in lines 125–138 of the manuscript and is summarized in the table below.

Although various time–frequency analysis methods, such as wavelet transform de-composition, have been widely used to analyze non-stationary seismic signals [41,42], they often require careful parameter tuning and involve trade-offs between time and frequency resolution. In contrast, the Mel-spectrogram provides a compact and stable time–frequency representation by mapping the frequency axis onto a nonlinear Mel scale. The Mel scale compresses high-frequency components while enhancing resolu-tion at low frequencies, making it particularly suitable for seismic applications. With appropriately selected parameters, the Mel-spectrogram effectively captures the low-frequency content where most seismic energy is concentrated. This nonlinear transformation helps preserve subtle variations and improves the localization of anomalies in seismic data. Additionally, the Mel-spectrogram is computationally effi-cient. Compared to wavelet-based time–frequency methods, it offers higher efficiency and generates compact, image-like representations that are well suited for deep learn-ing models.

Comments 2 - (line 138) The authors consider a clean signal and a clean signal that are combined by weighted summation. This is not evident from the spectrograms in Fig. 2.

Response 2: We have further annotated Figure 2 to highlight the main differences (lines 150 to 151 in the manuscript).

Original	Figure 2(b) shows the spectrogram of the noise trace, where strong energy appears outside the main frequency band of the seismic signal.
Revised	Figure 2(b) shows the spectrogram of the noise trace, where strong energy appears outside the main frequency band of the seismic signal, as highlighted by the red circles in the figure.

Comments 3 - (lines 172-174) The signal representation characteristics need to be explained. (The seismic signal is sampled in a 6-second window with an interval of 1 ms. The data are segmented into 64 equally spaced windows, each of which is overlapped by 50 samples).

Response 3: Thank you for your valuable comment. We have added a detailed explanation regarding the setting of sampling parameters for seismic signals. The relevant descriptions can be found in lines 188–197 of the revised manuscript.

Original

The seismic signal is sampled in a 6-second window with an interval of 1 ms. The data are segmented into 64 equally spaced windows, each of which is overlapped by 50 samples

Revised

At present, most newly collected seismic data have a sampling rate of 1 ms and a re-cording time of about 8 seconds. Some older datasets may use a 4 ms sampling rate, with recording times usually longer than 6 seconds. When using Mel-spectrograms to build datasets, the effect of sampling rate is small. For example, the data in Figure 1a were resampled to 1 ms, 2 ms, and 4 ms, and then converted to Mel-spectrograms. The results are shown in Figure 4. As seen in the figure, the Mel-spectrograms look very similar under different sampling rates. This shows one advantage of the proposed method—it is not sensitive to the sampling rate. Since most seismic data have record-ing times longer than 6 seconds, we used 6 seconds as the standard length for building the dataset. All field data used in our experiments have a sampling rate of 1 ms. The data is segmented into 64 equally spaced windows, each overlapping by 50 samples.

Figure 4. Mel-spectrograms at different sampling intervals: (a) 1 ms, (b) 2 ms, and (c) 4 ms.

Comments 4 - (line 185) when converting the spectrogram image matrix to a vector, some of the image structure is lost

Response 4: Thank you for the thoughtful comment. This is indeed a key feature of the Vision Transformer (ViT). Unlike convolutional networks, which gradually expand the receptive field through stacked layers to capture global information, ViT divides the image into fixed-size patches and converts each patch into a one-dimensional vector. It then applies a multi-head attention mechanism to directly and efficiently model the global context, enabling the network to capture the overall structure of the image more effectively.

Comments 5 - (line 283-288), Fig. 5, a simple noise model is considered. Can noise be dynamic, change over time?

Response 5: The noise traces we selected all come from real seismic field data. Most of these noises persist throughout the entire traces, while a small portion varies over time. Currently, we do not differentiate between these two cases. For example, in Figure 11c, the red-marked trace near trace 50 shows four significant amplitude fluctuations within the 6-second recording time. In the future, we will specifically focus on researching local noise recognition algorithms.

Comments 6 - Fig. 7. The case of high SNR is considered. Does the method work at low SNR?

Response 6: The example shown in Figure 7 (Figure 8 in revised manuscript) is primarily used to help select an appropriate threshold for determining the noise level. This example does not apply the proposed method for noise identification or denoising.

Comments 7 - Line 318 The authors write that 6001 samples are first compressed to 4096 samples 318 using a linear layer. If the number of samples is not equal to a power of two, it is customary to increase (not decrease) to the nearest power of two by padding with zeros to increase the resolution.

Response 7: Thank you for the reviewer’s question. We reduce the 6001 sampling points to 4096 features because the Mel-spectrogram we designed has a feature dimension of 4096. To ensure a fair comparison between methods, we use the same network architecture. Moreover, the Transformer architecture is naturally designed for sequence data. The mapping from the original sequence to 4096 features can be easily performed with a learnable fully connected layer. So, it is unnecessary to pad the 6001 samples to 8192 (2^13) before reducing the dimension to 4096. Comments 8 - Fig. 9 Training ends when, with a decrease in the training error, the testing error begins to grow. This is not visible from Figure 9. Should training be continued?

Response 8: The reviewer is likely referring to overfitting. Test accuracy may decline during extended training, but overfitting does not always occur. A well-designed training process should maintain stable test accuracy in later stages. Algorithm design should also aim to minimize the risk of overfitting as much as possible.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Editor
My recommendations have been addressed, I suggest ACCEPTING the manuscript for publication.

Article Menu

Identification of Environmental Noise Traces in Seismic Recordings Using Vision Transformer and Mel-Spectrogram

Further Information

Guidelines

MDPI Initiatives

Follow MDPI