Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments

Appl. Sci. 2024, 14(21), 9711; https://doi.org/10.3390/app14219711

by Xu Chen¹

, Mei Wang^1,2,*, Ruixiang Kan³

and Hongbing Qiu³

Reviewer 1:

Young Cheol Park

Reviewer 2: Anonymous

Appl. Sci. 2024, 14(21), 9711; https://doi.org/10.3390/app14219711

Submission received: 5 September 2024 / Revised: 18 October 2024 / Accepted: 20 October 2024 / Published: 24 October 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a transfer learning approach to enhance sound event detection accuracy by incorporating an Audio Spectrogram Transformer (AST) with data augmentation, contrastive learning, and a patch-mix scheme. Overall, the paper presents adequate technical novelty for sound event detection, and the experimental results demonstrate its effectiveness and superior performance to the previous methods.

**Eqs (1)~(3)**: These equations represent a general SNR adjustment process without any specific technical significance. Additionally, they do not seem to have a clear connection with the rest of the paper and thus could be omitted.

**Figs. (2) and (3)**: These figures convey overlapping content. It would be beneficial to combine them into a single figure if possible.

**ABSTRACT, Page 2, Line 60**: Please provide the full name of the CL transformer. What does "CL" stand for? Isn't it based on AST?

**Figures 1, 2, 3, 4, 5**: Increase the font size in the figures to improve readability.

**Table 2**: So, did you use alpha=0.5 in all experiments?

**Page 16, Table 7**: The CL-Transformer in Table 7 seems to be the final proposed model. If so, why is the accuracy for the highest SNR case (92.23%) not shown in other tables (Tables 5 and 6)? The best cases in Tables 5 and 6 show different accuracy(92.95%).

Comments on the Quality of English Language

The expressions "some scholars," "many scholars," and "most scholars" do not seem to be commonly used. Instead, using phrases like "some research" or "many previous studies" might be better.

Author Response

Response to Reviewer 1

Comments

Dear professors,

The revisions have been completed according to the comments of all the reviewers. It is quite convincing that each opinion is of great value, which improves the quality of this manuscript a lot. We are very grateful for this. According to these comments, all sections have been improved, some codes have been rewritten, and some experiments have been redone, so as to complete the modification of this manuscript. Compared with this manuscript's version1 in MDPI System, our revisions Manuscript Version 2(With Track Changes) has been uploaded.

Point 1: **Eqs (1)~(3)**: These equations represent a general SNR adjustment process without any specific technical significance. Additionally, they do not seem to have a clear connection with the rest of the paper and thus could be omitted.

Response 1: Many thanks for your useful comments. We have re-evaluated the necessity and relevance of these equations and found that omitting them can improve the overall coherence and focus of the manuscript. Therefore, we have removed these equations. Additionally, to maintain coherence, we have added necessary explanations in the text and updated the numbering of subsequent equations.

Point 2: **Figs. (2) and (3)**: These figures convey overlapping content. It would be beneficial to combine them into a single figure if possible. **Figures 1, 2, 3, 4, 5**: Increase the font size in the figures to improve readability.

Response 2: Many thanks for your useful comments. We realized that the latter part of Figure 2 overlaps with the content of Figure 3. Therefore, we have combined Figs. (2) and (3) into a single figure to more effectively present the relevant information. Merging these two figures enhances overall clarity and readability. Additionally, we have adjusted the figures order in the manuscript and enlarged the font sizes of Figures 1, 2, 3, and 4 in Manuscript Version 2 (With Track Changes).

Point 3: **ABSTRACT, Page 2, Line 60**: Please provide the full name of the CL transformer. What does "CL" stand for? Isn't it based on AST?

Response 3: Many thanks for your useful comments. We have revised the manuscript to provide the full name in the ABSTRACT: Contrastive Learning-based Audio Spectrogram Transformer (CL-Transformer). The CL-Transformer is based on the Audio Spectrogram Transformer (AST) and incorporates Contrastive Learning.

Point 4: **Table 2**: So, did you use alpha=0.5 in all experiments?

Response 4: Many thanks for your useful comments. Regarding Table 2, the purpose of this set of experiments was to determine the optimal value of alpha in Equation 5. When alpha is 0.01, the mixed loss primarily comes from the cross-entropy loss; when alpha is 0.99, the supervised contrastive loss dominates. Only when alpha is 0.5, where the contributions of the cross-entropy and contrastive loss functions are equal, do we achieve higher accuracy. Therefore, we determined the optimal value of alpha through experimentation and applied this value in subsequent experiments. We have modified the manuscript and added an explanation below Table 2.

Point 5: **Page 16, Table 7**: The CL-Transformer in Table 7 seems to be the final proposed model. If so, why is the accuracy for the highest SNR case (92.23%) not shown in other tables (Tables 5 and 6)? The best cases in Tables 5 and 6 show different accuracy(92.95%).

Response 5: Many thanks for your useful comments. This is a critical issue because the validation results in Table 7 should indeed be consistent with those in Tables 5 and 6. Therefore, we carefully examined the code and evaluation process. We found an error in the code during the model evaluation process that should not have occurred. The method we used in designing the model training process is as follows: first, the model undergoes one round of training, then it is evaluated on the validation set, and this process is redundant. Finally, the model parameters corresponding to the highest accuracy are saved. This process is completely correct. However, in the evaluation experiment of Section 4.5, we first loaded the model parameters and then directly ran the code without disabling the initial training that occurs before evaluating the validation set. This directly resulted in the model undergoing an initial training round, altering its internal parameters before evaluating the validation set, which consequently affected our experimental results. Therefore, we modified the evaluation code to disable the training round before assessing the validation set, then re-conducted the experiments for this section and updated the results accordingly. The difference in results between Table 7 and Tables 5 and 6 is due to the fact that the results in Tables 5 and 6 come from the same set of experiments, while the results in Table 7 are derived from the evaluation experiments in Section 4.5. Many thanks for your useful comments.

Point 6: The expressions "some scholars," "many scholars," and "most scholars" do not seem to be commonly used. Instead, using phrases like "some research" or "many previous studies" might be better.

Response 6: Many thanks for your useful comments. We thoroughly reviewed the entire manuscript and revised the expressions "some scholars," "many scholars," and "most scholars."

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Overall, this manuscript is very well-written, informative, and useful. Congratulations! However, there are some minor revisions needed for it to be published.

Please ensure that readers from different backgrounds can follow this manuscript. Even though the main issue is well explained, several technical terms and terminologies are hard to follow. For example, the terms “local minima entrapment,” “cross-entropy loss,” and “supervised contrastive loss” should be explained.

Please attend to the following minor revisions:

L37-39: Please rephrase these lines; the “one of which” part is confusing.

L40-44: A reference would be appropriate here.

L45-46: Mentioning a model here seems out of place.

L46-47: The in-text transition from real life to models is rough. A smoother transition after line 43 is necessary.

L83-84: Please connect the two split phrases.

L570: Please provide information for the audio category labels and define the abbreviations used in the confusion matrices (Figures 7 & 8).

L580-582: The phrase feels incomplete. Starting with "Considering" lacks a clear main clause or conclusion. It introduces an idea but doesn't state what follows from "Considering."

L605: Briefly explain the term “local minima entrapment,” even if it is well-known among experts.

Kind regards

Comments on the Quality of English Language

Overall, this manuscript is well-written.

Author Response

Response to Reviewer 2 Comments

Dear professors,

Point 1: Please ensure that readers from different backgrounds can follow this manuscript. For example, the terms “local minima entrapment,” “cross-entropy loss,” and “supervised contrastive loss” should be explained.

Response 1: Many thanks for your useful comments. We have added explanations for the terms "local minima entrapment," "cross-entropy loss," and "supervised contrastive loss" in Manuscript Version 2 (With Track Changes) to help readers from different backgrounds better understand the content. Clear explanations of these terms will enhance the readability and comprehensibility of the manuscript. Thank you for your suggestions!

Point 2: L37-39: Please rephrase these lines; the “one of which” part is confusing.

Response 2: Many thanks for your useful comments. We have revised the sentence to: "In Environmental Sound Classification (ESC), the recognition and classification of sound signals remain challenging." This modification removes the confusion caused by the phrase "one of which" and makes the statement clearer and more comprehensible.

Point 3: L40-44: A reference would be appropriate here.

Response 3: Many thanks for your useful comments. This is an error that should not have occurred. The sentence in question is derived from the original reference 17, which states, “The structural impact of noise on audio” (now cited as reference 3). We have made the necessary adjustments to ensure proper citation and have also revised the order of all references throughout the manuscript for consistency.

Point 4: L45-46: Mentioning a model here seems out of place; L46-47: The in-text transition from real life to models is rough. A smoother transition after line 43 is necessary.

Response 4: Many thanks for your useful comments. The mention of "model" here is indeed inappropriate. Additionally, to enhance the flow of this section, we have rephrased the content, resulting in the following modification: ”This is evidenced by the diminished clarity at low Signal-to-Noise Ratios (SNR) conditions, the potential masking of the target sound source, and the requirement for extensive annotated datasets. These factors significantly complicate feature extraction, ultimately leading to recognition errors. Furthermore, obtaining high-quality, accurately annotated audio data presents a significant challenge in real-world scenarios. As a result, this hinders classification tasks”.

Point 5: L83-84: Please connect the two split phrases.

Response 5: Many thanks for your useful comments.To make the expression more concise, we have merged the sentence: "Experiments demonstrate that some feature fusions, such as MFCC and Chroma STFT, can enhance classification model accuracy."

Point 6: L570: Please provide information for the audio category labels and define the abbreviations used in the confusion matrices (Figures 7 & 8).

Response 6: Many thanks for your useful comments.We have added an explanation of the audio label sources and label information in Section 3.1, Public Environment Sound Datasets. Additionally, in Section 4.4, The Impact of the Classification Model Framework on Model Performance, we have redefined the audio abbreviations used in the confusion matrix Figures 7 & 8.

Point 7: L580-582: The phrase feels incomplete. Starting with "Considering" lacks a clear main clause or conclusion. It introduces an idea but doesn't state what follows from "Considering."

Response 7: Many thanks for your useful comments.Our research aims to improve the accuracy of audio classification in urban noise scenarios. The purpose of Section 4.5, Robustness of the CL-Transformer Classification Model Framework to Noise, is to evaluate whether the CL-Transformer model trained with Patch-Mix mixed noise can better handle noise in real-world situations. We did not clearly express this intent in our writing, and as a result, We have updated this part in Manuscript Version 2 (With Track Changes).

Author Response File: Author Response.pdf

Article Menu

Improved Patch-Mix Transformer and Contrastive Learning Method for Sound Classification in Noisy Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI