Review Reports
- Min Cui1,*,
- Yang Liu1 and
- Yanbo Wang2
- et al.
Reviewer 1: Patricia Rodríguez Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Please read and attend the general comments I wrote in the pdf file. I hope that could be useful for the structure of your work.
Comments for author File:
Comments.pdf
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
This paper presents a new model (MFF-Resnet) in which they propose the fusion of audio and MFCC features by means of an attention-based residual neural network. A transfer learning method is proposed in order to perform training with few samples and results are shown comparing the authors' proposal with those provided by a pre-trained network (R-Resnet).
The work presents a number of weaknesses that require a rethinking of the work so that it can be scientifically sound for the scientific community working in MIR or ESC.
1. I strongly recommend a literature review more adapted to the field of ESC. It is noteworthy that not even one of Piczak's early works is cited:
[1] K.J. Piczak, Esc: Dataset for environmental sound classification, in: Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 1015–1018. doi:10.1145/2733373.2806390.
A good starting point would be to review these two papers (among many others), which would put the authors' approach in context:
[2] Tripathi, A. M., & Mishra, A. (2021). Environment sound classification using an attention-based residual neural network. Neurocomputing, 460, 409-423.
[3] Bansal, A., & Garg, N. K. (2022). Environmental Sound Classification: A descriptive review of the literature. Intelligent Systems with Applications, 200115.
The first is a paper very similar to the one presented by the authors but with a much better set of references, methodological approach and presentation of results. The second is a general review necessary to adequately frame the work.
2. A methodological review of the research is essential. This involves:
a) adequately describing the database used,
b) comparing the results with other databases such as ESC-10 or DCASE data,
c) comparing the results obtained with state-of-the-art techniques, and
d) providing a repository (e.g., on Github) that can be accessed to contrast the obtained results.
I recommend the article [2] as one of many from which the authors can draw inspiration for the development of their work.
3. Finally, a correction of the English language is essential. The number of errors is very high and in many cases makes the reading of the work very complicated.
In my humble opinion, the work is at a very preliminary stage and needs additional time and work to be presented in a solid way.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 3 Report
Explain what MVDR is (Minimum Variance Distortionless Response Beamformer?) and why is useful?
Please reformulate As shown in Figure 1, Sound is a kind of multi-channel waveform data, however, 118 considers only the sound’s time domain information that leaves out its frequency domain 119 information
Please be clearer regarding Fig3, i.e. the dimension of F(X) and X
Regarding Fig. 4 it is not clear if the authors made the acquisition themselves or used an annotated data basis
Please reformulate (Fig4) Acoustic data is waveform data in essence.
Fig.7: Please reformulate 1 × 1 convolutions are used to adjust channel 287 number in.
Please be clearer regarding the acquisition of the training set and the test set.
Maybe such details could be included in the Abstract
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Please review the file attached
Comments for author File:
Comments.pdf
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
I understand that the authors' intended contribution to sound classification may be interesting. Especially in merging conventional audio features with the latent feature space provided by a deep neural network. All this realized through an attention layer.
The authors have addressed some of the comments made in my previous review, which I appreciate.
However, the manuscript still contains many errors, some of which invalidate the work presented, in my humble opinion.
I thank the authors for including a comparison with other results, presented in Table 3. However, the results shown do not clarify the actual ability of the network to classify sounds, as only training results are presented. It is essential to provide results on test data. The same is true for Table 2. It is necessary to know the generalization capability of the developed models. Similarly, it is not specified with which network the data in Table 3 have been calculated (which have been the training data and which have been the test data), nor what is the value presented in the table. No ranking measures such as F-score or any other are given.
In addition, the manuscript still has many problems with the wording in English. A revision of the language by a translator is mandatory.
Just to give a few examples:
- Pag 3, line 136. In waveform -> waveforms represented in the time domain
- Pag 4 line 141 deep convolutional network can be used to high level feature representation. Lack of verb
- Pag 5 line 173: and completed work from signal acquisition to signal identification -> It is not clear
- Pag 5, line 191 which are original in waveform-> it is not clear what does it means
- Pag 6 Essentia and Freesound must be properly referenced
- The words in boxes in Figure 5 are bad justified
- Pag 7 line 256 start-> input
- Figure 7 is hardly seen
- Pag 7 line 262, Due to the convolution is a low pass filter -> The convolution can be the type of filter that is desired
- Pag. 8 line 292, is consist-> consist
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 3
Reviewer 2 Report
The manuscript is much improved over the original version, the methodology and results are more adequately presented. The results obtained seem relevant. It would be interesting if the code were available on a platform commonly accessed by the scientific community such as GitHub, which provides a lot of visibility to this type of work.
Author Response
Dear Professor:
Thank you in advance for your time and colleagueship, We have uploaded the code to Github according to your suggestion. Here is the website address:
https://github.com/nucmail/sound_ESC_rcnn
we are looking forward to hearing from you. Please feel free to contact me if there are any questions.
Best Regards