Next Article in Journal
Contrastive Multiscale Transformer for Image Dehazing
Next Article in Special Issue
Graph-Based Audio Classification Using Pre-Trained Models and Graph Neural Networks
Previous Article in Journal
Reliability Evaluation for Continuous-Wave Functional Near-Infrared Spectroscopy Systems: Comprehensive Testing from Bench Characterization to Human Test
Previous Article in Special Issue
Sons al Balcó: A Comparative Analysis of WASN-Based LAeq Measured Values with Perceptual Questionnaires in Barcelona during the COVID-19 Lockdown
 
 
Article
Peer-Review Record

Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation

Sensors 2024, 24(7), 2046; https://doi.org/10.3390/s24072046
by Drew Priebe 1, Burooj Ghani 2 and Dan Stowell 1,2,*
Reviewer 1:
Reviewer 2: Anonymous
Sensors 2024, 24(7), 2046; https://doi.org/10.3390/s24072046
Submission received: 27 January 2024 / Revised: 1 March 2024 / Accepted: 7 March 2024 / Published: 22 March 2024
(This article belongs to the Special Issue Acoustic Sensing and Monitoring in Urban and Natural Environments)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Within this paper, the authors investigate the reduction of network size (complexity, latency, computation load, etc.) for voice activity detection in eco-acoustic monitoring. Four effective student architectures are derived from the MobileNetV3-Small-Pi model and compared against the larger EcoVAD teacher model. Furthermore, transfer learning to the lightweight student models is examined using three different knowledge distillation techniques. The investigated models are trained on application-specific large datasets and further validated on separate data. Evaluation is based on the standardized F1 score, which combines precision and recall measures, as well as the AUC-ROC score, indicating the feasibility to correctly identify speech and non-speech regions in eco-acoustic recordings. Additionally, net-related measures such as the number of parameters, layers, floating-point operations (flops), multiplications, required memory, and inference time are considered.

Line 26: (minor hint) The abbreviation MLP is not introduced but is already mentioned in Line 25.

Line 156: The model architecture is described on a meta-level. It might be more helpful to provide details on the applied physical background, as the detailed time-frequency resolution used cannot be found throughout the paper (cf. Line 217 – also not very forthcoming).

Lines 166 to 188: The different student architectures are addressed. However, it is not clear from this description what the theoretical foundations or backgrounds for the selected structures are. Rather, only the implemented changes and their impact on the network size are described.

Table 1 on page 5: The term "Inference Time" is not described. How is it measured?

 

Lines 247 to 253: (minor hint) The abbreviations F1 and AUC can be better introduced.

Table 2 on page 7: Average score values of F1 and AUC are given, but the scattering parameters are missing. Based on the given average score values and the subsequently described observations, it is evident that these parameters (F1 and AUC scores) are not normally distributed. Therefore, relying solely on the mean values for interpretation can be misleading. Median values plus confidence intervals of the median values would be much more informative. Additionally, a simple statistical evaluation of the observed score measures is requested to prevent unverified comments as addressed subsequently.

Line 275: The most "significant" improvement...! (unverified)

Line 282: ...F1 scores of the student models did not "significantly" decrease...! (unverified)

Figure 1, page 7: The ordinate indicating "F1 score" is extremely stretched, showing only a small range. Indicated average mean values are not sufficient in this case to display the full details of the different models.

Figure 2, page 7: Similar remarks apply as for Figure 1. Furthermore, to better visually differentiate the examined models, use qualified markers as well as better distinguishable colors. Moreover, only distinct distance positions are evaluated. Only dashed lines should be used to continue the possible trend of evaluated models.

Considering the above-mentioned suggestions (median values + IQRs + CI of median values) and basic statistical evaluation will lead to a modified discussion and conclusion.

Author Response

Reviewer 1 Responses:

 

Line 26: (minor hint) The abbreviation MLP is not introduced but is already mentioned in Line 25.

 

  • Abbreviation has now been introduced prior to mentioning. Update can be found on Line 25

 

Line 156: The model architecture is described on a meta-level. It might be more helpful to provide details on the applied physical background, as the detailed time-frequency resolution used cannot be found throughout the paper (cf. Line 217 – also not very forthcoming).

  • Section 2.2 Model Architectures has been updated to give a more detailed understanding of time frequency approach undertaken and theoretical reference to support the decisions for structuring model inputs. Update can be found starting from line 260

Lines 166 to 188: The different student architectures are addressed. However, it is not clear from this description what the theoretical foundations or backgrounds for the selected structures are. Rather, only the implemented changes and their impact on the network size are described.

  •  Section 2.2 Model Architectures was further updated to address these concerns. Our revisions attempt to address a direct link between the architectural decisions made for each student model and the theoretical foundations and motivations that guided these choices. 

Table 1 on page 5: The term "Inference Time" is not described. How is it measured?

  • An update to Table 1 includes context to the calculations of Avg. Inference time

Lines 247 to 253: (minor hint) The abbreviations F1 and AUC can be better introduced.

  • Adjustment have been made in order to address the introduction of the metrics to the audience 

Table 2 on page 7: Average score values of F1 and AUC are given, but the scattering parameters are missing. Based on the given average score values and the subsequently described observations, it is evident that these parameters (F1 and AUC scores) are not normally distributed. Therefore, relying solely on the mean values for interpretation can be misleading. Median values plus confidence intervals of the median values would be much more informative. Additionally, a simple statistical evaluation of the observed score measures is requested to prevent unverified comments as addressed subsequently.

  • These comments were assessed and addressed by revamping table 2 with a more reliable measure for interpretation. Furthermore, an implementation and reporting of statistical analysis has been integrated within the manuscript in order to address these concerns.

Line 275: The most "significant" improvement...! (unverified)

  • In line with the adjustments to table 2, these concerns were addressed through the reporting of a statistical analysis 

Line 282: ...F1 scores of the student models did not "significantly" decrease...! (unverified)

  • Section 3.2. Impact of Parameter Reduction and Efficiency on Model Accuracy has been updated in order address unverified statement

Figure 1, page 7: The ordinate indicating "F1 score" is extremely stretched, showing only a small range. Indicated average mean values are not sufficient in this case to display the full details of the different model

  • Figure 1 has been updated with median score values in order to address the concerns with respect ot mean values

Figure 2, page 7: Similar remarks apply as for Figure 1. Furthermore, to better visually differentiate the examined models, use qualified markers as well as better distinguishable colors. Moreover, only distinct distance positions are evaluated. Only dashed lines should be used to continue the possible trend of evaluated models.

  • Remarks with respect to figure 1 are unable to be applied to Figure 2 reporting. The evaluation dataset and metrics come from a well cited paper throughout the manuscript. The author did not report Median Values, therefore it is difficult to address these concerns with respect to the evaluation data. Furthermore, only distinct distance positions were evaluated as those were the distances collected for testing and evaluated.

The discussion was updated to reflect or address any statements that were unverified in past versions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript entitled "Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation"  presents the approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics.  The quality of manuscript is good. I believe the manuscript is suitable for publication. In addition, there are several points that should be improved for this manuscript.

1. Please supplement the number of audio files of The Evaluation Playback Dataset .

2. It is recommended to discuss in the discussion section whether the proposed approach is suitable for deployment on edge devices

3. Line 275 - 276,  “The most significant improvement was seen in the relational-based distillation method is inconsistent with the data.   AUC scores(feature-based): 0.9870 to 0.9893, AUC scores(relational-based): 0.9856 to 0.9898.

Author Response

Reviewer 2 Responses:

 

Comment 1: Please supplement the number of audio files of The Evaluation Playback Dataset .

 

  • The supplementation of audio files with respect to Evaluation playback dataset has been addressed. Updates can be found on lines 218-219

 

Comment: 2 It is recommended to discuss in the discussion section whether the proposed approach is suitable for deployment on edge devices

 

  • The current version addresses this fundamentally through the design of efficient architectures. This was addressed in the discussion “Given their reduced computational demands, these student models are well-suited for deployment in edge devices, where efficiency is paramount.”

 

  • Furthermore, the architectures were built from a backbone architecture designed for mobile phones and built for efficiency 

 

Commnet 3:  Line 275 - 276,  “The most significant improvement was seen in the relational-based distillation method” is inconsistent with the data.   AUC scores(feature-based): 0.9870 to 0.9893, AUC scores(relational-based): 0.9856 to 0.9898.

 

  • This comment was addressed through the reporting of median scores with a more reliable measure for interpretation. Furthermore, an implementation and reporting of statistical analysis has been integrated within the manuscript in order to address these concerns. Updates can be found starting from line 260 



Author Response File: Author Response.pdf

Back to TopTop