Empirical Analysis of Learning Improvements in Personal Voice Activity Detection Frameworks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis study presents a systematic enhancement of the personal voice activity detection (PVAD) framework through four key methodological improvements, which is then verified through experiments. This research represents a meaningful contribution to the field, the following issues should be addressed to further strengthen this work:
The following points need to be addressed:
- The abstract should be expanded to include a more comprehensive description of the research background.
- The full name should be provided when an abbreviation first appears, for example, CALR.
- In Section 4.3, is it appropriate to set the droppot rate to 0.1? This may lead to the loss of key information.
- In Section 4.4, the computational efficiency indicators can be considered, such as FLOPs, especially for PVAD tasks with high real-time requirements.
- In conclusion, more quantitative results need to be reflected.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article describes modifications of Personal VAD 2.0 system:
- incorporation of bidirectional gated recurrent units for better temporal modelling
- adding cross-attention mechanism to dynamically adjust the influence of speaker embedding on features
- introduction of AUROC loss function to address class imbalance
- employing cosine annealing as learning rate schedule
Authors evaluate each of the technique independently on LibriSpeech dataset, comparing several accuracy metrics and number of model parameters.
Although each modification brings improvement of the VAD system, authors state (no results given) that combination of the techniques do not bring additional improvement.
The paper is well written, the techniques are well described. only parts on lines 154-156 and 163-166 are very similar - convey the same message. Please combine it or leave one part out. Move the three lines from top of page 7 below the figure - it does not look good.
I lack the comparison with other modifications made to the baseline system since its publication in 2022 - the the personal VAD (1.0) was published in 2019, so now about the time for PVAD 3.0. What are the other modifications made to the system and what is their performance?
Authors also did not state clearly what the system is trained on. they name the dataset splits, but did not write what is used for training. They also created additional recordings to simulate multi-speaker recordings, but it is not clear whether they were used for training, for testing of for both. Anyway, it makes it impossible to compare the results to other published research. Authors should report results on a standard training/test set and their special set separately.
Authors claim that Bi-GRU doeas not excessively increase the model complexity. But this techniques doubles the number of parameters. It isn't excessive increase of complexity?
I would like to se the results for the combination of the techniques even though it does not bring additional improvements.
The relevant citation for d-vector approach should be given. Now authors refer to PVAD and PVAD2.0.
In citations, there is many times "In Proceedings of the Proceedings of the" - please fix it.
Why some cited papers have capitalized all first words in title while other not?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe research proposes four improvements for personal voice activity detection (PVAD):
- Bidirectional GRU for temporal modeling,
- Cross-attention for context-aware speaker integration,
- CE-AUROC loss for addressing class imbalance,
- Cosine annealing learning rate for training convergence.
Overall, the research provides a practical analysis of PVAD, supported by review of related work, clear motivations, and well-designed experiments. However, several issues should be addressed:
- Some equations in the paper are not numbered.
- The authors use pairwise hinge loss to approximate the “AUROC loss.” Why not retain the original name, "pairwise hinge loss"? Upon checking citations [18] and [19], the reviewer found that the provided formula has little to no relation to these sources—citation [19] differs in implementation, and citation [18] is based on a different theoretical framework.
- Experiments were conducted solely on the LibriSpeech dataset, which limits the generalizability of the methods. Since this is an empirical analysis, the reviewer suggests including at least one additional dataset, if feasible. Nonetheless, the reviewer understands that conducting additional experiments requires significant time and resources.
The authors demonstrate improvements by incorporating each new module individually into the baseline, but omit the potential of combining them—for example, using both CALR and AUROC loss together.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf