Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors-The proposed augmentations rely on Montreal Forced Aligner for phoneme boundaries. However, the computational cost of MFA during training is ignored. Is alignment applied once during preprocessing or per epoch? Also, there is no analysis of MFA’s alignment accuracy on Cantonese Common Voice data is provided. Errors in boundary estimation could propagate into augmentation, degrading performance. The authors should address these issues.
-The authors should explain how the hybrid CTC-CRF loss, as mentioned in Eq. 1 optimized?
-MFA alignment and attention-weighted sampling may be too expensive for edge deployment. Also, no latency and throughput analysis is provided. The authors should address these issues.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis is a well-structured study that aims to improve Cantonese automatic speech recognition (ASR) under low-resource conditions by introducing two phoneme-aware augmentation techniques. The proposed approach is built upon the Whistle model and utilizes weak phonetic supervision. The reported results demonstrate promising reductions in phoneme error rate (PER) on the 50h and 100h subsets of the Cantonese corpus.
However, several important aspects require clarification and improvement:
1. The reported improvements in PER are promising, but no statistical significance testing is provided. Small changes in error rates may not reflect meaningful differences. Please consider applying standard significance tests to verify whether the observed differences are statistically valid.
2. The study does not include comparisons with other state-of-the-art or widely used approaches for low-resource ASR, especially those involving phoneme-level modeling.
3. Since the proposed methods rely heavily on phoneme boundaries obtained via MFA, a brief analysis of alignment quality is needed. This may include examples of typical boundary errors, phonological ambiguities, etc.
4. Although Whistle’s configuration is described in detail, key training settings for the baseline models (Wav2Vec 2.0 and Zipformer) are not provided.
5. Dynamic phoneme dropout is a core contribution, but its implementation is not clearly described. Several variables in equations (4)–(8) are left undefined, and the dropout mechanism is difficult to follow. Please revise Section 2.5.
6. The study would benefit from including 2–3 examples of phoneme-level predictions, comparing target vs. predicted IPA sequences (with and without augmentation).
7. Whistle with CTC-CRF is used as the core ASR system, but the choice is not justified. Given the availability of well-established open-source architectures (e.g., Conformer), it would be helpful to briefly explain why Whistle was selected, and whether it offers specific advantages in performance, efficiency, or compatibility with phoneme-aware augmentation.
8. A more detailed discussion of the pros and cons of using the Whistle model with CTC-CRF would improve clarity. For instance, why was CRF used instead of pure CTC or attention-based decoding?
9. Overly Dense Paragraphs and Readability
Some sections, particularly Sections 2.2, 2.5, and 4.3, are difficult to read due to overly long paragraphs with multiple complex ideas.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsReview report for journal article: Phoneme-Aware Augmentation for Robust Cantonese ASR under Low-Resource Conditions
The article explores two low-resource data augmentation strategies; Dynamic boundary-aligned Phoneme Dropout and Phoneme-aware Specaugment using 50 h and 100 h of the Cantonese Common Voice dataset and a Whistle ASR model.
Major Comments
- Section 2 also introduces Wav2vec 2.0 and the Zipformer ASR models in detail. Could the authors state if these models were trained or finetuned in their paper. The authors should also clearly state why these models are compared to the Whistle model, when the optimization approaches are clearly different.
- There are several repeats of paragraphs which impede the flow of the article. For example, Lines 179 - 187 contains information already stated in lines 156 -164, lines 239 - 245 is already presented in lines 208 - 215, the information in lines 222 - 223 is already stated in lines 219 - 220, lines 523 - 528 is a duplicate of lines 512 - 516, etc.
- Lines 293: How is frequency masking operation aligned with the full duration of a phoneme segment? What is the motivation? Do the authors only mask the frequency segment where the time segments are also masked? This is unclear. Authors can also include figures to clearly explain the process if this is the case.
- Lines 302: The phoneme-aware Specaugment relies on several other parameters such as a) concentrating masking on regions most relied on using attention weighted sampling b) curriculum learning, etc. How could one determine the impact of these tweaks on the method itself? The authors could provide ablation studies that compare the performance of the method incrementally with and without the extra bells and whistles.
- Line 318: The results presented in the paper are relative to the error rate of the forced-alignment model and process. Therefore, authors should provide more information regarding this. For example; How accurate is MFA in curating time aligned phoneme annotations? What is the initial accuracy of the MFA model? Why did the authors choose MFA over other forced-alignment libraries available? How do these other libraries/models compare to MFA?
- Similarly, did the authors in any way manually validate the results of the forced alignment?
- Section 4: The authors could include more informative plots like variables during curriculum learning vs loss or per.
- Table 1: The CER results of Wav2vec and Zipformer are not comparable to PER results of whistle on the same table. The table cannot be used to reach a conclusion among the models as their metrics are different. It is preferable for the authors to convert the predictions from Wav2vec and Zipformer to phones and then computing the PER directly.
- Line 436: Authors should define static and dynamic dropout before using the terms in the table. Although these can be inferred on lines 444 - 450, it comes after these terms have been used.
- Line 450: Could the authors describe how they determined that their method leads to a significant improvement? Conclusions regarding significance cannot be reached without statistical tests especially for small-sized data.
- Do these ideas transfer to other low-resourced languages or other tonal languages? These are not also mentioned in future work, therefore, authors could provide some experiments to show these.
- Authors could provide more qualitative results of their approach compared to other methods, e.g., plot of the embedding space of phonemes before and after augmentation, etc.
- Lines 540 - 541: Should the toolkit selection impact results ?
- Lines 554 - 556: Trained models are typically deployed on embedded devices, therefore the model architecture and parameter size are usually the performance factor. However, the authors are also suggesting deploying their data augment method on these devices. Could the authors state the intended use of their approach?
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI have no further concerns; the manuscript is recommended for its acceptance in the present form.
Author Response
We sincerely thank the reviewer for the positive evaluation and recommendation.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have addressed nearly all of the previously raised concerns with clarity. However, the issue of statistical significance testing remains unaddressed. Without this analysis (paired t-tests, or reporting the mean or standard deviation over multiple training runs with different random seeds), it is difficult to assess whether the improvements can be reliably attributed to the proposed methods or may be due to random variation.
Author Response
Comments 1: The authors have addressed nearly all of the previously raised concerns with clarity. However, the issue of statistical significance testing remains unaddressed. Without this analysis (paired t-tests, or reporting the mean or standard deviation over multiple training runs with different random seeds), it is difficult to assess whether the improvements can be reliably attributed to the proposed methods or may be due to random variation.
Response 1: We thank the reviewer for raising this important point. We agree that statistical validation is essential to assess whether the observed improvements are due to the proposed methods rather than random variation. In our revision, we have included confidence interval estimation for the final results. Specifically, on the test set with 50 h training, the full augmentation strategy achieved a PER of 15.9% with a 95% confidence interval of ±0.5 percentage points (reported in Section 4.5). This interval is substantially smaller than the absolute improvements (1.5–2.0 PER points) observed in our experiments, indicating that the performance gains are statistically significant and unlikely to be explained by random fluctuations. We have further clarified this in the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you to the authors for their rebuttal, however the authors have not fully addressed all.
Major concerns
1. Figures would make the ideas clearer for readers. For example, for phoneme dropout, do we mask the same phonemes at each layer of the model or random phonemes are selected? For phoneme aware specAugment, do we mask the time segments for some phonemes and mask frequencies for other phonemes? How can time and frequency masking be applied to the same phoneme segments?
- The authors should provide ablation studies raised in my previous review. It helps to inform what factors really lead to improvements, and for reproducibility by other readers.
- Comparing CER to WER is not acceptable in any way and no informative conclusions can be made. Either all models are trained using PER and evaluated or the texts from Wav2vec and Zipformer are converted into phonemes, and then PER calculated on the phonemes for consistency.
- Line 366 - 369 mentions the objective of their augmentation method, i.e., to enhance the model’s sensitivity to phonemic structure and bolster its robustness to pronunciation blurring or deletion. Without qualitative analysis, their approach can hardly be justified to have met this objective. Therefore, it is still a good idea to provide embedding plots before, during and, after augmentation, as requested.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsI would like the thank the authors for responding to my comments.
Minor comment
1. To improve on the qualitative analysis done in Figure 5, the authors could include statistics, e.g., the average ratio of spectrogram frame length to unmasked frames length for random masking and phone-aware masking. Comparing the values for both can also provide insights on the optimal masking ratio to achieve good results. This is optional.
2. Figure 1: Text in the second rectangle is not readable.
Author Response
Comments 1: To improve on the qualitative analysis done in Figure 5, the authors could include statistics, e.g., the average ratio of spectrogram frame length to unmasked frames length for random masking and phone-aware masking. Comparing the values for both can also provide insights on the optimal masking ratio to achieve good results. This is optional.
Response 1: We sincerely thank the reviewer for this constructive suggestion. We agree that adding statistical comparisons (e.g., the average ratio of spectrogram frame length to unmasked frames) could provide additional insights into the masking behavior. However, the main objective of this work is to highlight the effectiveness of phoneme-aware augmentation strategies under low-resource settings, rather than to exhaustively optimize masking ratios. To keep the study focused, we have not introduced these extra statistics. Instead, we emphasize the qualitative results already shown in Figure 5, which demonstrate that phoneme-aware masking aligns better with linguistic boundaries and provides greater robustness compared to random masking.
Comments 2: Figure 1: Text in the second rectangle is not readable.
We thank the reviewer for pointing out this issue. In the revised manuscript, Figure 1 has been enlarged and the font size of all labels has been increased. The text in the second rectangle, along with other blocks and annotations, is now clearly visible and easy to read. Figure 1 is provided in the attachment for your reference.
Author Response File: Author Response.docx