Deep Models for Low-Resourced Speech Recognition: Livvi-Karelian Case
Round 1
Reviewer 1 Report
This article proposes an automatic speech recognition system for Livvi‐Karelian language using deep learning models. There are few observations/suggestions to the authors:
1. A sentence indicating how the proposed work is novel as compared with existing techniques could be added in the abstract section.
2. Introduction section could have more information on the Livvi‐Karelian language constructs and which of these constructs are useful in the proposed methodology.
3. Contributions of the proposed work is missing. Authors need to address this in the introduction section.
4. Related work is well written. A research gap needs to be highlighted at the end of this section.
5. Is it necessary to include figure 3? If so, need to cite this.
6. Results section is poorly discussed. Authors need to compare the results with existing language models with similar/different data sets. Only WER parameter is used for evaluation. Authors need to look into other parameters like accuracy, F1 score and other related parameters and should include analysis graphs. Without this the result section is incomplete.
7. Future enhancements could be added in the conclusion section.
Overall Evaluation:
This article proposes an automatic speech recognition system for Livvi‐Karelian language using deep learning models. Methodology section is well discussed but result section needs improvement. The article may be accepted only after incorporating the suggestions provided.
Author Response
Please see the attachment
Author Response File:  Author Response.docx
 Author Response.docx
Reviewer 2 Report
Thank you very much for the article.
The paper is difficult to follow, and it takes time for the audience to read. The article pays a lot of attention to the literature review of previous work and techniques involved, but not many details are provided regarding the results and discussion. I would suggest that the author could make the manuscript sound a bit more scientific.
Line 83: Please describe speech decoding and the best hypothesis selection in more detail.
Table 2: You do not have validation set. How do you stop training the machine learning models? What is the stopping criteria?
Line 385: Should the results of perplexities be moved to the results section?
Section 436: Please compare the results of this work with others.
I would suggest that the author could make the manuscript sound a bit more scientific. It some parts, especially in the results and discussions, need corrections.
Author Response
Please see the attachment
Author Response File:  Author Response.docx
 Author Response.docx
Reviewer 3 Report
The main issues of this paper is about overall innovation, experimental set-up and organization.
Innovation: The progress beyond the state of this art is not well justified. There is to be included a compact SOTA table, indicating the main progress in the field, the assumptions made, the limitations and advantages of this method.
Second, there are several papers in the field dealing with speech recognition and deep learning. So, the innovative components is not well justified.
Third, there is no a clear mathematical point of view of the method adopted. It is rather an application driven paper.
However,
Experimental Set-up: Though this work focuses on an application domain, it is unclear of how the experiments support the main research objectives. The experiments are rather poor. There is no a clear justification of how the proposed method is better, in several terms, from existing ones. The experiment does not also justify the use of this methodology and its advantages, assumptions made
Organization: The overall organization of this paper is rather poor. The progress beyond the state of the art is not well justified and the experiments fail to reveal these issues
Experimental SE
Author Response
Please see the attachment
Author Response File:  Author Response.docx
 Author Response.docx
Reviewer 4 Report
This paper presents an automatic speech recognition system for the low-resource language, Livvi-Karelian. It is a commendable initiative as it probes into the possibilities of tech adaptation for low-resource languages. The authors built a corpus with both speech and text sets as the groundwork for their proposed system, utilizing a CNN-based acoustic model and an LSTM-based language model. The topic is intriguing and addresses an important gap in the field; however, some concerns need further elaboration and clarification:
1. The resource construction is critical in low-resource language research, facilitating continuous advancement. Although the authors reported constructing a corpus for Livvi-Karelian, they have not publicly released/mentioned the accessibility to the essential data, program codes or the other necessary materials relevant to their automatic speech recognition system. These are critical for research community and may limit the work's reproducibility and impact.
2. Concerning the experiments, the speech dataset was divided into training and testing sets. However, without a dedicated development set, it remains unclear how the system parameters were fine-tuned. Details such as whether cross-validation experiments were performed, or how many times the repeated experiments were carried out would provide insights into the system's reliability and robustness.
3. It's also unclear how the authors decided to split the text set of the corpus, more details need to be clarified.
4. The proposed deep learning-based hybrid system, doesn't introduce significant novelty as CNN/LSTM methods have been extensively used across various tasks. A clear articulation of the unique aspects of their proposed system would be beneficial.
5. The authors omitted critical specifics regarding the configuration of their deep neural networks. Initial setup details, text encoding approaches considering the low-resource nature of Karelian, choice of word embeddings, batch size, optimizer for the loss function, and the GPU type used for the study should be included for comprehensive understanding and replication.
6. To validate the proposed method, implementing a baseline model, perhaps a traditional machine learning approach, is essential for a fair performance comparison. This would strengthen the argument for the proposed hybrid method's effectiveness.
7. The experimental section would benefit from an in-depth error analysis. A thorough discussion of current system errors, possible methodological limitations, and potential areas of improvement could provide a more robust evaluation of the proposed system.
In summary, while the paper offers an interesting perspective on speech recognition for low-resource languages, more details are required to ensure the study's reproducibility and to enhance the proposed system's clarity and novelty.
The article is well written.
Author Response
Please see the attachment
Author Response File:  Author Response.docx
 Author Response.docx
Round 2
Reviewer 1 Report
The suggestions provided have been incorporated in the revised manuscript. The article may be accepted for the possible publication.
Author Response
We thank the reviewer for the positive comment.
Reviewer 2 Report
All concerns are met.
Minor language edits are still required.
Author Response
Point 1: All concerns are met.
We thank the reviewer for the positive comment.
Point 2: Minor language edits are still required.
We have carefully read the text of our paper and corrected a number of mistakes. All changes are highlighted in yellow. Thank you very much for bringing this issue to our attention!
Reviewer 3 Report
The authors have adequately addressed all my previous concerns.
Author Response
We thank the reviewer for the positive comment.
Reviewer 4 Report
I appreciate the efforts made to address my previous concerns. Nevertheless, there are some areas that still require clarification:
1. In the manuscript and responses, the authors have acknowledged the limited size of the dataset used in this study. Given that there isn't a distinct development set for model training and fine-tuning, and the absence of cross-validation techniques to validate the trained model, I am curious about the methodology used to determine the best-performing model for testing. Typically, the development set is employed to ascertain the optimal model parameter combination for testing due to the unknown true answers of the test data. A smaller dataset can often lead to overfitting, causing experimental results to fluctuate. In machine learning tasks, it is common to utilize cross-validation to repeat experiments multiple times, which allows for the calculation of average performance and variance. How have the authors addressed the potential overfitting issue in this context?
2. While I understand that data can be sensitive and may not always be feasible for public release, it would be beneficial if the authors could share their codes along with a few sample data points. This would provide readers and the research community with a clearer insight into the significance and value of this study.
The manuscript is well written.
Author Response
Please see the attachment.
Author Response File:  Author Response.docx
 Author Response.docx
Round 3
Reviewer 4 Report
I continue to have concerns regarding the experimental design and the robustness of the system presented by the authors. My previous reservations remain unaddressed.
In the authors' response, it appears there were significant design flaws in their experiments. For standard machine learning tasks, the test data should never be used for model training or fine-tuning. This is rooted in the principle that we should have no prior knowledge of the test set, including the true labels. Typically, a separate development dataset is utilized to fine-tune the model, allowing for the exploration of various parameter combinations to identify the best-performing model. Only after this step should the model be applied to the test set for final predictions. Using the test set directly for model optimization risks overfitting, as the model would be exposed to the true answers of the test data during training. This situation is indicative of a label leakage issue, which is crucial to be avoided in machine learning tasks. Merely employing l2-regularization does not address the label leakage problem. In scientific research, time constraints should not justify deviating from rigorous research standards.
While it is understood that some datasets may be limited in terms of size, making it challenging to allocate a separate development dataset, the common practice in such cases is to employ cross-validation. This approach involves repeating the experiments multiple times to ensure the model's performance is both reliable and robust.
In the case of the authors, if cross-validation is deemed too time-consuming, a viable alternative would be to partition the original training set into two new subsets, ideally in a 9:1 or 8:2 ratio, for training and development respectively. The authors are then advised to retrain the model using the subset designated for training and employ the development subset for parameter tuning.
It is unfortunate that the authors appear to have bypassed this foundational principle in their experimental design, casting doubts on the reliability and validity of their results.
The manuscript is well-written.
Author Response
Please see the attachment
Author Response File:  Author Response.docx
 Author Response.docx
 
        



