Inter-Sentence Segmentation of YouTube Subtitles Using Long-Short Term Memory (LSTM)

Round 1
Reviewer 1 Report
This paper addresses an interesting topic, which is to predict the punctuation mark in a given short span by using LSTM and Word2Vec. I have some considerations listed below:
1. Nowadays, ASR can achieve amazing recognition results, and the punctuation marks can also be output simultaneously. Thus, I think the paper should compare the results with some SOTA ASR outputs.
2. The major advantage of RNN-based networks is that they can accept variable length sequences so as to leverage long-term dependency information for prediction and/or generation. However, in this study, the inputs are fixed length spans. Consequently, I wonder to know why can we leverage different networks to do the project? In other words, why should we have to leverage RNN-based networks to do this project? I suggest the paper should emphasize the motivations.
3. In section 3.1, the paper mentions that “CBOW … This method is know to have good performance when data is small. On the … This method is know to show good performance when there is a lot of data.” Please provide references!
Author Response
1. Nowadays, ASR can achieve amazing recognition results, and the punctuation marks can also be output simultaneously. Thus, I think the paper should compare the results with some SOTA ASR outputs.
=> As the reviewer mentioned, now ASR is automatically generating punctuation marks.. In fact, Siri, Android and Google API insert punctuation marks after recognition. However, it is more important to distinguish whether the punctuation of a sentence is a period or a question mark, rather than finding and recognizing appropriate points when recognizing multiple sentences at once. We added the above explanation to the paper, and compared our research with other studies similar to our study in the introduction and experiment part.
2. The major advantage of RNN-based networks is that they can accept variable length sequences so as to leverage long-term dependency information for prediction and/or generation. However, in this study, the inputs are fixed length spans. Consequently, I wonder to know why can we leverage different networks to do the project? In other words, why should we have to leverage RNN-based networks to do this project? I suggest the paper should emphasize the motivations.
=> I think the question of why the data is cut to fixed length is very valuable. The main reason is that the scripts provided by YouTube are cut into shorter length as shown in Figure 1. Of course we have a full-length scripts, but we deliberately truncated the data to make it identical to the YouTube environment. However, the YouTube script also has a difference of one or two words, which we will consider it further in future research. We selected the RNN series because it is the most active in NLP, but we will expand our research to CNN and attention structure. We added above explanation in Introduction and conclusion part.
3. In section 3.1, the paper mentions that “CBOW … This method is know to have good performance when data is small. On the … This method is know to show good performance when there is a lot of data.” Please provide references!
=> You can find the comments at site https://www.tensorflow.org/tutorials/representation/word2vec,. We added the reference as a footnote because it is not a book or a paper. In the site, you can find sentences, "For the most part, this turns out to be a ... better when we have larger datasets."
Thank you.
Reviewer 2 Report
This paper explores methods how to enrich automatic transcriptions with punctuations for further applications such as translation. Overall this paper is easy follow.
The motivation is meaningful but the proposed method is not new. Therefore, this paper lacks of novelty.
This paper has also serious problems with the quality of presentation. Figures 4 and 5 are quite well known. The authors need to cite them if the figures should be used in the paper.
Furthermore, the authors claimed to improve machine translation when inserting punctuations but there was no experimental results to prove this statement.
Author Response
The motivation is meaningful but the proposed method is not new. Therefore, this paper lacks of novelty.
=> In the field of speech recognition, researches have tried to extract sentences from a speech for the sake of future processing. As you said, RNN and LSTM is not new methodologies in this area. In fact, [17] used LSTM in the punctuation problem, but it combined text feature and pause duration that we didn't use. Our search is focused mainly on YouTube scripts. We added comments in Introduction part.
This paper has also serious problems with the quality of presentation. Figures 4 and 5 are quite well known. The authors need to cite them if the figures should be used in the paper.
=> We cite two figures. Figure 4 was already included a paper cited in the article but we add the citation in the figure caption. Unfortunately, we mistook figure 5 as already included in the reference. We add a footnote in the caption.
Furthermore, the authors claimed to improve machine translation when inserting punctuations but there was no experimental results to prove this statement.
=> Our research is based on the hypothesis that the punctuation could improve translation. The hypothesis was verified in study [12]. They showed robust performance on Chinese-to-English and English-to-Spanish speech translation.
Thank you.
Reviewer 3 Report
My comments are on some issues that should improve the presentation and relevance of the paper.
'Subtitling' or 'automatic subtitling' should be one of the key words.
Definition of machine translation is poor and should be better formulated to be more accurate and clear.
While LSTM is given fully in the title, RNN needs to be fully provided (p. 2, line 78)
Peculiar, if not awkward, usage of the verb 'learn' (p. 2, line 78)
The data description on p. 3 (line 89) is not clear. This needs elaboration to help the reader understand what you try to say.
p 3, line 104: the sentence would be much better by removing 'process'.
p 3, line 105: Your formulation of "the exclamation mark (!) and the question mark (?) are changed to a punctuation mark" is contradictory, only because these are punctuation marks in the first place and thus they cannot be changed to a punctuation mark. I would think that you meant that they were change into ANOTHER punctuation mark.
p 4, line 114: could the sentence "the preprocessed data is processed" be changed into "the data is preprocessed"? If not, the repetitive nature of the sentence is not recommended.
Author Response
'Subtitling' or 'automatic subtitling' should be one of the key words.
=> We added 'Subtitling' into Keywords list.
Definition of machine translation is poor and should be better formulated to be more accurate and clear.
=> We corrected the definition part. We deleted all the MT definition part and made new sentences.
While LSTM is given fully in the title, RNN needs to be fully provided (p. 2, line 78)
=> We inserted the full name of LSTM.
Peculiar, if not awkward, usage of the verb 'learn' (p. 2, line 78)
=> We corrected the sentence to "... in natural language processing, to build a model with data and predict the punctuation."
The data description on p. 3 (line 89) is not clear. This needs elaboration to help the reader understand what you try to say.
=> We corrected the sentence to "In this paper, we collect ... provided by Stanford University."
p 3, line 104: the sentence would be much better by removing 'process'.
=> We deleted 'process' from the sentence.
p 3, line 105: Your formulation of "the exclamation mark (!) and the question mark (?) are changed to a punctuation mark" is contradictory, only because these are punctuation marks in the first place and thus they cannot be changed to a punctuation mark. I would think that you meant that they were change into ANOTHER punctuation mark.
=> We changed the word 'punctuation' to 'period (.)'.
p 4, line 114: could the sentence "the preprocessed data is processed" be changed into "the data is preprocessed"? If not, the repetitive nature of the sentence is not recommended.
=> We deleted the word 'preprocessed'.
Round 2
Reviewer 1 Report
Thanks for the response, I have no further questions.
Author Response
Thanks for your comments.
Reviewer 2 Report
Thanks the authors for their responses.
Author Response
Thanks for your comments.