Inter-Sentence Segmentation of YouTube Subtitles Using Long-Short Term Memory (LSTM)

.


Introduction
Speech to Text (STT) [1,2] is a process in which a computer interprets a person's speech and then converts the contents into text.One of the most popular algorithms is HMM (Hidden Markov Model) [3], which constructs an acoustic model by statistically modeling voices spoken by various speakers [4] and constructs a language model using corpus [5].
Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another (https://en.wikipedia.org/wiki/Machine_translation). MT has been approached by rules [6], examples [7], and statistics [8].Currently Neural Machine Translation (NMT) [9] has dramatically improved MT performance, and there are a lot of translation apps, such as iTranslate (https://www.itranslate.com/)and Google Translate (https://translate.google.com/),competing in the market.A recent video-sharing site, YouTube's automated captioning system [10] combines STT with MT technology.When uploading a video, YouTube extracts the voice data and writes subtitle files, and translates the file into the desired language.These techniques are of great help to users who cannot speak in the language spoken in the content.However, if the speaker does not deliberately distinguish between sentences, for example if they do not make a pause between sentences, multiple sentences are recognized as a single sentence.This problem significantly degrades the performance of the MT.What's even worse is that YouTube manages subtitles by time slots, not by utterances.The subtitles automatically generated in the actual YouTube are divided into units of scene, and these are to be translated.
In this paper, we propose an automatic sentence segmentation method that automatically generates a period mark using deep neural networks to improve the accuracy of automatic translation of YouTube subtitles.
In natural language processing, a period is a very important factor in determining the meaning of a sentence.Table 1 shows sentences with different meanings according to the different period positions.The example sentence is an STT-generated sentence with periods and capital letters are removed.This sentence is divided into completely different sentences, such as Case 1 and Case 2, depending on the position of the period, and the generated sentences have different meanings.If the original text is divided as in Case 1, here "he" is "Sam John".On the other hand, if divided as in Case 2, "he" would be "John" and "John" should eat "Sam".Table 1.An example of distinction (segmentation) ambiguity in STT-generated sentences without periods and capital letters.

Example Sentence
I told him eat sam john called imp followed the command Case 1 I told him eat.Sam John called imp followed the command.Case 2 I told him eat Sam.John called imp followed the command.
Studies on past real-time sentence boundary detection and period position prediction [11][12][13] have attempted to combine words and other features (pause duration, pitch, etc.) into one framework.A research [14] verified that the detection process could improve translation quality.In addition, a research has been conducted to automatically detect sentence boundaries based on a combination of N-gram language models and decision trees [15].Currently, Siri, Android, and Google API (https://cloud.google.com/speech-to-text/docs/automatic-punctuation)insert punctuation marks after recognition.However, it is more important for them to distinguish whether the punctuation of a sentence is a period or a question mark, rather than finding and recognizing appropriate position when recognizing multiple sentences at once.Because these studies were focusing on speech recognition and translation, they relied on some acoustic features, which could not be provided by YouTube scripts.Chinese sentence segmentation [16] created a statistical model using data derived from Chinese Treebank, and predicted the positions of periods based on the model.This research was different from other mentioned research studies because it was using only text data.
This paper presents a new sentence segmentation approach based on neural networks and YouTube scripts that is relatively less dependent on word order and sentence structure.We measures the performance of this approach.We used the 27,826 subtitles included in the online lectures provided by Stanford University for this study.These lecture videos provide well-separated subtitle data in sentence units.Therefore, this subtitles can be converted into the format of automatic subtitles provided by YouTube, and can be used as training data for the model to classify whether the period is present or not.We use Long-Short Term Memory (LSTM) [17] of Recurrent Neural Network (RNN) [18], which has excellent performance in natural language processing, to build a model with data and predict the position of the punctuation mark.LSTM showed potential to be adapted to punctuation restoration in speech transcripts [19], which combines textual features and pause duration.Although RNN has shown good performance with various input lengths, we sacrificed some of this benefits by making the data length similar to YouTube subtitles.In this study, we build the input as closely as possible to YouTube scripts, and try to locate the punctuation marks using only text features.
This paper is composed as follows.Section 2 describes the experimental data used in this study and their preprocessing process.Section 3 explains neural network-based machine learning and explains algorithms and models used in this study.In Section 4, the process of the period prediction and the sentence segmentation experiment are explained in detail.Finally, in Section 5, the present study is summarized and conclusions are presented.

Data
In this paper, we collect 27,826 sentences from 11,039 sentences of "Natural Language Processing with Deep Learning" class and 16,787 sentences of "Human Behavioral Biology" class provided by Stanford University.The videos of these two lessons provide caption data of the complete sentence.Therefore, subtitles can be transformed into an automatic subtitle format provided by YouTube and used as training data to learn the position of a period.To do this, we convert the complete caption data into the format shown in Figure 1.
Appl.Sci.2019, 9, x FOR PEER REVIEW 3 of 10 explains algorithms and models used in this study.In Section 4, the process of the period prediction and the sentence segmentation experiment are explained in detail.Finally, in Section 5, the present study is summarized and conclusions are presented.

Data
In this paper, we collect 27,826 sentences from 11,039 sentences of "Natural Language Processing with Deep Learning" class and 16,787 sentences of "Human Behavioral Biology" class provided by Stanford University.The videos of these two lessons provide caption data of the complete sentence.Therefore, subtitles can be transformed into an automatic subtitle format provided by YouTube and used as training data to learn the position of a period.To do this, we convert the complete caption data into the format shown in Figure 1.

Preprocessing
Figure 1 shows that company names such as Google (00:24), Microsoft (00:26), Facebook (00:28), and Apple (00:32) are written in capital letters.However, some people did not capitalize the first letters due to the nature of their own subtitles, so we converted the whole data to lower case.Also, if you do not do lowercase conversion, word embedding recognizes "Google" and "google" as different words.
Figure 2 is a flowchart showing the preprocessing of this study.First, the exclamation mark (!) and the question mark (?) are changed to a period (.) mark to indicate the end of the sentence.After that, sentences with less than 7 words ended by the period are excluded from the data, which means the minimum length required for learning.The eliminated sentences include sentences consisting mainly of simple greetings or nouns.Automatically generated subtitles do not have apostrophes (') or other punctuation.Therefore, only the period position of the actual subtitle is stored as a reference, and all remaining punctuation marks, such as commas (,), double quotation marks (" "), and single quotation marks (' '), are removed.

Preprocessing
Figure 1 shows that company names such as Google (00:24), Microsoft (00:26), Facebook (00:28), and Apple (00:32) are written in capital letters.However, some people did not capitalize the first letters due to the nature of their own subtitles, so we converted the whole data to lower case.Also, if you do not do lowercase conversion, word embedding recognizes "Google" and "google" as different words.
Figure 2 is a flowchart showing the preprocessing of this study.First, the exclamation mark (!) and the question mark (?) are changed to a period (.) mark to indicate the end of the sentence.After that, sentences with less than 7 words ended by the period are excluded from the data, which means the minimum length required for learning.The eliminated sentences include sentences consisting mainly of simple greetings or nouns.Automatically generated subtitles do not have apostrophes (') or other punctuation.Therefore, only the period position of the actual subtitle is stored as a reference, and all remaining punctuation marks, such as commas (,), double quotation marks (" "), and single quotation marks (' '), are removed.that, sentences with less than 7 words ended by the period are excluded from the data, which means the minimum length required for learning.The eliminated sentences include sentences consisting mainly of simple greetings or nouns.Automatically generated subtitles do not have apostrophes (') or other punctuation.Therefore, only the period position of the actual subtitle is stored as a reference, and all remaining punctuation marks, such as commas (,), double quotation marks (" "), and single quotation marks (' '), are removed.The preprocessed data is processed in the same way as Table 2 to learn long-short term memory (LSTM).First, given an original data sentence, a window of size 5 is created by moving one word at a time.In this study, only windows containing a period were considered as learning data.That is, in the first line of Table 2, "no one use them.And", "one use them.and so", "use them.and so the", "them.and so the cool", ". and so the cool thing" are generated as a candidate for the Ex.Only one of these candidates is selected as the Ex.[1, 0, 0, 0, 0]

Raw Data
and so the cool thing was that by doing this as neural network dependency parser we were able to get much better accuracy.we were able . . .Ex.2 ['get', 'much', 'better', 'accuracy', '.', 'we'] X_DATA [1] ['get', 'much', 'better', 'accuracy', 'we'] Y_DATA [1] [0, 0, 0, 0 The Ex is converted into X_DATA only with the included words again, and these words become the input of the learning.For learning, words are converted to vector representations using word embedding.Y_DATA, which is a reference of learning, shows the period to the left of a word.In Table 2, you can see that a period appears at the left of the first word.The second example in Table 2 shows that a period appears on the left of the fifth word.Through this process, LSTM is learned with X_DATA and Y_DATA, and a model that can predict the end of a sentence is created.

Methods
In this paper, we use LSTM of an artificial neural network among machine learning methods to predict whether or not to split the sentence.To do this, we create a dictionary with a number (index) for each word, express each word as a vector using word embedding, and concatenate 5 word vectors and use it as X_DATA.
Figure 3 shows how data are processed in the program.The five words in Figure 3 are taken from the second example in Table 2. First, all words get vector values of 100 dimensions through word embedding using Word2Vec.These vector values also contain information about the position of the period.Second, the words that have been converted to vectors enter the LSTM layer and this output is calculated while being affected by the previous output.Finally, these outputs predict that a period must be generated between the "accuracy" and "we" by softmax function.

Word Embedding
Word embedding is a method of expressing all the words in a given corpus in a vector space [20].Word embedding allows us to estimate the similarity of words, which makes it possible to achieve higher performance in various natural language processing tasks.Typical word embedding models are Word2Vec [21][22][23], GloVe [24], and FastText [25].In this paper, words are represented as vectors by using Word2Vec, which is most widely known to the public.
Word2Vec is a continuous word embedding learning model created by several researchers, including Google's Mikolov in 2013.The Word2Vec model has two learning models: Continuous Bag of Words (CBOW) and Skip-gram.CBOW basically creates a network for inferring a given word by using surrounding words as an input.This method is known to have good performance when data is small.On the other hand, the Skip-gram uses the given word as an input to infer the surrounding word.This method is known to show good performance when there is a lot of data (https://www.tensorflow.org/tutorials/representation/word2vec).

RNN (Recurrent Neural Network)
RNN (Recurrent Neural Network) is a neural network model that is suitable for learning timeseries data.In the previous neural network structure, it is assumed that the input and output are independent of each other.However, in RNN, the same activation function is applied to every element of one sequence, and the output result is affected by the previous output result.However, the actual implementation of RNN has a limitation in effectively handling only relatively short sequences.This is called vanishing gradient problem, and a long-short term memory (LSTM) has been proposed to overcome this problem.
As shown in Figure 4, LSTM solved the vanishing gradient problem by adding an input gate, a forget gate, and an output gate to each path that compute the data in the structure of the basic RNN.

Word Embedding
Word embedding is a method of expressing all the words in a given corpus in a vector space [20].Word embedding allows us to estimate the similarity of words, makes it possible to achieve higher performance in various natural language processing tasks.Typical word embedding models are Word2Vec [21][22][23], GloVe [24], and FastText [25].In this paper, words are represented as vectors by using Word2Vec, which is most widely known to the public.
Word2Vec is a continuous word embedding learning model created by several researchers, including Google's Mikolov in 2013.The Word2Vec model has two learning models: Continuous Bag of Words (CBOW) and Skip-gram.CBOW basically creates a network for inferring a given word by using surrounding words as an input.This method is known to have good performance when data is small.On the other hand, the Skip-gram uses the given word as an input to infer the surrounding word.This method is known to show good performance when there is a lot of data (https://www.tensorflow.org/tutorials/representation/word2vec).

RNN (Recurrent Neural Network)
RNN (Recurrent Neural Network) is a neural network model that is suitable for learning time-series data.In the previous neural network structure, it is assumed that the input and output are independent of each other.However, in RNN, the same activation function is applied to every element of one sequence, and the output result is affected by the previous output result.However, the actual implementation of RNN has a limitation in effectively handling only relatively short sequences.This is called vanishing gradient problem, and a long-short term memory (LSTM) has been proposed to overcome this problem.
As shown in Figure 4, LSTM solved the vanishing gradient problem by adding an input gate, a forget gate, and an output gate to each path that compute the data in the structure of the basic RNN.LSTM has been successfully applied to many natural language processing problems.Especially, it shows good performance in the language modelling [26], which calculates the probability of seeing the next word in the sentence, and the machine translation field [27], which decides which sentence to output as the result of automatic translation.In this study, we use this characteristic of LSTM to train the features of sentence segmentation of a series of consecutive words.We also create a model that can automatically segment sentences by creating a degree of segmentation of the sentence as the output value of the network.

Experimental Results
The data of this paper is composed of 27,826 subtitles (X_DATA, Y_DATA) in total.Training data and test data were randomly divided into 7:3 ratios.Table 3 shows the Hyper parameters used in this study and the performance of the experiment.As can be seen in Table 3, the words were represented as 100-dimensional vectors and training was repeated 2000 times with a training set.As a result, the final cost is 0.181 and the accuracy is 70.84%.LSTM has been successfully applied to many natural language processing problems.Especially, it shows good performance in the language modelling [26], which calculates the probability of seeing the next word in the sentence, and the machine translation field [27], which decides which sentence to output as the result of automatic translation.In this study, we use this characteristic of LSTM to train the features of sentence segmentation of a series of consecutive words.We also create a model that can automatically segment sentences by creating a degree of segmentation of the sentence as the output value of the network.

Experimental Results
The data of this paper is composed of 27,826 subtitles (X_DATA, Y_DATA) in total.Training data and test data were randomly divided into 7:3 ratios.Table 3 shows the Hyper parameters used in this study and the performance of the experiment.As can be seen in Table 3, the words were represented as 100-dimensional vectors and training was repeated 2000 times with a training set.As a result, the final cost is 0.181 and the accuracy is 70.84%.
We created a confusion matrix of Table 4 to evaluate our method in detail.In this table, A, B, C, D, and E correspond to classes of Y_DATA in Table 2, in which A is "1, 0, 0, 0, 0", B is "0, 1, 0, 0, 0", and so on.We excluded the prediction case of "0, 0, 0, 0, 0", since this case could be the further addressed in the future.Table 5 shows precision, recall, and f-measure values of each class.The average f-measure is over 80%, which is a relatively higher performance than previous research studies [11][12][13][14][15], even though we did not use acoustic features.Five classes did not show distinguished difference in their prediction performances.We did not consider which word could be a candidate of segmentation because the process also requires additional calculation and resources.We did not use any manual annotation work compared to other works proposed earlier.Given those considerations, the performance shows much potentials.Figure 5 shows the cross validation data and training data cost for learning epochs.By checking the cross validation, it is possible to determine the optimum number of epochs and check whether there is a problem of overfitting or underfitting.In this graph, after 1000 repetitions, the CV graph and the training graph are slightly different.Therefore, it can be concluded that there is no significant meaning for the more iterative learning.
Appl.Sci.2019, 9, x FOR PEER REVIEW 7 of 10 We created a confusion matrix of Table 4 to evaluate our method in detail.In this table, A, B, C, D, and E correspond to classes of Y_DATA in Table 2, in which A is "1, 0, 0, 0, 0", B is "0, 1, 0, 0, 0", and so on.We excluded the prediction case of "0, 0, 0, 0, 0", since this case could be the further addressed in the future.Table 5 shows precision, recall, and f-measure values of each class.The average f-measure is over 80%, which is a relatively higher performance than previous research studies [11][12][13][14][15], even though we did not use acoustic features.Five classes did not show distinguished difference in their prediction performances.We did not consider which word could be a candidate of segmentation because the process also requires additional calculation and resources.We did not use any manual annotation work compared to other works proposed earlier.Given those considerations, the performance shows much potentials.Figure 5 shows the cross validation data and training data cost for learning epochs.By checking the cross validation, it is possible to determine the optimum number of epochs and check whether there is a problem of overfitting or underfitting.In this graph, after 1000 repetitions, the CV graph and the training graph are slightly different.Therefore, it can be concluded that there is no significant meaning for the more iterative learning.

Conclusions
In this paper, we present a method to find proper positions of period marks in YouTube subscripts, which can help improve the accuracy of automatic translation.This method is based on Word2Vec and LSTM.We cut off the data to make them similar to YouTube subscripts.We tried to predict whether a period can come between each word or not.In experiment, the accuracy of the approach was measured to be 70.84%.
In a future study, we will apply other neural net models, such as CNN, attention, and BERT, to bigger data to improve the accuracy.To do this, we need to collect subscript data from a wider range of online education sites.Second, we want to collect more YouTube subtitle data in various areas and translate it into Korean to build a parallel corpus.We will develop the software tools needed to build the corpus.Finally, we will develop a speech recognition tool based on the open API to create the actual YouTube subtitle files.With those steps, we will be able to build all the necessary processes from the YouTube voice to the Korean subtitles creation.

Figure 1 .
Figure 1.Examples of captions generated automatically in YouTube.

Figure 1 .
Figure 1.Examples of captions generated automatically in YouTube.

Figure 3 .
Figure 3. Schematic of the program to learn and generating period using Word2Vec and LSTM.

Figure 3 .
Figure 3. Schematic of the program to learn and generating period using Word2Vec and LSTM.
These three gates determine what existing information in the cell will be discarded, whether to store new information, and what output value to generate.Appl.Sci.2019, 9, x FOR PEER REVIEW 6 of 10These three gates determine what existing information in the cell will be discarded, whether to store new information, and what output value to generate.

Figure 5 .
Figure 5. Graph of Training cost and Cross Validation (CV) cost according to learning rate.Figure 5. Graph of Training cost and Cross Validation (CV) cost according to learning rate.

Figure 5 .
Figure 5. Graph of Training cost and Cross Validation (CV) cost according to learning rate.Figure 5. Graph of Training cost and Cross Validation (CV) cost according to learning rate.

Table 2 .
Examples of preprocessed data.

Table 3 .
Hyper parameters and result performance.

Table 3 .
Hyper parameters and result performance.

Table 5 .
Precision, Recall, and f -measure values of each class.

Table 5 .
Precision, Recall, and f-measure values of each class.

Table 6
shows the predicted test data based on the learned model.A data shows correct prediction and B data shows wrong prediction.

Table 6 .
Examples of correct prediction and wrong prediction.