Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks

In this paper, we present an approach to Spanish talk shows summarization. Our approach is based on the use of Siamese Neural Networks on the transcription of the show audios. Specifically, we propose to use Hierarchical Attention Networks to select the most relevant sentences for each speaker about a given topic in the show, in order to summarize his opinion about the topic. We train these networks in a siamese way to determine whether a summary is appropriate or not. Previous evaluation of this approach on summarization task of English newspapers achieved performances similar to other state-of-the-art systems. In the absence of enough transcribed or recognized speech data to train our system for talk show summarization in Spanish, we acquire a large corpus of document-summary pairs from Spanish newspapers and we use it to train our system. We choose this newspapers domain due to its high similarity with the topics addressed in talk shows. A preliminary evaluation of our summarization system on Spanish TV programs shows the adequacy of the proposal.


Introduction
Nowadays, the development of automatic summarization systems is an important issue due to the great amount of information in different formats that is accessible in the web or in other repositories. Some summarization systems are based on unsupervised learning approaches by considering statistical features of words [1], topic modeling such as Latent Semantic Analysis [2], graph based approaches such as LexRank [3,4] (for a more exhaustive review see [5,6]). There are also systems based on supervised learning techniques such as Conditional Random Fields [7], Support Vector Machines [8] and Neural Networks [9][10][11][12][13][14][15].
Although most of the works are focused on the application of automatic summarization techniques to collections of purely textual documents, summarization systems are not limited to text input tasks. There are some other works that address the problem of adapting these techniques to audio recordings as input, typically broadcast news, lectures or meetings [16][17][18]. These systems have to tackle with specific problems derived from the errors generated by the speech recognition phase such as misrecognized words and errors in punctuation marks.
Just as the volume of textual documents available on the web has grown dramatically in recent years, the same is true in the case of TV programs collections. The television channels make available to the public the programs of their own production, generating with it large collections of videos. For the audience who could not follow the broadcast of the live programs, it is interesting the possibility of accessing them. Therefore, in addition to an adequate information retrieval system to perform the search in the programs collections, an automatic summarization system applied to these programs will be helpful for searching information. This is especially interesting for talk shows, which consist of several speakers giving opinions on various topics introduced by the program's presenter.
The application of supervised methods to automatic summarization, as those based on Neural Networks, implies the availability of adequate corpora consisting of a set of document-summary pairs. The construction of large and high quality corpora for this purpose is not an easy task, because it is necessary a great human effort to generate thousands of manual summaries, or to design new approaches to obtain these summaries in a semiautomatic way. The first important resource for learning corpus-based summarization models is the CNN/DailyMail summarization corpus, originally constructed by Reference [19] for the task of passage-based question answering and adapted to the document summarization task [9,10]. It consists of news stories from CNN and DailyMail and contains 312,077 document-summary pairs.
It have been developed appropriate corpora for English, however this is not the same for other languages, such as Spanish. With the aim of building a corpus for Spanish, a strategy similar to the proposed in Reference [20] for the construction of the NEWSROOM corpus has been followed in this work. In Reference [20], they take advantage of the highlights or summaries, provided by authors or editors in the newsroom, in order to obtain the summaries. A crawling process on the newspaper websites extracts articles and summaries in a straightforward way. The NEWSROOM corpus contains news, sports, entertainment, financial and other kind of publications from several English newspapers.
In this work, we have built a corpus of Spanish newspapers-the ES-NEWS corpus. It consists of a set of 277,675 article-summary pairs extracted from 11 different Spanish newspapers. ES-NEWS corpus contains articles and summaries of news, sports, politics, culture and other topics. The use of this corpus in this work is two-fold-on the one hand, we evaluate our summarization system [15] with it in order to study the transferability of our system from English to Spanish and on the other hand, we use it to train the system that we apply to the summarization of Spanish talk shows.
Another contribution of this work is the study of the transferability of our summarization system to the domain of talk shows in Spanish. In this case, the documents are fragments of audio transcriptions of TV programs and the summaries consist of sentences written manually. Since we do not have a sufficiently large corpus of TV programs to train our summarization system for the talk shows, we use the ES-NEWS corpus. It should be noted that the characteristics of the corpus used for training are different from those of the TV programs. In the case of newspaper articles, there are some sentences that usually appear at the beginning of the article, that contain the main ideas or the relevant information underlying the article. However, the TV programs are conversations where the ideas and information are more scattered in the speaker turns. Even so, both of them expose content of similar topics.
In a previous work [15], we proposed a supervised approach to text summarization which is based on Siamese Hierarchical Attention Neural Networks using distributed vector representation of words, the SHA-NN system. Siamese Neural Networks are capable of learning from positive and negative samples. During the training phase, we provide the network with positive and negative document-summary pairs; a positive pair is a document and its summary and a negative pair is a document and a summary of other different document randomly extracted from the training set. Furthermore, this model is enriched with an attention mechanism that provides the final score associated to each sentence of the input document, which allows us to establish a ranking and to select the most salient sentences to build the summary. We performed some experiments on the CNN/DailyMail corpus that confirmed the good behaviour of our approach to the summarization problem.
In this work, we address the application of the SHA-NN system to summarize TV programs, in particular, Spanish TV talk shows. First, in order to train our summarization system, the ES-NEWS corpus was built. Second, the SHA-NN system was evaluated on this text corpus. Third, a test corpus has been built with talk shows, the LN24-SUMM corpus. Finally, the summarization system trained with the ES-NEWS corpus has been applied to the LN24-SUMM test corpus. We present a preliminary evaluation of our summarization system on the transcribed speech of the LN24-SUMM test corpus. Despite the different characteristics between the two corpora, the results of transferability between domains are promising.

Corpora
In this section we present the characteristics of the two corpora built for this summarization work, one of them for training purposes (ES-NEWS) and the other one for evaluation (LN24-SUMM).

ES-NEWS Corpus
We have built a corpus of written Spanish article-summary pairs-the ES-NEWS corpus. It consists of a set of 277,675 article-summary pairs extracted from 11 different Spanish online newspapers. ES-NEWS contains articles and summaries of news, sports, politics, culture and other subjects. We used this corpus to evaluate the transferability of the SHA-NN summarization system from English to Spanish and on the other hand, we used it to train the system that we apply to Spanish talk shows summarization.
The ES-NEWS corpus is composed by newspaper articles extracted from around 1 million URLs, which were collected during the last week of June 2018. To enforce the diversity of summarization styles, 11 websites of relevant newspapers of Spain have been used. These newspapers are: Elconfidencial, EconomiaDigital, HuffingtonPost, ABC, FormulaTV, EldiarioCantabria, Publico, Vozpopuli, Rioja2, PeriodistaDigital and Eldiario. It have been excluded newspapers that do not include highlights such as ElMundo, newspapers that only consider a single short highlight per article such as ElPais and newspapers whose crawled content are not articles but web content (advertising, keywords, etc.) such as EuropaPress.
Once all the URLs were crawled, following Reference [20], we have used the field og:description to extract the highlights that, concatenated, were considered as summaries. We made a preprocess in order to remove noise such as duplicated URLs, empty summaries and articles and non-journalistic articles. All the text was lowercased and tokenized by using the Spanish version of Stanford CoreNLP.
The corpus consists of 277,675 article-summary pairs. From it, training, development and test partitions have been defined, following similar proportions to CNN/Dailymail corpus [11] (90%, 5.5% and 4.5% respectively). Thus, resulting in a training set of 249,919 pairs, a development set of 15,266 pairs and a test set of 12,490 pairs. Some ES-NEWS corpus statistics are shown in Table 1. To make a comparison between the ES-NEWS and the NEWSROOM corpora, we have used the Extractive Fragment Coverage, Extractive Fragment Density and Compression Ratio measures. These metrics, proposed in Reference [20], aim to measure the overlapping between summaries and articles to analyze the diversity of summarization styles. The metrics are defined in Equations (1)-(3), where A is the sequence of words of the article, S is the sequence of words of the summary, F is the set of common fragments, common sequences of words, in A and S computed using a greedy algorithm proposed in Reference [20] and | · | stands for the length, in terms of words, of the sequences.
The Extractive Fragment Coverage is computed as the sum of the lengths of all the common fragments between the article and its summary divided by the size of the summary. Thus, the greater its value, the more common fragments or larger common fragments have been found between the article and its summary.
The Extractive Fragment Density is a measure similar to the Extractive Fragment Coverage but using the square of common fragment lengths. Therefore, with the same number of common words, summaries with longer common fragments obtain higher values than those with more but shorter common fragments. Figure 1 shows the density and coverage distributions along with the compression ratio for each newspaper in ES-NEWS corpus. Each box is a normalized bivariate density plot of Extractive Fragment Coverage (x-axis) and Extractive Fragment Density (y-axis) of a newspaper. Furthermore, the final distributions on full ES-NEWS is shown in the bottom-right box. As Figure 1 shows, the distribution of Extractive Fragment Density and Extractive Fragment Coverage of ES-NEWS is lead by ElDiario (coverage between 0.6 and 0.8 with a high density), due to that newspaper brings the largest number of articles to the corpus. Despite this, generally, the mean coverages and densities in the newspapers show that the introduction of new words in summaries and the use of long extractive fragments is moderate, although both are higher than in NEWSROOM corpus [20].

LN24-SUMM Corpus
We have built a corpus for Spanish talk shows summarization-the LN24-SUMM corpus. It consists of a set of 30 document-summary pairs. Documents of this corpus are extracted from 5 talk shows of "La Noche en 24 horas", a program of the Spanish television (TVE). Documents have been obtained from the transcriptions of these TV programs, which have been manually segmented into pieces first from the Twitter hashtags appearing in the program videos, and second, from the interventions of the different speakers. This segmentation was made with the aim of summarize the opinion of a speaker about a given topic (hashtag). Four members of the research group generated the reference summaries. A common strategy based on paraphrasing the most representative sentences of each document was used. Consequently, the generated summaries, although they are abstractive, have very high Extractive Fragment Coverage and Density values, as it can be seen in Table 2. In this table some additional LN24-SUMM statistics are also shown.
It is interesting to see that the Mean Compression Ratio is lower in LN24-SUMM than in ES-NEWS corpus. Also, it can be seen that LN24-SUMM corpus has a very high mean Extractive Fragment Density and Coverage in comparison to ES-NEWS corpus. The differences could be because the newspaper summaries are made by many different journalists who are qualified in compressing information and in generating more diverse kind of summaries. In addition, extracting headlines from newspaper articles is simpler than from speaker interventions of talk shows. In the case of newspaper articles there are some sentences, that mainly appear at the beginning of the article, that contain the main ideas or information of the article. However, the talk shows are conversations where the relevant information is more scattered in the speaker turns and exhibits spontaneous speech phenomena, Therefore, they are more difficult to summarize. Additionally, the LN24-SUMM documents, that is, the transcribed and manually segmented talk shows, are very heterogeneous. Two examples to see the differences between both corpus are shown in Figure 2. In this Figure it is possible to see that the LN24-SUMM summaries are composed by more scattered sentences than the summaries of ES-NEWS.
Reference Summary (ES-NEWS): One of the best scientific minds in the world suffered from amyotrophic lateral sclerosis and lived 53 years longer than the doctors diagnosed him (1/13). "Their courage and persistence with their brilliance and their humor inspired people all over the world," their children say in a statement (4/13).

Reference Summary (LN24-SUMM):
More than 72 h have passed since the attack on Friday (3/28). The city suffered a heinous attack on several fronts, leisure centers, in football and in a concert hall (11/28).
Terrorists have attacked our way of life, even children are accompanied by security forces (16/28). Francois Hollande is going to meet in Paris with John Kerry (18/28). The five terrorists have been identified, four were French and one would have a Syrian passport (23/28).

System Description
The SHA-NN system [15] is based on Hierarchical Attention Networks (HAN) [21] trained in a Siamese way, where its left branch extracts representations for whole documents and its right branch extracts representations for summaries. HAN allows us to extract a vector representation for documents and summaries from the representations of their sentences. Moreover, the representation of each sentence is obtained from representations of their words. These representations are trained to address a binary classification task that consists of determining if a summary S is correct for a document D. It acts as an intermediate task in order to extract the most relevant sentences of the document to make a summary. Figure 3 shows an outline of the system architecture. As input, the SHA-NN system uses d e -dimensional skipgram word embeddings, trained from the training set of the ES-NEWS corpus, in order to represent both documents, D ∈ R T×W×d e and summaries, S ∈ R Q×V×d e , where T and Q are the maximum number of sentences for document and summary and W and V are the maximum number of words per sentence for document and summary. These representations are used as input for the two Hierarchical Attention Networks (H AN 1 and H AN 2 ) whose BLSTM layers are shared between them, both at sentence level (BLSTM 2 with dimensionality d s = 512) and at word level (BLSTM 1 with dimensionality d w = 512). However, the attention mechanisms in both levels are not shared.
From these inputs, H ∈ R T×d w and G ∈ R T×d w are computed, following Equations (4) and (6), as proposed in [22]. They are the output from the word level d w -dimensional BLSTM 1 with attention, where each row i is computed as the average of the hidden vectors of the sentence i attended by α ∈ R T×W Equation (5) and β ∈ R Q×V Equation (7) for document and summary respectively.
where W u ∈ R d w , W v ∈ R d w , are the weights of the attention mechanism for document and summary at word level. Once H and G are computed, r ∈ R d s and p ∈ R d s can be obtained, following Equations (8) and (10), similarly to the word level but using BLSTM 2 and the attentionsα ∈ R T and β ∈ R Q for document and summary, respectively.
where Wû ∈ R d s , Wû ∈ R d s , are the weights of the attention mechanism for document and summary at sentence level. These vector representations r and p, captures bidirectional relationships among the sentence representations, which are obtained from the representations of their words. Then, they can be used to distinguish correct summaries for documents which forces the attention mechanisms to focus on the most similar sentences of both. In order to do this, the vector representations of the document r, the summary p and the difference between them |r − p| are concatenated to feed a fully-connected output layer with softmax activation function [23], as defined in Equation (12).
In order to train the model, for each document we built positive pairs (D j , S j ), provided by the ES-NEWS corpus and negative pairs (D j , S k ) : j = k where S k is chosen randomly from the summaries of the remaining documents. For the positive pairs, the ground truth was y i = 1 whereas for the negative pairs, the ground truth was y i = 0. In this work, we used batches of 64 document-summary pairs (32 positive pairs and 32 negative pairs). The training objective consisted in minimizing the cross-entropy between the predictionŷ and the class y. The model was trained for 20 epochs of 1500 batches on a GPU Titan X during 3 h.
To carry out document summarization with SHA-NN, once the network has been trained to distinguish correct summaries for documents, the attention mechanisms at sentence level can be used to rank sentences and then, to select the most relevant of them based on this rank. That is, for the summarization process, given a document D, a forward pass is performed on the left branch of the siamese network (H AN 1 in Figure 3) to obtain the attention scoreα j of each document sentence. From the ranking of the document sentences based on those scores, the k = 3 sentences with higher attention score are selected to build the summary.

Related Work
Many Neural Network based approaches for text summarization have been proposed in the last few years. Most of them are based on encoder-decoder architectures. Generally, in these approaches, the encoder reads the source sequence as a list of continuous-space representations from which the decoder generates the target sequence. Some of these approaches also incorporate attention mechanisms. In particular, Reference [9] proposed an attentional encoder-decoder approach for extractive single-document summarization and Reference [11] presented an extractive summarization approach based on sentence classification using Neural Networks and required a previous adaptation of the corpus based on ROUGE. In both works, they used the CNN/DailyMail corpus since its large size makes it attractive for training deep Neural Networks. Other Neural Network based works try to improve the behavior of the models incorporating more information. This is the case of Reference [24] where they jointly learn the attention mechanism, to obtain the score of the sentences and the selection mechanism to extract the most salient sentences. One recent trend consists in addressing the summarization problem as a sentence ranking task by considering Reinforcement Learning. This is typically done with the aim of optimizing discrete metrics that are not differentiable, such as the ROUGE evaluation metric [25,26].
There have been some attempts to address the problem of summarization of speech. Most of them are based on techniques successfully used for text summarization and they are directly applied to the output of the speech recognition process. Reference [16] presented an approach to this problem where salient sentences or segments, are extracted only using the textual information, by means of concatenation and reordering mecanisms the final summaries are generated. Experiments are performed on monologues such as lectures, presentations and news commentaries. In Reference [17], an extractive summarization approach also based on the textual representation of the audio is also presented. Salient sentences are extracted based on Language Model measures. Experiments are performed on the Mandarin broadcast news corpus MATBN [27] which was manually segmented and transcribed for evaluation purposes. It must be noted that this kind of corpus has a more regular structure than other speech programs as interviews or debates. Reference [18] also addresses the speech summarization problem but in this case, by using Convolutional Neural Networks for sentence selection. The system includes two convolutional networks, one of them working on the document and the other one working on a sentence of the same document. The system learns for each document sentence a score that represents its probability of belonging to the summary. They also use the MATBN corpus for evaluation purposes.
The SHA-NN system [15] is based on addressing a binary classification problem in order to select the most relevant sentences by means of the attention mechanisms. This system, differently from some Neural Networks based mentioned works, does not require the preparation of the corpus [11], being the system which learns the alignment between document and summary. Moreover, our system addresses the problem as a binary classification task in order to distinguish correct summaries for documents, instead of performing sentence classification to score the document sentences [9,11,18].

Experiments
We carried out two different experiments-first, we trained and evaluated the SHA-NN system with ES-NEWS corpus and second, we evaluated the trained system with the LN24-SUMM test corpus.

•
Lead is a very popular and robust strategy to generate snippets and summaries of article newspapers that consists in extracting the first k sentences of the documents. This strategy is typically used as a baseline in the automatic summarization of newspaper articles, since in the writing style of this type of documents the most relevant information is usually condensed in the first paragraphs to capture the attention of the reader.

•
LexRank is an unsupervised, graph-based summary generation system inspired by both PageRank and HITS. It is based on the idea that the relevance of a sentence depends on its similarity with the rest of the sentences in the text. The nodes of the graph are the document sentences and the edges measure the similarity between two sentences using a idf based cosine distance. Two sentences are connected if the cosine similarity between them is greater than a certain threshold. The summary is made with the most salient sentences. If a sentence is similar to many others, then it must be salient in the document. • TextRank, like LexRank, is an unsupervised graph-based system inspired by PageRank. It uses a variation of PageRank to extract the most salient sentences of the document. Its most significant difference from LexRank is the way in which the weights of the edges are calculated. In this case, the edges measure the similarity between the different nodes based on the number of common words in the sentences.

•
LSA is a method based on Singular Value Decomposition, where a word-sentence matrix is decomposed in three new matrices. One of these matrices represents the association of underlying topics to sentences. This matrix is used to select the more salient sentences.

•
SumBasic exploits frequency related properties of the words to compose summaries, arguing that high frequency words in the documents are very likely to appear in the human generated summaries. It is a greedy search approximation where, first the probability distributions of the words are computed, second by using these probabilities a weight is assigned to each document sentence and later, the best scoring sentence is selected to fill the summary until the desired summary length has been reached.
In all the experimentation, we used the implementation of these systems provided by the Python sumy library (https://github.com/miso-belica/sumy) using the default configuration. All these systems extract 3 sentences in order to compose the summary.
The performance of the systems was evaluated by using variants of the ROUGE measure [30]. Concretely, Rouge-N with unigrams and bigrams (Rouge-1 and Rouge-2) and Rouge-L. Furthermore, the compression ratio of the generated summaries (Compression) was also analyzed. In order to compute the confidence intervals, we used the Bootstrap Confidence Intervals [31] approach. First, from the set of hypotheses provided by the system that we want to evaluate, we generated up to 1000 resamples by sampling with replacement from this original set of hypotheses. Each resample had the same size of the original set. Next, the value of the evaluation measure was calculated for each of the resamples. Finally, we computed the 95% confidence interval using the bootstrap distribution. Table 3 shows the results of our system compared to other summarization systems using the test set of ES-NEWS corpus. All the results in this experimentation are statistically significant. The results of all the systems in Table 3 on the ES-NEWS corpus (in Spanish) considering Rouge-2 measure are in line with those on the English CNN/DailyMail corpus [15]; for instance, SHA-NN and Lead achieved 14.7 and 15.1 respectively on the CNN/DailyMail corpus. However, in terms of Rouge-1 and Rouge-L all the systems present a slight decrease of performances on the ES-NEWS corpus with respect to CNN/DailyMail corpus; for instance, SHA-NN and Lead achieved 35.4 and 37.3 respectively on the CNN/DailyMail corpus considering Rouge-1. This fact illustrates small differences between the two corpora that have affected the results of all the compared systems. Regarding SHA-NN system, the results show a good transferability between languages as we hypothesized.
Using the summarization system trained in the above experimentation, we evaluated it with the LN24-SUMM test corpus, which contains 30 document-summary pairs (a small test set compared to the training set). Table 4 shows the results of the SHA-NN system compared to other summarization systems. This table shows that the results of our summarization system are better than those of the other systems at all levels of ROUGE, although it should be considered that the small size of the test set does not allow to obtain statistically significant results. It should be noted that when working with the ES-NEWS corpus, the Lead system, which consists of extracting the first 3 sentences of the article as a summary, outperforms the rest of the systems, including ours, as Table 3 shows. However, this system performed worse on LN24-SUMM corpus. This is due to the fact that in ES-NEWS corpus, unlike the LN24-SUMM corpus, the summary tends to be a very approximate version of the first sentences of the articles. Also, it is interesting that, although SHA-NN was trained under the bias to the first sentences (ES-NEWS), it is capable of generalizing when the relevant sentences are more scattered in the document (LN24-SUMM).
In relation to the transferability between domains, it is possible to see that all the results, in terms of ROUGE, obtained on the LN24-SUMM corpus are higher than those obtained on the ES-NEWS corpus. That is because the reference summaries of the LN24-SUMM corpus have a very high density (i.e., they are composed by long extractive fragments of the transcribed talk shows) and a very low compression ratio in comparison to the ES-NEWS corpus, as it can be seen in Tables 1 and 2. Furthermore, it is interesting to see in Table 4 that in general, when the compression ratio of the generated summaries increases, the results in terms of ROUGE decrease. Our system provide the best trade-off Rouge/Compression among all systems. Moreover, although TextRank obtains the most similar results with respect to SHA-NN, it suffers from a very low compression ratio due to it tends to extract the longest sentences.

Conclusions
We have studied the transferability of the SHA-NN summarization system, which is based on Siamese Hierarchical Attention Neural Networks, between languages and between application domains. Regarding the languages, the results of our system on the ES-NEWS corpus, in Spanish, are in line with those on the CNN/DailyMail corpus, in English. Regarding the application domains, we trained our summarization system on the ES-NEWS corpus, a text corpus of newspaper articles and we applied it to the summarization of transcribed speech of talk shows. The experimental results confirm the good behaviour of our proposal. We presented experiments on transcribed speech and as future work we will address its application to recognized speech. We will also study the evolution of our proposal to tackle with abstractive summarization based on the weights provided by SHA-NN in order to reduce the impact of recognition errors in the generation of summaries.