BART-IT: An Efﬁcient Sequence-to-Sequence Model for Italian Text Summarization

: The emergence of attention-based architectures has led to signiﬁcant improvements in the performance of neural sequence-to-sequence models for text summarization. Although these models have proved to be effective in summarizing English-written documents, their portability to other languages is limited thus leaving plenty of room for improvement. In this paper, we present BART-IT, a sequence-to-sequence model, based on the BART architecture that is speciﬁcally tailored to the Italian language. The model is pre-trained on a large corpus of Italian-written pieces of text to learn language-speciﬁc features and then ﬁne-tuned on several benchmark datasets established for abstractive summarization. The experimental results show that BART-IT outperforms other state-of-the-art models in terms of ROUGE scores in spite of a signiﬁcantly smaller number of parameters. The use of BART-IT can foster the development of interesting NLP applications for the Italian language. Beyond releasing the model to the research community to foster further research and applications, we also discuss the ethical implications behind the use of abstractive summarization models.


Introduction
Automatic Text Summarization (ATS) is a natural language processing task that consists of creating a shorter version of a text document, which is coherent and maintains the most relevant information of the original text [1]. Text summarization techniques are usually divided into two main categories: extractive and abstractive summarization. In extractive summarization, the summary is created by selecting a subset of the original text and thus the output text is composed of sentences that are present in the original document. Abstractive summarization, on the other hand, generates the output summary, which does not necessarily include sentences in the original document. Abstractive summarization is usually considered more challenging than extractive summarization as it entails creating new text containing coherent and summarized content. Due to the inherent complexity of the abstractive summarization task, it is often applied on top of an extractive summarization process to refine the previously selected content [1].
In recent years, ATS has received a lot of attention as it can be applied to a wide range of applications such as the extraction of highlights from scientific papers [2], the generation of summaries of news articles [3], and the creation of multimodal summaries of audio podcasts [4]. The summarization task can be also instrumental for other NLP tasks, e.g., by reducing the size of large documents to make them more suitable for downstream tasks [5].
Transformer-based architectures have shown to be effective in modeling long-range dependencies and have already been applied to several NLP tasks such as machine translation [6] and question answering [7]. BART [8] is a sequence-to-sequence model based on the Transformer architecture, which is trained using a denoising objective to learn effective representations of the input text that can be used for a wide range of downstream tasks. The model takes as input a sequence of tokens and generates a sequence of tokens as output. It can be exploited to solve several tasks including machine translation, abstractive question answering, and text summarization. The effectiveness of BART has been demonstrated both in English and in multilingual scenarios. However, limited efforts have been devoted to exploring its potential in language-specific, non-English scenarios.
In this paper, we present BART-IT, a sequence-to-sequence model, based on the BART architecture that is specifically tailored to the Italian language. To learn effective language representations, BART-IT is first trained on a large corpus of Italian documents and then fine-tuned on several datasets for text summarization. To evaluate BART-IT, we consider both single-language and multilingual models that have previously been applied to the Italian language [9][10][11]. The experimental results show that the BART-IT summarization performance is superior to that achieved by models with a comparable (or even higher) number of parameters. Performance comparable to that of larger multilingual models makes BART-IT competitive, especially in resource-constrained scenarios [12].
To foster further research in the field of sequence-to-sequence models for the Italian language, we release the pre-trained and fine-tuned BART-IT models as well as the code used to train and evaluate BART-IT (details on the public repository are given in the Data availability Statement.). Since the unconscious use of deep learning models for text abstraction is potentially subject to bias and unfairness, in this paper, we also discuss the ethical implications of using deep learning models for generating text summaries.
The main contributions of this work is the release of a new Italian language model, which can be used to solve a range of NLP tasks, its fine-tuning for abstractive text summarization, and testing on benchmark data. More specifically,

•
We present BART-IT, a sequence-to-sequence BART-based model specifically designed for the Italian language (see Section 3); • We release the model pre-trained on a large Italian corpus. It can be used to solve various NLP tasks on Italian documents (see the Data Availability Statement); • We fine-tune BART-IT for the abstractive text summarization task (see Section 4.3); • We assess model performance on Italian benchmark data, showing that BART-IT achieves the best balancing between efficiency and effectiveness compared to various baselines (see Figure 1); • We discuss the ethical aspects behind the practical use of the BART-IT model (see Section 5). The last four bars represent the performance of the models in terms of summaries/second on a single NVIDIA A6000 GPU.

Related Works
Abstractive text summarization aims at generating fluent and coherent summaries that are able to accurately convey the main ideas of the original text. Unlike extractive text summarization [13], in which the summary is generated by selecting relevant sentences from the original text, abstractive summarization entails understanding the meaning of the raw input text and generating a new summary that is grammatically consistent and semantically coherent.
In recent years, deep learning methods have made significant progress, achieving state-of-the-art results in several NLP tasks, including text summarization [14,15]. The generalization capabilities of deep learning models and their ability to automatically extract high-level features from text data have allowed them to outperform traditional methods in many NLP tasks. In the field of abstractive text summarization, sequence-to-sequence models [16] have gained significant popularity. Thanks to their ability to encode the input text into internal representations, sequence-to-sequence models can be used to condition the generation process to produce a coherent summary.
Recurrent Neural Networks (RNNs) [17,18] have been exploited to solve sequence-tosequence tasks since the early days of neural networks. These architectures have shown to be effective in modeling sequential data and successfully applied to abstractive summarization [19,20]. However, the training of RNNs can become very slow, especially while coping with long sequences because RNNs must be unfolded to model the temporal dependencies in the sequence. In addition, the long-term dependencies can be modeled very poorly due to the vanishing gradient problem [21].
Transformer-based models [6] mitigate the aforesaid issues by using the self-attention mechanism. It allows the network to explicitly attend to all the words in the sequence, regardless of their position in the sequence. They not only substantially improve the quality of sequential data models but also gain efficiency thanks to parallel computation. Transformers have shown to be particularly effective in addressing the abstractive summarization task [8,14] and are thus the backbone of state-of-the-art models.
Most transformer-based models for abstractive text summarization rely on the established encoder-decoder architecture. The encoder is a bidirectional transformer and is responsible for acquiring contextual information from the raw input text. The decoder is also a transformer and is trained to generate the summary by attending both to the encoder output and to the previously generated summary tokens. BART [8] is among the best-performing transformer-based abstractive summarization models. It is trained using a denoising autoencoder objective, where the input sequence is corrupted using different transformations, such as token permutations and token deletions. The training objective forces the model to learn to reconstruct the original sequence by attending to the contextual information and by generating the correct sequences.
Alternative transformer-based models for abstractive text summarization adopt similar architectures. For example, T5 [22] is a transformer-based model that is trained using multitask learning. The key idea is to train the model on multiple tasks at once, thus transferring useful knowledge from one task to another. This is particularly beneficial for achieving a high level of generality of the trained models, as they are exposed to different text types. PEGASUS [14] is a transformer-based model trained relying on the Gap Sentences Generation and Masked Language Modeling objectives. By learning to generate sequences and predict masked tokens, PEGASUS acquires multiple skills that are deemed as beneficial in the training phase.

Italian Language Modeling
Most of the existing transformer-based language models are available only for the English language. This strongly limits the portability to multilingual scenarios. To overcome the above-mentioned limitation, prior works have attempted to train transformer-based models trained in languages other than English such as French [23], Vietnamese [24], or Chinese [25]. Limited research efforts have been devoted to the Italian language, i.e., [9,[26][27][28].
Refs. [26][27][28] are encoder-only models. They encode pieces of text according to an internal representation that can be used for discriminative tasks such as text classification. However, they cannot be applied to solve generative tasks such as abstractive text summarization. IT5 [9] is a sequence-to-sequence model. Similar to BART, it can be used to generate text conditioned on the input text. However, IT5 is trained using a multi-task learning objective, which offers a different training objective than BART and thus may show different performance on the text summarization task. In this work, we aim at developing a transformer-based model specifically designed for the Italian language and trained using the same training objective as BART [8]. We also carry out an empirical comparison between BART-IT and IT5 to analyze model performance on Italian documents.

BART-IT
BART-IT is a sequence-to-sequence transformer-based model based on BART [8]. It is specifically designed for the Italian language and exploits the same denoising objectives used for the training of BART.

Denoising Objectives
To learn effective representation for the summarization task, the original BART model employs multiple denoising objectives. The input text is first corrupted using different transformations and then the model is trained to reconstruct the original text. The corruption transformations used in BART are the following: • Document rotation: the input document is first divided into sentences using full stops as separators. Then, one sentence is randomly selected as the first sentence and the remaining sentences follow the selected one using the original order of the input document. • Sentence permutation: Similar to the previous case, the input document is first divided into sentences using full stops as separators. Then, the sentences are randomly shuffled and the resulting sequence is used as input for the model. In this case, no sentence of the corrupted sequence is forced to follow the order of the original document. Token deletion: each token is randomly deleted (with a probability of 0.15) from the input sequence. The remaining tokens remain unchanged. Table 1 shows examples of the different denoising objectives. Each corruption is applied to the original input sequence and the associated corrupted sequence is shown. The choice of the corruption transformations is random, and the model is trained to reconstruct the original sequence using the corrupted sequence as input. While each corruption is applied independently and only one corruption is applied at a time, during training, the model is exposed to all possible corruptions of the input sequence, thus allowing the model to learn to generalize on different noise patterns. The use of several corruptions having different granularities allows the model to better learn the different aspects of the input text, such as the token, sentence, and document structure. Table 1. Examples of the different denoising objectives. Capital letters represent sentences and numbers represent tokens. Each corruption is applied to the original sequence, and the resulting corrupted sequence is shown. The model is trained to reconstruct the original sequence using the corrupted sequence as input. Selected tokens or sentences are highlighted using a yellow background.

Corruption
Original Sequence Corrupted Sequence

Model Architecture
BART-IT model follows the same architecture as BART [8]. BART-IT is trained from scratch in the Italian language and uses a language-specific tokenizer created using the Byte-Pair Encoding (BPE) algorithm [29]. We use the base architecture of BART to create an efficient transformer-based model that is both able to learn effective representations and can be trained on a single GPU with a reasonable amount of memory and time. BART-IT is a sequence-to-sequence model composed of an encoder and a decoder. Both the encoder and the decoder are composed of 12 layers, the number of attention heads is 12, and the hidden size of the internal representation is 768. To effectively learn language-specific representations, we also train a tokenizer for the Italian language using the Byte-Pair Encoding (BPE) [29] algorithm. The tokenizer is trained on the same data collection used for the training of BART-IT, and it is used to tokenize the input text before feeding it to the model. Table 2 reports the full list of parameters used for the training of BART-IT and the corresponding tokenizer. The resulting model has a total of 140 million parameters, which is comparable to the number of parameters of the base version of BART [8].

Training Data Collection
Modern transformer-based models are generally pre-trained using large corpora to learn the syntactic and semantic properties of the language. With the goal of learning effective representations of the Italian language, we train BART-IT using a large collection of Italian documents. Specifically, we use the Clean Italian mC4 Corpus used for training IT5 [9]. It is a cleaned version of the multilingual Colossal Clean Crawled Corpus (mC4) [11] containing only Italian documents. The dataset is cleaned to remove noisy or corrupted text that could negatively affect the training of sequence-to-sequence models. Using the same data collection allows us to compare the performance of BART-IT with the performance of IT5 [9] on the abstractive summarization task. The final data collection consists of approximately 103 million documents and 41 billion words.

Experiments
In this section, we present the datasets used for fine-tuning BART-IT for the abstractive summarization task, the evaluation metrics, the experimental setup, and describe the achieved results.

Fine-Tuning Datasets
Abstractive summarization is a challenging natural language processing task because it entails conveying the key information contained in a long piece of text into a short summary. Most of the benchmarks available for this task consist of a collection of text documents and the corresponding manually written summaries. In our experiments, we fine-tune BART-IT on three different Italian summarization datasets to evaluate its performance on the abstractive summarization task. December 2022) and extracting the leading section of each Wikipedia article and using it as the summary. The dataset is the largest among the three, containing a total of 700,000 article-summary pairs. Analogously to the original authors, we randomly select 10,000 articles as the test set, 10,000 articles as the validation set, and the remaining articles are used for training. Given the different nature of the articles, the average length of the documents is 956.66 words, and the average length of the summaries is 70.93 words.
Document type, documents' average length, and summaries' average length are rather diversified across the considered benchmarks, making them suitable for evaluating the performance of BART-IT under multiple aspects.
We use the same fine-tuning procedure for all the datasets, using a maximum sequence length of 1024 tokens for the input documents (i.e., longer documents are truncated) and a maximum sequence length of 128 tokens for the summaries (i.e., longer summaries are truncated).

Evaluation Metrics
The evaluation of abstractive summarization models is usually performed by comparing the generated summaries with the corresponding reference summaries. The most common evaluation metric is the ROUGE [32] score, which is a set of metrics that measure the similarity between the generated and the reference summaries by comparing the number of n-grams (i.e., sequences of n words) that they have in common. In this work, we use the ROUGE-1, ROUGE-2, and ROUGE-L scores, which measure the similarity between the generated and reference summaries by comparing the number of unigrams, bigrams, and the longest common subsequence, respectively.
The ROUGE score cannot capture semantic similarity between the generated and reference summaries. To overcome this limitation, we also use the BERTScore [33] metric, which is a state-of-the-art evaluator that measures the token-level semantic similarity between the generated and the reference summaries. Unlike the ROUGE score, the BERTScore metric computes the cosine similarity between the embeddings of the tokens in the generated and the reference summaries. These embeddings are usually generated using a pre-trained encoder model, i.e., in our case, we use the multilingual cased version of BERT [7] as suggested by the authors of the BERTScore metric. The use of this metric allows us to capture the semantic similarity between the generated and the reference summaries and to better evaluate the quality of summaries that are not extracted from the input documents but are generated and thus may contain words that do not appear in the input documents.

Experimental Setup
The pre-training of large transformer-based models is a computationally expensive task that requires both a large amount of text data and ample computation resources. To train the BART-IT model, we use a machine with the following characteristics: • CPU: Intel ® Core TM i9-10980XE CPU @ 3.00 GHz; • GPU: 2 x NVIDIA ® RTX A6000 GPU, with 48 GB of VRAM each • RAM: 128 GB.
The training of the model is performed using the transformers library [34] and leveraging the PyTorch framework for deep learning [35].

Pre-training phase
For the pre-training phase of BART-IT, we use a total batch size of 64, a maximum sequence length of 1024 tokens for both the input and the output sequences, and the AdamW optimizer [36] with a maximum learning rate of 10 −4 and a weight decay of 10 −2 . The pre-training phase is performed for 1.7 million steps (i.e., roughly 1 epoch as suggested in recent studies [37]) with 17,000 warmup steps. We use a linear scheduler for the learning rate, which means that the learning rate starts at 0 and linearly increases to reach the value of 10 −4 over the first 17,000 training steps. After the warmup phase is complete, the learning rate starts to decrease according to the decay factor. This means that the optimizer will take smaller and smaller steps as training progresses, which can help to improve the convergence of the training process and the performance of the trained model. To reduce both training time and memory usage, we also use floating-point 16-bit precision during model pre-training.

Fine-tuning phase
To ensure consistency and fairness in our comparison of model performance, finetune BART-IT using the same set of hyperparameters for all the datasets. This allowed us to directly compare the results and evaluate the effectiveness of our model without introducing any potential biases or inconsistencies. Specifically, we use a batch size equal to 32, a maximum sequence length of 1024 tokens for the input documents, and a maximum sequence length of 128 tokens for the summaries. The fine-tuning step is performed using AdamW optimizer [36] for a maximum of 10 epochs, with a maximum learning rate of 10 −5 , 500 warmup steps, and a weight decay of 10 −2 . We also use floating-point 16-bit precision during model fine-tuning. These parameters were chosen to fit the computational constraints of our experimental setup and are used for all datasets. The best model is selected using the ROUGE-2 score on the validation set.

Baseline Models
To evaluate the effectiveness of BART-IT, we compare it against the following strong baselines for abstractive summarization: • IT5 [9] is a state-of-the-art sequence-to-sequence model that relies on the same architecture proposed by T5 [22] but is trained on the Italian language. This model is trained on the same dataset used for the pre-training of BART-IT and is fine-tuned on the same summarization datasets used for its evaluation. The model is available in three different sizes: small, base, and large. We use the base version of the model since it is the most similar in terms of the number of parameters compared to the proposed model (i.e., 220 million parameters); • mBART [10] is a multilingual sequence-to-sequence model that uses the same architecture of the original BART model. It is trained on a multilingual corpus of 25 languages. By construction, the model size for the base model is more than four times larger than the model size of BART-IT (i.e., 610 million parameters). Even in this case, the model is fine-tuned on the same summarization datasets used for the evaluation of BART-IT; • mT5 [11] is a multilingual sequence-to-sequence model that uses the same architecture of T5 [22] and is trained on a multilingual corpus of 101 languages. The model is available in five different sizes: small, base, large, xlarge, and xxlarge. We use the base version of the model that has 390 million parameters (e.g., more than 2.5 times larger than the model size of BART-IT). Similar to the previous models, the model is fine-tuned on the same datasets used for the evaluation of BART-IT.
The summarization models are also compared against well-established extractive summarization baselines to evaluate the effectiveness of the abstractive models in generating summaries that are more fluent and coherent than the extractive summaries. • Lead-K is a baseline method that extracts the first k sentences of the input document.
In our experiments, we set k to 2; • LexRank [38] is an established summarization method that extracts the most important sentences from the input document by first modeling the input document as a graph and then computing the PageRank score [39] of each sentence. The similarity between two sentences is computed using the IDF-weighted cosine similarity between the TF-IDF vectors of the sentences; • TextRank [40] is a baseline method that, similar to LexRank, extracts the most important sentences of the input document by modeling the input document as a graph and computing the PageRank score [39] of each sentence. The pairwise sentence similarity is computed by exploiting the word-level overlap between each pair of sentences.
All the models are evaluated using both the ROUGE and BERTScore metrics to evaluate the effectiveness of the models in generating fluent and coherent summaries.

Experimental Results
This section outlines the main experimental results achieved on the analyzed datasets used during the evaluation phase. To analyze both model efficiency and effectiveness, we report the result obtained on the test set and the number of parameters separately for each model. The model size can be used as a proxy for the complexity of the model and as a way to gauge the performance of models of different sizes. Tables 3-5 separately report the results of both extractive and abstractive summarization models on the three different datasets using ROUGE and BERTScore metrics.  News summarization datasets FanPage and IlPost [30] are two datasets containing Italian news articles and their corresponding summaries. Both datasets can be classified as medium-sized datasets since they contain less than 100,000 of documents in the training, validation, and test sets. BART-IT significantly outperforms all extractive baseline models and IT5 with comparable model sizes on both datasets. Tables 3 and 4 show the comparison between BART-IT and the other models on the two datasets. Regarding ROUGE-1 and ROUGE-2 scores, BART-IT outperforms IT5 by a margin of 1.43 and 0.29 points (i.e., 4.2% and 1.8% relative improvement) on FanPage and 4.43 and 3.91 points (i.e., 13.4% and 25.17% relative improvement) on IlPost. In terms of BERTScore, BART-IT outperforms IT5 by a margin of 2.93 (4.2%) and 4.3 (6.11%) points on FanPage and IlPost, respectively. When comparing BART-IT with multilingual models, we observe that BART-IT can perform better than mT5, but it is outperformed by mBART on both datasets at the cost of significantly more computationally intensive model training. More specifically, both mBART and mT5 models are trained on a multilingual corpus and contain significantly more parameters than BART-IT (e.g., 610 M parameters for mBART and 390 M parameters for mT5 compared to 140 M parameters for BART-IT), which can be a reason for the performance gap between the models.
Wikipedia summarization dataset WITS [31] is a large dataset of Wikipedia articles and their corresponding summaries. The dataset contains roughly 700,000 documents and has been used in previous works to evaluate the performance of abstractive summarization models [31]. Despite the fact that having a large number of documents can be beneficial for training accurate models, the Wikipedia articles are rather diversified in length and topic, making the task of summarization significantly more challenging. Analyzing the results on the WITS dataset reported in Table 5, we observe that BART-IT performs significantly better than all the other models, including the multilingual ones. Even though the number of parameters of BART-IT is significantly fewer than both mBART and mT5, the model is able to outperform them in terms of ROUGE metrics. Considering the BERTScore metric, mT5 performs slightly better than BART-IT. The results indicate that multilingual models are able to learn the mapping between source and target languages, but they might be less effective in learning to produce a summary of long and diverse documents.
The empirical analyses reported in this section show that BART-IT is competitive yet efficient in tackling the Italian document summarization task, especially when compared with language-specific abstractive models. The model also outperforms the IT5 sequence-to-sequence model with a similar number of parameters and is competitive against multilingual models having significantly larger model sizes. Figure 1 compares the performance of the models in terms of a combined evaluation score mixing effectiveness (evaluated using the syntactic ROUGE-2 score) and efficiency (defined by the number of summaries generated per second on a single NVIDIA A6000 GPU). BART-IT achieves the best performance mix, addressing summary generation more efficiently than all the other tested models, outperforming IT5 and mT5 in terms of ROUGE-2 scores, and being competitive against mBART. In view of the achieved results, BART-IT can be deemed as an efficient yet effective solution for tackling the abstractive summarization task for the Italian language.

Discussions of Model Limitations and Ethical Issues
In this section, we address the limitations of the proposed approach and the ethical aspects related to the use of abstractive models. Sequence-to-sequence models are able to generate fluent and coherent text conditioned on the input document. Despite their characteristics being particularly suitable for addressing tasks such as summarization or machine translation, abstractive models are also prone to generate hallucinated text that may contain both non-factual information or offensive content [41]. In this work, we focused on the summarization task and assess the performance of BART-IT using automatic evaluation metrics. The adopted metric provides an estimate of the quality of the generated summaries, which neither demonstrates a factual correctness of the generated summaries nor prevents the inclusion of offensive/inappropriate content. Ensuring the factual correctness of the generated summaries is a challenging task that has been investigated by the NLP community [42] and is well beyond the scope of this work to the current work. The Italian NLP community would benefit from a pre-trained BART model on the Italian language to address several tasks, including but not limited to the summarization task. Hence, it is crucial to be aware of the ethical implications of using such models.
We provide BART-IT as an open-source model in a single model size, which is significantly smaller than the multilingual models (e.g., 140 million parameters vs. 610 million parameters for mBART). Researchers and practitioners can experiment with this model and evaluate its performance on different tasks. The model size allows them to fine-tune the model even on a single consumer GPU, which can be a significant advantage for researchers and practitioners that do not have access to large computational resources. However, although the quality of the generated summaries is competitive with models having significantly more parameters, it would be interesting to investigate if the performance of BART-IT can be further improved by training a larger version of the model. Training a larger version of BART-IT can be easily accomplished by using the same training procedure and leveraging the code used in the current work (see Section 3). However, the higher number of parameters would require a larger amount of computational resources. For example, to pre-train the large version of the model proposed by the original BART authors [8], on the same dataset as in the current work and in a reasonable amount of time (e.g., less than 30 days), it would require a system configuration having at least double the number of GPUs and amount of associated memory. The rest of the system configuration would be similar to the one used in the current work. This is beyond the scope of the current work.

Conclusions
In this work, we presented BART-IT, an efficient and effective sequence-to-sequence model for the Italian language. We pre-train BART-IT on a large Italian corpus and evalu-ated its performance on the summarization task. The results show that BART-IT is able to outperform the language-specific abstractive models achieving competitive performance against multilingual models with a significantly larger number of parameters.
We present and release the BART-IT pre-trained model (as well as the code needed to retrain it) to foster the Italian Natural Language Processing community to develop new applications tailored to the Italian language. In spite of the fact that it is focused on the summarization task, we believe that the proposed model can be used for other tasks such as question answering and machine translation. Investigating the performance of BART-IT on other tasks will be addressed as future work.