WATS-SMS: A T5-based French Wikipedia Abstractive Text Summarizer for SMS

: Text summarization remains a challenging task in the Natural Language Processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones; and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention; and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.


Introduction
One of the most fascinating advances in the field of artificial intelligence is the ability of computers to understand natural language. Natural Language Processing (NLP) is an active field of research encompassing text and speech processing. With the huge amount of textual content available on the Internet, the search for relevant information is becoming difficult. Therefore, one tremendous application of NLP remains the text summarization. According to Maybury [1], text summarization aims to distil the most important information from one or several sources to generate an abridged version of the original document(s) for a particular user or task. Text summarization is usually composed of three steps: pre-processing, processing, and post-processing [2]. The first step prepares the text to summarize by performing some tasks such as sentence segmentation, word tokenization or stop-words removal. The second step really performs the summarization of the text using a summarization technique based on one of the three approaches: extractive, as anaphora resolution before generating the final summary [3]- [5].
Text summarization is applied for several purposes including news summarization [6], [7], opinion or sentiment summarization [8], [9], tweet summarization [10], email summarization [11], scientific papers summarization [12], [13] and even web page summarization [14]. The latter offers the potential to provide an overview of web pages to devices with limited features. However, although the rate of use of mobile devices is increasing in remote and/or rural areas of developing countries, the bulk of those devices offers limited features with no browser and limited connectivity options [15].
Despite efforts that have been made to provide Wi-Fi-based wireless connectivity in certain locations [16], [17], a large part of those devices is working only with the GSM network. Therefore, the only mean to provide relevant information to those users is through SMS.
An important sector in rural regions, especially in sub-Saharan Africa, remains education that experiences teacher shortage [18], [19]. Providing relevant content to sustain the sector has become a priority. Among freely available educational resources, the Wikipedia encyclopedia represents one of the largest and most recognizable reference resources [20]. It is therefore an excellent choice to mitigate teacher shortage by providing relevant information to students. But due to character limitation when using SMS, a complete Wikipedia page cannot be always sent through SMS, at the risk of losing relevant information. Thus, the summarization of the page is mandatory to provide the quintessence of the page.
These works are mostly based on extractive techniques that amount to ranking sentences and selecting the best ones. Usually, they produce summaries that are between one quarter to half the length of the original page.
Although some works such as Lin et al. [23] deals with word limitation when generating the summary, it remains very difficult to consider character limitation as imposed by SMS. Indeed, the summary should fit in a maximum of three SMS. Depending on the mobile carrier, the number of characters should not exceed 455, otherwise the message will be transformed into MMS, requiring adapted devices and additional cost compared to SMS. Moreover, due to the unpredictable relationship between the number of words or sentences and the number of characters, summarizing a webpage into SMS is a real challenge.
The first attempt to consider text summarization with character limitation is provided in [24]. Authors used an extractive approach based on the combination of LSA and TextRank algorithms. Although their approach generates summaries with a limitation on the number of characters, the major constraint is the redundancy while producing the summary, and the lack of context understanding. In fact, the generated summary is based on the computation of the weight of the words in the sentences, with a complete unawareness of the context. Although some other works try to understand the context during the summarization [25]- [30], they do not take into consideration the limitation in terms of the number of characters. This paper introduces WATS-SMS, a French Wikipedia Abstractive Text Summarizer that aims to summarize French Wikipedia pages into SMS and to provide summaries directly on the user's device. It is built by applying a transfer learning technique to fine-tune a pre-trained model on French Wikipedia pages. The length of the summary is checked before producing the final summary. The fine-tuning process quickly becomes the fast approach used to build a custom model due to the huge requirements needed to train a model from scratch and the good performance shown by fine-tuned models [31].
The rest of the paper is organized as follows: Section 2 briefly presents the related works on text summarization. Section 3 explains the general architecture of WATS-SMS, as well as details of its main components. Section 4 presents and discusses the quality of summaries based on comparison with existing approaches. Section 5 presents a demonstration of the solution on a basic phone. This paper ends with conclusions and directions for future works, as discussed in section 6.

Related works on text summarization
The first significant contributions in text summarization dates back to Luhn [32] in the late 50's and Edmundson [33] in the early 60's. The approaches were basic, focusing on the position of sentences and the frequency of words to produce a summary composed of the extracts from the original text. Since then, many efforts have been made to develop new approaches to text summarisation. Recent surveys such as [2], [34] propose a comprehensive review of the state of art in automatic text summarization techniques and systems. Figure 1, adapted from [2], provides a classification of the automatic text summarization systems. Text summarization can be classified based on the input size. Some approaches such as [35], [36] have been designed for single document while others are targeting multidocument [37], [38]. The summary language is another important aspect in text summarization. Although most of the works are focused on English, some have developed multilingual text summarisation approaches [8], [39], [40], and recently cross-lingual approaches have emerged [41], [42]. Text summarization can be performed based on the type of summary: Extractive, Abstractive, or Hybrid. Extractive approach consists in scoring sentences and selecting the subset of high-scored sentences [43]. In an abstractive approach, the input document is represented in an intermediate representation before the summary is generated with sentences that are different from the original ones [44]. The hybrid approach combines both previous approaches [5].
Abstractive approaches can generate better summaries using words other than those in the original document [46], with the advantage that they can reduce the length of the summary, compared to extractive approaches [47]. Several methods have been developed for abstractive text summarization including rule-based, graph-based, tree-based, ontology-based, semantic-based, template-based, and deep-learning based. The latter was made possible thanks to the success of sequence-to-sequence learning (seq2seq) that uses a set of Recurrent Neural Network (RNN) based on attention encoder-decoder. However, RNN models are slow to train, and they cannot deal with long sequences [48]. Long Short-Term Memory (LSTM [49]) draw their potential from the latter, they are capable of learning long sequences. Several works have reported promising results in a wide variety of application where data has long sequences dependencies.
A breakthrough that revolutionized the NLP industry and helped to deal with long term dependencies and parallelization during training is the concept of transformers introduced in [50]. Later, the BERT (Bidirectional Encoder Representations from Transformers) representation model which is a pre-trained model that considers for the first time the context of a word from both sides (left and right) has been developed in [51]. BERT provides a better understanding of the context in which a word has been used, in addition to allow multi-task learning. BERT can be fine-tuned to generate new models for other languages. More recently, models based on BERT such as CamemBERT [52] and FlauBERT [53] have been trained on very large and heterogeneous French corpus. The generated French models have been applied to various NLP tasks including text classification, paraphrasing, and word sense disambiguation. However, the application of BERT-based model to summarization is not straightforward. In fact, BERT-based models usually output either a class label or a span of the input.
Text summarisation tasks using a BERT-based model are proposed in [54] and [55].
Generated models have been evaluated on three datasets of news articles: CNN/DailyMail news highlights, the New York Times Annotated Corpus, and Xsum. Lately, the Text-to-Text transfer transformer (T5) model and framework that achieve state-of-the-art results on several benchmarks including summarization has been proposed [56]. To the best of our knowledge, no work based on T5 deals with the text summarization using French Wikipedia. In addition, the limitation in terms of number of characters has not yet considered during abstractive summarization process.
Due to the homogeneity of Wikipedia in terms of genre and style, and because of the diversity of the domains we are interested in coverage through the WATS-SMS application (news, education, health, etc.), we used the T5 model instead of BERT. T5 works as a ledger of all NLP tasks into a unified format, in contrast to BERT-style models that usually output either a class label or a span of the input. This section presents the proposed WATS-SMS system including the SMS mobile gateway and the summarizer module. The complete working of the system is illustrated in Figure 2. The process starts with the user sending an SMS (through the GSM network) to the SMS gateway (SMS Gateway phone number). The first module of the SMS Gateway, namely the SMS Handler, retrieves the keyword and the phone number of the user. Then the second module of the SMS Gateway, namely the Request Handler, will generate the web request to send to the server. Once the request is received by the summarizer on the server, the Wikidump module will first retrieve and clean the corresponding Wikipedia page to be sent to the Text Summarization module. The latter module generates the summary that takes the reverse path to the user.

The SMS Gateway
The SMS Gateway is an application that can be installed on a smartphone to avoid traditional hardware SMS Gateway. It is composed of two modules: The SMS Handler and the Request Handler.
The SMS Handler performs two functions. Firstly, it reads and filters incoming messages to select those containing the keyword "resume". Since an ordinary smartphone can be used to play this role, not all incoming SMS messages are necessarily requests. Only SMS messages containing the keyword are considered as such. So, the SMS-Reader parses the content, and the text next to the keyword "resume" is considered as the title of the

The Summarizer
The proposed summarization approach consists of three main steps: 1. Pre-processing of the requested page (Retrieving and cleaning); 2. Summarization process; 3. Post-processing the summary.

Pre-processing of French Wikipedia pages
The pre-processing aims to retrieve and clean the content of a requested page before starting the summarization. It is important to know the structure of a Wikipedia page before starting the retrieval. A typical Wikipedia is composed of four parts: the top of the page, the stringcourse of the left, the body, and the footer. The body contains the main information of the page. It is also divided into four main parts: the title, the introductory summary, the table of contents and the content itself. The page retrieval targets the body part, mainly the content and the introductory summary. This is done by using the Wikidump function.
However, we should pay attention during the page retrieval. Some pages do not have a dense introductory summary, while others present relevant information only in the summary and not in the content section. Focusing only on the main content of the requested page may not provide relevant information. It is therefore necessary to combine the introductory summary and the content. For instance, the summary of the page "Eséka" 1 is only one sentence (112 characters with spaces) while the length of the content of the page is 9762 characters. Another example is the page "Song Bassong" 2 that contains relevant information in the summary section that are not present in the content section. In addition, the total number of characters of the latter page is 345 (below the limit 455), meaning that the content of both sections can be sent back to the user, even without a summarization. Even though such situations are quite rare, the content retrieved from the requested page will go through the summarization process.  We consider the default architecture of the T5 model, and we apply the Adam optimizer by initializing the learning rate to 0.0001. The input batch size for training and testing is 2, and we use 2 epochs for the training. The training was done using a Tesla T4 from Colab.

Summarization process
We split the datasets to 80% for the training set and 20% for the test set. The Wikipedia summary section was used as a reference summary as it is done in [24].

Post-processing of the summary
The post-processing is done with the help of the length_Checker function. It checks the length of the summary and eventually modifies it to get an understandable final summary. In fact, the maximal number of characters (455) is given as a parameter to the model.
Then the model outputs the first 455 characters of the generated summary in case the length of the generated summary is greater than 455. After truncation, the last sentence may not be understandable. The easiest way to fix this, is to delete the eventually incomplete last sentence.

Results
The well-known ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics are used to evaluate the quality of a summary or machine translation [57]. ROUGE works by comparing an automatically produced summary against a set of typically human-produced reference summaries. The measure is done by counting the number of matching words between a candidate summary and the reference summary. In other terms, ROUGE measures the content coverage of an automatic summary over the reference summary. Different variants of ROUGE have been proposed such as ROUGE-N, ROUGE-L and ROUGE-S that count respectively the number of overlapping units of ngram, word sequences, and word pairs between the candidate summary and the reference summary. The formula for ROUGE-N is given in equation 1.
where gram is the choice of n-gram and S is the reference summary.

Comparison with extractive approaches
We first compare the results of the proposed summarization model with the results in [24] that defined an approach associating the LSA and Text Rank algorithms for the text summarization of Wikipedia articles. However, the tests carried out in [24] are done on English pages. For a fair comparison, we selected the corresponding French Wikipedia pages of articles in [24]. We focused on the results of articles with a size greater than 2,000 characters. This is done for two main reasons. Firstly, most of the French pages in Wikipedia are longer than their English version. For instance, the French version of the page on "Kousseri" is 5,797 characters, while the English version is only 1,386 characters. Secondly, from the results in [24], the proposed extractive approach struggles to provide a good result when the length of the page is increasing. So, the idea is to also appreciate the From Table 1, we observe that the summary produced by the proposed abstractive model is most of the time better than the one produced by the extractive approach, even though French pages are longer than their corresponding English pages. A deep observation reveals that the extractive approach struggles to provide a good summary when the reference summary is longer. For instance, on the Wikipedia page "Kousseri", the reference summary contains 459 characters and the best ROUGE metric equal to 0.57 for LT ROUGE-1. When the length of the reference summary increases like on the page "Bafoussam", the LT ROUGE-1 drops to 0.52. However, the extractive approach provides better results when the length of the reference summary is very short (up to hundred characters).
WATS-SMS has also been compared to Wikipedia-based summarizer, another extractive-based summarizer system proposed by Sankarasubramaniam et al [21]. Their approach consists first to construct a bipartite sentence-concept graph, and to rank the input sentences using iterative updates on this graph. The work in [21] also makes use of English Wikipedia pages. We used ROUGE metrics for the evaluation and computed the recall scores for 100-word summaries to be able to reproduce the test conditions used by Sankarasubramaniam et al [21], with the only difference that our proposed model uses French Wikipedia pages.

Comparison with abstractive models
The proposed approach is compared to other abstractive models defined in [58] and trained using English Wikipedia. We first compared our model to seq2seq-attention and Transformer-ED with an input sequence length up to 500 using the ROUGE-L metric, as it is done in the original paper. The results are given in Table 3. Finally, the proposed model has been compared with Transformer-DMCA using input text between 10,902 and 11,000. 128. This demonstrates the ability of the proposed model to provide good summary.

Demonstration
This section shows a demonstration of WATS-SMS on a phone with limited features.
In Figure 4 the user starts by sending a request with the keyword "resume", followed by the title of the requested page: "informatique".