1. Introduction
Text summaries give a reader an overview of a document or a collection of documents on the same topic. Summaries are a convenient tool to give an indication of the overall content and may also inform the reader of details of interest to them. The need for quality summaries has grown with the dramatic rise of information. Many individual documents have good-quality summaries; however, these summaries may not focus on the specific information of interest to a user.
There are two general approaches to address a user’s needs for a summary of one or more documents:
extractive and
abstractive. Extractive methods primarily seek to find good sentences from the source documents that can be concatenated to form a summary; abstractive methods may use words and phrases that do not appear in the original documents and try to present them meaningfully for a more human-level summary. In this paper, we present the software package
occams, an extractive summarization system using a combinatorial optimization approach whose performance is demonstrated to be among the best extractive methods [
1]. For a survey of the summarization problem and classical extractive summarization methods, see [
2,
3].
Until recently, abstractive summarization methods lagged behind extractive methods, as systems could not generate fluent context containing relevant information. One of the first successful approaches used a recurrent neural network (RNN) along with a
pointer generating network [
4]. RNNs were used to encode a document, turning the sequence of token embeddings into a state vector. Conversely,
decoding takes the state vector and converts it into a sequence of tokens, forming the summary. This approach cleverly augmented the decoding of the RNN-encoded document with a “copy mechanism” to include extracted words or even segments of the document.
With the invention of the transformer model [
5] later that year, another big change came about. The transformer model then led to deep pre-trained language models, such as generative pre-trained transformers (GPT) [
6] and bidirectional encoder representations from transformers (BERT) [
7], both of which improved the performance on a range of natural language processing tasks, including recognizing relevant content to generate summaries. When these models are coupled as was done in bidirectional and auto-regressive transformers (BART) [
8], sequence-to-sequence models were trained to learn to transform a document into an abstractive summary. Over the last three years, the models have grown in size and scope for text summarization and other natural language processing tasks, including the introduction of sophisticated chatbots, such as ChatGPT [
9], which have the ability to answer questions, summarize, and more generally generate text based on a prompt (for example, a prompt could be “Summarize the following text with 100 words or less”
https://openai.com/blog/chatgpt, accessed on 8 December 2022).
Despite these great advances, there is still a need for classic extractive summarization. Abstractive methods suffer from “hallucinations”, factual errors or statements not supported or readily explained by the original text [
10]. Secondly, large language models can require great computing resources, and most have limited input sizes. Due to these concerns, an extractive summarization system can provide an alternative or be used in conjunction with an abstractive system as done by others, e.g., [
11].
The primary goal of
occams is to provide a state-of-the-art multilingual extractive summarization method using the first principles of the statistics of natural language in conjunction with an optimal approximation of a combinatorial covering algorithm. The covering problem approximates the integer linear programming problem formulation [
12]. The previous implementation of the approximation algorithm was reported [
13] to give over 90% of the optimal coverage with an order of magnitude speed up.
The current implementation maintains this coverage approximation and has improved its performance.The system’s name evokes the principle of Occam’s razor, which can be summarized as “entities should not be multiplied beyond necessity”. The principle is attributed to William Ockham (Occam), a 13th-century English Franciscan friar and philosopher (although this statement of principle occurs later, and the principle itself could be argued to date back to Aristotle [
14]).
We apply this principle first to aim toward an extractive versus abstractive summarization. While neural-based summarization systems have produced great advances in text summarization, and human-quality abstractive summaries would be preferred, in many cases, a simple extractive method suffices. In addition to being computationally cheaper, extractive systems often benefit from greater flexibility and control of input and output lengths. They can be applied to various multi-document summarization tasks across various document genres. As mentioned, abstractive methods suffer from hallucinations and present a challenge to explain their output. In contrast, an extractive method, by definition, selects a subset of sentences from the document, so it is easy to find the source of information used in the generated summary.
The occams package decomposes the extractive summarization task into three steps.
The input document or documents (the input text) is segmented into sentences and terms.
Term weights are computed, indicating the relative importance of the terms in the input text.
Given a target summary length in characters or words, an optimal approximation algorithm is used to maximize the weighted coverage of terms by selecting a subset of sentences, the sum of whose lengths does not exceed the budgeted target length.
The paper is divided into three main sections according to these three parts. We then give a short example demonstrating how to use the package to extract a summary, and conclude with results compared to other summarization systems.
2. Document Segmentation
Document segmentation is a critical part of natural language processing. The quality of the resulting downstream tasks, summarization included, is affected by this first step. Document segmentation for extractive summarization consists of splitting the text of a document into sentences and tokenizing those sentences into smaller units of meaning, typically words. This process is language dependent.
As we will see in more detail in
Section 3,
occams views a document as a bipartite graph on the sets of sentences and
terms, with adjacency indicating that a given term occurs in a given sentence. Terms are user specified and can represent more abstract notions, such as concepts, as originally proposed by [
15]. Typically, we form terms as unigrams or bigrams of tokens, where tokens consist of words, word stems, or word lemmas. One might also want to apply some kind of text normalization, such as collapsing consecutive whitespace characters to a single ASCII space, removing punctuation, removing digits, or lowercasing characters. There is quite a bit of choice in this process.
Included in
occams is the subpackage
occams.nlp, which deals with these problems. This subpackage serves two roles for users. First, it defines an abstraction in the form of the two Python dataclasses
Sentence and
Document. These dataclasses are a means of bundling the result of document segmentation such that it is independent of any particular NLP library’s API.
Sentence has two attributes,
text and
terms; the former, a string, is the literal text of the sentence, while the latter is a list of the terms which appear in that sentence.
Document also has two attributes:
sentences is a list of
Sentence objects, and
language is a string indicating the language of the document. Thus, a
Document object encodes the bipartite graph resulting from segmenting the text of a document. The summarizer class, which we will discuss in
Section 3, takes as input a list of
Document objects. In this way, we decouple the APIs of NLP libraries that may be used for the document segmentation step from the API of the summarizer class, giving users the flexibility to use whatever tools they wish for the document segmentation step without requiring those tools to be directly supported by
occams.
On the other hand, as we said above, there is quite a bit of choice involved in segmenting documents, from the choice of a library, to what constitutes a token, to whether or how to normalize the text. The second role of the
occams.nlp subpackage is to alleviate some of this burden for users who would prefer a more convenient and streamlined process. We provide a simple
DocumentProcessor class, which can be used to transform the text into
Document objects. By default, this class uses
nltk [
16] for document segmentation:
nltk.tokenize.sent_tokenize() to split the document into sentences,
nltk.tokenize.word_tokenize() to form word tokens, and one of
nltk’s Porter or Snowball stemmers to produce stemmed word tokens. We choose
nltk, as it is the most widely used Python package for natural language processing, it is computationally efficient, and gives strong support for most European languages. To provide convenient support for a broader set of languages, we also include a class that uses
stanza [
17] for document segmentation, as
stanza supports 70 languages.
These classes have a few options to allow users to control some of the aspects of the document segmentation noted above. However, they are also meant to be somewhat opinionated about the process to save users from having to make too many choices. Users with their own opinions can quite easily segment documents however they wish and bundle the results into
Sentence and
Document objects. Finally,
occams.nlp contains a helper function,
process_document(), that instantiates a
DocumentProcessor object with its default options and uses it to process text into a
Document object. We demonstrate the user of this wrapper in
Section 5.
Now that our text, which may consist of one or more documents, is segmented into sentences and terms, we can now formulate the text summarization problem as a combinatorial optimization problem.
3. Extractive Summarization as a Combinatorial Optimization Problem
The mathematical formulation of the problem used in
occams is adapted from [
12,
15]. We let
be the term-sentence incidence matrix of size
m by
s encoding the relationship among all of the terms and sentences found in the documents to be summarized so that
if and only if term
i appears in sentence
j, and 0 otherwise.
The objective function of the optimization problem credits each unique term a sentence that contributes to the summary based on its term weight, as discussed in
Section 4, which we denote as
and which represents the relative importance of term
i for the topic
of the set of documents to be summarized. The constraint for the optimization problem is the target length
L of the summary. Let
denote the length of sentence
j in whatever units are desired, e.g., words or characters.
The optimization problem can be formalized as follows:
subject to
Inequality (
1) says that the sum of the lengths of the sentences selected for the summary does not exceed the target length
L, while inequalities (
2)–(
4) together say that the set of terms “covered” by the summary is precisely the union of the sets of terms, which occurs in the sentences selected for the summary.
The above formulation is an NP-hard problem [
15]. Still, this binary integer programming problem can be viewed as a budgeted maximum coverage problem, and solved with provably optimal approximate algorithms [
15,
18,
19] or treated as a problem in the more general class of sub-modular approximation algorithms [
20]. These algorithms are
optimal, in that they are guaranteed to achieve an optimal fraction of the best solution. The
occams package uses an optimal approximation algorithm for the budgeted maximum coverage problem and a dynamic programming algorithm for the knapsack problem. These methods were shown empirically to exceed the guaranteed lower bound of approximation (
) and achieve about 94% accuracy [
13] and are about 14 times faster than the corresponding integer programming algorithm. The new implementation, which we present in this paper, achieves or exceeds the same accuracy and has both improved runtime and memory requirements. (See
Appendix A for an example comparing the former implementation and the current one).
The algorithms for solving the budgeted maximum coverage problem and the knapsack problem have been implemented in occams as a Python extension module for better performance. Originally, these were written in C and Cython, but the extension module was recently rewritten in the Rust programming language. We choose Rust because it is a modern, memory-safe systems programming language that compiles to native machine code and achieves comparable speed and memory usage to C. Users of occams benefit from the speed and memory safety of Rust while retaining a familiar Python API.
The algorithm employed for solving the budgeted maximum coverage problem is a modified greedy algorithm given in [
18], which is more efficient. In [
18], each step of the algorithm greedily finds the sentence from those remaining, which maximizes the normalized marginal weight (the sum of the weights of the new terms it would add to the solution divided by its cost). If this optimal sentence is affordable with what remains of the budget, it is added to the solution; otherwise, it is thrown out. This process continues until every sentence has either been selected for the solution or thrown away. This requires
s iterations, where the
i-th iteration maximizes the normalized marginal weight over the remaining
sentences.
Our modification reverses the order in which weight and cost are considered. Rather than maximizing the normalized marginal weight over all remaining sentences, we only consider affordable sentences with what remains of the budget. Our algorithm terminates when no more sentences can be afforded with the remaining budget and adds new terms not already in the solution. This requires at most L steps since the length of the sentence is an integer and at least 1. In practice, typical sentence lengths measured in words or characters well exceed 1. Our implementation of the greedy algorithm for the budgeted maximum coverage problem requires space of order since the incidence matrix is stored as a sparse matrix and has a running time that is at most where n is the number of non zeros in the term–sentence matrix and L is the budgeted length for the summary.
As well as the budgeted maximum coverage problem,
occams also models the sentence selection problem as a knapsack problem; see [
19] for details on the correspondence. However, our new Rust extension module has an improved implementation of a dynamic programming algorithm for the knapsack problem from that given in the reference and implemented in the original C extension module. Items and costs index the dynamic programming table, whereas previously, it was indexed by items and profits. Using costs instead of profits to index one dimension of the table has two advantages. First, by the very nature of the summarization problem, the maximum total knapsack cost, which is the budgeted summary length, is inherently small. Previously, our knapsack algorithm suffered from requiring space and time of order
, where
is the maximum sum of the weights of the terms of a sentence. Our new implementation requires space and time of order
, which is typically much smaller and, importantly, independent of the scaling of the term weights. Second, in the summarization problem, the costs are positive integers, whereas profits are floating point values. Indexing the table with profits requires approximating the profits with integers.
Users of
occams do not directly interact with the extension module and its functions implementing the above algorithms. Instead, the
occams API provides a high-level interface to these algorithms in the form of the
SummaryExtractor class. This class is responsible for collecting the required inputs for the summarization problem, converting those inputs to the more abstract formulation used in the combinatorial optimization algorithms, interacting with the extension module implementing those algorithms, and finally converting the results back into concrete, natural language constructs to produce the desired summary. The input to create an instance of
SummaryExtractor is primarily the data discussed in
Section 2 and
Section 4, namely, (a) documents segmented into sentences and terms in the form of
Document objects, and (b) term weights.
In addition to the base
SummaryExtractor class,
occams provides several subclass specializations, each with its baked-in notion of how to compute term weights. These
WeightedSummaryExtractors simplify the setup for a user by computing term weights automatically from the segmented documents the user provides. We discuss one of these subclasses,
TermFrequencySummaryExtractor, in
Section 4.
4. Term Weighting Methods
The linear objective function of the occams algorithm is the sum of the weights of the terms covered by an extract summary. These term weights are critical to the quality of the summary produced. The occams package provides a number of different ways of computing term weights. Some of these depend only on the counts of the number of sentences or documents containing each of the terms. Such term weight methods are collected as members of a Python enum class called TermFrequencyScheme, described in more detail below. However, we emphasize that users are not limited to these provided schemes for computing term weights. In the end, users may assign term weights however they wish and simply hand these term weights to SummaryExtractor, the summarizer base class, in the form of a dictionary mapping the terms to their weights.
The following are members of the TermFrequencyScheme enumeration:
LOG_COUNTS: The logarithm of the Laplace smoothed number of occurrences of term i; , where is 1 if term i occurs in sentence j and 0 otherwise.
ENTROPY: The scaled smoothed entropy over the sentences; , where .
POSITIONAL_FIRST: A variation of
LOG_COUNTS which gives double weight to the first sentence in each document, as inspired by [
12]. Formally,
where
is the set of first sentences in the documents.
POSITIONAL_DENSE: A variation of POSITIONAL_FIRST which replaces the use of the first sentence with the first sentence above the median score for the document. Formally, where is the set of positional dense sentences described below.
POSITIONAL_MEAN: The mean count of the number of times a term occurs in the list of documents to be summarized. This scheme is inspired by the work of [
13].
POSITIONAL_MAX: The maximum count of the number of times a term occurs in the list of documents to be summarized.
POSITIONAL_MIN: The minimum count of the number of times a term occurs in the list of documents to be summarized.
CORE_TERMS: The principle left singular vector of the term-sentence matrix. Assuming
A is irreducible, we let
u be the left principal singular value of
A.
u can be chosen to be non-negative and used to form term weights. The matrix
A can be assured to be irreducible by adding a small constant, currently set to 0.01, to each entry. This scheme, as well as the next, are inspired by the success of [
21,
22] as well as the recent paper [
23], which gives a theoretical justification. Specifically, spectral clustering of a graph’s adjacency matrix will tend to expose the “core–periphery” structure. Computing the left singular vectors of
A will then tend to partition the graph to separate the key terms from the least essential terms. A scale of this singular vector is added to the overall term counts, and the logarithm of the entries is used for term weights.
CORE_SENTENCE: The principle right singular vector of the term-sentence matrix. Assuming A is irreducible, we let v be the right principal singular value of A. v can be chosen to be non-negative and used to form term weights. The entries of v are used as sentence weights. Sentences above the median value are marked as “core” and contribute to a log count term weight. The sentence weights are converted to term weights by computing the matrix–vector product .
Term weights may be optionally scaled, via an affine transformation, to be between 1 and 100.
We now give more detail and motivation of the POSITIONAL_DENSE scheme. The goal of this term weight, which is the default for occams, is to produce a robust method that works at least as well as POSITIONAL_FIRST and will avoid giving double weight to the first sentence in a document if it has little content. Such low-content sentences in the first position may arise from an error in sentence splitting or a failure to remove a dateline, byline, or boilerplate sentence.
The relative importance of a sentence is given by its per-term log-likelihood score. A sentence is considered “positional dense” if it is the first sentence in a document whose score is above a specified quantile, currently set to 0.1. If all the sentences in a document are below the threshold, then no sentence from this document is chosen. (Such a situation only occurs for multi-document summarization). If we let
be the maximum likelihood estimated probability of term
then the average likelihood score for a sentence is given by
where
m is the number of terms in the list of documents and
is the number of terms in sentence
j.
5. A Simple Example
Before we give an evaluation of occams on a few common summarization datasets, let us illustrate the software with a simple example. We will use the abstract of this paper as the text to be summarized, about 130 words, and generate an extractive summary of at most 50 words. The code below generates such a summary.
The first step is to parse the document, a process which (a) segments the document as a list of sentences; (b) word tokenizes each sentence and stems the word tokens; and (c) forms bigrams as overlapping pairs of tokens. This information is collected into the dataclasses
Document and
Sentence, which we described in
Section 2. For example, the list of terms in the first sentence is given as follows:
[(’extract’, ’text’), (’text’, ’summar’), (’summar’, ’select’),
(’select’, ’a’), (’a’, ’small’), (’small’, ’subset’), (’subset’, ’of’),
(’of’, ’sentenc’), (’sentenc’, ’from’), (’from’, ’a’), (’a’, ’document’),
(’document’, ’which’), (’which’, ’give’), (’give’, ’a’), (’a’, ’good’),
(’good’, ’coverag’), (’coverag’, ’of’), (’of’, ’a’), (’a’, ’document’)].
Note that bigrams are denoted as 2-tuples of Porter stemmed and lower-cased tokens, which is the default.
Once the document is prepared, we can create an extract of the document by calling the extract_summary() function. Note that extract_summary() takes a list of documents as input. Here, we give it a list with only one element, but more generally, a list of any number of documents may be provided to compute a multi-document summary. We ask for term weights to be computed with the POSITIONAL_DENSE scheme (which is the default). This method of computing term weights gives double weight to terms in the first sentence in each document, which is dense in the sense that its log probability score exceeds the bottom quantile of 10% (the quantile is computed by numpy.quantile, which uses linear interpolation to estimate the quantile by default). The resulting Extract object contains various information about the sentences that were selected. Calling its summary() method returns those sentences concatenated to form a single string.
For our example, the resulting summary includes the first sentence and one additional sentence, and has a combined total length exactly matching our budget of 50 words:
Extractive text summarization selects a small subset of sentences from a document which give a good “coverage” of a document. The occams package is written in Python and provides an easy-to-use, modular interface, allowing it to work in conjunction with popular Python NLP packages, such as nltk, stanza or spacy.
This summary, while not fluent, does give a good indication of the content of the abstract.
The package also includes a command line interface. Assuming the text we wish to summarize is in the file input.txt, we can reproduce the results of the example above by running the occams command from a shell.
The CLI offers many of the options available in the Python package. See
Appendix B for the full usage of the program.
6. ROUGE Evaluation of Summaries
The
occams package employs
py_rouge (
https://github.com/Diego999/py-rouge, accessed on 8 December 2022), a Python implementation of the ROUGE metric for the evaluation of summaries with English language data. We also provide support for multilingual summary evaluation. This is accomplished by processing the summaries with the same
DocumentProcessor class and then computing a ROUGE score based on the overlap of token n-grams in the machine and human summaries. In this section, we illustrate the performance of the
occams summarizer on several classic summarization datasets.
Table 1 gives results on the Document Understanding Conference 2004 task 2 using
occams with the default term weighting scheme,
POSITIONAL_DENSE, (labeled
occams) and compares it to top systems, including the previous version of OCCAMS (labeled
OCCAMS_V) as reported in [
1]. The results demonstrate that
occams produces summaries among the best scoring on this widely studied dataset, giving the highest ROUGE-2 score.
While the DUC 2004 automatic evaluation is noteworthy, it is instructive to note that when human summaries are compared, their summaries bracket all but two of the systems listed. For ROUGE-2, the lowest-scoring human summary has a score of about 8.6 and the highest about 10.7. While ROUGE correlates well with content coverage of a summary, it is only a surrogate for human evaluation. It is helpful to illustrate two summaries, one generated from the
occams and one from an abstractive summarization system, ChatGPT (text generated by ChatGPT, 21 December 2022, OpenAI,
https://chat.openai.com/chat: the prompt was “summarize the following text”, and then when requested, the text was pasted into the chat). Here, we have both systems summarize document set d30010t from DUC 2004, a set of 10 documents on the Israeli–Palestinian peace negotiations, the Wye River Accord.
occams: | ChatGPT: |
A defiant Prime Minister Benjamin Netanyahu said on Saturday that Israel would continue to build Jewish neighborhoods throughout Jerusalem, including at a controversial site in the traditionally Arab sector of the city. An Islamic Jihad official in the Syrian capital of Damascus confirmed that the group’s leader Ramadan Abdullah Shallah claimed responsibility for Friday’s bombing in an interview with the Paris-based Radio Monte Carlo. The radical group Islamic Jihad claimed responsibility. The militant Palestinian movement Islamic Holy War said Saturday that it carried out the suicide bombing in a Jerusalem market on Friday, which prompted arrests by the Palestinian Authority overnight. | The Israeli Cabinet has announced that it will postpone a vote on whether to ratify the Wye River Accord until the Palestinian Authority cracks down further on terrorism. The Cabinet also insisted that Yasser Arafat convene the Palestinian National Council to abrogate the Palestinian National Charter, which calls forIsrael’s destruction. The decision follows a market bombing in Jerusalem in which two people were killed and 21 were wounded. Hamas, which has tried to stop the peace agreement, claimed responsibility for the attack. In response, the Israeli Cabinet suspended ratification of the Wye agreement until “there is verification that the Palestinian authority is indeed fighting terrorism”, and announced plans to build houses in the controversial Jewish settlement of Har Homa in east Jerusalem. |
We make a few observations about the two automatically generated summaries. First, both contain good content and are informative. However, the ChatGPT abstractive summary flows more naturally than the occams extractive counterpart. The occams summary satisfies the specified word limit of 100 words, whereas the ChatGPT summary uses 120 words. When the prompt was changed (“please limit the summary size to 100 words”)to specify that the length should be less than 100 words, ChatGPT generated 80 words. This is an example of length being more difficult to control with a neural system. On the other hand, abstracts will generally be more concise than extracts.
Finally, we note that there are two hallucinations in the ChatGPT summary. First, the summary gives a partial quote:
In response, the Israeli Cabinet suspended ratification of the Wye agreement until “there is verification that the Palestinian authority is indeed fighting terrorism”, …
While this reads well and is largely consistent with the text, the word “verification” does not appear in any quotes from the document set. The closest matches, found by reading the documents, are the following:
APW19981106.0520: “The government of Israel will resume the discussion of the agreement after it verifies that the Palestinian Authority is taking vigorous steps for a relentless fight against terrorist organizations and their infrastructure”, a Cabinet statement said.
APW19981106.0572: The Cabinet said in a statement that it will only reconvene after “it verifies that the Palestinian Authority takes vigorous steps for an all-out war against terrorist organizations and their infrastructure”.
Secondly, the ChatGPT summary says the following:
The Cabinet also insisted that Yasser Arafat convene the Palestinian National Council to abrogate the Palestinian National Charter, which calls for Israel’s destruction.
However, the original text states the following:
APW19981106.0520: The Cabinet also demanded clarifications from Palestinian leaderYasser Arafat on the procedure for removing sections of the PLO charter calling for Israel’s destruction.
APW19981106.0572: The Cabinet also demanded clarifications from Palestinian leaderYasser Arafat on the procedure for revoking clauses in the PLO founding charter calling for Israel’s destruction.
We note that abrogating (repealing) the entire charter is much stronger language than removing specific sections of the charter. Additionally, demanding clarifications about procedures differs from insisting that a council convene.
The second dataset we include is the Multi-News dataset [
24]. The default term weighting,
POSITIONAL_DENSE, holds up well against the neural abstractive systems as shown in
Table 2, achieving the third highest ROUGE-2-F1 score as of the writing of this paper [
25]. (These ROUGE-F1 scores were computed using the ROUGE package provided by
huggingface (
https://huggingface.co/spaces/evaluate-metric/rouge, accessed on 8 December 2022), as
occams’ automatic evaluation is designed to use only ROUGE-R (recall), and summaries are limited to a bound length, which parallels the
occams summarization design, in keeping with the “Recall Oriented [Understudy for Gisting Evaluation]” origin of this metric). The alternative approaches are supervised methods; tuning parameters of the currently supported term weighting methods or full supervised term weights for
occams are a logical next step.
We conclude with an example of
occams’ support for multilingual summarization as discussed in
Section 2 via
nltk [
16],
ersatz [
26], and
stanza [
17]. The performance is illustrated using the default term weighting on the MultiLing 2015 multi-document summarization evaluation (MMS 2015) [
27] dataset. For both the
stanza and
ersatz options, data are sentence split and tokenized. Terms are formed as bigrams of word lemmas or stems, and the
POSITIONAL_DENSE term weighting scheme is employed. Note that for the
ersatz approach, this multilingual sentence splitter is followed by
nltk for tokenization. Of the 10 languages in the MMS 2015 dataset,
nltk supports English, French, and Spanish. For the remaining languages, tokenization is achieved using
nltk English tokenization. In
Table 3, we give the ROUGE-2 Recall results for human-to-human as well as the system from West Bohemia University (WBU) [
28] from MMS 2015 for comparison with
occams. WBU was the highest scoring system in 2015 [
27,
29].
Overall, the performance of
occams varies greatly in Arabic, Chinese, and Romanian. The variation is largely due to the quality of sentence splitting. As an example,
stanza splits the first 10 documents of MMS 2015 Arabic into only 18 sentences, whereas
ersatz finds 89. Some of the sentences are longer than the 250 word budget and cannot be chosen, as
occams summaries are designed to consistently produce a summary not exceeding the target length. On the other hand, the ROUGE scores for Chinese are lower for
ersatz, as
nltk uses white space to tokenize. So regardless of the quality of the sentence splits, the resulting sentence will be broken down into tokens poorly. The
occams(ersatz) results would improve by using a Chinese tokenizer, for example,
jieba (
https://github.com/fxsjy/jieba accessed on 8 December 2022), or even single-character tokenization.
7. Conclusions
In this paper, we introduced occams, a fast, robust, and flexible software package for multi-lingual, single, and multi-document extractive summarization. occams allows users to choose their own NLP library for the document segmentation step, but it also comes with ready-to-go, built-in support for nltk and stanza, the latter of which supports 70 languages. The package provides a number of means of computing term weights; we gave an overview of one family of these methods and went into detail on the default term weight method POSITIONAL_DENSE. We explained the budgeted maximum coverage problem used by occams to model the extractive summarization problem and an optimal approximation algorithm for solving it. A new Python extension module for occams written in Rust contains efficient implementations of this optimal approximation algorithm as well as a dynamic programming algorithm for the knapsack problem.
We gave a simple example showing how to use occams with its default options to extract a summary from the text. Then, we illustrated the performance of occams with its default term weighting method on DUC 2004, MultiLing 2015, and Multi-News. In each case, the approach was shown to be very competitive with the state-of-the-art methods, including neural-net-based methods, which have more demanding computational requirements as well as the need for training data. Finally, we illustrated with an example that while large neural language models generate more fluent summaries, hallucinations are possible and sometimes subtle. In the near term, extractive summarization provides an alternative to abstractive summarization and could be used in conjunction with abstractive methods to facilitate fact checking. User interfaces will need to be adapted to support this hybrid approach.