1. Introduction
In our data-driven digital era, automatic text summarization has become indispensable for efficient information management. This technology aims to produce concise summaries from source texts without human intervention, offering significant time savings across various domains. Beyond facilitating rapid content digestion, it addresses information storage challenges by compressing documents. Its applications span numerous fields including news aggregation, medical record synthesis, educational material condensation, and web content browsing.
The field traces its origins to the pioneering work on technical paper abstracts by Luhn [
1]. Modern summarization systems handle both single-document and multi-document inputs, producing outputs that are either extractive (selecting key sentences) or abstractive (generating novel formulations). These outputs range from headlines to full summaries [
2]. While extractive methods dominate current research, abstractive approaches—particularly for morphologically rich languages like Arabic—remain underdeveloped due to their inherent complexity involving paraphrasing, sentence fusion, and novel word generation. This challenge is compounded by the slow progress in developing appropriate evaluation metrics.
Current approaches to abstractive summarization employ diverse techniques, including deep learning, discourse analysis, graph-based methods, and hybrid systems. Notably absent from this landscape is the application of Swarm Intelligence (SI) algorithms, despite their proven success in other NLP tasks. While SI methods like Ant Colony Optimization [
3,
4], Particle Swarm Optimization [
5], and Cat Swarm Optimization [
6] have shown promise in extractive summarization, their potential for abstractive summarization remains largely unexplored—particularly for Arabic.
We address this gap by framing abstractive summarization as a multi-objective optimization problem and proposing AASAC (Arabic Abstractive Summarization using Ant Colony). Our approach builds upon the ant colony system’s (ACS) proven effectiveness in pathfinding problems like the Traveling Salesman Problem (TSP), adapting it for linguistic optimization.
A key advantage of nature-inspired optimization methods—and non-deep learning approaches in general—is their transparency and credibility, in contrast to the opaque nature of deep neural networks. While deep learning models often suffer from interpretability issues and hallucination [
7], AASAC provides full traceability, allowing step-by-step scrutiny of its summarization process. This explainability is particularly valuable in abstractive summarization, where transparency ensures credible and controllable output generation. Key contributions include the following:
Nature-Inspired Algorithm: We introduce AASAC (Arabic Abstractive Summarization using Ant Colony), a novel approach for Arabic abstractive summarization that leverages the ant colony system (ACS), a nature-inspired algorithm. This innovative technique leads to superior summarization results.
Expanded Dataset: We have expanded the dataset introduced in [
8] by incorporating human-generated abstractive summaries. This dataset expansion facilitates a more comprehensive evaluation process, and it is readily accessible to fellow researchers.
Semantic Feature Integration: To enhance the efficacy of summarization, we incorporate semantic features as the foundation for fitness functions. This approach significantly enhances the capacity to generate high-quality summaries.
Linguistically Aware Evaluation: Recognizing the unique linguistic characteristics of the Arabic language, we advocate the use of the LemmaRouge evaluation metric. This measure accounts for the subtleties and intricacies specific to Arabic, providing more linguistically aware assessment.
The rest of the paper is organized as follows: In
Section 2, we provide an overview of related work in the field of abstractive text summarization.
Section 3 details the formulation of the ATS problem. Our proposed summarizer is explained in
Section 4. Experimental results and discussion are presented in
Section 5. Finally, in
Section 6, we conclude the paper and discuss potential directions for future research.
2. Related Works
Within the literature, researchers have recognized the limitations of abstractive text summarization systems due to the complexities associated with natural language processing. As a result, this has attracted the interest of researchers, prompting the exploration of various methods to obtain abstractive summaries, such as graph-based and semantic-based techniques.
Graph-based methods have traditionally dominated the field. These methods entail representing the text by using a graph data structure and determining an optimal path for generating a summary. For instance, Opinosis [
9] and Kumar et al. [
10] adopted this approach, with Opinosis incorporating shallow NLP techniques and the latter employing a bigram model to identify significant bigrams for summary generation. These techniques excel in extracting essential information and producing concise summaries. It is important to note that this approach does not involve sentence paraphrasing or the use of synonyms. Some summarization systems initially utilize extractive methods and then transition to generating abstractive summaries, as demonstrated by COMPENDIUM [
11,
12].
Another approach involves employing a semantic graph reduction technique, as demonstrated in [
13]. In their summarization method, they initiate the process by creating a rich semantic graph (RSG), which serves as an ontology-based representation. Subsequently, this RSG is transformed into a more abstracted graph, culminating in the generation of the abstractive summary from the final graph. The utilization of RSGs for Arabic text representation was further explored by the same authors in ongoing work related to Arabic summarization [
14,
15]. Additionally, another study applied this technique to summarize Hindi texts [
16]. In their model, the authors harnessed Domain Ontology and Hindi WordNet to facilitate the selection of diverse synonyms, thus enriching the summary generation process.
Furthermore, Le and Le [
17] introduced an abstractive summarization method for the Vietnamese language that distinguishes itself from [
9,
18] by incorporating anaphora resolution. This innovative approach effectively tackles the challenge of obtaining diverse words or phrases to represent the same concept, even when they exist in different nodes within the graph. The summarizer uses Rhetorical Structure Theory (RST) to streamline sentences. It achieves this by removing less important and redundant clauses from the beginning of a sentence and then reconstructing the refined sentence based on syntactic rules. This summarization technique is elegantly straightforward, as it amalgamates multiple sentences represented within a word graph, employing three predefined cases.
One of the earliest Arabic abstractive summarization systems, presented by Azmi and Altmami [
8], was developed on the foundation of a successful RST-based Arabic extractive summarizer [
19,
20]. In this approach, the sentences generated from the original text are first shortened by removing certain words, such as position names and days. Subsequently, sentence reduction rules are applied to create an abstractive summary. However, it is important to note that this method may result in non-coherent summaries due to the absence of paraphrasing. For the Malayalam language, Kabeer and Idicula [
21] employed a semantic graph based on POS (Part Of Speech) tagging. They used a set of features to assign weights to the relationships between two nodes representing the subject and object of a sentence. This process culminated in the generation of a reduced graph, from which an abstractive summary was derived. In the case of Kannada, a guided summarizer called sArAmsha [
22] relied on lexical analysis, Information Extraction (IE), and domain templates to generate sentences. A similar approach has also been implemented for the Telugu language [
23].
Another relevant line of work emphasizes the role of text segmentation in improving summarization quality. SEGBOT [
24] introduced a neural end-to-end segmentation model that leverages bidirectional recurrent networks and a pointer mechanism to detect text boundaries without hand-crafted features. It addresses key challenges in segmenting documents into topical sections and sentences into elementary discourse units (EDUs), which are crucial to structuring coherent summaries. SEGBOT’s outputs have also been shown to enhance downstream applications such as sentiment analysis, suggesting its broader utility in discourse-aware summarization pipelines.
In a related direction, Chau et al. [
25] introduced DocSum, a domain-adaptive abstractive summarization framework specifically designed for administrative documents. This work addresses key challenges in processing such texts, including noisy OCR outputs, domain-specific terminology, and scarce annotated data. DocSum employs a two-phase training strategy: (1) pre-training on noisy OCR-transcribed text to enhance robustness, followed by (2) fine-tuning with integrated question–answer pairs to improve semantic relevance. When evaluated on the RVL-CDIP dataset, DocSum demonstrated consistent improvements over a BART-base baseline, with ROUGE-1 scores increasing from 49.52 to 50.72 (+1.20%). Smaller but statistically significant gains were observed in ROUGE-2 (+1.14%) and ROUGE-L (+0.96%). These results highlight the framework’s ability to handle domain-specific nuances while maintaining summary coherence.
Sagheer and Sukkar [
26] introduced a hybrid system that combines knowledge base and fuzzy logic techniques for processing domain-specific Arabic text. This system operates by leveraging predefined concepts associated with the domain. Within this framework, the knowledge base serves the purpose of identifying concepts within the input text and extracting semantic relations between these concepts. The resulting sentences generated by the system comprise three essential components: subject, verb, and object. Multiple sentences are produced based on the identified concepts and their relations, and a fuzzy logic system is then employed. This fuzzy logic system computes a fuzzy value for each word in a sentence, utilizing fuzzy rules and defuzzification techniques to rank the summary sentences in descending order based on their fuzzy values. It is worth noting that the system’s performance was evaluated on texts sourced from the Essex Arabic Summaries Corpus. However, it is important to mention that no specific evaluation method was applied to systematically assess and compare the system against other techniques.
In the realm of machine learning and deep learning, Rush et al. [
27] introduced a neural attention-based summarization model that utilizes a feed-forward neural network language model (NNLM) [
28] and an attention-based model [
29] to generate headlines with fixed word lengths. However, this model has certain limitations. It tends to summarize each sentence independently, relies on the source text’s vocabulary, and sometimes constructs sentences with incorrect syntax, as it focuses on reordering words. To address some of these issues, Chopra et al. [
30] developed the recurrent attention summarizer (RAS), which incorporates word positions and their word-embedding representations to handle word ordering challenges. Additionally, the encoder–decoder recurrent neural network (RNN) [
29] has been a fundamental component in many abstractive summarization models. Nallapati et al. [
31] enhanced [
29]’s model by adding an attention mechanism and applying the large vocabulary trick (LVT) [
32]. They also tackled the problem of out-of-vocabulary (OOV) words by introducing a switching generator–pointer model [
33]. However, a drawback of this model was the generation of repetitive phrases, which was mitigated to some extent by employing a Temporal Attention model [
34]. Another approach improved handling OOV words by learning when to use the pointer and when to generate a word [
35], a technique that enhanced [
31]’s model.
The issue of repetition was further addressed by incorporating the coverage model [
36] and by implementing an intra-decoder attention mechanism [
37]. In Arabic headline summarization [
38], a pointer–generator model [
35] with an attention mechanism [
29] served as a baseline, and a variant with a copy mechanism [
33] was developed. The latter model demonstrated improved results compared with the baseline. Notably, the model with a copy mechanism and a length penalty outperformed other variants that incorporated coverage penalties or length and coverage penalties, largely due to considerations related to summary length limitations. To evaluate these models, an Arabic headline summary (AHS) dataset was created. Additionally, another study in Arabic explored sequence-to-sequence models with global attention for generating abstractive headline summaries [
39]. They examined the impact of the number of encoder layers for three types of networks: gated recurrent units (GRUs), LSTM, and bidirectional LSTM (BiLSTM). Evaluation using ROUGE and BLUE measures, employing the AHS dataset [
38] and Arabic Mogalad_Ndeef (AMN) [
40], indicated that the two-layer encoder for GRUs and LSTM achieved better results than the single-layer and three-layer configurations. Conversely, the three-layer BiLSTM encoder outperformed the single-layer and two-layer configurations. Notably, utilizing AraBERT [
41] in the data preprocessing stage contributed to improved results.
Furthermore, the RNN architecture initially proposed by [
31] underwent modifications for a multi-layer encoder and single-layer decoder summarization model tailored to Arabic [
42]. The encoder incorporates three hidden state layers for input text, keywords, and text name entities. These layers employ bidirectional LSTM and feature a global attention mechanism for enhanced performance.
Pre-trained language models, including BERT (Bidirectional Encoder Representations from Transformers) [
43] and BART (Bidirectional and Auto-Regressive Transformers) [
44], have found applications in abstractive summarization tasks. BART, in particular, distinguished itself from BERT by pre-training both the bidirectional encoder and the auto-regressive decoder. To harness the power of BERT for text summarization,
BertSum (BERT architecture for summarization) [
45] was introduced, offering two variants:
BertSumExt and
BertSumAbs. The former focused on extractive summarization, while the latter delved into abstractive summarization.
BertSumAbs employed an encoder–decoder architecture [
35], where the encoder was a pre-trained
BertSum and the decoder was a transformer initialized randomly. To strike a balance between overfitting and underfitting, an Adam optimizer was used for both the encoder and decoder, each with distinct learning rates and warm-up steps.
For Arabic abstractive summarization, Elmadani et al. [
46] utilized multilingual BERT [
47] to train
BertSumAbs, and they evaluated its performance on the KALIMAT dataset by using ROUGE scores. In a separate effort, Kahla et al. [
48] fine-tuned multilingual transformer models, including BERT and BART, for the Arabic abstractive summarization task. They also leveraged AraBERT [
41], a BERT model specifically trained for Arabic. These models underwent fine-tuning using a corpus collected from Arabic Deutsche Welle (DW) news (
https://www.dw.com/ar/, accessed on 5 August 2021). Furthermore, they introduced a cross-lingual transfer-based approach by initially fine-tuning multilingual BERT for abstractive Hungarian and English text summarization and subsequently fine-tuning it for Arabic using the same corpus. Automatic and human evaluations demonstrated that multilingual BERT trained from English outperformed other models, producing summaries that closely resembled the original lead. In contrast, other models tended to generate longer summaries than the lead or contained more grammar and context errors.
Another noteworthy development is AraBART [
49], an Arabic pre-trained BART model. AraBART underwent fine-tuning for Arabic abstractive summarization tasks, utilizing datasets from Arabic Gigaword [
50] and XL-Sum [
51]. The evaluation results revealed that AraBART outperformed the pre-trained Arabic BERT-based model [
52], the multilingual mBART model [
53], and the mT5 model [
54].
Additive Manufacturing (AM) constructs objects layer by layer from digital models, generating extensive unstructured textual data such as design rationales, process parameters, and material specifications. Efficient organization of this knowledge is essential to Design for Additive Manufacturing (DFAM), where traceability and interpretability are critical to informed decision making. Abstractive summarization offers a scalable solution by condensing complex AM content into coherent and actionable insights. Recent advances enhance factual consistency in summarization by integrating structured knowledge extraction methods, including triple classification and knowledge graphs (KGs). For example, AddManBERT [
55] employs dependency parsing to extract semantic relations between AM entities (e.g., material–process dependencies) and encodes them as vector representations. Complementary work utilizes neural models with meta-paths to capture hierarchical semantics between entities and relations, while KG-based methods support scalable triple classification from multi-source Fused Deposition Modeling data. These techniques have demonstrated superior classification accuracy and computational efficiency compared with rule-based systems, underscoring the value of KG-augmented summarization in AM knowledge management.
Table 1 provides an overview of abstractive text summarization studies, including details about the corpus used and the scope of the summary. The summary scope can fall into one of three categories: headline, sentence level (where a single sentence serves as the summary), or document level (which generates multiple or a few lengthy sentences).
Today, deep learning forms the backbone of most abstractive summarization models [
7]. However, its effectiveness typically assumes access to large-scale training data and substantial computational resources—conditions often unmet for Arabic and other morphologically rich languages. Our ACS-based approach offers a strategically compelling alternative by addressing four critical limitations of neural methods. First, it operates effectively with limited labeled data, making it suitable for specialized Arabic domains where annotated corpora are scarce. Second, its explicit modeling of Arabic root patterns and collocations through interpretable fitness functions introduces morphological awareness, which is often lacking in transformer-based systems without extensive pre-training. Third, the framework inherently supports multi-objective optimization, enabling precise trade-offs between competing priorities such as content density and readability—a capability that requires complex architectural modifications in neural models. Finally, ACS achieves competitive performance without reliance on GPUs, thereby democratizing access to abstractive summarization for Arabic NLP researchers and practitioners with constrained infrastructure. Rather than opposing the neural paradigm, this work expands the methodological repertoire for Arabic NLP, particularly in low-resource, high-interpretability scenarios. The success of AASAC further suggests promising directions for hybrid systems combining neural fluency with nature-inspired optimization.
3. Problem Formulation
The ant colony system (ACS) algorithm, an enhanced variant of Ant Colony Optimization (ACO) [
56], provides an effective framework for addressing our multi-objective Arabic abstractive summarization challenge. As a population-based metaheuristic, ACS mimics the emergent intelligence of natural ant colonies, where artificial ants collaboratively explore solution paths while dynamically updating pheromone trails to guide subsequent searches toward optimal solutions. This biologically inspired approach is particularly suited for our task, as it efficiently balances multiple competing objectives—content coverage, linguistic coherence, and summary conciseness—through its distributed optimization mechanism.
We formulate the abstractive summarization task as a graph-based optimization problem, where the source document is represented as a connected network of word nodes. Unlike previous ACO applications in extractive summarization [
3] that treated entire sentences as nodes, our AASAC approach operates at the word level to enable finer-grained abstractive generation. Each node encapsulates lexical, collocational, and semantic features that collectively inform our multi-component fitness function. The ACS agents navigate this linguistic landscape, with pheromone dynamics reflecting both local heuristic information (word relations) and global summary quality metrics.
This formulation advances prior work in three key aspects: (1) the graph representation preserves Arabic-specific morphological and syntactic relationships critical to abstractive generation; (2) the optimization process simultaneously considers semantic preservation and linguistic fluency through specialized fitness functions; and (3) the solution path directly feeds into a generation module that produces human-like summaries rather than extracted fragments.
Table 2 summarizes the mathematical notation for our ACS adaptation to this novel domain.
Our approach can be outlined as follows: Consider a set of words
representing the words in the original document. Within this set, each word
is associated with a cost that takes into account factors like its position in the document and frequency of occurrence. These words are interconnected through edges denoted by
, with each edge carrying a cost that signifies the sequential relationship between the connected words. Importantly, a word can be linked to multiple other words. Our ultimate objective is to construct a summary by identifying a set of words that maximizes the following expression:
where
assumes a value of 1 when word
is chosen and 0 otherwise. |
W| signifies the total number of words in the document,
E represents the count of edges in the document,
denotes the cost attributed to the combination of word
and edge
j, and
serves as the constraint that limits the overall length of the selected summary words.
The ACS algorithm consists of three main steps in each iteration. Initially, every ant constructs a solution path, essentially creating a word summary. Subsequently, it identifies the best path among all those generated by the ants up to that point. Lastly, there is a global update of the pheromone level for this best path.
In the process of constructing a solution, unlike in the TSP, where all nodes are explored, each ant, denoted by
k, adds an edge labeled
j to its path and adjusts the edge’s pheromone level. This process continues until the path reaches a predefined threshold, represented by
, which limits the length of the summary. The selection of edge
j over another edge
follows a pseudo-random-proportional rule described by Equation (
3):
where
represents the pheromone level of edge
,
denotes the heuristic information value for edge
,
is a parameter that determines the relative importance of the heuristic information value, while
q and
are real values ranging from 0 to 1. Additionally,
represents the set of nearest-neighbor nodes that have not been selected by ant
k, which essentially comprises the
n-gram words originating from the current word. The value of
is given by Equation (
4),
where
signifies that
j belongs to the set of nearest-neighbor nodes not chosen by ant
k and
U represents the count of available nodes that have not yet been selected by ant
k. It is important to note that the denominator of the sum is not zero. When an ant selects an edge, the local update of the edge’s pheromone level takes place using Equation (
5). In this equation, the evaporation rate
is a real value within the range of 0 to 1.
In the ACS algorithm’s second step, the aim is to identify the best-so-far path among the set of solutions created by the ants, and this is determined based on a fitness function. Finally, the pheromones associated with the best-so-far path are updated globally using Equation (
6),
Here,
represents the global pheromone evaporation rate, and
signifies the fitness value for the best-so-far solution. We will introduce and define two functions relevant to this process in
Section 4.3.
5. Results and Discussion
A set of experiments were conducted to evaluate the AASAC system. This section showcases setting up the ant colony system (ACS) parameters, the dataset used for evaluation, the evaluation metric, the experimental results, human evaluation, and the discussion.
To facilitate the integration of ACS with Neo4j, we developed a custom Cypher procedure using the Java programming language. This procedure was then called from Python 2.7 for executing our experiments. All experiments were conducted on a Mac OS X 11.7.3 system equipped with a 2.3 GHz Quad-Core Intel Core i7 processor and 32 GB of RAM.
5.1. Experimental Setup
ACS employs a range of parameters, each with specific values. The termination condition, denoted by the number of iterations, was set to
. Given that ACS incorporates local search, the number of ants,
m, was deliberately chosen as a relatively small value, i.e.,
. Following the guidelines presented in [
56], the evaporation rate
was established at 0.1, while
was set to 0.1. The weight assigned to heuristic information, denoted by
, was configured to 2.0, with
set to 0.1 and
to 0.7. Finally, the initial pheromone trail,
, was calculated as
, where
represents the number of nodes (words) in the graph and
d denotes the nearest-neighbor distance.
We performed several experiments to generate summaries at 30% and 50% of the original text. One of the experiments, called H1FF1, utilized the heuristic information function
along with the fitness function
. Another experiment, named H2FF2, employed the heuristic information function
with the fitness function
.
Table 4 lists all the experiments.
We employed three different settings for the candidate list, namely, 1-gram, 2-gram, and 3-gram, to generate both H1FF1 and H2FF2 summaries. To indicate the appropriate n-gram, the experimental names are further marked by -n. Consequently, the variations are denoted by HiFFj-n to indicate ACS variations when using the heuristic information function with fitness and n-gram, with . For instance, H1FF1-1, H1FF1-2, and H1FF1-3 are our ACS variations when using the heuristic information function with fitness and 1-gram, 2-gram, and 3-gram candidate lists, respectively.
5.2. Dataset
Due to the absence of a standardized Arabic single-document abstractive summary dataset, we utilized a subset of data shared by [
8]. The subset comprises 104 documents of varying lengths, with an average of 239 words each. These documents were accompanied by system-generated summaries of 30% and 50% of the original document size, which we considered a baseline.
The documents were collected from different sources, including 79 documents from Saudi Arabian newspapers
Al-Riyadh (
https://www.alriyadh.com, accessed on 5 August 2021.) and
Al-Jazirah (
https://www.al-jazirah.com, accessed on 5 August 2021), and 25 documents from Lebanese newspapers
Al-Joumhouria (
https://www.aljoumhouria.com, accessed on 5 August 2021) and
An-Nahar (
https://www.annahar.com, accessed on 5 August 2021). The topics covered in these documents encompassed a range of subjects such as general health, sports, politics, business, and religion.
Figure 8 displays the distribution of the document lengths based on the number of words. The x-axis depicts the document word counts grouped into categories, while the y-axis represents the number of documents falling into each category. The majority of documents contained between 200 and 300 words.
Since the dataset lacked human-authored summaries for comparison, we sought human professionals to summarize the documents to 30% and 50% of their original size. The complete dataset comprising all 104 documents, alongside their corresponding 30% and 50% human summaries, is freely accessible to interested parties.
5.3. Evaluation Metric
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [
59] is a common evaluation measure for summaries. It compares a generated summary with one or more reference summaries according to a set of metrics; ROUGE-N compares the
N-gram, ROUGE-L compares the longest word sequence, and ROUGE-SU skip-bigram counts unigrams and any pair of words between them. For example, ROUGE-N is computed as follows:
where S is a reference summary, RS is the set of reference summaries, N (in ROUGE-N) is the length of the
n-gram, and
is the maximum number of
n-grams co-occurring in a candidate summary and a set of reference summaries. However, Arabic is a more morphological language. Therefore, using ROUGE to evaluate Arabic texts will not result in a valid comparison.
To solve this problem, the
LemmaRouge metric [
60], which converts system summary words into lemmas as a unified surface form, before applying the ROUGE toolkit [
61], was used. The
LemmaRouge-N is given by
where lemma-
n is a sequence of consecutive
n words’ lemmas from a given text.
The use of a lemma-based ROUGE metric for evaluating Arabic text summarization systems proves advantageous due to the morphological complexity of the Arabic language. By considering the lemma form of words, which captures the root meaning while disregarding variations in inflections and derivations, the metric provides a more accurate measure of semantic similarity between system-generated summaries and the original text. This approach better accounts for the unique linguistic characteristics of Arabic and improves the evaluation and development of abstractive text summarization systems in this or other Semetic languages. An example highlighting the advantages of using
LemmaRouge for Arabic text is shown in
Figure 9.
5.4. Evaluation Results
As most Arabic abstractive summarizers generate one sentence, we compared the performance of our system against the results obtained from the summarizer in [
8], which we considered a baseline and we called ANSum. We also applied lemmatization to ANSum summary texts and reference summary texts.
The
LemmaRouge-1 and
LemmaRouge-2 metrics were used to measure the coverage of salient information, and
LemmaRouge-L was used to evaluate fluency.
Table 5 reports
LemmaRouge recall and
scores for summaries that are 30% of the original length. Each reported score represents the average of three runs. ANSum
LemmaRouge scores are the results of comparing the reference summaries with the summaries generated by the system in [
8]. As mentioned earlier, we have variants of each of the experiments listed in
Table 4. The specific variant depends on the candidate list, which can be 1-gram, 2-gram, or 3-gram. For example, H2FF1-3 indicates an experiment that uses the heuristic information function
with fitness
and a 3-gram candidate list.
The results show that the H2FF2-1 variant achieved the highest recall, outperforming all other variants and the ANSum system. It also attained superior scores across all metrics except for LemmaRouge-2, indicating that relation and collocation features enhance performance.
When these features were excluded, the H1FF1-2 variant—using a 2-gram candidate list—performed better than 1-gram or 3-gram lists in terms of LemmaRouge-1 and LemmaRouge-L. The 3-gram candidate lists showed no improvement regardless of feature inclusion. We attribute this to three key factors: (1) the combinatorial sparsity of possible 3-gram combinations reduces their utility as building blocks, (2) longer n-grams impose rigid structural constraints that limit sentence generation flexibility, and (3) the fixed-length structure hinders dynamic optimization between salient coverage and fluency.
The
LemmaRouge recall and
scores for the evaluations of summaries at 50% of the original text size are reported in
Table 6. All our AASAC system variants gained a higher recall score than ANSum, and higher
scores in
LemmaRouge-1 and
LemmaRouge-L. In general, variant H2FF2-2 achieved the best results, except for the scores of
LemmaRouge-2 recall, and
LemmaRouge-2 and
LemmaRouge-L
scores. As in the case of the 30% summary size, the variant H1FF1-2 gave better results than using other 1-gram or 3-gram candidate lists for all scores except
LemmaRouge-2 recall and
.
5.5. Human Evaluation
ROUGE, originally designed for extractive summaries, falls short in assessing abstractive summaries due to their divergence from the source text wording. Abstractive summaries focus on conveying meaning rather than verbatim representation, making ROUGE’s word-matching approach inadequate. Alternative evaluation methods are needed to capture the semantic and conceptual aspects of abstractive summarization accurately.
To minimize the human effort required for manual evaluation, we selected 20 random documents from the dataset and enlisted three human evaluators to assess our summarizer, AASAC. Given that the H2FF2 variant achieved higher
LemmaRouge scores, we specifically asked the evaluators to choose the best summary among the H2FF2 variants based on 1-gram, 2-gram, and 3-gram. In
Figure 10a, the preferences of the evaluators for each H2FF2 summary are shown, with the scores representing the total number of times a variant was chosen by the three evaluators. Each selection by an evaluator increases the score by one. A maximum score of 60 indicates unanimous agreement among the evaluators for a particular variant across all 20 documents. The results showed that the H2FF2 summary using a 2-gram candidate list received the highest score compared with the other
n-grams.
Following that, the evaluators provided assessments of the H2FF2 summary using a 2-gram candidate list, answering four questions: (Q1) Does the summary effectively cover the document’s most important issues? (Q2) Does the summary enable readers to understand the main points of the article? (Q3) How would you rate the summary’s readability? (Q4) What is your overall assessment of the summary’s quality? Each question was answered on a scale of 1 to 5, with 1 indicating strong disagreement, 3 for a neutral response, and 5 for strong agreement.
Figure 10b summarizes the questionnaire results, indicating that the summary effectively captures the document’s most important aspects, enabling readers to understand its content with average scores of 3.7 and 3.8, respectively. Additionally, the summary received ratings of 2.9 and 3.1 for readability and overall quality, respectively.
5.6. Discussion
The experimental results show the ability of ACS to select salient words to generate an informative summary. These results show the potential of considering relation and collocation features in the heuristic information and the fitness function to boost the results. Moreover, the results indicate that setting 3-gram candidate lists will not improve the summary results for any size even when adding relation and collocation features.
Nevertheless, there are occurrences in which the word segmentation performed by the tokenizer is inaccurate, resulting in an adverse impact on the generated summary. For instance, the term (بقايا: “remains”) is incorrectly split into two tokens, namely, (ب) and (قايا). While the character (ب) is a valid Arabic letter, the token (قايا) does not correspond to a valid Arabic word. Similarly, the word (بغداد: “Baghdad”) is divided into (ب) and (غداد). Although both the character (ب) and the token (غداد) exist in the Arabic dictionary, the token (غداد) conveys a different meaning than (بغداد). Utilizing a robust tokenizer is anticipated to address this issue effectively.
Another limitation of our AASAC summarizer manifests when there is an ambiguous selection between more than one word, as the summarizer shows all words’ possibilities. This can be solved by incorporating a grammar model. Repeated words are scarcely generated, and they can be addressed by adding a penalty or cost to the fitness function.
The human evaluators expressed positivity regarding the content of the summary, indicating that our AASAC summarizer effectively captured the main points of the document. However, they remained neutral when evaluating the readability and overall quality of the summary. This suggests that our summarizer may benefit from incorporating language guidance to enhance these aspects during the summary generation process.