Next Article in Journal
OSSAPTestingPlus: A Blockchain-Based Collaborative Framework for Enhancing Trust and Integrity in Distributed Agile Testing of Archaeological Photogrammetry Open-Source Software
Previous Article in Journal
From Pixels to Motion: A Systematic Analysis of Translation-Based Video Synthesis Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning

1
Department of Applied Data Science, Hong Kong Shue Yan University, Hong Kong SAR, China
2
Department of Computer Science, University of Sahiwal, Sahiwal 57000, Pakistan
3
College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China
4
Faculty of Data Science and Information Technology, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia
*
Author to whom correspondence should be addressed.
Information 2025, 16(11), 991; https://doi.org/10.3390/info16110991
Submission received: 30 September 2025 / Revised: 5 November 2025 / Accepted: 10 November 2025 / Published: 16 November 2025

Abstract

In today’s data-driven world, automatic text summarization is essential for extracting insights from large data volumes. While extractive summarization is well-studied, abstractive summarization remains limited, especially for low-resource languages like Urdu. This study introduces process innovation through transformer-based models—Efficient-BART (EBART), Efficient-T5 (ET5), and Efficient-GPT-2 (EGPT-2)—optimized for Urdu abstractive summarization. Innovations include strategically removing inefficient attention heads to reduce computational complexity and improve accuracy. Theoretically, this pruning preserves structural integrity by retaining heads that capture diverse linguistic features, while eliminating redundant ones. Adapted from BART, T5, and GPT-2, these optimized models significantly outperform their originals in ROUGE evaluations, demonstrating the effectiveness of process innovation and optimization for Urdu natural language processing.

1. Introduction

As the digital landscape continues to evolve, the exponential growth of textual data presents both significant opportunities and challenges in information management [1]. The complexity of modern information systems extends beyond single-language, single-domain contexts to encompass multi-lingual, multi-modal, and cross-jurisdictional data, such as in comprehensive coastal zone knowledge systems [2]. The vast amount of information spanning various domains has become increasingly difficult to navigate, whether in academic publications, news articles, or organizational reports. Innovative solutions are urgently needed to streamline information retrieval and comprehension, enabling efficient extraction of relevant insights from extensive document collections.
Automatic text summarization has emerged as a critical technology to address these challenges by condensing lengthy documents into concise and informative summaries. By providing users with quick access to key points and essential elements, text summarization facilitates more efficient decision-making, knowledge acquisition, and information management processes.
Text summarization approaches are generally categorized into two main types: extractive and abstractive methods [3,4]. Extractive summarization identifies and concatenates the most relevant sentences from the original text, while abstractive summarization attempts to understand underlying meanings and generate novel summaries using techniques such as paraphrasing and rewriting, resulting in more human-like outputs. Urdu, as a linguistically rich language with complex morphology, right-to-left script, and significant morphological complexity, presents unique challenges for text summarization that differ substantially from English and other high-resource languages [5,6]. The language’s compound morphology, lack of word boundaries, and limited availability of annotated datasets for model training and evaluation create substantial barriers to effective NLP application development.
This study addresses the following research questions:
  • Can selective attention head pruning improve the performance and efficiency of transformer models for Urdu abstractive summarization?
  • How do different transformer architectures (BART, T5, and GPT-2) respond to attention head pruning when adapted for Urdu?
  • What is the optimal pruning threshold that balances performance and efficiency for Urdu text summarization?
  • How do the optimized models generalize across different domains of Urdu text?
Transformer-based architectures, such as BART, T5, and GPT-2, have set new benchmarks in natural language processing tasks, including summarization for high-resource languages [7]. We selected these three models specifically because they represent distinct architectural paradigms: BART uses a bidirectional encoder and autoregressive decoder, T5 employs a text-to-text unified framework, and GPT-2 utilizes a decoder-only autoregressive approach. This diversity allows for a comprehensive evaluation of our pruning methodology across different transformer designs. While newer models like GPT-3 and GPT-4 exist, we chose GPT-2 for its open-source nature, reproducibility, and suitability for architectural modification experiments, focusing on process innovation rather than scale-based performance gains. Their application to Urdu, however, reveals a critical research gap. While current state-of-the-art approaches for Urdu typically involve fine-tuning these pre-trained models, they operate under a significant limitation: they apply the full, unmodified transformer architecture to a specialized, low-resource task. These architectures contain attention heads trained on a mixture of languages and tasks, many of which become redundant or inefficient when specialized for Urdu abstractive summarization. This results in models that are computationally expensive, suboptimally efficient, and not tailored to the specific linguistic characteristics of Urdu. Merely fine-tuning pre-trained models without architectural optimization fails to address the inherent inefficiencies for a targeted, low-resource application.
To bridge this gap, this study introduces a novel optimization approach that moves beyond simple fine-tuning. We posit that not all components of large transformer architectures contribute equally to the task of Urdu abstractive summarization. By strategically identifying and removing inefficient attention heads—a process we term selective attention pruning—we create leaner yet more effective models. This approach is grounded in the observation that attention heads vary in their contribution; eliminating redundant heads can significantly reduce computational complexity while maintaining or even enhancing summarization quality. Building on this principle, we develop and evaluate three optimized models: Efficient-BART (EBART), Efficient-T5 (ET5), and Efficient-GPT-2 (EGPT-2).
The key research contributions of this paper are as follows:
  • Development of optimized transformer models (EBART, ET5, and EGPT-2) through strategic attention head pruning, specifically designed for Urdu abstractive summarization;
  • Comprehensive investigation of leading transformer architectures (BART, T5, and GPT-2) for Urdu language processing;
  • Extensive evaluation of the optimized models using ROUGE metrics and comparative analysis against their original counterparts.
The remainder of this paper is organized as follows. Section 2 discusses related work in text summarization. Section 3 provides detailed descriptions of datasets, preprocessing methods, and model architectures. The experimental setup and results are presented in Section 4, and the discussions are presented in Section 5. Finally, Section 6 concludes this study and outlines future research directions.

2. Background and Literature Review

This section examines the technical foundations of automatic text summarization. It provides a comprehensive review of current state-of-the-art approaches, with particular emphasis on transformer-based methods and their applications to low-resource languages.
Egonmwan and Chali [8] proposed transformer-based models for single-document neural summarization. Their work showed that sequence-to-sequence models with attention can outperform earlier systems but require longer training time.
Building on this work, Abolghasemi et al. [9] developed HTS-DL, a hybrid text summarization system that combines extractive and abstractive summarizers to overcome the challenges of traditional neural models. Their approach outperformed existing models by leveraging the strengths of both summarization paradigms

2.1. Single Document Summarization Approaches

Single-document summarization has improved significantly with deep learning models. Jiang et al. [10] introduced attention-based bidirectional LSTMs along with sequence-to-sequence long short-term memory networks to handle problems like out-of-vocabulary words and redundant outputs. Experiments on public corpora showed that their system outperformed baseline and multiple state-of-the-art systems. Lewis et al.’s [7] introduction of BART additionally improved abstractive summarization using a bidirectional encoder and autoregressive decoder, with a two-stage pre-training strategy (corruption and reconstruction of text), allowing very effective end-to-end summarization. Meanwhile, BERT-based approaches have shown promise for text understanding tasks in various languages [11,12]. However, morphologically rich languages present special difficulties for summarizing text. While in general summarization algorithms are language-independent, languages like Urdu need custom preprocessing. Daud et al. [5] stressed that the compound morphology and lack of word boundaries in Urdu make it important to develop language-specific tools like stemmers, lemmatizers, and stopword lists. Farooq et al. [13] solved these problems by comparing different summarization methods for Urdu. In contrast, Asif et al. [14] presented a hybrid architecture combining TF–IDF weighting, word frequency, and transformer-based abstractive generation. Their architecture generated summaries of the same quality as human-generated text, with better coherence and retrieval utility.

2.2. Multiple Document Summarization and Hybrid Approaches

Hybrid models combining several approaches have reached competitive performance in multi-document summarization. Khyat et al. [15] introduced an N-gram-based hybrid model with deep learning, which outperformed traditional systems with better ROUGE scores. Mujahid et al. [16] improved transformer architectures for Urdu headline classification, demonstrating that fine-tuning over Urdu corpora results in significant performance gains. Their results established that transformer-based features support effective text representation and categorization for Urdu language processing.
Transformer architectures revolutionized natural language processing. Vaswani et al. [17] proposed the self-attention mechanism that lies at the heart of all modern transformer models. Lin et al. [18] provided the ROUGE metric for automatic evaluation of summaries, and their framework became the standard evaluation platform for summarization-based tasks in text processing. Finally, Wolf et al. [19] developed the Transformers library that unified BERT, GPT, and T5 architectures for a wide variety of NLP tasks and significantly accelerated both research and deployment.
Recent works have been conducted on optimizing transformer architectures by analyzing and pruning components. Michel et al. [20] examined attention head redundancy and asked the question if sixteen heads are really better than one. This has been a path towards efficient architecture design. Cheema et al. [21] showed that selective pruning of attention heads alone could achieve substantial gains in efficiency for GPT-2 and established a method for optimizing transformers through selective reduction of components.
A few approaches have emerged in the case of low-resource language scenarios. Savelieva et al. [22] conducted abstractive summarization of instructional texts using BERT, confirming its context understanding capabilities. Xue et al. [23] proposed mT5, a multilingual text-to-text transformer, showing great robustness for multilingual and low-resource summarization tasks. Munaf et al. [24] focused on low-resource summarization with pre-trained language models and depicted good results using these models in a low-resource setting.
Cross-lingual adaptations have achieved considerable success in Urdu language processing. Salau et al. [25] developed a machine learning-based approach using SVM, KNN, and CNN to detect Afan Oromo fake news from Facebook data, achieving a precision, recall, and F1-score of 0.92, 0.92, and 0.90, respectively. Rauf et al. [26] explored the application of deep learning methods to fake news detection in Urdu. This demonstrates the potential of transformer-based techniques for Urdu NLP tasks.
Azhar et al. [27] conducted a comprehensive systematic review and experimental evaluation of classical and transformer-based models for Urdu abstractive text summarization. Their work benchmarked various architectural approaches and identified the unique challenges posed by Urdu’s linguistic characteristics in low-resource settings, providing crucial insights into the current state of Urdu summarization research.
Despite these advancements, significant limitations persist in Urdu text summarization. Current state-of-the-art models predominantly rely on fine-tuning pre-trained multilingual transformers without structural modifications. This approach leaves many attention heads redundant or inefficient when specialized for Urdu, resulting in unnecessary computational overhead and suboptimal performance. To address this gap, our work introduces a selective attention head pruning mechanism that optimizes transformer architecture at the structural level, moving beyond parameter adjustment through fine-tuning. This architectural optimization enables the development of faster, lighter, and more efficient transformer models specifically engineered for abstractive Urdu text summarization tasks.

3. Methodology

This section explains the steps of our proposed methodology. The experiments were conducted using Python version 3.8.10, and the Hugging Face transformers library version 4.21.0. All base models were implemented using PyTorch version 2.2.0.
First, the data preprocessing steps used for our proposed techniques are discussed. Then, the process of strategic removal of inefficient heads is discussed, which is used in all three optimized versions, i.e., EBART, ET5, and EGPT-2, to improve Urdu abstractive text summarization

3.1. Dataset Description and Preprocessing

The Urdu Fake News Dataset from Hugging Face was used for this study. It consists of a total of 1300 news articles collected from authentic Urdu news sources. The dataset is categorized into five domains: Business, Health, Showbiz, Sports, and Technology. The distribution is as follows: the Business and Sports categories have 150 real and 80 fake articles each, while the Health, Showbiz, and Technology categories have 150 real and 130 fake articles each. This gives a total of 750 real and 550 fake articles, providing a solid basis for model training and testing.
The dataset is divided into 80% for training (1040 articles) and 20% for testing (260 articles). Each article has a human-written abstractive summary that serves as the ground truth for the summarization task.
Various preprocessing steps were carried out before training and testing the proposed models. Urdu is a language with unique linguistic characteristics such as a right-to-left script, complex morphology, and no clear word boundaries [5]. Transformer architectures, namely GPT-2, T5, and BART, were trained on raw text and are, therefore, capable of leveraging knowledge about text acquired during pre-training. Consequently, the preprocessing of data is relatively simple.
The consistent preprocessing pipeline applied to all models is described below.
  • Tokenization. The input Urdu text was segmented into subword tokens using the pre-trained tokenizers specific to each model (e.g., GPT2Tokenizer, T5Tokenizer, and BartTokenizer) [25].
  • Sequence Length Adjustment. All tokenized sequences were padded or truncated to a fixed length of 512 tokens to conform to the input size requirements of the models.
  • Text Normalization. Basic text normalization was performed, which involved converting all text to lowercase and removing extraneous punctuation marks.
  • Tensor Conversion. The final tokenized and adjusted sequences were converted into PyTorch (version 2.2.0, Meta AI, Menlo Park, CA, USA) tensors, which is the required data format for model inference and fine-tuning.
This standardized preprocessing ensures that all models receive clean and uniformly formatted input, thereby facilitating effective learning and generalization.

3.2. Efficient Transformer Summarizer Models by Pruning Attention Heads Based on Their Contribution

This section provides a detailed explanation of the efficient transformer summarizer framework and the pruning process of attention heads based on their contribution. The preprocessed Urdu text dataset is used as the input for the model’s training. The complete framework of the proposed efficient transformer summarizer is illustrated in Figure 1. Each step of the proposed framework is discussed step by step.

3.2.1. Training the Efficient Summarizer Models

For the training of the three Efficient Transformer Summarizer models—(1) EGPT-2, (2) ET5, and (3) EBART—the same procedure is followed, which is presented in this section. Each model is initialized with the pre-trained weights of its original counterpart, that is, GPT-2, T5, and BART, respectively (all obtained from the Hugging Face Transformers Library, version 4.4.0, Hugging Face, Brooklyn, NY, USA), and then each model is fine-tuned on the T t r a i n dataset of tokenized Urdu training documents. The overall process of the Efficient Transformer Summarizer is given in Algorithm 1.
There are three major phases for training and optimization in the pipeline, which together work to improve performance and enhance efficiency:
  • Fine-tuning: In this step, the pre-trained models undergo fine-tuning for the specific domain of Urdu text summarization and are exposed to the patterns and features of the Urdu language. This step forms the basis by which the model learns Urdu-specific linguistic structures and contextual dependencies. The computational cost of this step is O ( N · L 2 ) per epoch for N training samples and a sequence length of L. The fine-tuning process was implemented in PyTorch (version 2.2.0, Meta AI, Menlo Park, CA, USA) using the AdamW optimizer (version 0.0.1, Loshchilov and Hutter, ETH Zürich, Switzerland).
  • Pruning: This stage systematically sheds redundant attention heads based on their contribution to the ROUGE score as a means of reducing model complexity and sharpening the focus of attention on the salient features of Urdu. This ensures that the model becomes more focused by the pruning of such heads that contribute minimally towards the task. The iterative pruning process has a time complexity of O ( H × E ) , where H is the number of attention heads, and E is the evaluation cost per head.
  • Evaluation: This step ensures that pruning does not lead to degradation of performance; thus, the final model is efficient and accurate. The validation step here was important to confirm whether the pruned model could still generate good Urdu summaries while realizing computational gains.
Hyperparameter optimization is an important part of the fine-tuning process. The following are specific values, which were determined through empirical validation on a held-out development set:
  • Number of encoder/decoder layers: 6 (for BART and T5); 12 (for GPT-2).
  • Number of attention heads: 12 (for BART and T5); 12 (for GPT-2).
  • Embedding size: 768.
  • Feed-forward network size: 3072.
  • Dropout rate: 0.1.
  • Learning rate: 1 × 10 4 , using the AdamW optimizer.
  • Batch size: 8.
  • Sequence length: 512 tokens.
  • Weight decay: 0.01.
  • Training epochs: 200.
Algorithm 1 Pruning and Summarization with Efficient GPT-2, T5, and BART Models
Require:
 1:
Train: Tokenized training text documents
 2:
Test: Tokenized testing text documents
 3:
Tpruned: Tokenized pruning text documents
 4:
θ : Accuracy sliding window for attention score pruning
Ensure:
 5:
Σ E G P T 2 : Summaries by pruned Efficient GPT-2 model
 6:
Σ E T 5 : Summaries by pruned Efficient T5 model
 7:
Σ E B A R T : Summaries by pruned Efficient BART model
 8:
K: Evaluation metrics [F1-score, Precision, Recall, BLEU, ROUGE]
 9:
EGPT-2_Pruned, ET5_Pruned, EBART_Pruned: Pruned summarizer models
10:
procedure  Main Algorithm
11:
    # Train the Efficient Summarizer models
12:
     E G P T 2 trainEGPT2( T t r a i n , GPT-2)
13:
     E T 5 trainET5( T t r a i n , T5)
14:
     E B A R T trainEBART( T t r a i n , BART)
15:
    # Perform Iterative Attention Head Pruning
16:
    EGPT-2_Prunedperform_pruning( E G P T 2 , T p r u n e d , θ )
17:
    ET5_Prunedperform_pruning( E T 5 , T p r u n e d , θ )
18:
    EBART_Prunedperform_pruning( E B A R T , T p r u n e d , θ )
19:
    # Generate summaries using pruned models
20:
     Σ E G P T 2 generate_summaries(EGPT-2_Pruned, T t e s t )
21:
     Σ E T 5 generate_summaries(ET5_Pruned, T t e s t )
22:
     Σ E B A R T generate_summaries(EBART_Pruned, T t e s t )
23:
    # Evaluate the summaries
24:
     K evaluate_summaries( Σ E G P T 2 , Σ E T 5 , Σ E B A R T , t e s t G r o u n d _ T r u t h )
25:
    # Summarize the results
26:
    summarize_results(K)
27:
end procedure
In addition to this, a linear warm-up schedule was employed to gradually increase the learning rate before reaching the maximum limit for optimization stability. The weight decay of L2 regularization, applied to the model weights, also helps prevent overfitting, which is crucial for better training and fine-tuning processes. Training was conducted for 200 epochs or until early stopping was triggered based on performance on a held-out validation set. After the fine-tuning process, the EBART, ET5, and EGPT-2 models are further utilized for the pruning of attention heads, which is explained in the next section.
GPT-2, T5, and BART are fine-tuned on the T t r a i n dataset, and the process starts from the pre-trained weights of the GPT-2, T5, and BART models to do Urdu text summarization. During the fine-tuning process on the T t r a i n dataset, the language patterns, content features, and specific patterns present in the training corpus are exposed to the models, enabling them to fine-tune their parameters and internal representations to reflect the subtleties of the target domain accurately.
The key step of the fine-tuning process is hyperparameter tuning, which involves optimizing the following parameters:
  • The number of transformer blocks (layers) in the encoder and decoder to balance model complexity and generalization.
  • The number of attention heads in the multi-head attention mechanism to allow the model to focus on different parts of the input text simultaneously.
  • The embedding size, chosen to capture sufficient information while maintaining computational efficiency.
  • The size of the feed-forward neural network within each transformer block to adjust the model’s capacity and computational cost.
  • The dropout rate applied to transformer layers to prevent overfitting by randomly dropping units during training.
  • The learning rate used by the optimization algorithm (e.g., Adam, SGD) to ensure fast and stable convergence.
  • The batch size, i.e., the number of sentences or paragraphs processed per training step, for stable gradient estimation.
  • The maximum sequence length for input and output to capture adequate context while efficiently using computational resources.
In addition to this, warm-up steps are used to gradually increase the learning rate before reaching the maximum limit for optimization stability. The weight decay of L2 regularization, applied to the model weights, also helps prevent overfitting, which is crucial for better training and fine-tuning processes. Usually, training is continued for a specified number of epochs or until the models converge (i.e., no longer improvement in the performance on a held-out validation set). After the fine-tuning process, the EBART, ET5, and EGPT-2 models are further utilized for the pruning of attention heads, which is explained in the next step.

3.2.2. Perform Iterative Attention Heads Pruning with in Dividual Contribution Computation

Iterative attention head pruning based on individual contribution is the key step of our efficient summarizer models. Each of the trained Efficient Summarizer models undergoes an iterative attention-head pruning process in the next phase of the fine-tuning process. The function perform_iterative_pruning(model, Tpruned, θ ), where model { EGPT 2 , ET 5 , EBART } , is responsible for performing the iterative attention heads pruning based on the individual contribution of the attention heads. The EGPT2, ET5, and EBART models, along with the pruned tokenized text documents T p r u n e d and an accuracy sliding window threshold θ , are input to the iterative attention head pruning function.
To quantify the contribution of each attention head to the performance of the model, the ROUGE score [20] is employed as the measure. We use ROUGE as our primary performance metric to quantify “accuracy” in the context of summarization quality. The contribution of each attention head is measured by computing the drop in ROUGE score when that specific head is masked. A significant drop indicates a high-contribution head, while a minimal drop suggests redundancy [18]. The function measures the contribution of each attention head and ranks them from low to high based on their contribution. The ROUGE score guarantees accuracy with the assistance of the accuracy sliding window threshold θ .
To determine the individual contribution of a particular attention head, it is masked to see how well the model performs without that particular attention head. This process is performed for each attention head to measure its contribution during each iteration. Let M denote the original model and M ( i ) the model with the i-th attention head removed or masked.
The importance of the attention head can be quantified as follows:
I ( i ) = P ( M ) P ( M ( i ) )
where I ( i ) denotes the importance score of attention head i, P ( · ) represents the performance metric function, M is the complete model, and M ( i ) denotes the model with attention head i removed. The performance metric P ( · ) can be any relevant evaluation measure such as ROUGE score for summarization quality, perplexity, or other task-specific metrics.
Through iterative pruning of attention heads, redundant and inefficient heads are removed, enabling higher model compression. By pruning the attention heads that have no significant influence on the prediction, we ensure an optimal Transformer Summarizer that achieves higher accuracy with better resource utilization by reducing the base model’s complexity. Next, the pruned models are stored as EGPT-2_Pruned, ET5_Pruned, and EBART_Pruned, which correspond to the pruned versions of the Efficient-GPT-2 (EGPT-2), Efficient-T5 (ET5), and Efficient-BART (EBART) models, respectively.
Visualizing Pruning Effects: Figure 2 illustrates the transformative effect of selective attention head pruning on inter-layer information flow. The pre-pruning configuration (left) maintains all 12 attention heads, creating redundant computational pathways. In contrast, the optimized post-pruning architecture (right) retains only 8 high-contribution heads while eliminating 4 low-contribution heads (H3, H6, H8, and H10), resulting in focused processing of Urdu linguistic patterns.
This architectural optimization directly addresses Urdu’s unique morphological challenges by concentrating computational resources on heads that capture language-specific features, while eliminating redundant heads trained on cross-linguistic patterns irrelevant to Urdu summarization. The detailed process of pruning is shown in Figure 3 and Algorithm 2. Algorithm 2 describes the entire procedure of attention head pruning based on the individual contribution in the multi-head attention part. This algorithm shows the functionality of the function perform_iterative_pruning(model, Tpruned, θ ), where model { EGPT 2 , ET 5 , EBART } is used in all three proposed models, i.e., EGPT2, ET5, and EBART. M and H are given as input to Algorithm 2, where M is the original model, and H is the total number of attention heads in the model. M P and C P are the outputs of Algorithm 2, where M P is the optimal model consisting of the high-performing attention heads after the removal of the least contributing attention heads, while C P is the contribution of each attention head in the optimal model.
Algorithm 2 Attention Heads Pruning Based on the Individual Contribution in the Multi-Head Attention
 1:
Input:
  •       M: The original model
  •       H: The total number of attention heads in the model
 2:
Output:
  •       MP: The optimal model consisting of the high-performing attention heads
  •       CP: The contribution of each attention head in the pruned model
 3:
Compute the contribution of each attention head in the original model M:
  •        C = ComputeAttentionHeadContribution ( M , H )
 4:
Initialize M P = M , h r = [ ] (a list to keep track of removed heads)
  •       and C P = C
 5:
while True do
 6:
      Find the index h m i n of the attention head with the minimum contribution in C P
  •        h m i n = arg min ( C P )
 7:
      if  h m i n is already in h r  then
 8:
            Break out of the loop (all heads have been considered)
 9:
      end if
10:
     Create a new model M P by removing the h m i n -th attention head from M P
  •      M P / h m i n = RemoveAttentionHead ( M P , h m i n )
11:
     Compute the ROUGE score of the M P and the pruned model M P / h m i n
  •       t e x t R O U G E ( M P ) = ComputeROUGE ( M P )
  •       ROUGE ( M P / h m i n ) = ComputeROUGE ( M P / h m i n )
12:
      if  ROUGE ( M P / h m i n ) ROUGE ( M P )  then
13:
          Update M P = M P / h m i n , add h m i n to h r , and compute the new contribution of each attention head in the pruned model
  •        C P = ComputeAttentionHeadContribution ( M P , H )
14:
    else
15:
          Break out of the loop (accuracy has stopped improving)
16:
    end if
17:
end while
18:
Return M P and C P
In the standard model M, the function ComputeAttentionHeadContribution (M, H) is used to obtain the contribution of each attention head H. The C is the vector with the contribution of each attention head. At the beginning of the pruning process, the original model M is set as the trimmed model M P , and the original contribution vector C is assigned to the contribution vector C P . The list h r is established to keep track of the removed heads. In each iteration, h m i n is computed, which is the index of the attention head that contributes the least in C P . The function RemoveAttentionHead(MP, hmin) removes the h m i n -th attention head from the M P , resulting in a new model M P / h m i n , which is the model without the attention head of the minimum contribution.
The functions ComputeROUGE(MP) and ComputeROUGE(MP/hmin) are used to calculate the ROUGE scores for the current model M P and the pruned model M P / h m i n (a model after removing the attention head with a minimum individual contribution based on the ROUGE scores), respectively. The function ComputeAttentionHeadContribution(MP, HP) again computes the new contribution of each attention head in the pruned model M P , finds the index h m i n of the attention head with the minimum contribution in C P , and removes the inefficient attention head with a minimum contribution. This process continues until the accuracy increases or equals the previous accuracy of the model. The loop breaks if the accuracy starts decreasing. The program returns the optimal pruned model M P together with the contribution vector C P as output.
In brief, to remove inefficient attention heads based on their contribution, the value of the ROUGE score is taken into account. Each attention head’s contribution is computed at each step using the original model M to the trimmed model M P after each pruning step until the accuracy score is the same or better. Using the ROUGE score as a metric for summarization performance, this technique seeks to identify the optimal pruned model, M P , by pruning the inefficient attention heads.

3.2.3. Generate Summaries Using the Fine-Tuned Pruned Models

After pruning the inefficient attention heads based on their contributions from the multi-head attention module of the summarizer models, the algorithm generates summaries for the tokenized testing text documents set T t e s t using the optimal pruned model M P . The same process is used for each pruned Efficient Summarizer model, namely EGPT-2_Pruned, ET5_Pruned, and EBART_Pruned, as shown in Algorithm 1.
The text summaries are generated by the method generate_summaries(model, Test), where model represents EGPT-2_Pruned, ET5_Pruned, and EBART_Pruned.
These summaries are then stored in Σ E G P T 2 , Σ E T 5 , and  Σ E B A R T . Σ E G P T 2 contains the summary generated by the EGPT-2_Pruned model, a pre-trained Transformer model optimized for text generation tasks. Σ E T 5 stores the summary generated by the ET5_Pruned model, a Transformer-based text-to-text architecture demonstrating strong summarization performance. Finally, Σ E B A R T contains the summary produced by the EBART_Pruned model, a Transformer encoder–decoder variant highly effective for abstractive summarization and other sequence generation tasks.
In addition, various standard metrics will be used to assess the quality of the summaries, including BLEU, ROUGE, Precision, Recall, and F1-score. Precision and Recall are metrics that calculate how complete or relevant the generated summaries are, while the F1-score encompasses these two factors, representing the overall quality of summarization for the model. The BLEU score measures n-gram overlap between the generated text and the ground truth summary, reflecting the accuracy and fluency of the generated text. The evaluate_summaries( Σ E G P T 2 , Σ E T 5 , Σ E B A R T ,testGround_Truth) function performs this evaluation. Its inputs are the ground-truth summaries (testGround_Truth) and the outputs from the pruned Efficient Summarizer models: EGPT-2_Pruned, ET5_Pruned, and EBART_Pruned.

3.3. Ablation Study and Pruning Threshold Analysis

To objectively evaluate the contribution of the pruning strategy and find the optimal pruning threshold, an ablation study is conducted. In this regard, we systematically varied the % of pruned attention heads (0%, 15%, 30%, and 45%) of the EGPT-2 model and measured the performance in terms of ROUGE-1 along with the inference time. It aids in establishing that our proposed selective pruning does indeed retain the structural integrity and the linguistic capability of the model by removing only the redundant heads while quantifying the efficiency–performance trade-off.

3.4. Computational Cost Measurement

To substantiate the claim of reduced computational cost, we recorded the training and inference times for both the original and pruned models on identical hardware: Intel Core i7 processor, 32 GB RAM, and an NVIDIA GeForce GPU. Additionally, memory usage was monitored at inference. This provides empirical data supporting the efficiency gains achieved through our pruning methodology.

4. Results

We provide a thorough comparison of our suggested optimized models (EBART, ET5, and EGPT-2) with their original counterparts in this section. We address key concerns raised by reviewers through statistical significance testing, computational efficiency analysis, cross-domain validation, and detailed ablation studies to provide robust evidence for our claims.

4.1. Experimental Setup and Statistical Validation

TThe Hugging Face Transformers library was the baseline framework for our model development and evaluation. We utilized the Urdu Fake News dataset, comprising 1300 articles across five domains: Business, Health, Showbiz, Sports, and Technology, with 750 real and 550 fake articles, split into 80% training and 20% test data.
All transformer models, including BART, T5, GPT-2, and their efficient variants, were fine-tuned over 200 epochs with a learning rate of 1 × 10 4 using the AdamW optimizer and cross-entropy loss. Experiments were performed on consistent hardware: Intel Core i7, 32GB RAM, and NVIDIA GeForce GPU.
As a check for statistical reliability, we conducted paired t-tests across five independent runs on ROUGE-1 scores. Results for all model comparisons showed a p-value of less than 0.05: BART vs. EBART: p = 0.023, T5 vs. ET5: p = 0.017, GPT-2 vs. EGPT-2: p = 0.008. Hence, performance improvements achieved through attention head pruning are statistically significant.

4.2. Overall Performance Analysis

Table 1 shows the overall performance comparison of all models on all metrics. Optimized models perform better than their vanilla versions, and EGPT-2 achieves the best results in all the measures.
EGPT-2 outperforms others with a ROUGE-1 score of 0.52, which is an improvement of 15.6% over the base GPT-2. These consistent results across all the efficient models prove our hypothesis that selective attention head pruning will indeed enhance the model performance with linguistic integrity.
Figure 4 shows the performance of six summarization models in terms of four important metrics, namely ROUGE-1, ROUGE-2, ROUGE-L, and BLEU. The color intensity shows the performance of the model. Red represents the higher value, and blue represents the lower one. EGPT-2 obtains the highest ROUGE-1 score of 0.52 and demonstrates comprehensive performance, especially in ROUGE-L with 0.45 and BLEU with 0.4. EBART runs second, with its highest scores of 0.5 in ROUGE-1, while performing consistently well in other metrics.
The bar chart compares six different models based on four metrics of evaluation shown in Figure 5. EGPT-2 always outperforms the other models in all metrics, as it shows the highest scores in ROUGE-1, ROUGE-2, ROUGE-L, and BLEU for its superior capability in both recall (capturing important words and phrases) and precision (producing fluent and accurate summaries).

4.3. Extended Ablation and Pruning Threshold Evaluation

We perform a large-scale ablation study to objectively assess the contribution of our pruning strategy and to evaluate which is the best threshold for pruning. More specifically, we systematically vary the percentage of pruned attention heads for the EGPT-2 model and assess performance using ROUGE-1 score, inference time, and memory consumption: 0%, 15%, 30%, and 45%.
The outcomes of this analysis are listed in Table 2, clearly indicating that 30% is the optimal pruning ratio, balancing efficiency and effectiveness. Precisely at this threshold, the model reaches a maximum ROUGE-1 score of 0.52 with remarkable computational gains.
The trend of the ROUGE-1 scores over different pruning ratios is illustrated in Figure 6. As is evident from this figure, the performance of the model increases consistently from the original (0.45) through 15% (0.48) pruning to the maximum at 30% pruning (0.52), beyond which it starts decreasing at 45% pruning (0.49). This therefore confirms our contribution-based iterative pruning method and identifies 30% as the ideal threshold for EGPT-2.
The comprehensive results from Table 2 and Figure 6 yield the following key observations. Pruning up to 30% of attention heads results in a 15.6% improvement on ROUGE-1, as well as an 18% reduction in inference time (from 0.89 s to 0.73 s) and a 19% reduction in memory usage (from 4.2 GB to 3.4 GB). Beyond this optimal threshold, performance degrades despite continued improvements in computational efficiency, which strongly suggests that aggressive pruning removes attention heads responsible for maintaining the linguistic competencies and summarization quality of the model.
This ablation study shows empirical evidence for our hypothesis that transformer models have considerable redundancy in their attention mechanisms. The success of selective pruning at the 30% threshold goes to prove that a substantial fraction of attention heads can be removed without affecting performance and actually improve it because the model’s capacity has become focused on the most relevant linguistic features for Urdu abstractive summarization.

4.4. Computational Efficiency Analysis

Across all optimized models, our pruning approach shows notable computational benefits. The detailed comparison of inference time and memory usage between the original and pruned models is shown in Table 3.
Architecturally, the improvements of 22% in inference speed and 19% reduction in memory for EGPT-2, as shown in Table 3, are attributed to the streamlined information flow presented in Figure 2. By removing about 30% of attention heads while maintaining linguistic capabilities, our approach achieves the best efficiency–performance trade-off for Urdu NLP applications.
As shown in the Table 3, the optimally pruned EGPT-2 model (30%) results in a 22% reduction in inference time, from 0.89 s to 0.73 s, and consumes 19% less memory, from 4.2 GB to 3.4 GB, compared to the base GPT-2 model. Similar gains were observed for EBART, which yields 18% faster inference and 17% less memory, as well as ET5, with 16% faster inference and 15% less memory. These consistent improvements across all three architecture modifications, as illustrated in Figure 7, further confirm that selective pruning of attention heads is an effective method to enhance computational efficiency without degrading performance.
Moreover, the reference to Figure 2 (Section 3.2.2) has been underlined to make clear how the scheme of inter-layer information flow here provides a basis for the presented analysis of computational efficiency. All in all, this connection indicates that the streamlined architecture displayed in Figure 2 forms the structural backbone of the efficiency gains quantified in Table 3 and visualized in Figure 7.
In fact, the advantages of these models go beyond saving resources. The smaller memory footprints allow for easy deployment on low-resource devices, and the faster inference times now make real-time processing applications feasible. These efficiency gains are particularly valuable for Urdu NLP, where computational resources are often limited compared to high-resource languages like English.

4.5. Cross-Domain Generalization

To address concerns about the objectivity of results and domain specificity, we also tested EGPT-2 on an additional multi-domain Urdu dataset composed of 1200 documents from three diverse domains: news articles, blog posts, and academic texts. The performance results across these domains are presented in Table 4.
As presented in Table 4 and illustrated in Figure 8, the model demonstrated very similar performance across all domains, with ROUGE-1 scores ranging from 0.50 to 0.53. The narrow range of performance, less than ±0.015 of variation from the average, indicates an extraordinary generalization capability outside the domain used for the original training.
This not only constitutes consistency in performance across diverse text types—from formal news reporting to informal blog writing and technical academic content—but also underlines the robustness of our selective attention pruning approach. This invariance is especially useful in real-world applications where models encounter varied writing styles and content types, which bolsters the fact that our optimization methodology yields models with strong generalization capabilities rather than domain-specific overfitting.

4.6. Impact of Input Sentence Count on Summary Generation

The graph in Figure 9 shows the performance comparison of four different text summarization models for different input sentence counts. The x-axis represents the number of input sentences, while the y-axis denotes the number of generated summaries. On the whole, results point out that GPT-2 and EGPT-2 models generate more summaries than BART and T5, especially for larger input sizes. This shows that GPT-2 and EGPT-2 can handle longer inputs and are better at producing more detailed summaries. EGPT-2 exhibits the most robust and consistent performance across sentence counts.

4.7. Qualitative Analysis and Model Comparison

For comparing the summarization ability of different models, we provide some visual examples of the summaries produced by BART, T5, GPT-2, and EGPT-2. Figure 10 represents the source news article with six lines and presents all the summaries generated from each model. It can be clearly seen that the visual comparison between the summaries produced by different models shows clear performance differences. BART’s summary has four lines, while T5’s summary is one line shorter, with three lines in total. GPT-2 produces a two-line summary, whereas EGPT-2 produces the shortest summary of only one line. BART’s summary, though comprehensive at four lines, contains peripheral details that dilute the core message. This includes mentioning secondary entities and contextual information, which, while related, are not essential to the main event. The T5 model produced a more focused three-line summary; it still retains a minor detail that is not needed and thus could be omitted without losing any value in information. GPT-2’s two-line summary is better in terms of conciseness but unfortunately omits one of the key entities involved in the event and thus provides a somewhat incomplete picture of it.
By contrast, the one-liner by EGPT-2 is better at distilling the most critical information. We evaluate summary quality not based on length alone but based on three criteria: (1) Informativeness:retention of all key entities and the central action; (2) Conciseness: elimination of peripheral details with no loss of essential information; and (3) Coherence: fluency and logical flow. The EGPT-2 summary successfully identifies and retains all the key entities (e.g., the main actors, location, and central action) and the main event of the original six-line text while getting rid of all peripheral details. More importantly, it gets rid of all peripheral details and contextual fluff, yielding a concise, coherent, and highly informative abstractive summary.
This pattern indicates that the selective attention pruning applied to EGPT-2 improves its salience detection and highlighting mechanism. It clears out redundant attention heads, sharpening the model’s focus: it is less likely to be “distracted” by less important tokens and more capable of forming a strong, direct representation of the core semantic elements necessary for summary generation. This results in outputs that are not only much shorter but also semantically denser and factually more complete relative to their length, which is indicative of higher levels of summarization competence.
This comprehensive evaluation highlights that our method of selective attention head pruning effectively improves the performance of transformer models for Urdu abstractive summarization with substantial computational gains. The statistical significance of improvements coupled with robust performance on out-of-domain data and clear efficiency gains establishes the merits of our approach toward low-resource language processing.

5. Discussion

This section interprets the experimental results, compares them with prior work, answers the research questions, and discusses the theoretical and practical implications of our findings.

5.1. Interpretation of Results and Comparison with Prior Work

Our results show that selective attention head pruning greatly improves the performance of transformer models in Urdu abstractive summarization. The rate of improvement in ROUGE scores (up to 15.6% for EGPT-2) aligns with previous studies about attention pruning for other languages and tasks [20]; however, this is the first systematic application and validation for Urdu. Our approach outperforms conventional approaches to fine-tuning for Urdu [5,13] while reducing computation cost, thereby addressing a critical gap in low-resourced NLP.
Interestingly, the main reason for EGPT-2’s superior performance against EBART and ET5 lies in its autoregressive decoder-only architecture, which is inherently more suitable for generative tasks like abstractive summarization. In addition, pruning seems to have had a synergistic effect with the architecture of GPT-2, possibly because it contained a higher proportion of task-agnostic attention heads that could be safely removed with no loss in performance. This result runs somewhat in contrast to previous work regarding encoder–decoder models [7] and suggests that architectural differences are an important factor affecting how models respond to pruning.

5.2. Answering Research Questions

We now return to the research questions posed in the introduction:
  • Can selective attention head pruning improve performance and efficiency? Yes, our results clearly show that all pruned models (EBART, ET5, and EGPT-2) outperform their original counterparts across all metrics while achieving significant reductions in inference time (16–22%) and memory usage (15–19%).
  • How do different architectures respond to pruning?All three architectures benefited from pruning, but GPT-2 showed the greatest improvement, suggesting that decoder-only models may be particularly amenable to this optimization technique for generative tasks.
  • What is the optimal pruning threshold? Our ablation study identified 30% as the optimal pruning ratio for EGPT-2, balancing performance gains with computational efficiency. Beyond this threshold, performance degradation occurs, validating our contribution-based iterative approach.
  • How do optimized models generalize across domains? EGPT-2 maintained strong performance (ROUGE-1: 0.50–0.53) across news, blogs, and academic texts, demonstrating robust cross-domain generalization capabilities.

5.3. Theoretical and Practical Implications

Theoretical Implications: Our work provides evidence that transformer models of low-resource languages contain significant redundancy regarding their attention mechanisms. The success of selective pruning indeed points to the fact that not all attention heads bear equal importance for specific tasks and hence supports the hypothesis that transformers can be optimized by architectural simplification rather than by tuning their parameters alone [20].
Practical Implications: These optimized models offer a path toward deployable, efficient solutions for Urdu NLP applications. Significantly reduced computational requirements make transformer-based summarization feasible in resource-constrained environments, including mobile applications and systems that process data in real-time. This overcomes a critical barrier to the adoption of advanced NLP technologies in low-resource language contexts.

5.4. Urdu vs. English Processing Considerations

Our findings highlight important differences between processing Urdu and English text. The right-to-left script of Urdu, its complex morphology, and a lack of clear word boundaries [5] create challenges unlike those involved in English processing. The success of our pruning approach indicates that standard transformer architectures contain components optimized for English-like languages, which become redundant during the processing of Urdu. This again underlines the importance of language-specific optimization rather than directly applying models developed for high-resource languages.
Computational gains through pruning are more valued in Urdu because of the resource-constrained environment. While English NLP often has the advantage of scale and computational power, Urdu applications need to be optimized judiciously to be practical; therefore, our approach has special relevance for low-resource language processing.

6. Conclusions and Future Work

This study effectively showcased the power of selective attention head pruning as a means of optimizing transformer-based models for Urdu abstractive text summarization. We introduced three optimized models—EBART, ET5, and EGPT-2—by strategically removing inefficient attention heads from their original architectures. The theoretical foundation of our approach lies in preserving structural integrity while eliminating redundant attention mechanisms that contribute minimally to summarization quality.
Our extensive experimental results clearly prove that the pruned models outperform their base versions. The EGPT-2 model performed best, with remarkable results of 0.52 ROUGE-1, 0.36 ROUGE-2, 0.45 ROUGE-L, and 0.40 BLEU, outperforming the baseline GPT-2 by 15.6%. This may be because of the auto-regressive nature of EGPT-2, enhanced by focused attention due to strategic pruning, which aids in catching salient content and maintaining linguistic coherence.
This ablation study gave important insights into the optimal pruning threshold: removing up to 30% of attention heads gives the best performance–efficiency trade-off. Beyond this threshold, the performance degrades further, thereby justifying our contribution-based iterative pruning strategy. More importantly, it showed significant computation gains wherein at test time EGPT-2 achieved a reduction of 22% in inference time and 19% in memory consumption when compared with the base model; similarly, EBART and ET5 show efficiency gains of 18% and 16% faster inference, respectively.
The cross-domain evaluation consolidated our findings further, indicating that EGPT-2 retains strong performance in a variety of Urdu text domains, such as news, blogs, and academic texts, with ROUGE-1 scores between 0.50 and 0.53. This generalization capability underlines the practical applicability of our approach in real-world scenarios. Qualitative analysis indicated that EGPT-2 produces summaries that are more concise yet informative, summarizing only the essential information while removing peripheral details.
Statistical validation via paired t-tests showed the significance of our improvements: all model comparisons yielded p-values < 0.05, proving the effectiveness of our pruning methodology. This research makes significant contributions to Urdu NLP as follows:
  • Developing a novel attention head pruning framework specifically tailored for low-resource languages;
  • Demonstrating that model optimization can simultaneously improve performance and computational efficiency;
  • Providing a comprehensive evaluation methodology for Urdu abstractive summarization;
  • Establishing transformer-based models as viable solutions for Urdu NLP tasks.
For future work, we plan to explore several promising directions:
  • Extending the pruning methodology to other low-resource languages with similar morphological complexity;
  • Integrating Urdu-specific morphological embeddings to enhance semantic understanding and capture language-specific features;
  • Investigating dynamic pruning techniques that adapt to input characteristics and domain requirements;
  • Deploying and evaluating our models in real-time applications such as streaming news feeds and social media monitoring;
  • Exploring multi-modal summarization approaches that combine text with other data modalities;
  • Developing domain adaptation techniques to further improve performance in specialized domains.
We conclude that our study establishes selective attention head pruning as an effective strategy to enhance transformer models in low-resource language processing. These are very significant improvements in performance and efficiency, each with good generalization capabilities; this makes our optimized models practical for real-world Urdu text summarization applications. This work opens new avenues towards the deployment of efficient NLP solutions in resource-constrained environments and contributes toward bridging the technological gap for under-resourced languages.

Author Contributions

Conceptualization, M.A.; Methodology, M.A.; Software, M.A. and A.A.; Validation, A.A., G.F., D.A.D. and M.B.; Investigation, M.B.; Writing—original draft, M.A. and A.A.; Writing—review & editing, G.F.; Visualization, A.A.; Supervision, M.A.; Funding acquisition, D.A.D. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the public domain and can be found at the following URL: https://huggingface.co/datasets/community-datasets/urdu_fake_news (accessed on 1 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Muhammad, M.; Jazeb, N.; Martinez-Enriquez, A.; Sikander, A. EUTS: Extractive Urdu Text Summarizer. In Proceedings of the 2018 17th Mexican International Conference on Artificial Intelligence (MICAI), Guadalajara, Mexico, 22–27 October 2018; pp. 39–44. [Google Scholar]
  2. Yu, Z.; Du, P.; Yi, L.; Luo, W.; Li, D.; Zhao, B.; Li, L.; Zhang, Z.; Zhang, J.; Zhang, J.; et al. Coastal Zone Information Model: A comprehensive architecture for coastal digital twin by integrating data, models, and knowledge. Fundam. Res. 2024. [Google Scholar] [CrossRef]
  3. Vijay, S.; Rai, V.; Gupta, S.; Vijayvargia, A.; Sharma, D.M. Extractive text summarisation in Hindi. In Proceedings of the 2017 International Conference on Asian Language Processing (IALP), Singapore, 5–8 December 2017; pp. 318–321. [Google Scholar]
  4. Rahimi, S.R.; Mozhdehi, A.T.; Abdolahi, M. An overview on extractive text summarization. In Proceedings of the 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, Iran, 22 December 2017; pp. 54–62. [Google Scholar]
  5. Daud, A.; Khan, W.; Che, D. Urdu language processing: A survey. Artif. Intell. Rev. 2016, 27, 279–311. [Google Scholar] [CrossRef]
  6. Ali, A.R.; Ijaz, M. Urdu text classification. In Proceedings of the 2019 6th International Conference on Frontiers of Information Technology, Islamabad, Pakistan, 16–18 December 2009; pp. 1–4. [Google Scholar]
  7. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
  8. Egomnwan, E.; Chali, Y. Transformer-based model for single documents neural summarization. In Proceedings of the 2019 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 4 November 2019; pp. 70–79. [Google Scholar]
  9. Abolghasemi, M.; Dadkhah, C.; Tohidi, N. HTS-DL: Hybrid text summarization system using deep learning. In Proceedings of the 2022 27th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 23–24 February 2022; pp. 1–6. [Google Scholar]
  10. Jiang, J.; Zhang, H.; Dai, C.; Zhao, Q.; Feng, H.; Ji, Z.; Li, Y. Enhancements of attention-based bidirectional LSTM for hybrid automatic text summarization. IEEE Access 2021, 9, 123660–123671. [Google Scholar] [CrossRef]
  11. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  12. Chowdhury, S.A.; Abdelali, A.; Darwish, K.; Soon-Gyo, J.; Salminen, J.; Jansen, B.J. Improving Arabic Text Categorization Using Transformer Training Diversification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 226–236. [Google Scholar]
  13. Farooq, A.; Batool, S.; Noreen, Z. Comparing different techniques of Urdu text summarization. In Proceedings of the 2021 Mohammad Ali Jinnah University International Conference on Computing (MAJICC), Karachi, Pakistan, 15–16 December 2021; pp. 1–6. [Google Scholar]
  14. Asif, M.; Raza, S.A.; Iqbal, J.; Perwatz, N.; Faiz, T.; Khan, S. Bidirectional encoder approach for abstractive text summarization of Urdu language. In Proceedings of the 2022 International Conference on Business Analytics for Technology and Security (ICBATS), Dubai, United Arab Emirates, 16–17 February 2022; pp. 1–6. [Google Scholar]
  15. Khyat, J.; Lakshmi, S.S.; Rani, M.U. Hybrid Approach for Multi-Document Text Summarization by N-Gram and Deep Learning Models. J. Intell. Syst. 2021, 30, 123–135. [Google Scholar]
  16. Mujahid, K.; Bhatti, S.; Memon, M. Classification of URDU headline news using Bidirectional Encoder Representation from Transformer and Traditional Machine learning Algorithm. In Proceedings of the IMTIC 2021—6th International Multi-Topic ICT Conference: AI Meets IoT: Towards Next Generation Digital Transformation, Karachi, Pakistan, 24–25 November 2021; pp. 1–6. [Google Scholar]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  18. Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  19. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  20. Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 14014–14024. [Google Scholar]
  21. Cheema, A.S.; Azhar, M.; Arif, F.; Sohail, M.; Iqbal, A. EGPT-SPE: Story Point Effort Estimation Using Improved GPT-2 by Removing Inefficient Attention Heads. Appl. Intell. 2025, 55, 994. [Google Scholar] [CrossRef]
  22. Savelieva, A.; Au-Yeung, B.; Ramani, V. Abstractive Summarization of Spoken and Written Instructions with BERT. arXiv 2020, arXiv:2008.09676. [Google Scholar] [CrossRef]
  23. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv 2020, arXiv:2010.11934. [Google Scholar]
  24. Munaf, M.; Afzal, H.; Mahmood, K.; Iltaf, N. Low Resource Summarization Using Pre-trained Language Models. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 141. [Google Scholar] [CrossRef]
  25. Salau, A.O.; Arega, K.L.; Tin, T.T.; Quansah, A.; Sefa-Boateng, K.; Chowdhury, I.J.; Braide, S.L. Machine learning-based detection of fake news in Afan Oromo language. Bull. Electr. Eng. Inform. 2024, 13, 4260–4272. [Google Scholar] [CrossRef]
  26. Rauf, F.; Irfan, R.; Mushtaq, L.; Ashraf, M. Fake News Detection in Urdu Using Deep Learning. VFAST Trans. Softw. Eng. 2022, 10, 151–165. [Google Scholar] [CrossRef]
  27. Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
Figure 1. The proposed framework for training and evaluating efficient Transformer summarizers (EBART, ET5, and EGPT-2).
Figure 1. The proposed framework for training and evaluating efficient Transformer summarizers (EBART, ET5, and EGPT-2).
Information 16 00991 g001
Figure 2. Schematic of inter-layer information flow before and after attention head pruning.
Figure 2. Schematic of inter-layer information flow before and after attention head pruning.
Information 16 00991 g002
Figure 3. Attention heads pruning process based on the individual contribution in the multi-head attention.
Figure 3. Attention heads pruning process based on the individual contribution in the multi-head attention.
Information 16 00991 g003
Figure 4. Heatmap of model performance across metrics.
Figure 4. Heatmap of model performance across metrics.
Information 16 00991 g004
Figure 5. Comparison of models by ROUGE and BLEU scores.
Figure 5. Comparison of models by ROUGE and BLEU scores.
Information 16 00991 g005
Figure 6. Ablation study: effect of pruning ratio on EGPT-2 ROUGE-1 performance.
Figure 6. Ablation study: effect of pruning ratio on EGPT-2 ROUGE-1 performance.
Information 16 00991 g006
Figure 7. Computational efficiency gains: inference time and memory usage reduction.
Figure 7. Computational efficiency gains: inference time and memory usage reduction.
Information 16 00991 g007
Figure 8. EGPT-2 performance consistency across different Urdu text domains.
Figure 8. EGPT-2 performance consistency across different Urdu text domains.
Information 16 00991 g008
Figure 9. Comparison of text summarization models: impact of input sentence count on summary generation.
Figure 9. Comparison of text summarization models: impact of input sentence count on summary generation.
Information 16 00991 g009
Figure 10. Visual comparison of summaries generated by different models. Subfigures (af) present the outputs of the original and pruned model variants: (a) BART, (b) T5, (c) GPT-2, (d) EBART, (e) ET5, and (f) EGPT-2.
Figure 10. Visual comparison of summaries generated by different models. Subfigures (af) present the outputs of the original and pruned model variants: (a) BART, (b) T5, (c) GPT-2, (d) EBART, (e) ET5, and (f) EGPT-2.
Information 16 00991 g010
Table 1. Quantitative performance comparison of Transformer models for Urdu abstractive summarization.
Table 1. Quantitative performance comparison of Transformer models for Urdu abstractive summarization.
ModelROUGE-1ROUGE-2ROUGE-LBLEU Score
BART0.480.310.400.35
EBART0.500.330.420.37
T50.460.280.380.32
ET50.490.300.410.34
GPT-20.450.270.360.30
EGPT-20.520.360.450.40
Table 2. Ablation study: effect of pruning ratio on EGPT-2 performance and efficiency.
Table 2. Ablation study: effect of pruning ratio on EGPT-2 performance and efficiency.
Pruning RatioROUGE-1Inference Time (s)Memory Usage (GB)
0% (Original)0.450.894.2
15%0.480.783.7
30%0.520.733.4
45%0.490.703.1
Table 3. Computational efficiency comparison: inference time and memory usage.
Table 3. Computational efficiency comparison: inference time and memory usage.
ModelInference Time (s)ReductionMemory Usage (GB)Reduction
BART0.85-4.0-
EBART0.7018%3.317%
T50.88-4.1-
ET50.7416%3.515%
GPT-20.89-4.2-
EGPT-20.7322%3.419%
Table 4. Cross-domain generalization performance of EGPT-2.
Table 4. Cross-domain generalization performance of EGPT-2.
DomainROUGE-1ROUGE-2ROUGE-LBLEU
News0.530.370.460.41
Blogs0.510.350.440.39
Academic0.500.340.430.38
Average c0.510.350.440.39
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Azhar, M.; Amjad, A.; Farid, G.; Dewi, D.A.; Batumalay, M. Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning. Information 2025, 16, 991. https://doi.org/10.3390/info16110991

AMA Style

Azhar M, Amjad A, Farid G, Dewi DA, Batumalay M. Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning. Information. 2025; 16(11):991. https://doi.org/10.3390/info16110991

Chicago/Turabian Style

Azhar, Muhammad, Adeen Amjad, Ghulam Farid, Deshinta Arrova Dewi, and Malathy Batumalay. 2025. "Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning" Information 16, no. 11: 991. https://doi.org/10.3390/info16110991

APA Style

Azhar, M., Amjad, A., Farid, G., Dewi, D. A., & Batumalay, M. (2025). Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning. Information, 16(11), 991. https://doi.org/10.3390/info16110991

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop