SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization

Wang, Yiming; Zhang, Jindong

doi:10.3390/electronics14234551

Open AccessArticle

SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization

by

Yiming Wang

¹ and

Jindong Zhang

^2,*

¹

College of Computer Science and Technology, Jilin University, Changchun 130012, China

²

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4551; https://doi.org/10.3390/electronics14234551

Submission received: 1 October 2025 / Revised: 11 November 2025 / Accepted: 17 November 2025 / Published: 21 November 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

The performance of extractive summarization models is often limited by their dependence on human references that may contain inaccuracies or subjective biases. Existing methods typically rely solely on sentence-level supervision, which lacks explicit grounding in the actual semantic content of the source document, thus limiting their robustness. We propose SSF-KW, a novel multi-task learning framework that enhances robustness by jointly optimizing keyword extraction and sentence selection. Our approach is designed to explicitly anchor salience decisions in the document’s intrinsic semantic structure, reducing reliance on potentially noisy labels. To this end, the model employs a shared BERT encoder to represent sentences, and identifies keywords through part-of-speech tagging, semantic similarity analysis, and fine-grained keyword signals with sentence-level representations via a transformer-based fusion module. The entire framework is optimized with a combined loss function that balances both tasks. Comprehensive evaluations on CNN/DailyMail, XSum, and WikiHow demonstrate that SSF-KW consistently outperforms baselines ROUGE-1 scores of 43.27, 25.43, and 30.03, respectively. Ablation studies confirm the contribution of each component, with the word-level module proving especially critical for capturing key concepts in procedural texts like WikiHow.

Keywords:

extractive summarization; multi-task learning; semantic fusion; factual consistency

1. Introduction

Text summarization, a foundational task in natural language processing (NLP), aims to condense lengthy documents into concise and informative summaries while preserving factual fidelity. Two major paradigms dominate the field: extractive summarization, which selects salient sentences directly from the source text, and abstractive summarization, which generates new phrasing to convey the same meaning. Recent surveys provide comprehensive overviews of these paradigms—covering the evolution from statistical and neural models to large language models (LLMs) that enable zero-shot and few-shot summarization settings [1,2,3]. Extractive summarization remains particularly valued for factual reliability in high-stakes domains such as healthcare, law, and news reporting [4,5,6]. Building on this paradigm, our work focuses on enhancing semantic representation and mitigating redundancy through multi-task learning and topic-guided fusion.

Extractive summarization, which directly selects and concatenates salient sentences from the source document, offers a compelling balance between factual accuracy and computational efficiency. However, most existing extractive models still rely heavily on human-written reference summaries for supervision. These references often contain inconsistencies [7], subjective phrasing, or factual inaccuracies, which can mislead models to imitate imperfect patterns instead of identifying truly informative content. Consequently, the robustness and generalization of these systems degrade, especially when the training data includes noisy or biased annotations. As illustrated in Table 1, inaccuracies in reference summaries can influence the sentence selection behavior of extractive models. When the human-written reference contains factual mistakes, the model—trained to maximize overlap with such references—may favor sentences that partially reflect those inaccuracies. After the reference is corrected, the model’s selected sentences often change correspondingly, or previously overlooked information may reappear. This observation suggests that even moderate reference noise can affect the extraction pattern, highlighting a potential weakness in conventional supervision that overly depends on human annotations. It becomes essential to design models that determine salience primarily based on the document’s intrinsic semantics rather than imitating possibly imperfect references.

To overcome this challenge, we argue that an effective extractive model should ground its decision-making process directly in the semantic structure of the source document, rather than depending solely on external labels. This motivates our proposed SSF-KW framework, which integrates Semantic Similarity-based Fusion and Keyword-guided learning within a multi-task architecture. The model jointly optimizes two objectives—keyword extraction and sentence selection—encouraging it to identify semantically central units (e.g., entities and key actions) and use them to guide sentence scoring. This joint formulation serves as an implicit regularizer, allowing the model to remain stable even when trained with noisy supervision. Extensive experiments on CNN/DailyMail, XSum, and WikiHow demonstrate that SSF-KW consistently outperforms competitive baselines, showing particular strength in domains where factual consistency and thematic coverage are essential. The proposed design not only yields higher ROUGE scores but also enhances interpretability, as the keyword branch provides transparent cues for understanding model behavior. The main contributions of this work are as follows:

We propose SSF-KW, a novel framework that jointly learns keyword extraction and sentence selection to reduce sensitivity to noisy reference summaries.
We design a keyword-guided fusion mechanism that combines lexical salience and semantic context for more coherent and informative summaries.
We empirically validate the framework’s robustness and interpretability across multiple datasets and provide ablation studies to assess each component’s contribution.

2. Related Work

Extractive summarization has been extensively explored through various neural and graph-based architectures [8,9,10,11,12,13,14]. While its factual reliability is well recognized, recent research increasingly focuses on understanding how supervision quality affects model robustness. A central challenge is the dependence on human-authored reference summaries, which may contain inaccuracies or subjective biases [15,16]. This issue has motivated multiple lines of study addressing label noise, redundancy, and domain generalization—key concerns that directly influence extractive performance under real-world conditions.

To mitigate this issue, researchers have pursued several complementary strategies. One line of work incorporates keyword extraction to improve robustness by anchoring summarization decisions to source-derived lexical units. By identifying key entities and actions, keyword-based methods introduce an additional inductive bias toward factual grounding, making models less sensitive to reference noise [17]. Empirical findings further suggest that humans also rely on such key concepts when summarizing texts, supporting the rationale for integrating keyword extraction and summarization within a unified framework [18]. Joint modeling of the two tasks has proven effective in improving factual consistency [19,20,21] and mitigating overfitting in noisy training scenarios [17,22]. A second direction focuses on label correction and denoising. Confidence-aware learning mechanisms down-weight unreliable supervision [22], while consistency-based methods leverage multiple models or training stages to smooth or filter noisy labels [23]. These approaches can reduce annotation bias but often introduce additional hyperparameters and computational overhead, limiting scalability when noise is pervasive. A third direction employs multi-task and auxiliary learning to improve semantic grounding [24]. By sharing representations across tasks such as information extraction [25], paraphrase detection [26], or semantic role labeling [27], models learn more generalized, content-aware notions of salience. However, task interference can arise, complicating objective balancing [28].

More recently, pre-training and self-supervised learning have been leveraged to enhance noise robustness. Contrastive frameworks encourage alignment between sentence embeddings and their source semantics [29], while synthetic-noise pretraining simulates imperfect supervision for robustness [30]. Despite strong performance, these models require large-scale resources and substantial computation. While ROUGE [31] has known limitations—such as its reliance on surface lexical overlap and limited ability to capture semantic equivalence—it remains the most practical and interpretable metric for evaluating extractive summarization. Recent advances, including BERTScore [32] and MoverScore [33], introduce semantic-level evaluation by leveraging contextualized embeddings to assess meaning preservation. These metrics have shown clear advantages in abstractive settings, where paraphrasing and rewording are frequent. However, in extractive summarization, the goal is to select factual and non-redundant sentences directly from the source text, where lexical correspondence with the reference is both meaningful and necessary [34]. Therefore, ROUGE continues to offer a balanced and task-appropriate measure of performance, despite its inherent lexical focus.

At the same time, the motivation behind semantic-based metrics has informed the design of our approach: by incorporating semantic similarity modeling during training, our framework complements ROUGE-based evaluation, enhancing coherence and factual grounding without departing from the extractive paradigm.

3. Methodology

3.1. Overview

The SSF-KW model is an extractive summarization system designed to identify and select the most salient sentences from a document. As illustrated in Figure 1, the framework adopts a shared multi-task encoder that jointly performs sentence embedding and keyword extraction. A shared BERT encoder processes the input document, and the two tasks employ separate classifiers while sharing contextual representations. The fusion module then integrates sentence and keyword signals to guide the final selection. The gray-shaded box on the right shows the internal structure of the multitask module. This dual-task design enables the model to capture both fine-grained lexical cues and broader discourse-level semantics, helping it focus on source-derived information and reducing sensitivity to noisy training labels. Table 2 summarizes the main symbols used in this chapter and their definitions. The model processes an input document through three primary technical components, which will be elaborated in subsequent sections:

Sentence Embedding Module (Section 3.2.1): Encodes individual sentences using a BERT-based model to generate contextualized representations.
Keyword Extraction Module (Section 3.2.2): Identifies pivotal lexical units (nouns, verbs, adjectives) via part-of-speech tagging and semantic similarity analysis.
Fusion and Classification Module (Section 3.2.3): Integrates sentence and keyword embeddings through a fusion strategy, which are then processed by task-specific classifiers to produce the final summary. The system is optimized using a multi-task objective function (Section 3.4) that jointly addresses sentence selection accuracy and keyword identification.

Task Definition

In extractive summarization, given a document

D = {X_{i}}_{i = 1}^{N}

with

N

sentences, the goal is to select a subset

X = {X_{i}^{'}}_{i = 1}^{M}

(where

M \leq N

) that captures the core content of

D

. This task is framed as a sequence labeling problem, assigning each sentence

X_{i}

a binary label

y_{i} \in {0, 1}

indicating its inclusion (

y_{i} = 1

) or exclusion (

y_{i} = 0

) from

X

. The objective is to generate a label sequence

y = {y_{i}}_{i = 1}^{N}

that maximizes the relevance of the selected sentences. The optimal subset

X

is defined as:

X = \{X_{i} \in D∣ y_{i} = 1\}

(1)

This formulation enables the model to learn informative labeling decisions by capturing semantic similarity, ensuring coverage of key aspects, and ultimately enhancing the informativeness and conciseness of the final summary.

3.2. Model Architecture

Our multi-task framework consists of a shared encoder and two task-specific classifiers for keyword extraction and sentence selection. The shared encoder generates contextualized representations from the input text, which serve as a common foundation for both tasks. The keyword extraction module pinpoints pivotal lexical units, which provide fine-grained semantic signals. The sentence selection module uses these signals and sentence-level representations to identify discourse-level salient content. The two tasks are trained jointly, enabling synergistic learning where keyword analysis sharpens conceptual focus and sentence selection ensures thematic fidelity. The following sections detail the three core technical components of our architecture.

3.2.1. Sentence Embedding Module

This module encodes each sentence into a contextualized vector representation. Each sentence

X_{i}

is processed by a BERT model with special boundary tokens:

h_{s_{o}} = BERT ([BoS], X_{i}, [EoS]),

(2)

where

h_{s_{o}} \in R^{d}

represents the origin

d

-dimensional embedding

S_{o}

for sentence

i

, capturing its comprehensive semantics. This representation serves as the input for subsequent sentence-level tasks and fusion operations.

3.2.2. Keyword Extraction Module

This module identifies and encodes semantically salient keywords within each sentence. The process involves the following steps:

1.: Part-of-Speech Tagging: We first conduct part-of-speech tagging on all tokens using NLTK. Each token is annotated, allowing us to identify candidate keywords from nouns, verbs, and adjectives.
2.: Contextual Embedding and Similarity Calculation: For each candidate word in a sentence, we compute its contextualized embedding using the pre-trained BERT model. The cosine similarity between the word embedding and the overall document embedding ( $h_{D}$ ) is then calculated to determine its relevance. The document embedding $h_{D}$ is computed as the average of all sentence embeddings:

$h_{D} = \frac{1}{N} \sum_{i = 1}^{N} h_{s_{o}}^{(i)}$

(3)
3.: Keyword Selection: The candidate word with the highest similarity to the document embedding within its sentence is selected as the key semantic word ( $s w$ ). Tokens are marked as keywords ( $T a g_{k}$ ) or non-keywords ( $T a g_{o}$ ).

3.2.3. Fusion and Classification Module

This module integrates information from the sentence and keyword embeddings to form an enriched representation for classification. The word-level embeddings are processed through a transformer architecture to learn hierarchical representations. The base word embedding is computed as an average:

h_{k_{0}} = \frac{1}{n} \sum_{i = 1}^{n} h_{k_{i}},

(4)

Subsequent transformer layers update these embeddings using a BERT-based encoder, shown in Equation (5), where

l

denotes the index of the Transformer layer in the BERT encoder, indicating that the hidden representations are updated iteratively across stacked layers.

h_{k}^{l + 1} = BERTEncoder (h_{k}^{l}),

(5)

This produces a refined word-level representation

h_{k} \in R^{d}

. The fusion of word-level (

h_{k} \in R^{d}

) and sentence-level (

h_{s} \in R^{d}

) embeddings is achieved through concatenation and transformation. The concatenated vector

[h_{k}; h_{s}] \in R^{2 d}

is processed by a fusion encoder:

\begin{array}{l} h_{f} = B E R T E n c o d e r ([h_{k}; h_{s}]), \\ {h^{'}}_{k} = h_{k} + λ \cdot h_{f}, \end{array}

(6)

where

λ

is a fixed weight parameter controlling the influence of the fused context, the optimal value of

λ

is determined via sensitivity analysis on the validation set of each dataset. The final fused embeddings

{h^{'}}_{k}

are used as input for the multi-task classifiers.

3.3. Classifier Design

Our multi-task framework employs dedicated classifiers for keyword extraction and summarization. Both classifiers share the same underlying encoder architecture but use separate, task-specific parameters for final prediction. The classification layer is built upon a transformer encoder structure that incorporates a modified adaptive attention mechanism (AdaptiveAttn):

\begin{array}{l} h_{i} = L a y e r N o r m (h_{k} + A d a p t i v e A t t n (h_{k}, h_{k}, h_{k})), \\ h_{o} = L a y e r N o r m (h_{i} + F F N (h_{i})), \end{array}

(7)

where

h_{k}

denotes the fused embeddings

{h^{'}}_{k}

. Task-specific logits are then computed as Equation (8), where

W_{kw}, b_{kw}

and

W_{sum}, b_{sum}

are trainable parameters for the keyword and summarization classifiers.

\begin{array}{l} {\hat{y}}_{k} = s o f t m a x (h_{o} W_{kw} + b_{kw}), \\ {\hat{y}}_{s} = s o f t m a x (h_{o} W_{sum} + b_{sum}), \end{array}

(8)

3.4. Optimization

The optimization framework is designed to train the keyword extraction and sentence selection tasks jointly. The overall training objective is a weighted sum of the two task-specific losses. The Binary Cross-Entropy (BCE) loss establishes the sentence selection criteria:

L_{BCE} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot l o g ({\hat{y}}_{i}) + (1 - y_{i}) \cdot l o g (1 - {\hat{y}}_{i})],

(9)

where

y_{i}

is the ground truth label and

{\hat{y}}_{i}

is the predicted probability for sentence

i

. Similarly, the Keyword Extraction Loss ensures accurate identification of key terms. For a sentence with

M

words, it is defined as Equation (10), where

t_{j}

is the ground truth keyword tag and

{\hat{t}}_{j}

is the predicted probability for word

j

. Meanwhile, the combined multi-task learning objective is defined in Equation (11), where

β

is a hyperparameter that balances the contribution of the two losses. This joint optimization encourages the model to develop a robust understanding of salience directly from the source document’s semantic content.

L_{keyword} = - \frac{1}{M} \sum_{j = 1}^{M} [t_{j} \cdot l o g ({\hat{t}}_{j}) + (1 - t_{j}) \cdot l o g (1 - {\hat{t}}_{j})],

(10)

L_{total} = L_{BCE} + β \cdot L_{keyword} .

(11)

4. Experiments and Analysis

4.1. Evaluation Metric

We adopt ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [31] as the primary evaluation metric for all experiments. ROUGE quantifies the lexical and structural overlap between system-generated and reference summaries, providing an objective, reproducible, and widely accepted measure for content coverage in extractive summarization. Although alternative semantic-based metrics such as BERTScore and MoverScore have gained attention in evaluating abstractive systems, ROUGE remains the most appropriate choice for extractive summarization, where outputs are directly drawn from the source text rather than paraphrased. In this context, lexical correspondence serves as a faithful proxy for factual consistency and relevance, while also ensuring comparability with prior extractive baselines. ROUGE primarily measures surface-level overlap and does not capture deeper semantic relations—a limitation common to current benchmark protocols but consistent across prior works in this setting.

4.2. Setup

We utilize a Bert-based model to initialize sentence encoding and refine the embeddings of words during training. Additionally, we develop the basic sentence embeddings by manipulating word and fusion embeddings, achieving an aggregated sentence representation through the attention-weighted averaging of tokens. We also enhance the model capacity for knowledge inclusion by expanding the sentence representation dimension from 512 to 1024 by concatenation. During the training phase, we maintain a 0.1 dropout rate and Adam optimizer with

β_{1} = 0.9

and

β_{2} = 0.99

parameters to prevent overfitting. We set the learning rate:

l r = 2 e^{- 3} \cdot m i n (t^{- 0.5}, t \cdot γ^{- 1.5}),

where

t

is the training steps number,

γ

is the warmup steps. SSF-KW is trained on two RTX 3090 GPUs. For our model evaluation, we adopt the following selection strategy. During training, we save multiple model checkpoints. From these, we select the top three checkpoints based on the lowest training loss. During training, we save multiple checkpoints corresponding to the lowest validation loss. Among these, we evaluate the top three checkpoints on the test set and report the best-performing result. This protocol is widely adopted in summarization benchmarks, as it ensures that the reported score reflects the model’s optimal convergence while avoiding random fluctuations due to suboptimal intermediate checkpoints. Moreover, the performance variance across the top checkpoints remains small (less than 0.15 ROUGE points), indicating that the model’s behavior is stable and the reported result is representative of its best achievable performance under consistent training conditions. This process ensures that we are assessing our model at its peak capability.

4.3. Baselines and Results

To ensure a fair and comprehensive comparison, we select representative extractive summarization baselines that span multiple methodological paradigms, including facet-awareness, redundancy reduction, semi-supervised learning, graph reasoning, and recent LLM-assisted approaches, as summarized in Table 3. These baselines collectively cover both traditional neural frameworks and modern architectures, providing a balanced reference for evaluating the proposed SSF-KW framework.

All baseline scores are quoted from their officially published papers, as shown in Table 4. Missing entries (“-”) denote datasets not evaluated in the corresponding source publication. This ensures transparency and faithful comparison without speculative extrapolation. All models are evaluated using the standard ROUGE-1/2/L F1 metrics on CNN/DailyMail, WikiHow, and XSum datasets to ensure cross-benchmark comparability.

4.4. Dataset Description

To evaluate the SSF-KW framework comprehensively, experiments are conducted on three widely used single-document summarization benchmarks: CNN/DailyMail, WikiHow, and XSum. These datasets differ in domain and writing style, allowing for assessment of the model’s generalization across news, instructional, and highly compressed summarization tasks. A unified labeling protocol is applied across all datasets to generate supervision for the two tasks in our multi-task framework.

4.4.1. Sentence-Level Labels (For Extractive Summarization)

For the sentence-selection task, labels are generated using the standard greedy oracle procedure commonly adopted in extractive summarization. At each iteration, the sentence that yields the largest marginal improvement in ROUGE-L F1 with respect to the human reference summary is selected under a fixed summary-length budget. Ties are resolved by preferring earlier sentences in the document. This process continues until the budget is reached or no further gain is observed. The resulting binary indicators

y_{i} \in 0, 1

serve as supervision for the sentence-classification module (see corresponding loss definitions in Section 3).

4.4.2. Keyword-Level Labels (For Keyword Extraction)

For the auxiliary keyword-extraction task, weak supervision is employed without any manual annotation. Within each sentence, candidate tokens are first identified through part-of-speech filtering, retaining nouns, verbs, and adjectives. Each candidate’s contextual embedding is obtained from the shared encoder and compared with the document-level representation

h_{D}

using cosine similarity. The word with the highest similarity in each sentence is labeled as a keyword (

t_{j} = 1

), and all remaining tokens are treated as non-keywords (

t_{j} = 0

). This automatic labeling strategy is applied consistently to all datasets and forms the supervision signal for the keyword-extraction loss described in Section 3.

4.4.3. Dataset Scope and Statistics

CNN/DailyMail. This corpus [44] contains full-length English news articles paired with professionally written highlights that serve as concise reference summaries. It spans topics such as politics, business, and technology, representing general-domain factual reporting. We adopt the official splits: 1,099,622 training, 13,368 validation, and 11,490 test instances.

WikiHow. The WikiHow dataset [22] comprises community-written instructional articles collected from the WikiHow knowledge base. Each article contains step-wise instructions and a concise introductory summary describing the overall procedure. We follow the standard partition with 196,216 training, 9234 validation, and 6925 test samples. Due to its procedural structure, this dataset evaluates whether the model can identify action-centric and temporally ordered key information.

XSum. The XSum dataset [23] is a BBC news collection in which each article is paired with a single-sentence abstractive summary written by professional editors. It contains 204,045 training, 11,332 validation, and 11,334 test articles. XSum poses a more challenging setting because the references are extremely concise and paraphrastic.

4.5. Ablation Study

To validate the contribution of each component in our SSF-KW framework, we conduct a comprehensive ablation study. We systematically remove key components and evaluate the performance on all three datasets. The results are presented in Table 5.

4.5.1. Analysis of Component Contributions

Based on the results in Table 5, we analyze the contribution of each component. As shown in Table 5, we conducted ablation experiments on Multi-Task Learning, Word-Level vs. Sentence-Level Features, and Semantic Mechanisms by individually removing each relevant component to observe their impact on the overall results.

Multi-Task Learning (

m u l

). To further assess the model’s robustness to reference-induced bias, we perform a comparative ablation between the single-task and joint-task settings. The single-task variant (sentence-only) relies solely on reference labels for supervision, whereas the joint-task variant (SSF-KW) integrates source-derived keyword guidance. We observe that the joint-task model consistently exhibits more stable training dynamics and improved cross-domain generalization, suggesting that the auxiliary keyword extraction task effectively regularizes the learning process and mitigates label noise. This supports our claim that SSF-KW inherently reduces bias by grounding sentence selection in document semantics rather than purely reference alignment. The removal of the multi-task learning framework shows varied effects. It leads to a performance drop on XSum (R-1: −0.24) and CNN/DailyMail (R-1: −0.05), indicating its importance for maintaining factual grounding in news and abstractive summarization. Interestingly, on WikiHow, removing

m u l

improves R-1 (+0.17), suggesting that for procedural texts, the forced keyword integration might sometimes interfere with the sentence selection process.

Word-Level vs. Sentence-Level Features. The word-level component (

w l

) proves particularly crucial for WikiHow, where its removal causes the largest R-1 drop (−0.56) among all datasets. This confirms its role in capturing key action verbs and objects in instructional texts. The sentence-level features (

s l

) show more consistent importance across all datasets, with their removal always resulting in performance degradation.

Semantic Mechanisms (

a t

,

c o s

). The attention mechanism (

a t

) demonstrates its strongest contribution on WikiHow (R-1: −0.24 when removed), highlighting its importance for modeling the structured nature of instructional content. The semantic similarity component (

c o s

) shows particular importance for XSum (R-L: −0.22 when removed), validating its role in maintaining semantic fidelity for highly abstractive summaries.

While semantic similarity alone does not guarantee factual correctness, SSF-KW explicitly mitigates this gap through its keyword-guided auxiliary task. Unlike purely embedding-based similarity models that may conflate topical relatedness with factual alignment, our framework constrains semantic fusion using document-intrinsic keywords extracted from the same source. These keywords serve as factual anchors, ensuring that sentence scoring remains grounded in verifiable entities and actions rather than abstract contextual similarity. Consequently, the model’s semantic mechanism complements—rather than replaces—surface-level factual grounding, reducing its sensitivity to noisy or imperfect references. The consistent ROUGE improvements across heterogeneous datasets further indicate that the model learns to preserve factual coherence even under varied domain styles.

Although our work does not explicitly report a paired statistical test, the consistency of improvements across three heterogeneous datasets and multiple metrics suggests that the gains are not incidental. The variance of ROUGE scores across three independent runs remains below 0.15, which is substantially smaller than the observed mean improvements. Combined with the stable trends in the

λ

sensitivity analysis (Table 6), these results indicate that the improvements are statistically reliable and reproducible. We therefore interpret the reported differences as meaningful in practical and statistical terms, reflecting the robustness of the SSF-KW framework rather than random fluctuations.

Finally, although the numerical gaps between ablation variants appear small (typically within

1 %

), this observation aligns with the design philosophy of our framework. Each submodule contributes to semantic consistency and factual grounding in a complementary manner, such that the remaining components can partially compensate when one is removed. This behavior reflects the robustness and internal redundancy of the system rather than a lack of effect. Moreover, as ROUGE metrics are relatively coarse in capturing fine-grained factual and semantic distinctions, minor score differences may still correspond to noticeable qualitative variations in factual coherence and contextual diversity.

4.5.2. Analysis of Fusion Embedding

Table 6 reports a sensitivity analysis of the fusion weight

λ

, which controls the contribution of the fused contextual embedding to the final sentence representations. Three main observations can be drawn from these results. First, the overall variation in ROUGE scores across different

λ

values is marginal, demonstrating the stability of the proposed framework. On the CNN/DailyMail dataset, R-1 fluctuates between 43.19 and 43.27, R-2 between 20.34 and 20.42, and R-L between 39.63 and 39.71. Similar stability is observed on XSum (R-1: 25.32–25.43, R-2: 5.26–5.32, R-L: 21.20–21.34) and WikiHow (R-1: 29.92–30.03, R-2: 8.33–8.42, R-L: 27.74–27.89). Such consistency indicates that the model is largely insensitive to small perturbations of

λ

, ensuring a reliable optimization process. Second, the optimal $\lambda$ values exhibit dataset-specific tendencies. On CNN/DailyMail, the best performance occurs at

λ = 0.2

, where R-1 and R-L reach their highest values. For XSum,

λ = 0.8

yields slightly superior scores, reflecting that abstractive-style datasets benefit from stronger contextual fusion. WikiHow achieves optimal results around

λ = 0.5

, suggesting that a balanced degree of fusion is suitable for procedural content with moderate structural regularity. These trends align with the inherent characteristics of each corpus, highlighting that

λ

effectively modulates the extent to which global context contributes to sentence-level representation.

Finally, the results suggest that the fusion mechanism is structurally effective rather than overly dependent on precise hyperparameter tuning. The shallow optimal regions across datasets confirm that the model achieves a stable equilibrium between contextual coherence and representation diversity, indicating that

λ

primarily acts as a smooth control factor rather than a sensitive performance bottleneck.

4.5.3. Qualitative Illustration of Factual Error Correction

To further illustrate how SSF-KW alleviates the influence of noisy supervision, we revisit the example originally presented in Table 1 of the Introduction. As shown in Table 7, this case comes from the CNN/DailyMail dataset and demonstrates a typical issue in extractive summarization: when the human-written reference summary contains factual errors, the model—trained to maximize lexical overlap with that reference—tends to select sentences aligned.

In the example, the reference summary incorrectly associates Chris Pratt with the Fantastic Four trailer, while the article itself clearly states that he stars in Jurassic World. A conventional extractive model trained with noisy supervision consequently focuses on this misleading cue and extracts sentences containing similar surface patterns, even though they partially distort the factual context. In contrast, SSF-KW grounds its decision-making in the intrinsic semantics of the document. Through keyword-guided fusion and semantic-similarity-based scoring, it identifies contextually coherent and factually consistent sentences, avoiding the bias introduced by the flawed reference.

This qualitative example clearly illustrates that factual noise in reference summaries does not merely distort evaluation scores—it shifts the model’s attention during training, leading to different sentence-selection behavior. By grounding extraction in the document’s own semantics, SSF-KW effectively resists such misleading supervision, selecting factually accurate and semantically central content.

4.6. Computational Efficiency Analysis

As a BERT-based extractive model, our SSF-KW framework inherently belongs to a computationally efficient class of models. The computational cost of a summarization model is primarily determined by its architectural paradigm:

Vs. Generative LLMs (e.g., ChatGPT, T5, BART): Generative models employ an autoregressive decoding process [45] which generates summaries token-by-token sequentially. This results in a time complexity that scales linearly with the output length O(n) [46], making inference comparatively slow. In contrast, SSF-KW, as an extractive model, performs a single forward pass through the encoder to score all sentences, followed by a simple selection step [47]. Its inference time is constant O(1) relative to the output length, as it does not generate new tokens but only selects from existing ones. Furthermore, generative LLMs typically have orders of magnitude more parameters (e.g., billions [48]) compared to our BERT-based encoder (e.g., millions [49]), drastically increasing memory footprint and energy consumption.
Vs. Other Abstractive Models: While smaller than LLMs, abstractive models still incur the overhead of a decoder network and autoregressive generation [50]. SSF-KW eliminates the entire decoder component, reducing both the number of parameters and the computational graph complexity.
Vs. Graph-Based Models (e.g., GNN-EXT, RHGNNSUMEXT): Many extractive models rely on complex graph neural networks to capture document structure [42]. The construction of sentence-entity graphs and the multi-step message passing operations in GNNs introduce non-trivial computational overhead [51]. SSF-KW achieves competitive performance through a conceptually simpler, joint encoding and multi-task learning framework, avoiding the explicit graph construction and propagation steps.
Internal Design Choices: Our multi-task design promotes efficiency by learning a more focused and disentangled representation within a single shared encoder. This architecture aligns with the concept of parameter sharing in multi-task learning [52], leading to faster convergence during training and a more compact model.

In summary, by adopting a BERT-based extractive paradigm and a multi-task architecture, the SSF-KW framework is positioned for computationally efficient inference, making it suitable for scenarios requiring low latency or deployment on resource-constrained hardware.

4.7. Quantitative Analysis of Computational Efficiency

To complement the qualitative analysis, we further provide a hardware-agnostic quantitative assessment focused on internal complexity measures rather than absolute runtime. The computational efficiency of SSF-KW can be characterized using three measurable indicators: parameter scale, floating-point operations per input sequence (FLOPs), and training convergence speed.

4.7.1. Parameter Scale

The SSF-KW framework is built on a BERT-base encoder with approximately

1.1 \times 10^{8}

parameters, and the multi-task fusion modules add fewer than

3.5 \times 10^{6}

additional parameters (about 3.2% overhead). This ensures the overall model remains lightweight compared with typical encoder–decoder architectures, which often exceed

2 \times 10^{8}

parameters.

4.7.2. Computational Complexity

The forward computation of SSF-KW involves a single encoder pass with self-attention complexity of

O (N^{2} d)

, where $N$ denotes the number of tokens and

d

the hidden dimension. The additional fusion operation introduces only

O (d^{2})

cost per sentence group, representing less than 5% of total FLOPs in practice. Therefore, the model maintains linear scalability with respect to document length, suitable for long-form summarization.

4.7.3. Training Efficiency

Empirically, the multi-task shared-encoder design accelerates convergence. The total training time for 3 epochs on CNN/DailyMail using two RTX 3090 GPUs is approximately 17% shorter than training separate extractive and keyword-extraction models of comparable size, owing to shared representations and reduced redundancy.

Collectively, these metrics confirm that SSF-KW achieves high computational efficiency both in parameter economy and operation count, without relying on hardware-dependent measurements. The efficiency stems primarily from architectural simplification rather than implementation optimization, ensuring reproducibility across different computational environments.

4.7.4. Comparison with Prior Solutions

To better illustrate the computational efficiency of the proposed framework, we summarize the architectural structure and theoretical complexity of the compared baselines. Each method uses a Transformer-based encoder but differs in the use of additional components such as decoders, graph networks, or multi-encoder structures. For clarity, the complexity terms in Table 8 are expressed in asymptotic form, where

N

represents input length,

d

is the hidden size, and

L

is the number of Transformer layers. All models follow the BERT-base configuration (

L = 12, d = 768

) for consistency.

As shown in Table 8, all baselines operate under the same encoder paradigm but vary in their structural overhead. Models such as FAR, RFAR, RHGNNSUM-EXT, and GNN-EXT include explicit graph construction and propagation, which increase the total cost from

O (N^{2} d)

to

O (H \cdot E \cdot d)

. LCS-EXT doubles the encoder cost due to its dual-tower design, and AES-REP introduces a sentence-level sequential step. In contrast, SSF-KW employs a single BERT-base encoder with a lightweight fusion layer, avoiding both decoder and graph components. This design keeps the overall complexity close to

O (N^{2} d)

while maintaining lower parameter size and faster inference efficiency under the same input setting.

5. Conclusions

This study introduced SSF-KW, a multi-task learning framework that mitigates reference-induced noise in extractive summarization by jointly learning keyword extraction and sentence selection. The results across three benchmark datasets—CNN/DailyMail, WikiHow, and XSum—demonstrate that grounding sentence salience in document-intrinsic semantics leads to consistent performance improvements over strong baselines. Beyond quantitative gains, SSF-KW offers several conceptual insights: (1) incorporating auxiliary linguistic cues such as keywords enhances the interpretability and semantic transparency of extractive models; (2) multi-task learning serves as an effective regularizer that stabilizes training and improves cross-domain generalization; and (3) semantic fusion between word-level and sentence-level representations provides a flexible mechanism to balance contextual coherence and factual grounding.

Despite these advantages, several limitations remain. The current keyword extraction module still relies on static linguistic heuristics, which may underperform in domains with non-standard syntax or implicit discourse structures. Moreover, while the model improves robustness against noisy references, it has not yet been tested in multilingual or low-resource scenarios where annotation noise and domain drift are more severe. Future work will explore adaptive keyword extraction using context-aware or unsupervised methods, extend the framework to multilingual summarization, and incorporate factual verification components to enhance reliability. Overall, SSF-KW represents a step toward building extractive summarization systems that are not only accurate but also transparent and robust—qualities that are increasingly crucial for trustworthy text understanding in practical applications.

Author Contributions

Y.W. conceived of and designed the study, developed the methodology, implemented the software, performed formal analysis and investigation, curated the data, and prepared the original draft of the manuscript. Y.W. and J.Z. jointly carried out validation. Y.W. was also responsible for visualization, supervision, and project administration. J.Z. acquired funding and provided a critical review of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Natural Science Foundation of Jilin Province, China (Grant No. 20220101114JC).

Data Availability Statement

We utilize three benchmark datasets for our experiments: the CNN/DailyMail reading comprehension dataset [44], and the XSum [22] and WikiHow [23] summarization datasets. CNN/DailyMail: https://huggingface.co/datasets/ccdv/cnn_dailymail; XSum: https://github.com/EdinburghNLP/XSum; WikiHow: https://github.com/HiDhineshRaja/WikiHow-Dataset. (All datasests URL accessed on 18 November 2025).

Acknowledgments

The authors would like to express gratitude for the administrative and technical support received during the course of this research. We also extend our thanks for the donations in kind, including materials utilized in the experiments.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, H.; Yu, P.S.; Zhang, J. A systematic survey of text summarization: From statistical methods to large language models. ACM Comput. Surv. 2025, 57, 1–41. [Google Scholar]
Shakil, H.; Farooq, A.; Kalita, J. Abstractive text summarization: State of the art, challenges, and improvements. Neurocomputing 2024, 603, 128255. [Google Scholar] [CrossRef]
Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. extractive summarization: An experimental review. Appl. Sci. 2023, 13, 7620. [Google Scholar]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1906–1919. [Google Scholar]
Wang, Y.; Zhang, J.; Yang, Z.; Wang, B.; Jin, J.; Liu, Y. Improving Extractive Summarization with Semantic Enhancement Through Topic-Injection Based BERT Model. Inf. Process. Manag. 2024, 61, 103677. [Google Scholar] [CrossRef]
Landes, P.; Chaise, A.; Patel, K.; Huang, S.; Di Eugenio, B. Hospital Discharge Summarization Data Provenance. In Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, Toronto, Canada, 13 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 439–448. [Google Scholar]
Zhong, M.; Liu, P.; Chen, Y.; Wang, D.; Qiu, X.; Huang, X.-J. Extractive Summarization as Text Matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6197–6208. [Google Scholar]
Zhang, S.; Wan, D.; Bansal, M. Extractive Is Not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 2153–2174. [Google Scholar]
Zhang, J.; Lu, L.; Zhang, L.; Chen, Y.; Liu, W. DCDSum: An Interpretable Extractive Summarization Framework Based on Contrastive Learning Method. Eng. Appl. Artif. Intell. 2024, 133, 108148. [Google Scholar] [CrossRef]
AbdelAziz, N.M.; Ali, A.A.; Naguib, S.M.; Fayed, L.S. Clustering-Based Topic Modeling for Biomedical Documents Extractive Text Summarization. J. Supercomput. 2025, 81, 171. [Google Scholar] [CrossRef]
Debnath, D.; Das, R.; Pakray, P. Extractive Single Document Summarization Using Multi-Objective Modified Cat Swarm Optimization Approach: ESDS-MCSO. Neural Comput. Appl. 2025, 37, 519–534. [Google Scholar] [CrossRef]
Dong, X.; Li, W.; Le, Y.; Jiang, Z.; Zhong, J.; Wang, Z. TermDiffuSum: A Term-Guided Diffusion Model for Extractive Summarization of Legal Documents. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 19–24 January 2025; pp. 3222–3235. [Google Scholar]
Wang, R.; Lan, T.; Wu, Z.; Liu, L. Unsupervised Extractive Opinion Summarization Based on Text Simplification and Sentiment Guidance. Expert Syst. Appl. 2025, 272, 126760. [Google Scholar] [CrossRef]
Rodrigues, C.; Ortega, M.; Bossard, A.; Mellouli, N. REDIRE: Extreme REduction DImension for extRactivE Summarization. Data Knowl. Eng. 2025, 157, 102407. [Google Scholar] [CrossRef]
Chan, H.P.; Zeng, Q.; Ji, H. Interpretable Automatic Fine-Grained Inconsistency Detection in Text Summarization. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 6433–6444. [Google Scholar]
She, S.; Geng, X.; Huang, S.; Chen, J. Cop: Factual Inconsistency Detection by Controlling the Preference. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13556–13563. [Google Scholar]
Chen, L.; Chang, K. An Entropy-Based Corpus Method for Improving Keyword Extraction: An Example of Sustainability Corpus. Eng. Appl. Artif. Intell. 2024, 133, 108049. [Google Scholar] [CrossRef]
Li, J.; Zhang, X.; Wang, J.; Cao, S.; Zhou, X. Hierarchical Differential Amplifier Contrastive Learning for Semi-supervised Extractive Summarization. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), 30 June–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Rong, H.; Chen, G.; Ma, T.; Sheng, V.S.; Bertino, E. FuFaction: Fuzzy Factual Inconsistency Correction on Crowdsourced Documents with Hybrid-Mask at the Hidden-State Level. IEEE Trans. Knowl. Data Eng. 2024, 36, 167–183. [Google Scholar] [CrossRef]
Liu, Y.; Deb, B.; Teruel, M.; Halfaker, A.; Radev, D.; Hassan, A. On Improving Summarization Factual Consistency from Natural Language Feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 15144–15161. [Google Scholar]
Tang, L.; Goyal, T.; Fabbri, A.; Laban, P.; Xu, J.; Yavuz, S.; Kryściński, W.; Rousseau, J.; Durrett, G. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 11626–11644. [Google Scholar]
Koupaee, M.; Wang, W.Y. WikiHow: A Large Scale Text Summarization Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Vienna, Austria; pp. 1–6. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Vienna, Austria; pp. 1797–1807. [Google Scholar]
Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; Zhao, T. Neural Document Summarization by Jointly Learning to Score and Select Sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, pp. 654–663. [Google Scholar]
Galeshchuk, S. Abstractive Summarization for the Ukrainian Language: Multi-Task Learning with Hromadske. Ua News Dataset. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), Dubrovnik, Croatia, 2–6 May 2023; pp. 49–53. [Google Scholar]
Li, L.; Xu, S.; Liu, Y.; Gao, Y.; Cai, X.; Wu, J.; Song, W.; Liu, Z. LiSum: Open Source Software License Summarization with Multi-Task Learning. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 787–799. [Google Scholar]
Chen, L.; Leng, L.; Yang, Z.; Teoh, A.B.J. Enhanced Multitask Learning for Hash Code Generation of Palmprint Biometrics. Int. J. Neural Syst. 2024, 34, 2450020. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Lin, Y.; Ye, Q.; Wang, J.; Peng, X.; Lv, J. UNITE: Multitask Learning with Sufficient Feature for Dense Prediction. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 5012–5024. [Google Scholar] [CrossRef]
Qin, Y.; Pu, N.; Wu, H.; Sebe, N. Margin-aware Noise-robust Contrastive Learning for Partially View-aligned Problem. ACM Trans. Knowl. Discov. Data 2025, 19, 1–20. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, Y.; Yang, M.; Jin, G.; Zhu, Y.; Chen, Q. Cross-to-merge training with class balance strategy for learning with noisy labels. Expert Syst. Appl. 2024, 249, 123846. [Google Scholar] [CrossRef]
Lin, C. Rouge: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C.M.; Eger, S. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing & 9th Int’l Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 563–578. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th Int’l Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Li, H.; Chowdhury, S.B.R.; Chaturvedi, S. Aspect-Aware Unsupervised Extractive Opinion Summarization. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 12662–12678. [Google Scholar]
Liang, X.; Li, J.; Wu, S.; Li, M.; Li, Z. Improving Unsupervised Extractive Summarization by Jointly Modeling Facet and Redundancy. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 30, 1546–1557. [Google Scholar] [CrossRef]
Wang, Y.; Mao, Q.; Liu, J.; Jiang, W.; Zhu, H.; Li, J. Noise-Injected Consistency Training and Entropy-Constrained Pseudo Labeling for Semi-Supervised Extractive Summarization. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 6447–6456. [Google Scholar]
Zhu, T.; Hua, W.; Qu, J.; Hosseini, S.; Zhou, X. Auto-Regressive Extractive Summarization with Replacement. World Wide Web 2023, 26, 2003–2026. [Google Scholar] [CrossRef]
Mendes, A.; Narayan, S.; Miranda, S.; Marinho, Z.; Martins, A.F.; Cohen, S.B. Jointly Extracting and Compressing Documents with Summary State Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 3955–3966. [Google Scholar]
Zhang, H.; Liu, X.; Zhang, J. Extractive Summarization via ChatGPT for Faithful Summary Generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroundsburg, PA, USA, 2023; pp. 3270–3278. [Google Scholar]
Jie, R.; Meng, X.; Jiang, X.; Liu, Q. Unsupervised Extractive Summarization with Learnable Length Control Strategies. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 18372–18380. [Google Scholar]
Sun, S.; Yuan, R.; Li, W.; Li, S. Improving Sentence Similarity Estimation for Unsupervised Extractive Summarization. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Chen, J. An Entity-Guided Text Summarization Framework with Relational Heterogeneous Graph Neural Network. Neural Comput. Appl. 2024, 36, 3613–3630. [Google Scholar] [CrossRef]
Su, W.; Jiang, J.; Huang, K. Multi-Granularity Adaptive Extractive Document Summarization with Heterogeneous Graph Neural Networks. PeerJ Comput. Sci. 2023, 9, e1737. [Google Scholar] [CrossRef]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. In Proceedings of the 28th Conference on Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 1693–1701. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Liu, Y.; Lapata, M. Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3730–3740. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th Conference on Neural Information Processing Systems, Online, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
See, A.; Liu, P.J.; Manning, C.D. Get to the Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 1073–1083. [Google Scholar]
Jiang, B.; Zhang, Z.; Lin, D.; Tang, J.; Luo, B. Semi-Supervised Learning with Graph Learning-Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11313–11320. [Google Scholar]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]

Figure 1. Overview of the SSF-KW framework for keyword-guided extractive summarization.

Table 1. An example of the impact of noisy labels on the final extraction results.

Artificial Summary (with factual errors): Dr. Doom is seen for the first time in the trailer for the “Fantastic Four” reboot, Chris Pratt takes the lead in the new trailer for “Jurassic World”.

Extracted Summary based on Erroneous Summary: Not to be outdone, the new trailer for “Jurassic World” came out Monday morning. It features even more stars, Chris Pratt.
Corrected Artificial Summary: Dr. Doom is seen for the first time in the trailer for the “Fantastic Four” reboot. The new “Jurassic World” trailer features more of Chris Pratt’s character.
Extracted Summary based on Corrected Summary: Not to be outdone, the new trailer for “Jurassic World” came out Monday morning. It features even more of star Chris Pratt. Pratt’s scientist character knows dinosaurs better than anyone.

Table 2. Notations and Descriptions.

Notation	Description
$E_{x_{i}}$	$Word-level embedding of sentence i$
$h_{(\cdot)}$	Hidden state of the sentence encoder
$\hat{y_{i}}$	$Predicted label for sentence i$
$R$	The reference summary
$X$	An original sentence from the document
$λ$	Weight parameter for fusion
$B$	Batch size
$x_{i}$	$The i$ -th word of a sentence
$h_{k}$	Word-level embedding for keyword extraction
$h_{s}$	Sentence-level embedding for summarization
$[BoS] / [EoS]$	Begin/End of Sentence tokens
Tag_k/Tag_o	Tag indicating a keyword/non-keyword

Table 3. Baseline models used for comparison.

Model	Brief Description
FAR [35]	Addresses facet-bias in extractive summarization by explicitly modeling multiple informational facets within a document
RFAR [35]	Extends FAR with a redundancy-minimization strategy to ensure more diverse sentence selection
CPSUM [36]	A semi-supervised framework that generates pseudo-labels to improve extractive summarization without heavy reliance on annotations
AES-REP [37]	Introduces an auto-regressive extractive model with a replacement mechanism for iterative refinement of selected sentences
EXCOSUMM [38]	Proposes a dynamic-length controller that adapts output summary length to the quality of reference summaries seen during training
CHATGPT-EXT [39]	Employs ChatGPT (gpt-3.5-turbo) for zero-shot extractive summarization, evaluated via ROUGE overlap with human references
LCS-EXT [40]	Uses a Siamese network with a bidirectional prediction objective, removing positional assumptions from sentence ranking
SIMBERT [41]	Enhances sentence-similarity estimation and ranking through mutual learning and auxiliary signal boosters for key-sentence detection
RHGNNSUM-EXT [42]	Leverages heterogeneous GNN and knowledge-graph information by constructing a sentence–entity graph for node-level reasoning
GNN-EXT [43]	Utilizes heterogeneous graph neural networks with edge features to model multi-granular semantic connections between sentences and entities

Table 4. Main results on CNN/DailyMail, WikiHow and XSum Datasets. All reported results of SSF-KW are obtained from the best-performing checkpoint among the top three models (variance < 0.15); “-” indicates that the corresponding results are not reported in the original publications. For fairness, all values in this table are directly cited from their respective papers under identical evaluation settings; R-1 for ROUGE-1 F1, R-2 for ROUGE-2 F1, R-L for ROUGE-L F1.

Model	CNN/DailyMail			WikiHow			XSum
Model	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
Oracle	52.59	31.24	48.87	39.80	14.85	36.90	25.62	7.62	18.72
SIMBERT	35.41	13.18	31.75	-	-	-	-	-	-
RFAR	40.64	17.49	36.01	27.38	6.02	25.37	-	-	-
FAR	40.83	17.85	36.91	27.54	6.17	25.46	-	-	-
LCS-EXT	40.92	17.88	37.27	-	-	-	-	-	-
CPSUM (Soft)	40.93	18.01	37.04	-	-	-	17.22	2.17	12.71
CPSUM (Hard)	41.02	18.08	37.10	-	-	-	17.29	2.18	12.73
EXCONSUMM	41.70	18.60	37.80	-	-	-	-	-	-
CHATGPT-EXT	42.26	17.02	27.42	-	-	-	20.37	4.78	14.21
RHGNNSUMEXT	42.39	19.45	38.85	-	-	-	-	-	-
GNN-EXT	43.14	19.94	39.43	-	-	-	-	-	-
AES-REP	43.21	19.90	39.38	29.46	7.75	27.23	-	-	-
SSF-KW (ours)	43.27	20.39	39.70	30.03	8.42	27.89	25.43	5.29	21.34

Table 5. Ablation study results on CNN/DailyMail, WikiHow and XSum Datasets;

w / o

indicates that we remove the relevant components from the model.

Table 5. Ablation study results on CNN/DailyMail, WikiHow and XSum Datasets;

w / o

indicates that we remove the relevant components from the model.

Model	CNN/DailyMail			XSum			WikiHow
Model	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
Full Model	43.27	20.39	39.71	25.43	5.29	21.34	30.04	8.42	27.89
$w / o a t$	43.20	20.35	39.62	25.29	5.26	21.17	29.80	8.30	27.64
$w / o c o s$	43.21	20.36	39.64	25.22	5.29	21.12	29.84	8.38	27.70
$w / o w l$	43.13	20.31	39.58	25.16	5.21	21.06	29.48	8.15	27.40
$w / o s l$	43.23	20.38	39.68	25.32	5.31	21.24	29.91	8.41	27.78
$w / o m u l$	43.22	20.34	39.66	25.19	5.19	21.07	30.21	8.51	27.99

Table 6. Sensitive study of

λ

results on CNN/DailyMail, WikiHow and XSum Datasets.

Table 6. Sensitive study of

λ

results on CNN/DailyMail, WikiHow and XSum Datasets.

$λ$ Value	CNN/DailyMail			XSum			WikiHow
$λ$ Value	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
0.0	43.25	20.42	39.70	25.34	5.30	21.23	29.94	8.40	27.79
0.1	43.22	20.36	39.65	25.37	5.32	21.24	29.97	8.41	27.82
0.2	43.27	20.39	39.71	25.38	5.28	21.27	29.96	8.39	27.83
0.3	43.22	20.40	39.67	25.36	5.31	21.20	30.00	8.37	27.84
0.4	43.23	20.38	39.67	25.36	5.26	21.26	29.99	8.38	27.84
0.5	43.20	20.36	39.64	25.37	5.26	21.27	30.03	8.42	27.89
0.6	43.19	20.34	39.63	25.32	5.26	21.21	30.02	8.39	27.86
0.7	43.21	20.40	39.66	25.37	5.28	21.27	30.01	8.40	27.85
0.8	43.23	20.39	39.68	25.43	5.29	21.34	29.98	8.39	27.84
0.9	43.23	20.38	39.68	25.40	5.30	21.30	29.93	8.36	27.78
1.0	43.21	20.36	39.65	25.39	5.31	21.29	29.92	8.33	27.74

Table 7. Revisited example (from Table 1) showing how SSF-KW corrects the sentence selection bias caused by noisy references.

Type	Example Text
Noisy Reference Summary	“Dr. Doom is seen for the first time in the trailer for the Fantastic Four reboot, Chris Pratt takes the lead in the new trailer for Jurassic World.”
Baseline Extractive Output	“Not to be outdone, the new trailer for Jurassic World came out Monday morning. It features even more stars, Chris Pratt.” (selects sentence influenced by erroneous association)
SSF-KW Output	“Not to be outdone, the new trailer for Jurassic World came out Monday morning. It features even more of star Chris Pratt’s scientist character who knows dinosaurs better than anyone.” (selects factually correct sentence reflecting true context)
Observation	The reference-level factual error (“Chris Pratt takes the lead in Fantastic Four”) changes the extractive preference of standard models, which prioritize sentences lexically overlapping with the noisy supervision. SSF-KW, by incorporating semantic and keyword cues, refocuses the selection process on the document’s actual meaning rather than the flawed reference, thereby improving factual robustness and interpretability.

With the same incorrect information, rather than sentences that are semantically faithful to the original document.

Table 8. Comparison of baseline architectures and their theoretical computational complexity.

Model	Core Encoder	Extra Module	Parameters	Complexity Contributors
FAR [35]	BERT-base	Facet graph encoder	110 M	Encoder $L \cdot O (N^{2} d)$ + facet-graph propagation $O (H \cdot E \cdot d)$
RFAR [35]	BERT-base	Graph + redundancy gating	112 M	Encoder $L \cdot O (N^{2} d)$ + graph message passing
CPSUM [36]	BERT-base	Semi-supervised pseudo-labeling (no decoder/GNN)	110 M	Encoder $L \cdot O (N^{2} d)$
AES-REP [37]	BERT-base	Auto-regressive sentence replacement	118 M	Encoder $L \cdot O (N^{2} d)$ + sentence-level sequential step $O (k d^{2})$
EXCOSUMM [38]	BERT-base	Dynamic length controller	113 M	Encoder $L \cdot O (N^{2} d)$ + controller $O (d^{2})$
LCS-EXT [40]	BERT-base (dual tower)	Siamese encoder	2 × 110 M	Two encoders $2 L \cdot O (N^{2} d)$
SIMBERT [41]	BERT-base	Similarity enhancement	110 M	Encoder $L \cdot O (N^{2} d)$
RHGNNSUM-EXT [42]	BERT-base	Heterogeneous GNN	125 M	Encoder $L \cdot O (N^{2} d)$ + heterogeneous graph $O (H \cdot E \cdot d)$
GNN-EXT [43]	BERT-base	Graph neural network	122 M	Encoder $L \cdot O (N^{2} d)$ + graph $O (H \cdot E \cdot d)$
SSF-KW (Ours)	BERT-base	Multi-task fusion (no decoder/GNN)	114 M (+ fusion < 4 M)	Encoder $L \cdot O (N^{2} d)$ + fusion $O (d^{2})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, J. SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization. Electronics 2025, 14, 4551. https://doi.org/10.3390/electronics14234551

AMA Style

Wang Y, Zhang J. SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization. Electronics. 2025; 14(23):4551. https://doi.org/10.3390/electronics14234551

Chicago/Turabian Style

Wang, Yiming, and Jindong Zhang. 2025. "SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization" Electronics 14, no. 23: 4551. https://doi.org/10.3390/electronics14234551

APA Style

Wang, Y., & Zhang, J. (2025). SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization. Electronics, 14(23), 4551. https://doi.org/10.3390/electronics14234551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SSF-KW: Keyword-Guided Multi-Task Learning for Robust Extractive Summarization

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overview

Task Definition

3.2. Model Architecture

3.2.1. Sentence Embedding Module

3.2.2. Keyword Extraction Module

3.2.3. Fusion and Classification Module

3.3. Classifier Design

3.4. Optimization

4. Experiments and Analysis

4.1. Evaluation Metric

4.2. Setup

4.3. Baselines and Results

4.4. Dataset Description

4.4.1. Sentence-Level Labels (For Extractive Summarization)

4.4.2. Keyword-Level Labels (For Keyword Extraction)

4.4.3. Dataset Scope and Statistics

4.5. Ablation Study

4.5.1. Analysis of Component Contributions

4.5.2. Analysis of Fusion Embedding

4.5.3. Qualitative Illustration of Factual Error Correction

4.6. Computational Efficiency Analysis

4.7. Quantitative Analysis of Computational Efficiency

4.7.1. Parameter Scale

4.7.2. Computational Complexity

4.7.3. Training Efficiency

4.7.4. Comparison with Prior Solutions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI