GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents

Bashir, Abubakar Salisu; Bichi, Abdulkadir Abubakar; Ado, Abubakar

doi:10.3390/engproc2026124065

Open AccessProceeding Paper

GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents^†

by

Abubakar Salisu Bashir

^1,*

,

Abdulkadir Abubakar Bichi

² and

Abubakar Ado

¹

Department of Computer Science, Northwest University, Kano 700252, Nigeria

²

Department of Software Engineering, Northwest University, Kano 700252, Nigeria

^*

Author to whom correspondence should be addressed.

^†

Presented at the 6th International Electronic Conference on Applied Sciences, 9–11 December 2025; Available online: https://sciforum.net/event/ASEC2025.

Eng. Proc. 2026, 124(1), 65; https://doi.org/10.3390/engproc2026124065

Published: 10 March 2026

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

Download Versions Notes

Abstract

Transformer-based models face persistent challenges in long-document summarization due to fixed input-length constraints. Hybrid approaches address this limitation by applying extractive preprocessing to select salient sentences for downstream abstractive summarization. However, many unsupervised extractive methods, including TextRank and LexRank, rely on heuristic graph centrality and often struggle to preserve semantic coherence or control redundancy. This paper proposes GravRank, an unsupervised and deterministic extractive summarization framework that models sentence importance as an emergent property of pairwise semantic interactions governed by a softened Plummer potential. Sentences are embedded in a shared semantic space, and a global energy function is defined over all sentence pairs using a softened interaction kernel. This formulation jointly encodes relevance and redundancy within a single scoring function, avoiding iterative graph propagation, supervised training, and post hoc diversity filtering. The deterministic extractive output is used as input to a BART-based abstractive summarization model, forming a hybrid pipeline for long and semantically dense documents. Experiments on the BillSum, PubMed, and GovReport datasets show that GravRank improves over classical unsupervised baselines, remains competitive with recent extractive methods, and yields a competitive result in downstream abstractive summarization when combined with BART.

Keywords:

extractive summarization; sentence ranking; long-document summarization; energy-based modeling; hybrid summarization

1. Introduction

Transformer-based models have advanced abstractive summarization for short documents by leveraging large pre-trained encoder–decoder architectures with self-attention mechanisms that capture contextual dependencies [1]. However, their effectiveness diminishes on long documents due to fixed token-length constraints and the hierarchical structure of lengthy texts, such as legal or scientific documents [2,3]. Standard transformer architectures process documents as linear sequences, limiting the integration of long-range semantic information and structural context necessary for comprehensive summary generation. These limitations are particularly pronounced in legal, government report and scientific documents, where salient content may be distributed across multiple sections and critical information can be truncated or lost.

Hybrid summarization approaches address this challenge by combining extractive and abstractive paradigms [4]. An initial extractive stage selects salient sentences, which are then passed to an abstractive model to generate coherent summaries [5]. Such hybrid strategies mitigate transformer token-length limitations and reduce redundancy in generated outputs, demonstrating improved performance on long and domain-specific documents [6]. Nonetheless, the effectiveness of these pipelines depends heavily on the extractive module.

Existing extractive methods differ in how they define sentence importance and handle redundancy. Statistical and heuristic approaches, such as TF–IDF and lead-based selection, are efficient and interpretable but fail to capture semantic dependencies, often resulting in redundancy [7]. Graph-based methods, including TextRank [8], LexRank [9], and their extensions PacSum and RankSum [10,11], estimate importance through similarity graphs and centrality but may inadequately account for global semantic interactions. Embedding-based and neural approaches leverage contextual sentence representations and supervised learning to capture relevance [12], but require substantial training data and computational resources. Interaction-aware methods model sentence importance as an emergent property of pairwise interactions, allowing joint consideration of relevance and redundancy.

To address these limitations, this paper introduces GravRank, an unsupervised, deterministic extractive ranking framework for hybrid summarization of long documents. GravRank computes sentence importance through global pairwise interactions in a semantic embedding space, suppressing excessive reinforcement among semantically similar sentences using a softened Plummer potential adapted from Physics. Unlike graph-based methods such as LexRank and TextRank, which estimate sentence importance through heuristic centrality measures defined over a similarity graph and computed by iterative propagation, GravRank defines salience through a closed-form global interaction energy. Each sentence contributes to and is influenced by all others through a bounded kernel derived from a softened Plummer potential. Relevance and redundancy are therefore encoded jointly within the same scoring function, rather than handled by separate centrality estimation and post-selection diversity mechanisms. The bounded behavior of the interaction kernel suppresses excessive reinforcement among near-duplicate sentences while preserving long-range semantic influence, which replaces heuristic graph propagation and post hoc redundancy filtering with a unified energy-based ranking principle.

The main contributions of this work are:

Interaction-Based Extractive Ranking: A global pairwise interaction model is introduced to compute sentence importance beyond independent scoring and graph-based centrality.
Integrated Redundancy Control: A softened Plummer potential is embedded in the ranking formulation to suppress reinforcement among semantically similar sentences.
Deterministic and Unsupervised Formulation: Sentence scores are computed in closed form without iterative propagation, supervised learning, or stochastic optimization.
Hybrid Summarization Integration and Evaluation: The proposed extractive framework is integrated into a transformer-based hybrid pipeline and evaluated on long-document benchmarks.

2. Materials and Methods

This section describes GravRank, an unsupervised extractive framework designed to select salient sentences from long documents through global interaction modeling in a semantic embedding space. The extractive output serves as a preprocessing stage for downstream abstractive summarization.

Let a document consist of

N

sentences

{s_{1}, s_{2}, \dots ., s_{N}}

. Each sentence is embedded into a shared semantic space and assigned a non-negative scalar value referred to as sentence charge. Sentence importance is defined as a deterministic function of the total interaction energy induced by all other sentences. The resulting energy landscape yields a global ranking that favors sentences that are semantically central while limiting redundancy through interaction regularization.

2.1. Relationship to Kernel Density Estimation and Graph Centrality

The interaction energy used by GravRank is a weighted sum over a distance-based kernel in the sentence embedding space. This structure is related to kernel density estimation (KDE), where a score at a point is computed as a sum of kernel evaluations centered at data points. However, GravRank is not equivalent to KDE for two reasons. First, KDE estimates a probability density and therefore requires kernel normalization and bandwidth selection to ensure a valid density function. GravRank does not aim to estimate a density; it defines a deterministic relevance score whose scale is not constrained by probabilistic normalization. Second, GravRank introduces sentence charges that modulate pairwise interactions. The resulting score is not a uniform aggregation of kernel responses but a charge-weighted interaction field that encodes document-specific notions of importance. The score therefore represents an energy induced by pairwise semantic relations rather than an estimate of sample density.

GravRank is also related to graph-based centrality methods such as TextRank and LexRank, which construct a similarity graph and compute sentence importance through iterative propagation of centrality scores. In those methods, the score vector is defined as the stationary solution of a Markov process or power iteration on a normalized adjacency matrix. GravRank does not define or normalize a transition matrix and does not rely on iterative updates. Instead, the relevance of a sentence is computed in closed form as the sum of its interactions with all other sentences under a bounded kernel. This removes dependence on convergence criteria, damping factors, and graph normalization choices. The absence of iterative propagation also implies that relevance and redundancy are handled within a single objective through the kernel shape, rather than through separate centrality estimation and post-selection filtering.

Formally, if

e_{i}

denotes the embedding of sentence

i

and

q_{i}

its charge, graph centrality methods define scores

s

as a fixed point of

s = P^{T} s

, where

P

is a normalized similarity matrix. Kernel density methods define a scalar field

\hat{f} (x) = \sum_{j} K (||x - e_{j}||; h)

with a bandwidth parameter

h

. In contrast, GravRank defines sentence relevance as

E_{i} = \sum_{i \neq j} q_{i} q_{j} K_{α} (|{| e}_{i} - e_{j} ||),

where

K_{α}

is a softened interaction kernel. The score is therefore neither a stationary distribution of a random walk nor a density estimate. It is a deterministic energy that aggregates pairwise semantic interactions under document-specific weights. This distinction explains why GravRank can encode relevance and redundancy within a single scoring function without iterative propagation or probabilistic normalization.

2.2. Datasets and Preprocessing

Experiments are conducted on three benchmark datasets for long-document summarization: GovReport, PubMed, and BillSum. GovReport contains long U.S. government reports paired with human-written summaries [13]. PubMed consists of biomedical research articles with structured abstracts serving as reference summaries [14]. BillSum comprises U.S. Congressional and California State legislative bills annotated with professional summaries [7]. All datasets are used for both extractive and hybrid evaluations.

For each dataset, the standard train–validation–test splits defined by the original authors are used. Models are trained and evaluated independently on each dataset, and no cross-dataset training is performed.

Documents are segmented into sentences using the spaCy sentence tokenizer. Non-textual elements such as tables, figure captions, and section headers are removed when present. Sentences containing fewer than three tokens are discarded as they do not carry semantic content relevant for summarization.

Each sentence

s_{i}

is represented as a vector

x_{i} \in R^{384}

using the pretrained all-MiniLM-L6-v2 encoder, whose parameters remain fixed during all experiments. Sentence embeddings are normalized to unit length to ensure consistent distance scaling across documents. No sentence-level truncation is applied.

2.3. Sentence Charge Computation

Each sentence

s_{i}

is assigned a non-negative sentence charge

q_{i}

, which modulates its influence on the global interaction energy. Charges are computed as a convex combination of

K

normalized sentence-level features:

q_{i} = \sum_{k = 1}^{K} w_{k} F_{k} (s_{i}), \sum_{k = 1}^{K} w_{k} = 1, w_{k} \geq 0 .

(1)

All feature terms

F_{k} (s_{i}) \in [0, 1]

, with

w_{k}

assigned deterministically using document-level semantic statistics to ensure reproducibility across datasets and experiments.

Lexical significance

Lexical significance

F_{1} (s_{i})

captures the informativeness of a sentence based on token importance:

F_{1} (s_{i}) = \frac{1}{| s_{i} |} \sum_{t \in s_{i}} T F - I D F (t),

where

|s_{i}|

is the number of tokens in sentence

s_{i}

, and TF–IDF values are computed across the document. Higher values indicate sentences with rare and informative terms.

b.: Semantic centrality

Semantic centrality

F_{2} (s_{i})

measures the alignment of a sentence with the overall semantic content of the document:

F_{2} (s_{i}) = \cos (x_{i}, x_{c}), x_{c} = \frac{1}{N} \sum_{j = 1}^{N} x_{j}

where

x_{i}

is the embedding of sentence

s_{i}

and

x_{c}

is the document centroid. Sentences closer to the centroid are considered more central.

c.: Residual semantic novelty score

Residual semantic novelty

F_{3} (s_{i})

encourages selection of sentences that are semantically distinct from their nearest neighbors:

F_{3} (s_{i}) = 1 - \frac{1}{k} \sum_{j \in δ_{k} (i)} \cos (x_{i}, x_{j}),

where

δ_{k} (i)

contains the indices of the

k

nearest neighbors of

s_{i}

in cosine similarity space

(k = 5)

in all experiments. To account for overall document-level semantic dispersion, the novelty score is rescaled:

D = \sqrt{\frac{1}{N (N - 1)}} \times \sqrt{\sum_{i \neq j} {(\cos (x_{i}, x_{j}) - μ)}^{2},} F_{3}^{'} (s_{i}) = D \times F_{3} (s_{i})

where μ is the mean pairwise cosine similarity across the document. This scaling ensures that novelty is appropriately weighted in semantically homogeneous or heterogeneous documents.

d.: Deterministic feature weighting

Feature weights are assigned using a fixed, rule-based procedure to ensure reproducibility and eliminate manual tuning. A constant baseline weight is allocated to lexical significance

{(w}_{1} = 0.2)

to guarantee that informative terms are consistently represented across all documents. The remaining weight mass (0.8) is distributed between semantic centrality and residual novelty based on the document’s mean pairwise sentence similarity μ:

w_{3} = 0.8 * (1 - μ), w_{2} = 0.8 * μ

This allocation increases the contribution of novelty in semantically homogeneous documents (low μ) and emphasizes centrality in semantically diverse documents (high μ), providing a principled and fully deterministic weighting scheme that adapts to document-level semantic structure.

e.: Feature Summary

The final sentence charge is computed as:

q_{i} = w_{1} F_{1} (s_{i}) + w_{2} F_{2} (s_{i}) + w_{3} F_{3}^{'} (s_{i})

where the weights

w_{1}, w_{2}, w_{3}

sum to 1. This formulation jointly captures lexical importance, semantic centrality, and novelty, providing a deterministic measure of sentence influence for the subsequent global interaction modeling.

2.4. Global Interaction Energy Modeling

GravRank defines sentence relevance through a global interaction field over all sentence pairs. Each sentence

s_{i}

interacts with every other sentence

s_{j}

via a softened kernel, and the total interaction energy

E_{i}

serves as a deterministic measure of its importance. Formally, the interaction energy for sentence

s_{i}

is:

E_{i} = \sum_{j \neq i} \frac{q_{i} q_{j}}{\sqrt{‖ x_{i} - x_{j} ‖^{2} + α^{2}}}

(2)

where

$x_{i} a n d x_{j}$ are the embeddings of sentences $s_{i}$ and $s_{j}$ ;
$q_{i} a n d q_{j}$ are the sentence charges computed in Section 2.3;
$α > 0$ is a softening parameter that prevents singularities at small distances.

This formulation is equivalent to a softened Plummer potential [15], providing a bounded interaction kernel. The bounded behavior near zero prevents excessive reinforcement among semantically near-duplicate sentences, while the slow attenuation preserves long-range semantic influence. Compared with inverse-distance or Gaussian kernels, this formulation maintains sensitivity to both local and global semantic structure.

The softening parameter is adapted per document to account for variations in semantic dispersion:

α = γ \times m e a n_{i \neq j} |{| x}_{i} - x_{j}| |, γ = 0.1

This adaptive scaling ensures stability of the interaction field across documents of varying lengths and semantic diversity. Sentence ranking is then performed in descending order of

E_{i}

. Higher interaction energy corresponds to higher relevance, and the ranking is deterministic; no iterative propagation, graph construction, or local neighborhood pruning is required.

2.5. Sentence Selection

After computing the interaction energy

E_{i}

for each sentence

s_{i}

(Section 2.4), sentences are ranked in descending order of

E_{i}

. The extractive summary is constructed by greedily selecting sentences from this ranked list until a predefined token budget

B

is reached:

\sum_{s_{i} \in S_{s e l e c t e d}} |s_{i}| \leq B

where

| s_{i} |

denotes the number of tokens in sentence

s_{i}

, and

S_{s e l e c t e d}

is the set of sentences included in the summary. To avoid bias toward longer sentences, the interaction energy is normalized at selection time:

E^{'} = \frac{E_{i}}{| s_{i} |}

Ties in normalized energy are resolved using lexical significance

F_{1} (s_{i})

as a secondary criterion. This ensures that more informative sentences are preferred when interaction scores are equal. Selected sentences are returned in their original document order to preserve discourse coherence. This ordering maintains logical flow without requiring additional structural analysis. The token budget

B

is fixed to 512 tokens for all extractive experiments. This value provides a balance between summary conciseness and coverage for long documents across all datasets. A link to the full implementation is provided in the Supplementary Material.

2.6. Abstractive Summarization Stage

The extractive summary produced by GravRank serves as input to a pretrained abstractive summarization model. In this work, BART-large-CNN [16] is employed due to its capacity for long-form conditional generation and established performance on summarization benchmarks.

The model is fine-tuned on the BillSum, GovReport and PubMed training set, using the extractive outputs as inputs and the corresponding human-written summaries as targets. Training is performed for five epochs with the AdamW optimizer, a learning rate of

3 \times 10^{- 5}

, and a batch size of 4. Linear warm-up is applied over 10% of the training steps to stabilize convergence.

During inference, decoding is performed using beam search with a beam size of 4, a length penalty of 1.0, and a maximum output length of 256 tokens. This configuration balances fluency and summary conciseness while preventing excessive truncation of content.

By using the extractive stage as a preprocessing step, the abstractive model focuses on the most relevant and semantically diverse sentences, reducing input length and improving both computational efficiency and summary quality.

2.7. Evaluation Protocol

GravRank is evaluated in two stages: extractive summarization and hybrid extractive–abstractive summarization. In the extractive stage, summary quality is measured using Rouge scores [17]. Each experiment is repeated three times with fixed random seeds. Statistical significance is assessed using paired bootstrap resampling with 100 samples at the 95 percent confidence level, which provides stable variance estimates while limiting computational cost. Performance comparisons with prior methods are conducted at this stage using results reported on the same datasets.

To improve the comparability of GravRank with prior work, we followed the standard train–validation–test splits for BillSum, GovReport, and PubMed and adopted widely used preprocessing steps for our own experiments, including sentence segmentation with spaCy and the removal of sentences shorter than three tokens. For all extractive settings, we enforced a fixed token budget of 512 tokens to ensure consistent input length constraints across datasets and to match common practice in hybrid summarization studies. Reported results for baselines and recent models were taken from the corresponding publications, which evaluate on the same benchmark datasets using ROUGE F1 metrics. We acknowledge that differences in preprocessing pipelines, hyperparameter choices, and evaluation scripts across studies can introduce residual inconsistencies that cannot be fully eliminated without reimplementing all systems. To mitigate this, we report statistical variability for GravRank using repeated runs and paired bootstrap resampling with N = 1000 iterations to obtain 95 percent confidence intervals, and we focus our analysis on consistent performance trends rather than isolated point estimates.

In the hybrid stage, extractive outputs are used as inputs to the abstractive model, and the resulting summaries are evaluated using ROUGE and BERTScore. Comparisons with existing extractive–abstractive systems are also conducted at this stage using published results obtained under comparable evaluation settings.

3. Results

3.1. Extractive Summarization Performance (GravRank)

Table 1 presents ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) F1 scores for GravRank in comparison with classical baselines, recent unsupervised methods, and selected supervised models on the BillSum, PubMed, and GovReport test sets. All comparative results are taken from the cited sources unless otherwise indicated.

3.2. Statistical Analysis

To quantify variability in GravRank performance, paired bootstrap resampling was performed at the document level with N = 1000 iterations. Table 2 reports the mean ROUGE-1 scores along with 95% confidence intervals.

3.3. Sensitivity Analysis of the Softening Parameter α

This section reports the sensitivity of GravRank to variations in the softening parameter α. Table 3 presents ROUGE-1, ROUGE-2, and ROUGE-L scores on the GovReport, BillSum, and PubMed datasets for multiple α values.

3.4. Ablation Study

To quantify the contribution of each charge component, two complementary ablation settings were evaluated. First, the sentence charge was computed using only one component at a time, namely lexical significance (

f_{1}

), semantic centrality (

f_{2}

), or residual novelty (

f_{3}

), while keeping all other parts of the GravRank pipeline unchanged. Second, a leave-one-out analysis was conducted in which each component was removed in turn from the full model (minus

f_{1}

, minus

f_{2}

, minus

f_{3}

). Performance is reported using ROUGE-1 F1, which serves as a primary indicator of content coverage in extractive summarization. Table 4 reports the results on BillSum, PubMed, and GovReport.

3.5. Hybrid Summarization Performance

This subsection reports ROUGE and BERTScore-F1 results for hybrid summarization models, where extractive outputs are used as inputs to downstream abstractive models. In addition to comparative results from prior studies, we include a controlled reimplementation of BART, denoted by † in Table 5a–c, to provide a direct baseline under the same preprocessing and evaluation conditions. In this setting, BART is fine-tuned and evaluated using the same data splits, input token budget, and decoding configuration as in the proposed pipeline, which allows the effect of the extractive preprocessing stage to be isolated. The proposed system, GravRank + BART, uses the same BART configuration and the same input budget, with the only difference being that the input is produced by the GravRank extractive stage rather than by direct document truncation. BERTScore is computed using the default model configuration with F1 aggregation. Missing entries indicate results not reported in the corresponding studies.

4. Discussion

4.1. Effect of Global Interaction Modeling

The extractive results (GravRank) across BillSum, PubMed, and GovReport indicate that GravRank performs competitively with recent unsupervised methods. This behavior follows from defining sentence importance through global interaction energy rather than independent relevance scores or local graph centrality. By aggregating interactions from all sentences, GravRank captures document-wide semantic structure, allowing sentences located in semantically dense regions to receive higher importance. This global formulation reduces dependence on positional heuristics and local neighborhood topology, which characterize heuristic and graph-based baselines. Compared with embedding-based unsupervised methods that optimize relevance and diversity separately, GravRank integrates both within a single interaction model. This unified formulation explains its consistent behavior across datasets with different lengths and discourse structures.

The ablation results in Table 4 show that each component contributes measurably to the final performance. Semantic centrality (

f_{2}

) yields the strongest results on PubMed and GovReport among the single-component variants. This reflects the importance of global semantic alignment in long scientific and governmental documents. In contrast, lexical significance (

f_{1}

) performs competitively on BillSum, which indicates that salient content in legislative text is strongly associated with domain-specific terminology and named entities. Residual novelty (

f_{3}

) achieves substantially lower scores in isolation across all datasets, which is expected because it is designed as a regularizing term rather than a primary relevance signal.

The leave-one-out analysis further confirms the complementary roles of the three components. Removing any single feature leads to a consistent drop in ROUGE-1 relative to the full model across datasets, with the largest degradations observed when either semantic centrality or lexical significance is excluded. The smaller but systematic decrease observed when removing the novelty term indicates that its primary contribution lies in redundancy suppression rather than standalone content coverage. Taken together, these results support the use of a convex combination of features and show that GravRank benefits from jointly encoding relevance and redundancy within a single interaction-based scoring function.

4.2. Redundancy Control and the Softened Potential

The sensitivity analysis shows smooth variation in performance across different values of the softening parameter α. This stability is a direct consequence of the softened Plummer potential, which bounds interactions between near-duplicate sentences and prevents excessive reinforcement among highly similar embeddings. At small α, interactions become sharper and more sensitive to local differences, while larger α attenuates discrimination between semantically related and unrelated sentences. The intermediate range therefore reflects a balance between local sensitivity and global stability, consistent with the intended redundancy control mechanism. These results support the choice of embedding redundancy control directly into the interaction kernel rather than applying diversity constraints as a post hoc step.

4.3. Determinism and Statistical Stability

The narrow confidence intervals obtained from bootstrap resampling reflect the deterministic nature of GravRank. Because sentence charges and interaction energies are computed in closed form, without stochastic optimization or training dynamics, performance variability is limited primarily to document content and embedding representations. This contrasts with neural extractive models, where training instability and parameter initialization introduce additional variance.

4.4. Hybrid Summarization Behavior

In the hybrid setting, GravRank-based preprocessing yields competitive ROUGE and BERTScore values compared to several baselines. This behavior follows from providing the abstractive model with a compact input that jointly reflects relevance and non-redundancy. By compressing long documents into semantically structured representations, GravRank reduces the burden on the generative model to recover missing context or resolve redundancy caused by truncation. This explains the consistent hybrid improvements across legal, governmental, and scientific datasets.

4.5. Future Directions

Although GravRank provides a deterministic and unsupervised formulation for long-document extractive summarization, its current implementation relies on all-pair sentence interactions, which leads to quadratic complexity in the number of sentences. For very large documents, this cost can become a practical bottleneck. Several directions can be pursued to improve scalability. First, approximate nearest-neighbor search or locality-sensitive hashing can be used to restrict interactions to semantically relevant subsets while preserving the global structure of the interaction field. Second, hierarchical or block-wise processing can be introduced to compute interactions at multiple granularity levels, which would reduce the effective problem size and align naturally with the structure of long documents. Third, low-rank or sparse approximations of the sentence similarity matrix can be used to accelerate energy computation without altering the underlying objective. Finally, the current fixed-length token budget can be replaced by adaptive budget allocation strategies that account for document length and section importance. These directions provide a clear path for improving computational efficiency and applicability to very large documents without changing the core formulation of the model.

5. Conclusions

This paper introduced GravRank, an unsupervised and deterministic extractive ranking framework designed as a preprocessing stage for abstractive summarization of long documents. GravRank models sentence importance through global pairwise semantic interactions governed by a softened Plummer potential, enabling joint treatment of relevance and redundancy within a unified interaction formulation. The framework avoids iterative propagation, supervised training, and heuristic redundancy filtering, while remaining compatible with transformer-based summarization architectures.

Experimental results on BillSum, GovReport, and PubMed demonstrate that GravRank provides competitive extractive performance and supports effective hybrid summarization when integrated with BART. Statistical and sensitivity analyses further indicate stability with respect to sampling variation and kernel parameters. These findings suggest that interaction-based energy modeling offers a viable alternative to graph-based and learning-dependent extractive methods for long-document summarization. Future work may explore adaptive interaction kernels and the incorporation of structural and discourse-level information to further extend the framework.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/bbash/-ArewaDS-Machine-Learning-Assignment/blob/main/GavRank.ipynb (accessed on 13 January 2026).

Author Contributions

Conceptualization, A.S.B. and A.A.B.; methodology, A.S.B. and A.A.B.; software, A.S.B.; validation, A.A.; formal analysis, A.S.B. and A.A.; investigation, A.S.B. and A.A.; data curation, A.A. and A.S.B.; writing—original draft preparation, A.S.B.; writing—review and editing, A.S.B., A.A.B. and A.A.; visualization, A.S.B.; supervision, A.A.B. and A.A.; project administration, A.A. and A.A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository: The data presented in this study are openly available in: (1) BillSum: [Hugging Face] at [10.18653/v1/D19-5406], link [https://huggingface.co/datasets/FiscalNote/billsum] (accessed on 16 January 2026). (2) PubMed: [Hugging Face] at [10.1002/bjs.1800650203], link [https://huggingface.co/datasets/ncbi/pubmed] (accessed on 16 January 2026). (3) GovReport: [Hugging Face] at [10.18653/v1/2021.naacl-main.112], link [https://huggingface.co/datasets/launch/gov_report] (accessed on 16 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shakil, H.; Ortiz, Z.; Forbes, G.C.; Kalita, J. Utilizing GPT to Enhance Text Summarization: A Strategy to Minimize Hallucinations. Procedia Comput. Sci. 2024, 244, 238–247. [Google Scholar] [CrossRef]
Chen, X.; Chen, Z.; Cheng, S. CoTHSSum: Structured long-document summarization via chain-of-thought reasoning and hierarchical segmentation. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 40. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, J.; Yang, Z.; Wang, B.; Jin, J.; Liu, Y. Improving extractive summarization with semantic enhancement through topic-injection based BERT model. Inf. Process. Manag. 2024, 61, 103677. [Google Scholar] [CrossRef]
Bashir, A.S.; Bichi, A.A.; Mahmud, U.; Bello, A.M. Long-Text Abstractive Summarization using Transformer Models: A Systematic Review. J. Braz. Comput. Soc. 2025, 31, 1264–1279. [Google Scholar] [CrossRef]
Jain, D.; Borah, M.D.; Biswas, A. Summarization of Lengthy Legal Documents via Abstractive Dataset Building: An Extract-then-Assign Approach. Expert Syst. Appl. 2024, 237, 121571. [Google Scholar] [CrossRef]
Koh, H.Y.; Ju, J.; Liu, M.; Pan, S. An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics. ACM Comput. Surv. 2022, 55, 154. [Google Scholar] [CrossRef]
Kornilova, A.; Eidelman, V. BillSum: A Corpus for Automatic Summarization of US Legislation. In Proceedings of the 2nd Workshop on New Frontiers in Summarization; Wang, L., Cheung, J.C.K., Carenini, G., Liu, F., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 48–56. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing; Lin, D., Wu, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 404–411. Available online: https://aclanthology.org/W04-3252/ (accessed on 2 June 2025).
Erkan, G.; Radev, D.R. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. 2004, 22, 457–479. [Google Scholar] [CrossRef]
Joshi, A.; Fidalgo, E.; Alegre, E.; Alaiz-Rodriguez, R. RankSum An unsupervised extractive text summarization based on rank fusion. arXiv 2024, arXiv:2402.05976. [Google Scholar] [CrossRef]
Zheng, H.; Lapata, M. Sentence Centrality Revisited for Unsupervised Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6236–6247. [Google Scholar] [CrossRef]
Liu, Y.; Lapata, M. Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3728–3738. [Google Scholar] [CrossRef]
Huang, Y.; Yu, Z.; Guo, J.; Xiang, Y.; Xian, Y. Element graph-augmented abstractive summarization for legal public opinion news with graph transformer. Neurocomputing 2021, 460, 166–180. [Google Scholar] [CrossRef]
Cohan, A.; Dernoncourt, F.; Kim, D.S.; Bui, T.; Kim, S.; Chang, W.; Goharian, N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. arXiv 2018, arXiv:1804.05685. [Google Scholar] [CrossRef]
Saitoh, T.R.; Makino, J. A Natural Symmetrization for the Plummer Potential. New Astron. 2012, 17, 76–81. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv 2020, arXiv:1910.13461. [Google Scholar] [CrossRef]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013/ (accessed on 13 July 2025).
Liang, X.; Li, J.; Wu, S.; Zeng, J.; Jiang, Y.; Li, M.; Li, Z. An Efficient Coarse-to-Fine Facet-Aware Unsupervised Summarization Framework based on Semantic Blocks. arXiv 2022, arXiv:2208.08253. [Google Scholar] [CrossRef]
Zhang, H.; Liu, X.; Zhang, J. HEGEL: Hypergraph Transformer for Long Document Summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 10167–10176. [Google Scholar] [CrossRef]
Gu, N.; Ash, E.; Hahnloser, R. MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6507–6522. [Google Scholar] [CrossRef]
Jain, D.; Borah, M.D.; Biswas, A. Summarization of legal documents: Where are we now and the way forward. Comput. Sci. Rev. 2021, 40, 100388. [Google Scholar] [CrossRef]
Zhang, Y.; Ni, A.; Mao, Z.; Wu, C.H.; Zhu, C.; Deb, B.; Awadallah, A.H.; Radev, D.; Zhang, R. Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents. arXiv 2022, arXiv:2110.10150. [Google Scholar] [CrossRef]
Xie, J.; Cheng, P.; Liang, X.; Dai, Y.; Du, N. Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 13500–13519. [Google Scholar] [CrossRef]
Moro, G.; Ragazzi, L.; Valgimigli, L.; Frisoni, G.; Sartori, C.; Marfia, G. Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes. Sensors 2023, 23, 3542. [Google Scholar] [CrossRef] [PubMed]
An, C.; Zhong, M.; Geng, Z.; Yang, J.; Qiu, X. RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization. arXiv 2021, arXiv:2109.07943. Available online: http://arxiv.org/abs/2109.07943 (accessed on 10 February 2024).
Han, C.; Feng, J.; Qi, H. Topic model for long document extractive summarization with sentence-level features and dynamic memory unit. Expert Syst. Appl. 2024, 238, 121873. [Google Scholar] [CrossRef]

Table 1. ROUGE Performance on GovReport, BillSum, and PubMed Datasets.

Models	GovReport			BillSum			PubMed			Source
Models	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L	Source
ORACLE	74.87	49.02	72.48	65.24	47.09	58.81	55.05	27.48	38.66	[18]
BASELINES
LEAD	50.94	19.53	48.45	40.53	18.28	34.15	35.63	12.28	25.17	[18]
LEXRANK	40.16	8.85	37.65	34.39	10.05	28.93	39.19	15.87	34.53	[18]
TEXTRANK (BERT)	56.00	22.42	52.86	38.05	12.99	31.46	39.43	12.89	34.66	[18]
RECENT UNSUPERVISED
PACSUM	56.89	26.88	54.33	41.11	17.24	34.54	39.79	14.00	36.09	[18]
C2F-FAR	57.98	27.63	55.33	42.53	17.85	35.58	40.12	14.79	36.91	[18]
HIPORANK	--	--	--	--	--	--	43.58	17.00	39.31	[19]
GRAVRANK	58.08	23.03	52.97	43.58	21.42	34.51	43.21	16.73	38.65
SUPERVISED MODELS
HEGEL	--	--	--	--	--	--	47.13	21.00	42.18	[19]
HISTRUCT+	--	--	--	--	--	--	46.59	20.39	42.11	[19]
DANCER-LSTM	--	--	--	--	--	--	44.09	17.69	40.27	[19]
SUMMARUNNER	--	--	--	--	--	--	43.89	18.78	30.36	[20]
NEUSUM	58.94	25.38	55.80	--	--	--	47.46	21.92	42.87	[20]
MEMSUM	59.43	28.60	56.69	--	--	--	49.25	22.94	44.42	[20]
LSTM WITH W2V	--	--	--	28.89	15.26	27.83	--	--	--	[21]
LSTM WITH GLOVE	--	--	--	29.46	15.51	28.24	--	--	--	[21]
SUMM^N	56.77	23.25	53.90	--	--	--	--	--	--	[22]

-- indicates that no results were reported or provided for that specific dataset in the reviewed literature/existing studies.

Table 2. Bootstrapped ROUGE-1 scores and 95% confidence intervals for GravRank on the BillSum, PubMed and GovReport test set.

Dataset	GravRank Mean (R-1)	95% CI Lower	95% CI Upper
BillSum	43.45	42.75	44.14
PubMed	43.22	41.31	45.12
GovReport	58.08	57.20	58.95

Table 3. Sensitivity Analysis of the Softening Parameter α for GravRank.

α Value	GovReport			BillSum			PubMed
	R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
0.05	57.12	21.85	51.94	42.11	19.84	33.62	41.96	15.98	37.82
0.10	57.74	22.46	52.51	42.93	20.63	34.17	42.74	16.41	38.24
0.20	58.08	23.03	52.97	43.58	21.42	34.51	43.21	16.73	38.65
0.40	57.91	22.81	52.73	43.12	21.01	34.29	42.98	16.55	38.44
0.80	57.46	22.12	52.05	42.47	20.18	33.74	42.21	16.02	37.91

Table 4. Ablation Analysis of GravRank Charge Components (ROUGE-1).

Configuration	BillSum (R-1)	PubMed (R-1)	GovReport (R-1)
Full GravRank	43.58	43.21	58.08
Lexical Only ( $f_{1}$ )	40.22	35.14	53.56
Centrality Only ( $f_{2}$ )	39.71	40.73	54.19
Novelty Only ( $f_{3}$ )	35.86	29.90	38.17
Minus $f_{1}$	42.21	41.88	57.20
Minus $f_{2}$	42.36	40.78	56.05
Minus $f_{3}$	42.34	41.32	57.38

Table 5. a: Hybrid Summarization Performance on GovReport. b: Hybrid Summarization Performance on BillSum. c: Hybrid Summarization Performance on PubMed.

a
Study	Model	R-1	R-2	R-L	BERTScore-F1
^†	BART	51.66	20.13	24.22	68.02
[23]	BARTbase	51.72	19.37	23.11	64.12
[24]	LED (Longformer Encoder–Decoder)	59.42	26.53	56.63	--
[23]	SimCAS + BARTlarge	59.30	25.95	27.07	68.17
[2]	CoTHSSum	42.56	21.54	24.36	76.12
[2]	T5	27.26	8.24	18.61	50.78
This study	GravRank + BART	61.18	39.93	47.49	83.24
b
Study	Model	R-1	R-2	R-L	BERTScore-F1
^†	BART	50.98	31.34	40.46	69.89
[5]	ETAROUGE	33.85	15.33	30.58	69.00
[5]	ETABERTScore	35.23	16.58	32.09	60.86
[5]	Legal PEGASUS	34.19	16.25	30.16	59.45
[5]	BigBird-PEGASUS	30.76	13.70	27.63	56.49
[25]	Retrieval + BART	56.26	34.90	52.51	—
[25]	BART	51.80	33.05	47.72	—
This study	GravRank + BART	52.55	32.32	46.45	71.01
c
Study	Model	R-1	R-2	R-L	BERTScore-F1
^†	BART	44.87	22.64	41.90	70.89
[26]	PEGASUS	45.97	20.15	41.34	--
[26]	BigBird	46.32	20.65	42.33	--
[26]	HEPOS	45.80	18.61	40.69	--
[23]	BARTbase	40.36	16.60	35.02	61.77
[23]	SimCAS + BARTlarge	48.65	21.40	44.14	66.52
[2]	CoTHSSum	52.14	28.46	36.45	77.36
[2]	T5	44.65	21.54	40.65	65.48
This study	GravRank + BART	47.56	24.29	43.76	76.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bashir, A.S.; Bichi, A.A.; Ado, A. GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents. Eng. Proc. 2026, 124, 65. https://doi.org/10.3390/engproc2026124065

AMA Style

Bashir AS, Bichi AA, Ado A. GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents. Engineering Proceedings. 2026; 124(1):65. https://doi.org/10.3390/engproc2026124065

Chicago/Turabian Style

Bashir, Abubakar Salisu, Abdulkadir Abubakar Bichi, and Abubakar Ado. 2026. "GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents" Engineering Proceedings 124, no. 1: 65. https://doi.org/10.3390/engproc2026124065

APA Style

Bashir, A. S., Bichi, A. A., & Ado, A. (2026). GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents. Engineering Proceedings, 124(1), 65. https://doi.org/10.3390/engproc2026124065

Article Menu

GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents^†

Abstract

1. Introduction

2. Materials and Methods

2.1. Relationship to Kernel Density Estimation and Graph Centrality

2.2. Datasets and Preprocessing

2.3. Sentence Charge Computation

2.4. Global Interaction Energy Modeling

2.5. Sentence Selection

2.6. Abstractive Summarization Stage

2.7. Evaluation Protocol

3. Results

3.1. Extractive Summarization Performance (GravRank)

3.2. Statistical Analysis

3.3. Sensitivity Analysis of the Softening Parameter α

3.4. Ablation Study

3.5. Hybrid Summarization Performance

4. Discussion

4.1. Effect of Global Interaction Modeling

4.2. Redundancy Control and the Softened Potential

4.3. Determinism and Statistical Stability

4.4. Hybrid Summarization Behavior

4.5. Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents †

Abstract

1. Introduction

2. Materials and Methods

2.1. Relationship to Kernel Density Estimation and Graph Centrality

2.2. Datasets and Preprocessing

2.3. Sentence Charge Computation

2.4. Global Interaction Energy Modeling

2.5. Sentence Selection

2.6. Abstractive Summarization Stage

2.7. Evaluation Protocol

3. Results

3.1. Extractive Summarization Performance (GravRank)

3.2. Statistical Analysis

3.3. Sensitivity Analysis of the Softening Parameter α

3.4. Ablation Study

3.5. Hybrid Summarization Performance

4. Discussion

4.1. Effect of Global Interaction Modeling

4.2. Redundancy Control and the Softened Potential

4.3. Determinism and Statistical Stability

4.4. Hybrid Summarization Behavior

4.5. Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

GravRank: A Gravitational Extractive Preprocessing Framework for Abstractive Summarization of Long Documents^†