Next Article in Journal
Classification of Drowsiness and Alertness States Using EEG Signals to Enhance Road Safety: A Comparative Analysis of Machine Learning Algorithms and Ensemble Techniques
Next Article in Special Issue
Natural-Language Mediation Versus Numerical Aggregation in Multi-Stakeholder AI Governance: Capability Boundaries and Architectural Requirements
Previous Article in Journal
Formal Analysis of Bakery-Based Mutual Exclusion Algorithms
Previous Article in Special Issue
MacHa: Multi-Aspect Controllable Text Generation Based on a Hamiltonian System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English

1
Shri Ramdeobaba College of Engineering and Management, Affiliated to Rashtrasant Tukdoji Maharaj Nagpur University, Nagpur 440013, India
2
School of Medicine, Stanford University, Stanford, CA 94305, USA
3
Department of Artificial Intelligence and Cyber Security, Ramdeobaba University, Nagpur 440013, India
*
Authors to whom correspondence should be addressed.
Computers 2025, 14(12), 508; https://doi.org/10.3390/computers14120508
Submission received: 1 October 2025 / Revised: 6 November 2025 / Accepted: 18 November 2025 / Published: 23 November 2025

Abstract

The rapid growth of digital journalism has heightened the need for reliable multi-document summarization (MDS) systems, particularly in underrepresented, low-resource, and culturally distinct contexts. However, current progress is hindered by a lack of large-scale, high-quality non-Western datasets. Existing benchmarks—such as CNN/DailyMail, XSum, and MultiNews—are limited by language, regional focus, or reliance on noisy, auto-generated summaries. We introduce NewsSumm, the largest human-annotated MDS dataset for Indian English, curated by over 14,000 expert annotators through the Suvidha Foundation. Spanning 36 Indian English newspapers from 2000 to 2025 and covering more than 20 topical categories, NewsSumm includes over 317,498 articles paired with factually accurate, professionally written abstractive summaries. We detail its robust collection, annotation, and quality control pipelines, and present extensive statistical, linguistic, and temporal analyses that underscore its scale and diversity. To establish benchmarks, we evaluate PEGASUS, BART, and T5 models on NewsSumm, reporting aggregate and category-specific ROUGE scores, as well as factual consistency metrics. All NewsSumm dataset materials are openly released via Zenodo. NewsSumm offers a foundational resource for advancing research in summarization, factuality, timeline synthesis, and domain adaptation for Indian English and other low-resource language settings.

1. Introduction

The explosive growth of digital news media has resulted in vast volumes of unstructured text, making efficient information synthesis a critical challenge in natural language processing (NLP) [1,2,3,4]. Text summarization, particularly in the news domain, condenses multiple sources into concise, coherent narratives, supporting faster and more informed decision-making. While existing summarization benchmarks reflect these limitations in distinct ways. Single-document datasets such as CNN/DailyMail [5] and XSum [6,7] excel at capturing individual article narratives but struggle with synthesis across multiple sources, a critical requirement for comprehensive news coverage [8]. Multi-document datasets like MultiNews [9] address scale but rely on automatically generated summaries from metadata or heuristics [10,11,12], introducing noise and factual inconsistencies that limit their reliability for high-stakes applications. Meanwhile, all major benchmarks, whether single or multi-document, predominantly feature Western English from North American or European sources, embedding cultural assumptions, editorial conventions, and entity distributions that diverge substantially from non-Western contexts. These three constraints (document scope, annotation quality, geographic bias) collectively restrict progress on factual, large-scale, multi-document summarization in diverse linguistic settings.
India represents one of the world’s largest and most complex news ecosystems, yet lacks a high-quality, large-scale, human-annotated [13] dataset for multi-document summarization in Indian English. This gap limits both practical applications and research on factual [14,15,16,17,18], scalable summarization in low-resource linguistic [19,20,21] settings. To address this gap, we introduce NewsSumm—the largest human-annotated multi-document news summarization dataset for Indian English. It comprises over 317,498 articles from 36 newspapers across 25 years (2000–2025) and 20+ topical domains, with professionally written summaries designed for coherence, factuality, and abstraction. NewsSumm uniquely combines (1) comprehensive temporal coverage enabling timeline-aware and diachronic summarization research; (2) human-curated, abstractive summaries avoiding noisy automatic extraction; (3) geographic and linguistic diversity reflecting Indian news ecosystem complexity; and (4) extensive quality control ensuring factual consistency across a 25-year span.
Our contributions include the following:
  • Introducing NewsSumm, with detailed construction and annotation protocols ensuring reproducibility;
  • Providing linguistic, statistical, and comparative analyses positioning NewsSumm relative to global benchmarks;
  • Benchmarking state-of-the-art summarization models (BART, PEGASUS, T5) on NewsSumm with aggregate and category-specific evaluation;
  • NewsSumm is fully accessible on Zenodo (https://zenodo.org/records/17670865 accessed on 21 November 2025), supporting open research and broad reproducibility.
Our experiments, described in Section 6.3 and Table 1, demonstrate that state-of-the-art summarization models trained exclusively on Western English datasets (CNN/DailyMail, XSum) perform substantially worse on Indian English news, quantitatively validating the need for culturally specific benchmark resources such as NewsSumm.
Section 2 reviews related work; Section 3 describes dataset construction and annotation methodology; Section 4 presents statistical and linguistic analyses of the dataset; Section 5 provides comparative insights with existing benchmarks; Section 6 details benchmarking experiments; Section 7 discusses emerging and potential applications; Section 8 outlines current limitations; and Section 9 suggests directions for future work.

2. Related Work

Automatic text summarization has progressed rapidly with the development of neural architectures [22,23,24,25] and large-scale benchmark datasets [5,6,7,9,13,21,26,27,28,29,30]. This section reviews foundational work in single-document summarization (SDS), multi-document summarization (MDS), and factual consistency evaluation, focusing on dataset evolution, scope, and limitations relevant to multilingual and regionally diverse contexts.
Early SDS efforts were shaped by datasets such as DUC 2002 [31], which introduced standardized tasks for abstractive summarization. Subsequent benchmarks like CNN/DailyMail, Gigaword [32] and XSum supported the rise of neural encoder-decoder [33] models and innovations in headline generation, pointer-generator networks, and single-sentence abstractive summaries. Domain-specific datasets extended SDS research: Newsroom [34] explored stylistic variety; Reddit TIFU [35] and SAMSum [13] targeted informal and conversational content; PubMed [36,37], arXiv [38], WikiHow [39], and BigPatent covered scientific, instructional, and legal domains. Despite this progress, most SDS datasets remain Western-centric and extractive, limiting their relevance for multi-source or culturally diverse summarization needs [40,41]. Because SDS corpora are largely Western-centric and often extractive, we turn to MDS resources that explicitly address multi-source synthesis.

2.1. Multi-Document Summarization Datasets

MDS tackles the challenge of synthesizing information from multiple related documents, with applications in news aggregation, legal analysis [42], and scientific discourse [43,44,45]. Foundational datasets like DUC and TAC [46] provided early benchmarks for multi-document synthesis and coherence [47].
Later, corpora scaled up MDS. MultiNews introduced 56k news clusters but relied on auto-generated summaries that used lead sections as references for aggregated source articles, and CCSUM [47] merged multilingual sources with mixed annotation quality. More recent datasets extend MDS beyond news—EventSum [48], CivilSum, and PeerSum [49] target legal, biomedical, and peer-review domains, respectively. Two collections sit in distinct subdomains: CRD focuses on opinion synthesis from conversational user/product or service reviews, while QMSum [50] addresses query-focused, multi-domain meeting summarization driven by information-seeking queries rather than standard article clusters. Unlike classic news/scientific MDS, CRD and QMSum are conversational or query-driven and use different annotation protocols and evaluation criteria, broadening MDS benchmarks beyond traditional narrative and event aggregation.
Figure 1 is the timeline matrix of major SDS and MDS summarization datasets (2002–2026). Each row corresponds to one dataset; columns mark the release year. Bubble area encodes dataset size (log scale); color encodes annotation strategy (green = extractive, cyan = human-abstractive, gold = auto/weakly supervised, purple = mixed). SDS appear in the upper block; MDS in the lower block. NewsSumm is highlighted with a bold black outline to indicate its position as the largest human-annotated MDS dataset for Indian English news.
Yet, most of these corpora are either automatically labeled, domain-specific, or derived from Western media, limiting their applicability in multilingual, low-resource environments such as Indian English news. Figure 1 presents a taxonomy of major SDS and MDS datasets, highlighting the evolution of dataset scale, abstraction level, and domain coverage over time. This contextual overview underscores the growing need for regionally grounded, human-annotated resources that support factual, abstractive summarization across diverse editorial styles and languages.

2.2. Abstractive vs. Extractive Summarization Paradigms

A key methodological distinction in summarization is between extractive and abstractive approaches. Extractive summarization selects source sentences for output, preserving original phrasing and reducing hallucination risk but producing fragmentary, less coherent summaries. Abstractive summarization rephrases and compresses content, enabling fluency and conciseness but risking factual inconsistencies. Recent comparative reviews document key tradeoffs: extractive methods preserve factuality but sacrifice coherence and brevity; abstractive methods improve readability but require robust quality control [51,52].
For multi-document news summarization, abstractive approaches are strongly preferred when (i) source articles contain redundant phrasing across clusters, requiring paraphrasing to avoid verbatim duplication; (ii) compression from multi-source inputs demands synthesis and narrative coherence; (iii) readers expect fluent, contextualized narratives rather than sentence fragments. NewsSumm adopts abstractive annotation (50–250 word summaries, required rephrasing) to support these demands. While our primary focus is abstractive modeling, we establish extractive baselines (LexRank [51], TextRank [52]) for comparison in Section 6.4, Table 2. Preliminary results show extractive methods achieve ROUGE-L 20–22%, substantially lower than abstractive models (37–40%), validating our abstractive annotation choice for NewsSumm’s compression ratios and coherence requirements.

2.3. Datasets for Factual Consistency and Evaluation

Factuality remains a key challenge in neural summarization, with models often generating hallucinations [53,54,55] or misrepresenting source content. FactCC [56] provides labeled pairs for factuality classification. SummEval [57] and FRANK offer human judgments and taxonomy-based annotations of hallucinations. QAGS [58] applies QA techniques to measure alignment between source and summary, while FactCheckSum focuses on verifying factual claims in generated summaries.
However, these resources are predominantly built from Western media and Wikipedia, limiting their reliability in evaluating summaries across different linguistic or cultural contexts, including Indian English news. Given that these tools are trained mostly on Western media/Wikipedia, reliability in Indian English news is limited, further promoting NewsSumm as both a benchmark and an evaluation platform.
Although more than 30 SDS and MDS datasets are publicly available—including CNN/DailyMail, XSum, MultiNews, WikiSum, EventSum, PubMed, and CivilSum—there remains a lack of large-scale, human-annotated corpora for multi-document summarization in linguistically diverse and underrepresented regions such as India. This limits research on inclusive and factually reliable summarization systems. We now turn from limitations in prior work to the resource itself, describing how NewsSumm is built and annotated in Section 3.

3. Dataset Construction and Annotation

NewsSumm is a large-scale, human-annotated multi-document summarization dataset curated from Indian English news sources. This section outlines the dataset’s construction, including source selection, annotation workflow, and quality assurance measures. The dataset comprises over 317,498 articles collected from 36 prominent Indian English newspapers, including national (e.g., The Hindu, Hindustan Times) and regional outlets (e.g., The Hitavada, The Assam Tribune). The articles span a 25-year period (2000–2025) and cover over 20 topical domains, such as politics, science, health, technology, crime, and the environment. To ground this overview, Section 3.1 details how articles were collected and summarized before we describe quality controls (Section 3.2), cleaning/normalization (Section 3.3), and alignment with global standards (Section 3.4).

3.1. Collection and Annotation Pipeline

A distributed team of over 14,000 trained volunteers collected and annotated the data using a standardized protocol. Annotators extracted metadata including headline, publication date, source, and URL, and composed 50–250-word abstractive summaries in their own words. Each article was tagged with one or more topical categories from a predefined taxonomy. The process involved manual article entry through structured forms, daily and archival collection, and summarization with emphasis on neutrality, coherence, and informativeness. Annotator metadata and timestamps were recorded to ensure reproducibility. Figure 2 illustrates the pipeline from source selection through annotation and quality control. Having described how entries are created, we next formalize the rules that guide annotators and the checks that enforce them in Section 3.2.

3.2. Annotation Guidelines and Quality Assurance

Annotators received detailed instructions and participated in hands-on training sessions. Each summary was expected to be written using abstraction rather than direct extraction, maintain strict factual consistency with the source content, and follow a logical structure while remaining neutral and free of personal opinions or bias. Quality control involved multiple stages, including independent peer review of each summary, random audits by senior editors, and iterative feedback and revision cycles to ensure adherence to guidelines and high annotation quality. Summaries that failed to meet standards were excluded. Table 3 outlines the roles and outcomes at each stage of the QA process. In this manuscript, the terms “annotators” and “volunteers” refer to the same group—over 14,000 trained individuals coordinated by the Suvidha Foundation. Each participant completed a structured 7–11 h training program that included an orientation session, detailed annotation manuals, and 10–15 practice tasks. Only those achieving at least 90% accuracy on gold-standard references proceeded to live annotation. Weekly recalibration sessions were conducted to minimize drift and maintain consistency. Valid summaries were required to range between 50 and 250 words and demonstrate non-trivial abstraction, comprehensive coverage, factual correctness, neutrality, and at least 90% reviewer agreement. The annotators represented a strong and diverse demographic: 61% held university degrees, 34% had journalism experience, 91% possessed C1-level or higher English proficiency, and 80% were fluent in Hindi (refer to Table 4). With guidelines and multi-stage QA in place, we then standardized and validated the corpus through data cleaning procedures in Section 3.3.
To address temporal validity across the 25-year span (2000–2025), NewsSumm employed a date-stamped fact-verification protocol. Annotators were instructed to ground all summaries exclusively in source article content and to flag claims that could be disputed or time-sensitive. When factual status was uncertain, summaries were escalated to senior editors, who documented decisions with timestamps for auditability. No summary was approved without all claims being verifiable in sources contemporary with the original publication date. Annotators were explicitly prohibited from applying hindsight or present-day knowledge to “correct” past events. A 5% year-by-year random QA audit verified absence of retrospective bias. This protocol ensures that summaries faithfully represent facts as understood at publication time, maintaining temporal consistency and reproducibility.

3.3. Data Cleaning

Post-annotation, the dataset underwent cleaning to ensure consistency and integrity. Duplicate entries were identified and removed through a combination of automated similarity checks and manual review. Dates and category labels were standardized for uniformity across the corpus. Incomplete or noisy records—such as those with missing headlines or summaries below the minimum length—were flagged and either corrected or excluded. Finally, all records were validated against a predefined schema to ensure field completeness and correct formatting (see Figure 3a,b). Each record contains eight core fields: source (newspaper), publication date, headline, article text, human summary, category/tags, URL (for source traceability), and annotator ID (for quality auditing and reproducibility). After normalization and schema validation, we situate NewsSumm relative to established practices to facilitate comparability and reuse.

3.4. Alignment with Global Standards

NewsSumm follows practices established by CNN/DailyMail, XSum, and MultiNews, adapted for Indian-English news and multi-document synthesis. Emphasis on human annotation, rigorous QA, and linguistic diversity makes it globally interoperable while uniquely valuable for low-resource, culturally specific NLP. Figure 3a,b show the structured schema with eight core fields centered on the main article, ensuring completeness, traceability, and consistency. The URL field enables source verification and reproducibility; the annotator ID supports quality auditing and longitudinal analysis of annotation patterns. This alignment preserves compatibility with existing evaluation workflows and enables downstream analyses and benchmarking described next.

4. Dataset Statistics and Analysis

NewsSumm is the largest human-annotated multi-document summarization resource for Indian English news, distinguished by its scale, linguistic richness, and high-quality human-written summaries. The corpus comprises 317,498 articles spanning from 2000 to 2025, covering 5121 unique categories. Articles and their corresponding summaries collectively contain over 123.67 million tokens and exhibit consistent editorial standards with summaries ranging between 50–250 words. Table 5 provides comprehensive dataset statistics, including total counts of articles, sentences, words, and tokens, along with averages per article and summary. Section roadmap: Section 4.1 examines temporal coverage; Section 4.2 profiles topical composition; Section 4.3 reports tokenization statistics; Section 4.4 quantifies compression ratios; Section 4.5 grounds these with a concrete example entry, and Section 4.6 shows analytical insights of NewsSumm.
To better characterize the scale and structure of the NewsSumm dataset, we analyzed distributions of article and summary lengths. Figure 4 presents a histogram of article lengths in words, capturing the wide spectrum from short news briefs to long-form investigative reports. Figure 5 similarly illustrates the distribution of summary lengths, demonstrating consistency in human annotation and a balance between conciseness and informativeness. Together, Figure 4 and Figure 5 establish the basic length profiles, setting up a closer look at how coverage evolves over time in Section 4.1.

4.1. Temporal Distribution

NewsSumm spans articles published from 2000 to 2025, enabling longitudinal analyses of media trends across a 25-year period. As shown in Figure 6, the dataset exhibits steady growth over time, with marked surges during major national events, reflecting both evolving news volume and collection dynamics. Having established temporal coverage, we next examine topical breadth and category balance in Section 4.2.

4.2. Category Distribution

The dataset covers a diverse topical space with over 5000 unique sub-categories across more than 20 major domains. Figure 7 highlights the top 10 most frequent categories, demonstrating the dataset’s wide thematic range, from politics and economy to health and education. Given this thematic diversity, we turn to token-level characteristics relevant for model design and sequence handling in Section 4.3.

4.3. Number of Articles per Year from 2000 to 2025

To support downstream model design, we examined token-level distributions using HuggingFace’s AutoTokenizer. Figure 8 shows boxplots of token counts in both articles and summaries, capturing central tendencies and variance, along with the presence of long-tailed outliers. Building on these sequence length statistics, Section 4.4 quantifies summarization difficulty via article-to-summary compression ratios.

4.4. Compression Ratios

We calculated the compression ratio (article length divided by summary length) to evaluate the summarization challenge presented by NewsSumm. With a mean ratio of 3.58:1 (words) and 3.55:1 (tokens), the summaries are concise yet content-rich, attesting to the annotation quality. Figure 9 displays the distribution of these ratios, highlighting typical and extreme cases. To contextualize the statistics above, Section 4.5 presents a concrete example entry that reflects the schema and annotation style.

4.5. Example Entry

To illustrate the annotation schema, Table 6 presents a representative NewsSumm entry. This includes metadata such as source, publication date, category, and URL, along with the full article text and its corresponding human-written summary. The example showcases the depth of curation and fidelity to content. These analyses collectively establish NewsSumm’s scale, diversity, and modeling characteristics; the next section situates NewsSumm alongside existing datasets to contextualize its contributions.

4.6. Analytical Insights of NewsSumm

While Table 5 establishes scale, the true value of NewsSumm emerges through comparative analysis. NewsSumm achieves the rare combination of N 300k articles, C 20 , Y 20 years, and 100% human-generated summaries, a feature set unattained by any existing MDS benchmark (see Section 5.1, Table 7). Furthermore, the compression ratio of 3.58 (words) and 3.55 (tokens) positions NewsSumm between extractive datasets like CNN/DM (lower compression) and abstractive resources like XSum (higher demands for rephrasing). Linguistic analysis (Table 8) quantifies the gap between Indian and Western English—vocabulary divergence (crore vs. million, ration card vs. welfare), code-switching frequency (8–12% Hindi/Urdu terms), and political-administrative references (Lok Sabha, MLA vs. Congress, Senator)—which Western-trained models struggle with, as demonstrated empirically in Section 6.3 (PEGASUS Western-trained: ROUGE-L 29.4 vs. fine-tuned: 39.6). These insights establish NewsSumm not as “just another dataset,” but as a methodologically grounded, comparative contribution addressing gaps in culturally-specific, large-scale abstractive summarization.

5. Comparison with Existing Datasets

To contextualize NewsSumm within the existing summarization landscape, we compare it with prominent datasets across both single-document (SDS) and multi-document summarization (MDS) paradigms. Key axes of comparison include dataset scale, annotation quality, document structure, domain and linguistic diversity, and temporal coverage. Section roadmap: Section 5.1 maps the dataset landscape; Section 5.2 focuses on linguistic specificity; Section 5.3 examines domain/temporal breadth and event modeling; Section 5.4 discusses annotation quality and ethics.

5.1. Dataset Landscape

Table 7 presents a comparative overview of major summarization datasets, highlighting NewsSumm’s distinctiveness in its large-scale (317,498 documents), full human abstraction, multi-document structure, and 25-year temporal span. Unlike CNN/DailyMail (extractive), MultiNews (auto-generated), or WikiSum (Wikipedia-based), NewsSumm provides manually written summaries with rich metadata and factual verification across 20+ domains in Indian English. In short, NewsSumm uniquely combines large-scale, full human abstraction, and multi-document structure over two-and-a-half decades—setting the stage to examine its linguistic distinctiveness in Section 5.2.
NewsSumm uniquely fulfills the combination of N > 300k articles, C > 20 categories, Y > 20 years, and 100% human-generated summaries—features not simultaneously present in any other benchmark. Having situated the broader landscape, we next compare linguistic and cultural properties in Section 5.2.

5.2. Linguistic and Cultural Specificity

Unlike datasets built on US or UK news (e.g., CNN/DailyMail, XSum), NewsSumm captures the nuances of Indian English. Table 8 compares key linguistic traits between NewsSumm and Western datasets, highlighting divergences in vocabulary, sentence structure, idiomatic expressions, and political references. These contrasts underscore why Indian English requires dedicated benchmarks; we now turn to breadth across domains and time in Section 5.3. These features support research into model generalization across underrepresented linguistic varieties.

5.3. Domain, Temporal Breadth, and Event Modeling

NewsSumm spans a 25-year period (2000–2025) and includes 5000+ fine-grained topical categories ranging from legislative proceedings and rural governance to caste-based social issues and cultural events. The dataset is structured as multi-document clusters, enabling modeling of evolving events and discourse-level coherence as shown in Figure 10. A mathematical summary of NewsSumm’s distinctive attributes:
  • N NewsSumm = 317 , 498 (articles);
  • C NewsSumm > 20 (topical categories);
  • Y NewsSumm = 2000 2025 (years).
No other human-annotated MDS resource simultaneously satisfies all three criteria: N > 100,000, C > 10, and Y > 10.

5.4. Annotation Quality and Ethical Construction

All summaries are manually authored, peer-reviewed, and audited for factual accuracy and neutrality. The dataset was constructed through a transparent and ethical process, employing over 14,000 trained annotators under fair-labor guidelines via the Suvidha Foundation. This contrasts with many Western benchmarks, which rely on web-scraped or weakly supervised sources. Taken together, Section 5 shows that NewsSumm uniquely integrates scale, linguistic specificity, temporal breadth, and rigorous human annotation—context that motivates the downstream benchmarks and analyses that follow.

6. Benchmarking Experiments

To evaluate the utility of NewsSumm as a benchmark, we conducted experiments with three state-of-the-art transformer models: PEGASUS, BART, and T5. These models were selected for their strong performance on existing abstractive summarization tasks. Section roadmap. Section 6.1 describes the experimental setup; Section 6.2 reports overall and category-wise results, including factuality; we conclude with takeaways that motivate the next section.

6.1. Experimental Setup

The dataset was split into training (80%), validation (10%), and testing (10%) sets with mutually exclusive document clusters to avoid data leakage. Inputs comprised concatenated articles from each cluster, truncated to model-specific token limits (512–1024), and tokenized using Hugging Face’s AutoTokenizer. Fine-tuning was performed using official model checkpoints with the AdamW optimizer and early stopping based on validation ROUGE-L. Training was conducted on NVIDIA A100 GPUs.
We evaluated performance using ROUGE-1, ROUGE-2, and ROUGE-L metrics, and assessed factual consistency on a representative subset using FactCC. Additionally, we performed category-wise evaluations across the top 10 news domains to explore domain-specific challenges. All code, splits, and checkpoints are available via the NewsSumm repository for reproducibility. With the setup established, we now present aggregate model performance in Section 6.2 and then drill down by domain and factuality.

6.2. Results

Table 9 shows the ROUGE scores for each model on the test set. PEGASUS [60] consistently outperformed BART [61] and T5 [24], particularly in ROUGE-L.
Category-wise analysis (Figure 11) revealed higher scores in domains such as Business and Sports, which tend to have more structured narratives. Lower performance in Politics and International News reflects the increased difficulty of summarizing multi-perspective, event-rich content.
FactCC revealed persistent factual inconsistencies, including incorrect attributions and hallucinations, particularly in complex or ambiguous stories. While PEGASUS led overall, the margin of difference narrowed in challenging categories. Detailed category-wise scores are provided in Table 10.
These results highlight the limitations of current summarization models when applied to a large, diverse, multi-document dataset like NewsSumm. Although the models generate fluent summaries, maintaining factual accuracy remains a key challenge. NewsSumm thus offers a realistic, high-quality benchmark for advancing research in factual consistency, cross-document synthesis, and domain adaptation in abstractive summarization.

6.3. Cross-Lingual Generalization of Western-Trained Models on NewsSumm

To empirically assess the impact of language/cultural adaptation, we evaluated PEGASUS, BART, and T5 models trained solely on Western English corpora (CNN/DailyMail + XSum) directly on the NewsSumm test split—without any Indian in-domain fine-tuning. These “zero-shot” models yielded a marked drop in performance, with PEGASUS (Western-trained) achieving a ROUGE-L of just 29.4, versus 39.6 for the same architecture fine-tuned on NewsSumm. We analyzed a sample of outputs and found frequent failures on Indian idioms (“ration card” misinterpreted as ”welfare form”), misattributed legal/policy contexts, and loss of crucial local detail. These findings underline the real gap between general English and Indian English news summarization, and reinforce the critical benchmarking potential of NewsSumm.

6.4. Extractive Baseline Comparison

To ensure robust and transparent benchmarking, we evaluate both abstractive and extractive models on NewsSumm. While our dataset annotations are designed for abstractive summarization—requiring paraphrasing and deep information compression—extractive techniques provide important baselines. We implemented two classical extractive methods: LexRank (graph-based sentence ranking) and TextRank (eigenvector centrality). As shown in Table 2, extractive baselines underperform abstractive models by 15–19 ROUGE-L points, confirming that NewsSumm’s compressed, paraphrased summaries are better suited to abstractive approaches. These results validate both our annotation strategy and the necessity of neural abstractive models for high-quality multi-document news summarization in the Indian English context. In sum, Section 6 demonstrates strong baselines but persistent factual deficits; the next section discusses implications, ablations, and avenues for improving cross-document faithfulness.

7. Applications

NewsSumm’s scale, human-verified abstractive summaries, and diverse multi-document structure make it a valuable resource for developing and evaluating advanced summarization models. Its extensive temporal coverage enables timeline-aware modeling for historical and policy-related narratives, while the inclusion of headlines supports tasks like headline generation and news compression. As an Indian English corpus, NewsSumm facilitates research in domain adaptation and cross-lingual transfer, helping models generalize to low-resource, culturally distinct settings. Its professionally curated summaries also serve as a reliable benchmark for evaluating factual consistency in both standard and large language models. Finally, NewsSumm’s open access, standardized splits, and supporting code foster reproducibility and fair benchmarking in summarization research. Together, these applications motivate careful reflection on scope and biases, which we address next in Section 8 (Limitations).

8. Limitations

NewsSumm exclusively represents English-language Indian news, which may reflect regional or urban biases typical of mainstream media. Its evolving annotation pipeline across a long temporal span could introduce slight inconsistencies in style or coverage. These challenges are mitigated through versioning, clear documentation, and continuous quality monitoring.

9. Future Work

To further enhance the utility and research impact of NewsSumm, future developments will focus on three key areas. First, the dataset will incorporate event-centric clustering and fine-grained topical categorization to support narrative modeling and event timeline summarization. Multilingual versions in major Indian languages such as Hindi, Marathi, and Tamil are also planned to facilitate cross-lingual and comparative summarization research. Second, structured human evaluation protocols will be introduced to complement automated metrics like ROUGE and FactCC, enabling a more nuanced assessment of model coherence, informativeness, and factual accuracy. Finally, NewsSumm is openly released via Zenodo, complete with detailed dataset documentation and standardized metadata to ensure accessibility, transparency, and alignment with open science best practices.

10. Conclusions

We present NewsSumm, the largest human-annotated, multi-document summarization dataset curated from Indian English news sources. Spanning 317,000+ articles across 36 newspapers and 25 years, NewsSumm fills a critical gap in culturally diverse, low-resource NLP benchmarking. Its abstractive, fact-checked summaries—produced by over 14,000 trained annotators—offer high-quality supervision for modeling real-world summarization challenges. Through extensive benchmarking with PEGASUS, BART, and T5, we provide reproducible baselines and highlight performance trends across domains. NewsSumm uniquely supports tasks such as timeline summarization, factual consistency evaluation, and cross-domain modeling. Its ethical construction, linguistic uniqueness, and scale make it an essential resource for advancing summarization technologies in global, multilingual settings.
We invite the community to build on NewsSumm for research in factuality, domain adaptation, and inclusive NLP. Its public availability, comprehensive metadata, and reproducible benchmarks ensure wide accessibility and continued relevance.

Author Contributions

All authors have contributed equally to this work. M.M., M.A. and A.A. jointly contributed to the conceptualization and methodology of the study. M.M. developed the software and handled data curation, resources, and visualization. Validation and formal analysis were carried out collaboratively by M.M., M.A. and A.A. The investigation and overall research design were conducted collectively by all three authors. M.M. prepared the original draft, while M.A. and A.A. contributed to the review and editing process. A.A. supervised the project. M.M. managed project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The complete NewsSumm dataset, including all associated preprocessing scripts, model configurations, training routines, evaluation metrics, and supporting documentation, has been publicly released via Zenodo and is freely accessible at https://zenodo.org/records/17670865. All resources are available for direct download, with no access request required. Comprehensive documentation, annotation protocols, dataset schema, and user instructions are included to ensure ease of use, transparency, and reproducibility. Users are requested to cite the following paper in any publications or research utilizing NewsSumm: Motghare, M.; Agarwal, M.; Agrawal, A. NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers 2025. Contributions and feedback are welcomed to further advance research in Indian English and multilingual NLP.

Acknowledgments

We gratefully acknowledge the Suvidha Foundation for its foundational support, strategic guidance, and leadership throughout the creation of NewsSumm. This work would not have been possible without the tireless contributions of over 14,000 trained annotators and volunteers, whose dedication and expertise ensured the dataset’s quality and consistency. We thank the editorial and quality assurance teams for their critical role in validating and refining the annotations. We also extend special appreciation to Shobha Motghare, Payal Badhe, and the core technical team at Suvidha Foundation for their exceptional project management and coordination.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDSMulti-Document Summarization
NLPNatural Language Processing
AIArtificial Intelligence
MLMachine Learning
LLMLarge Language Model
ROUGERecall-Oriented Understudy for Gisting Evaluation
BLEUBilingual Evaluation Understudy
BERTScoreBidirectional Encoder Representations from Transformers Score
CNNConvolutional Neural Network
PEGASUSPre-training with Extracted Gap-sentences for Abstractive Summarization
T5Text-To-Text Transfer Transformer

References

  1. DeYoung, J.; Martinez, S.C.; Marshall, I.J.; Wallace, B.C. Do Multi-Document Summarization Models Synthesize? Trans. Assoc. Comput. Linguist. 2024, 12, 1043–1062. [Google Scholar] [CrossRef]
  2. Ahuja, O.; Xu, J.; Gupta, A.; Horecka, K.; Durrett, G. ASPECTNEWS: Aspect-Oriented Summarization of News Documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 6494–6506. [Google Scholar]
  3. Alambo, A.; Lohstroh, C.; Madaus, E.; Padhee, S.; Foster, B.; Banerjee, T.; Thirunarayan, K.; Raymer, M. Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020. [Google Scholar]
  4. Benedetto, I.; Cagliero, L.; Ferro, M.; Tarasconi, F.; Bernini, C.; Giacalone, G. Leveraging large language models for abstractive summarization of Italian legal news. Artif. Intell. Law 2025. [Google Scholar] [CrossRef]
  5. See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar]
  6. Rao, A.; Aithal, S.; Singh, S. Single-Document Abstractive Text Summarization: A Systematic Literature Review. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
  7. Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1797–1807. [Google Scholar]
  8. Li, H.; Zhang, Y.; Zhang, R.; Chaturvedi, S. Coverage-Based Fairness in Multi-Document Summarization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025. [Google Scholar]
  9. Fabbri, A.; Li, I.; She, T.; Li, S.; Radev, D. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
  10. Langston, O.; Ashford, B. Automated Summarization of Multiple Document Abstracts and Contents Using Large Language Models. TechRxiv 2024. [Google Scholar] [CrossRef]
  11. Wang, M.; Wang, M.; Yu, F.; Yang, Y.; Walker, J.; Mostafa, J. A Systematic Review of Automatic Text Summarization for Biomedical Literature and EHRs. J. Am. Med. Inform. Assoc. 2021, 28, 2287–2297. [Google Scholar] [CrossRef]
  12. Yu, Z.; Sun, N.; Wu, S.; Wang, Y. Research on Automatic Text Summarization Using Transformer and Pointer-Generator Networks. In Proceedings of the 2025 4th International Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, 21 March 2025; pp. 1601–1604. [Google Scholar]
  13. Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-Annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019; pp. 70–79. [Google Scholar]
  14. Cao, M.; Dong, Y.; Wu, J.; Cheung, J.C.K. Factual Error Correction for Abstractive Summarization Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6251–6258. [Google Scholar]
  15. Kryscinski, W.; McCann, B.; Xiong, C.; Socher, R. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9332–9346. [Google Scholar]
  16. Luo, Z.; Xie, Q.; Ananiadou, S. Factual Consistency Evaluation of Summarization in the Era of Large Language Models. Expert Syst. Appl. 2024, 254, 124456. [Google Scholar] [CrossRef]
  17. Pagnoni, A.; Balachandran, V.; Tsvetkov, Y. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4812–4829. [Google Scholar]
  18. Wang, A.; Cho, K.; Lewis, M. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5008–5020. [Google Scholar]
  19. Tulajiang, P.; Sun, Y.; Zhang, Y.; Le, Y.; Xiao, K.; Lin, H. A Bilingual Legal NER Dataset and Semantics-Aware Cross-Lingual Label Transfer Method for Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–21. [Google Scholar] [CrossRef]
  20. Mi, C.; Xie, S.; Li, Y.; He, Z. Loanword Identification in Social Media Texts with Extended Code-Switching Datasets. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–19. [Google Scholar] [CrossRef]
  21. Mohammadalizadeh, P.; Safari, L. A Novel Benchmark for Persian Table-to-Text Generation: A New Dataset and Baseline Experiments. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–17. [Google Scholar] [CrossRef]
  22. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
  23. Fodor, J.; Deyne, S.D.; Suzuki, S. Compositionality and Sentence Meaning: Comparing Semantic Parsing and Transformers on a Challenging Sentence Similarity Dataset. Comput. Linguist. 2025, 51, 139–190. [Google Scholar] [CrossRef]
  24. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  25. Zandie, R.; Mahoor, M.H. Topical Language Generation Using Transformers. Nat. Lang. Eng. 2023, 29, 337–359. [Google Scholar] [CrossRef]
  26. Al-Thubaity, A. A Novel Dataset for Arabic Domain Specific Term Extraction and Comparative Evaluation of BERT-Based Models for Arabic Term Extraction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–12. [Google Scholar] [CrossRef]
  27. Kurniawan, K.; Louvan, S. IndoSum: A New Benchmark Dataset for Indonesian Text Summarization. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 15–17 November 2018. [Google Scholar]
  28. Sharma, E.; Li, C.; Wang, L. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2204–2213. [Google Scholar]
  29. Malik, M.; Zhao, Z.; Fonseca, M.; Rao, S.; Cohen, S.B. CivilSum: A Dataset for Abstractive Summarization of Indian Court Decisions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC USA, 10 July 2024; pp. 2241–2250. [Google Scholar]
  30. Wang, H.; Li, T.; Du, S.; Wei, X. Mixed Information Bottleneck for Location Metonymy Resolution Using Pre-trained Language Models. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025. [Google Scholar] [CrossRef]
  31. Gutiérrez-Hinojosa, S.J.; Calvo, H.; Moreno-Armendáriz, M.A.; Duchanoy, C.A. Sentence Embeddings for Document Sets in DUC 2002 Summarization Task; IEEE Dataport: Piscataway, NJ, USA, 2018. [Google Scholar]
  32. Rush, A.M.; Chopra, S.; Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 379–389. [Google Scholar]
  33. Fatima, W.; Rizvi, S.S.R.; Ghazal, T.M.; Kharma, Q.M.; Ahmad, M.; Abbas, S.; Furqan, M.; Adnan, K.M. Abstractive Text Summarization in Arabic-Like Script Using Multi-Encoder Architecture and Semantic Extraction Techniques. IEEE Access 2025, 13, 104977–104991. [Google Scholar] [CrossRef]
  34. Grusky, M.; Naaman, M.; Artzi, Y. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 708–719. [Google Scholar]
  35. Kim, B.; Kim, H.; Kim, G. Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2519–2531. [Google Scholar]
  36. Gupta, V.; Bharti, P.; Nokhiz, P.; Karnick, H. SumPubMed: Summarization Dataset of PubMed Scientific Articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research, Online, 1–6 August 2021; pp. 292–303. [Google Scholar]
  37. Xia, T.C.; Bertini, F.; Montesi, D. Large Language Models Evaluation for PubMed Extractive Summarisation. ACM Trans. Comput. Healthc. 2025, 3766905. [Google Scholar] [CrossRef]
  38. Cohan, A.; Dernoncourt, F.; Kim, D.S.; Bui, T.; Kim, S.; Chang, W.; Goharian, N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 615–621. [Google Scholar]
  39. Koupaee, M.; Wang, W.Y. WikiHow: A Large Scale Text Summarization Dataset. arXiv 2018, arXiv:1810.09305. [Google Scholar] [CrossRef]
  40. Zhang, Y.; Zhang, N.; Liu, Y.; Fabbri, A.; Liu, J.; Kamoi, R.; Lu, X.; Xiong, C.; Zhao, J.; Radev, D.; et al. Fair Abstractive Summarization of Diverse Perspectives. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
  41. Huang, K.-H.; Laban, P.; Fabbri, A.; Choubey, P.K.; Joty, S.; Xiong, C.; Wu, C.-S. Embrace Divergence for Richer Insights: A Multi-Document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 570–593. [Google Scholar]
  42. Datta, D.; Soni, S.; Mukherjee, R.; Ghosh, S. MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of Indian Legal Case Judgments. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 5291–5302. [Google Scholar]
  43. Zhao, C.; Zhou, X.; Xie, X.; Zhang, Y. Hierarchical Attention Graph for Scientific Document Summarization in Global and Local Level. arXiv 2024, arXiv:2405.10202. [Google Scholar] [CrossRef]
  44. To, H.Q.; Liu, M.; Huang, G.; Tran, H.-N.; Greiner-Petter, A.; Beierle, F.; Aizawa, A. SKT5SciSumm—Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization. arXiv 2024, arXiv:2402.17311. [Google Scholar]
  45. Lu, Y.; Dong, Y.; Charlin, L. Multi-XScience: A Large-Scale Dataset for Extreme Multi-Document Summarization of Scientific Articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8068–8074. [Google Scholar]
  46. Deutsch, D.; Dror, R.; Roth, D. A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods. Trans. Assoc. Comput. Linguist. 2021, 9, 1132–1146. [Google Scholar] [CrossRef]
  47. Jiang, X.; Dreyer, M. CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 7306–7336. [Google Scholar]
  48. Zhu, M.; Zeng, K.; Wang, M.; Xiao, K.; Hou, L.; Huang, H.; Li, J. EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents. Proc. AAAI Conf. Artif. Intell. 2025, 39, 26138–26147. [Google Scholar] [CrossRef]
  49. Li, M.; Qi, J.; Lau, J.H. PeerSum: A Peer Review Dataset for Abstractive Multi-Document Summarization. arXiv 2022, arXiv:2203.01769. [Google Scholar]
  50. Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-Based Multi-Domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar]
  51. Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. Extractive Summarization: An Experimental Review. Appl. Sci. 2023, 13, 7620. [Google Scholar] [CrossRef]
  52. Shakil, H.; Farooq, A.; Kalita, J. Abstractive Text Summarization: State of the Art, Challenges, and Improvements. Neurocomputing 2024, 603, 128255. [Google Scholar] [CrossRef]
  53. Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
  54. Chrysostomou, G.; Zhao, Z.; Williams, M.; Aletras, N. Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 1163–1181. [Google Scholar] [CrossRef]
  55. Alansari, A.; Luqman, H. Large Language Models Hallucination: A Comprehensive Survey. arXiv 2025, arXiv:2510.06265. [Google Scholar] [CrossRef]
  56. Gang, L.; Yang, W.; Wang, T.; He, Z. Corpus Fusion and Text Summarization Extraction for Multi-Feature Enhanced Entity Alignment. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 89. [Google Scholar] [CrossRef]
  57. Fabbri, A.R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; Radev, D. SummEval: Re-Evaluating Summarization Evaluation. Trans. Assoc. Comput. Linguist. 2021, 9, 391–409. [Google Scholar] [CrossRef]
  58. Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual Learning of Large Language Models: A Comprehensive Survey. ACM Comput. Surv. 2025, 58, 120. [Google Scholar] [CrossRef]
  59. Hindustan Times. Available online: https://www.hindustantimes.com/ (accessed on 5 November 2025).
  60. Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2019. [Google Scholar]
  61. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Figure 1. Timeline matrix of major single-document (SDS) and multi-document (MDS) summarization datasets (2002–2026).
Figure 1. Timeline matrix of major single-document (SDS) and multi-document (MDS) summarization datasets (2002–2026).
Computers 14 00508 g001
Figure 2. Overview of the NewsSumm dataset collection and annotation process.
Figure 2. Overview of the NewsSumm dataset collection and annotation process.
Computers 14 00508 g002
Figure 3. NewsSumm Dataset Schema: (a) NewsSumm dataset entry schema showing the eight core fields associated with each article. Each dataset entry includes: source (newspaper name), publication date, headline, article text, human-written summary, category/tags, URL, and annotator ID. (b) Structured annotation workflow and data schema for NewsSumm. The figure illustrates how each article cluster (containing a total of 317K articles) connects to its associated metadata fields: source newspaper, publication date, headline, article text, human summary (50–250 words), topical category/tags, URL, and annotator ID(s).
Figure 3. NewsSumm Dataset Schema: (a) NewsSumm dataset entry schema showing the eight core fields associated with each article. Each dataset entry includes: source (newspaper name), publication date, headline, article text, human-written summary, category/tags, URL, and annotator ID. (b) Structured annotation workflow and data schema for NewsSumm. The figure illustrates how each article cluster (containing a total of 317K articles) connects to its associated metadata fields: source newspaper, publication date, headline, article text, human summary (50–250 words), topical category/tags, URL, and annotator ID(s).
Computers 14 00508 g003
Figure 4. Distribution of article lengths in words with a logarithmic y-axis. The log scale reveals the long-tail region (≥1250 words) that is not visible under a linear scale. Dashed and dotted lines mark the mean and median, respectively.
Figure 4. Distribution of article lengths in words with a logarithmic y-axis. The log scale reveals the long-tail region (≥1250 words) that is not visible under a linear scale. Dashed and dotted lines mark the mean and median, respectively.
Computers 14 00508 g004
Figure 5. Distribution of summary lengths (in words).
Figure 5. Distribution of summary lengths (in words).
Computers 14 00508 g005
Figure 6. Number of articles per year from 2000 to 2025.
Figure 6. Number of articles per year from 2000 to 2025.
Computers 14 00508 g006
Figure 7. Top 10 most frequent news categories in the NewsSumm dataset. The figure illustrates category distribution across major domains, highlighting prevalent topics such as politics, business, and sports.
Figure 7. Top 10 most frequent news categories in the NewsSumm dataset. The figure illustrates category distribution across major domains, highlighting prevalent topics such as politics, business, and sports.
Computers 14 00508 g007
Figure 8. Boxplots of token counts for articles and summaries. The figure captures central tendencies and variance, along with long-tailed outliers, illustrating token-level distribution relevant for summarization difficulty.
Figure 8. Boxplots of token counts for articles and summaries. The figure captures central tendencies and variance, along with long-tailed outliers, illustrating token-level distribution relevant for summarization difficulty.
Computers 14 00508 g008
Figure 9. Histogram of article-to-summary compression ratios. The figure illustrates the distribution of compression ratios across the dataset, highlighting the balance between article length and summary length, and indicating summarization difficulty levels.
Figure 9. Histogram of article-to-summary compression ratios. The figure illustrates the distribution of compression ratios across the dataset, highlighting the balance between article length and summary length, and indicating summarization difficulty levels.
Computers 14 00508 g009
Figure 10. Scatter plot of summarization datasets by size and annotation quality. NewsSumm is uniquely positioned in the top-right quadrant, combining large-scale and high-quality human-written multi-document summaries, demonstrating its strength as a benchmark for future research.
Figure 10. Scatter plot of summarization datasets by size and annotation quality. NewsSumm is uniquely positioned in the top-right quadrant, combining large-scale and high-quality human-written multi-document summaries, demonstrating its strength as a benchmark for future research.
Computers 14 00508 g010
Figure 11. ROUGE-L and FactCC scores across news domains for PEGASUS, BART, and T5. The figure compares summarization quality and factual consistency across categories, highlighting domain-wise performance variations for each model.
Figure 11. ROUGE-L and FactCC scores across news domains for PEGASUS, BART, and T5. The figure compares summarization quality and factual consistency across categories, highlighting domain-wise performance variations for each model.
Computers 14 00508 g011
Table 1. Comparison of PEGASUS model performance on the NewsSumm test set.
Table 1. Comparison of PEGASUS model performance on the NewsSumm test set.
ModelTraining CorpusROUGE-L (NewsSumm Test)
PEGASUS (base)CNN/DailyMail, XSum29.4
PEGASUS (fine-tuned)NewsSumm39.6
Table 2. Extractive vs. abstractive model performance on NewsSumm.
Table 2. Extractive vs. abstractive model performance on NewsSumm.
Model TypeModelROUGE-1ROUGE-2ROUGE-L
ExtractiveLexRank24.3010.2021.80
ExtractiveTextRank23.109.8020.50
AbstractivePEGASUS42.8720.4539.62
AbstractiveBART40.1218.6737.30
AbstractiveT539.0517.5435.98
Table 3. Roles and outcomes at each QA stage.
Table 3. Roles and outcomes at each QA stage.
StepsResponsibleKey CriteriaOutcome
Annotator TrainingSenior TeamComprehension, exercisesEligibility
Manual SummarizationAnnotatorAbstraction, factualityDraft summary
Peer ReviewPeer AnnotatorCoherence, adherenceFeedback/Approve
Senior AuditSenior EditorProtocol adherenceApprove/Revision
Feedback & RevisionAnnotatorIncorporate feedbackFinal/Resubmit
Final CheckSenior TeamAll criteria metAccept/Exclude
Table 4. Annotator Demographics and Attributes.
Table 4. Annotator Demographics and Attributes.
AttributeValues in Percentage
University degree61
Journalism/writing experience34
English C1+ proficiency91
Hindi fluency80
Geographically diverse21 states
Table 5. Overall statistics of the NewsSumm dataset.
Table 5. Overall statistics of the NewsSumm dataset.
MetricValue
Total articles317,498
Total sentences in articles4,652,998
Total words in articles106,112,021
Total tokens in articles123,674,109
Total sentences in summaries1,481,410
Total words in summaries30,268,515
Total tokens in summaries34,845,065
Average sentences per article14.66
Average words per article334.21
Average tokens per article389.53
Average sentences per summary4.67
Average words per summary95.45
Average tokens per summary109.82
Time span2000–2025
Number of unique categories5121
Table 6. Example entry from the NewsSumm dataset with full metadata and human-written summary.
Table 6. Example entry from the NewsSumm dataset with full metadata and human-written summary.
FieldArticle Sample
News published in prominent newspapers of IndiaYes
Newspaper NameHindustan Times [59]
Published Date11 May 2017
URLLink to Article (https://www.hindustantimes.com/more-lifestyle/the-truth-is-out-there-tales-from-india-s-ufo-investigators/story-9crhwh8tiWKU4LSvwCfrpM.html#google_vignette/ accessed on 10 October 2025)
HeadlineThe truth is out there: Tales from India’s UFO investigators
Article TextAt the stroke of the midnight hour, when the world slept, Sunita Yadav awoke to a levitating alien. She watched, petrified, as it hovered a foot above the ground just behind her home. Standing over 4ft tall, with grey skin and big, black eyes, it proceeded from the Yadavs’ backyard toward their front door before—as her son Hitesh remembers—it “just vanished”......
Human SummaryA young tech student in Gurgaon, Hitesh Yadav, has dedicated years to investigating UFO sightings after claiming to witness a levitating alien as a child in Panchkula. Through the Disclosure Team India and UFO Magazine India, he and others collect and assess reports, blending scientific curiosity and skepticism. Fellow enthusiast Ramkrishan Vaishnav, inspired by early scientific pursuits and regional sightings, continues research while probing India’s unique patterns of UFO phenomena, especially in Rajasthan and Ladakh. Both exemplify India’s growing, reasoned approach to ufology.
News CategoryNational News
Table 7. Comparative overview of major summarization datasets, including scale, annotation type, temporal range, and document structure.
Table 7. Comparative overview of major summarization datasets, including scale, annotation type, temporal range, and document structure.
DatasetArticles/ClustersHumanMultiYearsCats.DomainAccessibility
NewsSumm (Proposed)317,498YesYes2000–202520+Indian Eng.Open (on request)
CNN/DailyMail287,227ExtractedNo2007–20151W. NewsOpen
XSum226,711YesNo2016–20181W. NewsOpen
MultiNews56,216 clustersAutoYes2013–20181W. NewsOpen
WikiSum1.7 M clustersAutoYesWiki-wideManyWikipediaOpen
CCSUM1 M clustersAuto/HumanYes2016–2021ManyMultiling. NewsOpen
EventSum330,000 clustersAutoYes2000–2022EventsChinese NewsOpen
CivilSum50,000 casesYesYes1950–2020LegalIndian CourtsOpen
WikiHow230,843ExtractedNoHow-ToInstructionalOpen
SAMSum16,000 dialoguesYesNo2015–2019DialogsConversationalOpen
Table 8. Linguistic comparison between NewsSumm (Indian English) and Western news.
Table 8. Linguistic comparison between NewsSumm (Indian English) and Western news.
FeatureNewsSumm (Indian English)CNN/XSum (Western English)
Vocabularycrore, lakh, petrol, prepone, dowry, ration card, RTI, Lok Sabha, Chief Ministermillion, gasoline, postpone, welfare, Congress, Governor, Senate
Formal RegisterOften formal; influenced by British usage (e.g., “thereafter”, “amongst”, “pursuant”)More casual; influenced by American English (e.g., “after that”, “among”)
Code-SwitchingFrequent Hindi/Urdu terms or cultural words (e.g., “aam aadmi”, “swachh”)Rarely present; mostly standard English
Naming ConventionsFrequent use of full names and titles (e.g., “Shri Narendra Modi”, “IAS officer”)Typically shorter names or titles (e.g., “President Biden”)
Sentence StructureLonger, complex, hierarchical clauses; formal styleShorter, simpler, conversational structure
Idiomatic UsageRegion-specific idioms (e.g., “vote bank”, “bandh”, “fast unto death”)Standardized idioms (e.g., “campaign promise”, “strike”, “protest”)
Numerical ReferencesUses crore, lakh; large-unit numeralsUses thousand, million, billion
Political & Administrative ReferencesLok Sabha, Rajya Sabha, Panchayat, MLA, Collector, TehsildarCongress, Parliament, Senate, Mayor, Governor
Table 9. ROUGE scores on the NewsSumm test set.
Table 9. ROUGE scores on the NewsSumm test set.
ModelROUGE-1ROUGE-2ROUGE-L
PEGASUS42.8720.4539.62
BART40.1218.6737.30
T539.0517.5435.98
Table 10. Category-wise ROUGE-L and FactCC scores for PEGASUS, BART, and T5 on the NewsSumm test set.
Table 10. Category-wise ROUGE-L and FactCC scores for PEGASUS, BART, and T5 on the NewsSumm test set.
CategoryPEG
ROUGE-L
PEG
FactCC
BART
ROUGE-L
BART
FactCC
T5
ROUGE-L
T5
FactCC
Politics44.178.241.373.842.575.1
Business47.381.044.576.545.677.8
National45.979.643.075.244.276.5
Sports48.782.546.278.947.480.3
International42.276.339.970.441.073.1
Health46.180.143.775.044.877.0
Technology45.079.543.373.943.975.4
Education44.877.941.572.143.074.2
Environment43.576.540.171.842.273.6
Law45.378.742.074.043.675.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Motghare, M.; Agarwal, M.; Agrawal, A. NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers 2025, 14, 508. https://doi.org/10.3390/computers14120508

AMA Style

Motghare M, Agarwal M, Agrawal A. NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers. 2025; 14(12):508. https://doi.org/10.3390/computers14120508

Chicago/Turabian Style

Motghare, Manish, Megha Agarwal, and Avinash Agrawal. 2025. "NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English" Computers 14, no. 12: 508. https://doi.org/10.3390/computers14120508

APA Style

Motghare, M., Agarwal, M., & Agrawal, A. (2025). NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers, 14(12), 508. https://doi.org/10.3390/computers14120508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop