This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English
by
Manish Motghare
Manish Motghare 1,*
,
Megha Agarwal
Megha Agarwal 2,* and
Avinash Agrawal
Avinash Agrawal 1,3
1
Shri Ramdeobaba College of Engineering and Management, Affiliated to Rashtrasant Tukdoji Maharaj Nagpur University, Nagpur 440013, India
2
School of Medicine, Stanford University, Stanford, CA 94305, USA
3
Department of Artificial Intelligence and Cyber Security, Ramdeobaba University, Nagpur 440013, India
*
Authors to whom correspondence should be addressed.
Computers 2025, 14(12), 508; https://doi.org/10.3390/computers14120508 (registering DOI)
Submission received: 1 October 2025
/
Revised: 6 November 2025
/
Accepted: 18 November 2025
/
Published: 23 November 2025
Abstract
The rapid growth of digital journalism has heightened the need for reliable multi-document summarization (MDS) systems, particularly in underrepresented, low-resource, and culturally distinct contexts. However, current progress is hindered by a lack of large-scale, high-quality non-Western datasets. Existing benchmarks—such as CNN/DailyMail, XSum, and MultiNews—are limited by language, regional focus, or reliance on noisy, auto-generated summaries. We introduce NewsSumm, the largest human-annotated MDS dataset for Indian English, curated by over 14,000 expert annotators through the Suvidha Foundation. Spanning 36 Indian English newspapers from 2000 to 2025 and covering more than 20 topical categories, NewsSumm includes over 317,498 articles paired with factually accurate, professionally written abstractive summaries. We detail its robust collection, annotation, and quality control pipelines, and present extensive statistical, linguistic, and temporal analyses that underscore its scale and diversity. To establish benchmarks, we evaluate PEGASUS, BART, and T5 models on NewsSumm, reporting aggregate and category-specific ROUGE scores, as well as factual consistency metrics. All NewsSumm dataset materials are openly released via Zenodo. NewsSumm offers a foundational resource for advancing research in summarization, factuality, timeline synthesis, and domain adaptation for Indian English and other low-resource language settings.
Share and Cite
MDPI and ACS Style
Motghare, M.; Agarwal, M.; Agrawal, A.
NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers 2025, 14, 508.
https://doi.org/10.3390/computers14120508
AMA Style
Motghare M, Agarwal M, Agrawal A.
NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers. 2025; 14(12):508.
https://doi.org/10.3390/computers14120508
Chicago/Turabian Style
Motghare, Manish, Megha Agarwal, and Avinash Agrawal.
2025. "NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English" Computers 14, no. 12: 508.
https://doi.org/10.3390/computers14120508
APA Style
Motghare, M., Agarwal, M., & Agrawal, A.
(2025). NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English. Computers, 14(12), 508.
https://doi.org/10.3390/computers14120508
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.