Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles

Pan, Ronghao; Bernal-Beltrán, Tomás; Salas-Zárate, María del Pilar; Paredes-Valverde, Mario Andrés; García-Díaz, José Antonio; Valencia-García, Rafael

doi:10.3390/app152111834

Open AccessArticle

Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles

by

Ronghao Pan

¹

,

Tomás Bernal-Beltrán

¹

,

María del Pilar Salas-Zárate

²

,

Mario Andrés Paredes-Valverde

²

,

José Antonio García-Díaz

¹

and

Rafael Valencia-García

^1,*

¹

Departamento de Informática y Sistemas, Universidad de Murcia, Campus de Espinardo, 30100 Murcia, Spain

²

Tecnológico Nacional de México/I.T.S. Teziutlán, Fracción I y II, Teziutlán 73960, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11834; https://doi.org/10.3390/app152111834

Submission received: 10 October 2025 / Revised: 30 October 2025 / Accepted: 4 November 2025 / Published: 6 November 2025

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Automatic summarization is essential for processing the vast quantity of news articles. However, existing methods struggle with factual consistency, hallucinations, and English-centric evaluations. This paper investigates whether Large Language Models can generate coherent and factually grounded summaries of Spanish-language news articles, using the DACSA dataset as a benchmark. Several strategies are evaluated, including zero-shot prompting, one-shot prompting, fine-tuning of seq2seq models mBART and mT5, and a novel bottleneck prompting method that integrates attention-based salience scoring with Named Entity Recognition. Our results show that modern instruction-tuned language models can achieve competitive performance in zero- and one-shot settings, often approaching the performance of fine-tuned baselines. Our proposed bottleneck method enhances factual accuracy and content selection, leading to measurable improvements in ROUGE and BERTScore, especially for larger models such as LLaMA-3.1-70B and Gemma-2-9B. These results suggest that structured prompting can complement conventional approaches, offering an effective and cost-efficient alternative to full supervision. The results indicate that LLMs guided by entity-anchored bottlenecks provide a promising approach to multilingual summarization in domains with limited resources.

Keywords:

LLMs; NER; Seq2Seq; prompt engineering; Spanish news summarization

1. Introduction

Automatic text summarization has been a central task in Natural Language Processing (NLP), driven by the increasing need to efficiently digest large volumes of information. In the news domain, summarization is essential to condense the constant flow of articles while maintaining readability, coherence, and factual accuracy. Early approaches to summarization were largely “extractive”, selecting salient sentences directly from the source text [1]. However, recent research has shifted toward “abstractive” methods, which can rephrase and generate formulations and combinations that resemble human-written summaries more closely, as discussed in [2].

Transformer-based neural models have accelerated progress in abstractive summarization. Architectures such as BART [3], T5 [4], mT5 [5], and Pegasus [6] have achieved state-of-the-art results on benchmark datasets such as CNN/DailyMail [7] and XSUM [8]. However, these methods typically require substantial amounts of supervised training data, which can be costly to collect, particularly in domains and languages where parallel “article-summary” pairs are scarce.

The advent of Large Language Models (LLMs) has shifted the paradigm toward prompt engineering using zero-shot and one-shot methods for summarization. Pretrained generative models, such as LLaMA-3 [9], Qwen-2 [10], and Gemma-2 [11], have demonstrated remarkable generalization abilities across tasks using natural language prompts without requiring explicit, task-specific fine-tuning. However, applying them to domains such as news raises concerns about factual consistency and hallucinations [12,13]. News summarization is particularly sensitive to these issues because named entities, dates, and events are fundamental to journalistic narratives.

Furthermore, most existing studies have focused on English datasets. Large-scale corpora, such as CNN/DailyMail [7] and XSum [8], dominate evaluation benchmarks, and most models are pretrained and tested in English. These gaps motivate exploring Spanish datasets, such as DACSA. DACSA is a multilingual corpus of news articles and summaries that is a valuable resource for evaluating summarization strategies in Spanish.

Recent research has explored the use of intermediate representations to address challenges related to factuality. For example, ref. [8] highlight the use of entity-aware summarization to reduce hallucinations. Ref. [14] uses a Named Entity Recognition (NER) model to mask named entities in the text, and then trains a BART model to reconstruct them. Although NER has been identified as a useful tool for guiding summarization, its integration into prompting strategies for LLMs remains understudied.

NER plays a critical role in improving the factual grounding basis during automatic summarization. Entities such as people, organizations, and locations are fundamental to journalistic narratives, and errors often occur when these entities are misrepresented or omitted. By explicitly identifying and highlighting these entities before generation, NER provides the model with structured, contextually relevant cues that help maintain factual consistency. Furthermore, masking or emphasizing these entities during preprocessing reduces the model’s reliance on potentially inaccurate contextual tokens, thereby lowering the risk of fabricating details. Thus, guiding the LLM with these entity-aware constraints directs the generation of verifiable content, resulting in summaries that more closely align with the facts in the original article.

To overcome challenges related to factuality, we propose a prompting approach based on bottlenecks that integrates attention-based sentence scoring with NER. Specifically, a language model first segments each article into potential highlights. These highlights are then scored using Transformer attention weights and normalized features, such as sentence length, closure, and entity density. A probabilistic scoring function combines these factors, emphasizing sentences rich in entities that are also contextually salient according to the model’s internal attention patterns. Each highlight receives a discrete weight tag (high, medium, or low) and is enriched with explicit entity brackets. These structured highlights are then provided to LLMs as guidance for one-shot summarization.

In this paper, we explore whether LLMs can generate factually grounded news summaries by leveraging named entity recognition NER-based bottlenecks. Specifically, we evaluate four approaches using the DACSA dataset [15], a corpus of Spanish and Catalan news articles and summaries: (i) Zero-shot prompting: The model is instructed to summarize without task-specific examples; (ii) One-shot prompting: A single article-summary pair is provided for guidance; (iii) Fine-tuning: This method involves adapting Seq2Seq models, such as mBART and mT5, using DACSA samples for training; and (iv) Bottleneck one-shot prompting with NER: In this approach, structured highlights and entities are injected into the prompt. Although this study focuses on Spanish, it is worth mentioning that the proposed bottleneck prompting framework is language-agnostic. The framework’s main innovation is combining attention-based salience signals with entity anchoring via NER to improve the factual accuracy and coherence of abstractive summaries. This structured prompting strategy can be applied to English and other languages, offering a universal approach to improving the accuracy and clarity of LLM-generated summaries.

Our main contributions are threefold: First, we systematically evaluate LLM summarization strategies on the DACSA dataset, emphasizing Spanish news summarization. Second, we introduce a novel prompting method that combines Transformer attention signals with NER-driven entity anchoring and is based on bottleneck techniques. Third, we demonstrate through automatic evaluation that the bottleneck and one-shot approaches reduce hallucinations and narrow the gap with fine-tuning relative to standard prompting.

The remainder of this paper is structured as follows: Section 2 reviews the related work on summarization and large language models. Section 3 describes the dataset, models, and the proposed bottleneck prompting method. Section 4 presents the experimental setup and results, including quantitative and qualitative analyses. Section 5 discusses the main conclusions, limitations, and directions for future research.

2. Related Works

Historically, automatic summarization has been approached through two main paradigms: extractive and abstractive methods [1]. Extractive methods identify and rank salient sentences or phrases from the source text. These methods often rely on statistical or graph-based features, such as term frequency–inverse document frequency (TF–IDF), sentence centrality, or clustering. While effective in preserving factual correctness, these approaches typically lack fluency and may result in disjointed or redundant summaries.

In contrast, abstractive summarization aims to generate new sentences that paraphrase and condense the original content, imitating human-written summaries [1]. Early neural abstractive models used recurrent neural networks with attention mechanisms. However, the field advanced rapidly with the introduction of Transformers, which allow for long-range dependencies and stable training. Pretrained seq2seq models, such as BART, T5, mT5, and Pegasus, have demonstrated substantial progress in news, scientific, and long-form, multi-document contexts [1,15].

Despite the improvements in quality, two structural limitations remained. The first is data dependence. State-of-the-art models usually require substantial collections of domain-specific document summaries, which are expensive to curate, particularly outside of English. The second limitation is hallucination and factual drift. Abstractive models can generate information that is plausible yet incorrect, such as incorrect entities, numbers, or dates, while scoring high on lexical metrics.

The rise of LLMs has shifted the paradigm from a “train-then-infer” to various prompt engineering techniques to generate useful outputs. Models such as LLaMA-3 [9], Gemma-3 [11], Mistral [16], Mixtral [17], Qwen [10], and Phi-4 [18] demonstrate impressive zero-shot and one-shot summarization capabilities, often achieving this with minimal or no task-specific fine-tuning.

Building on this, ref. [19] benchmarked GPT-4 across multiple summarization datasets, highlighting that, while LLMs demonstrate impressive fluency, traditional benchmarks such as CNN/DailyMail and XSum are noisy and do not always align with human preferences. Similarly, ref. [20] conducted a large-scale human evaluation of LLMs for news summarization. They discovered that instruction tuning often has a greater impact than model size and that standard automatic metrics such as ROUGE fail to capture factual consistency.

Beyond summarization, LLMs have also shown strong zero-shot generalization in other complex language understanding tasks, such as hate speech detection [21] and satire recognition [22], reinforcing their adaptability across domains with limited supervision.

Moreover, studies have revealed weaknesses in robustness and input sensitivity. For instance, ref. [23] demonstrate that even state-of-the-art LLMs deteriorate under simple paraphrasing alterations. Ref. [24] identify positional biases that cause faithfulness to decrease in the middle sections of long documents. The field has progressed from extractive heuristics to neural abstractive models and, more recently, to instruction-tuned LLMs that can operate effectively in zero- and one-shot settings. However, hallucination, robustness, and faithfulness remain open challenges.

Additionally, most research and benchmarking has focused on English datasets, creating a gap in the evaluation of summarization systems for languages such as Spanish. Concerning Spanish, the BOE-XSUM dataset [25] provides a curated collection of brief summaries of Spanish legal documents obtained from the “Boletín Oficial del Estado”. This work demonstrates that fine-tuned models significantly outperform zero-shot approaches when experimenting with medium-sized LLMs, including BERT, GPT-J, and DeepSeek-R1. Another example is the NoticIA 2024 dataset [26], which concerns summarization research in the clickbait news domain. The NoticIA dataset includes 850 news articles with attention-grabbing headlines and human-written, single-sentence summaries designed to reflect the informational need created by the headline. This dataset challenges models to perform inferential summarization and evaluate comprehension and generative capabilities. Another resource is the DACSA dataset [15], which provides a Spanish news dataset with human-written summaries. However, systematic evaluations of LLM-based summarization in Spanish are limited.

For this reason, our study aims to address this gap by systematically evaluating the summarization of Spanish news articles using LLMs. We selected the DACSA dataset because it contains a wide variety of journalistic texts covering topics such as politics, the economy, culture, and international affairs. This makes it an ideal benchmark for general-purpose summarization in Spanish, as well as a means of comparing a variety of contemporary open LLMs in zero-shot and one-shot settings. Additionally, we propose a bottleneck prompting strategy based on highlights and named entities to improve factual accuracy. This strategy enables us to examine how recent developments in prompt engineering and intermediate representations can overcome ongoing challenges related to faithfulness, robustness, and entity preservation. Thus, the scope of summarization research is broadened beyond the English-centric paradigm.

3. Materials and Methods

3.1. Dataset

We conducted our experiments on the DACSA dataset, a multilingual corpus of news articles paired with human-written, abstractive summaries. DACSA is particularly relevant because it includes Spanish and Catalan, two understudied languages in summarization research compared to English. Each DACSA entry consists of a full-length news article covering politics, economics, society, and international events, as well as a brief summary written by professional journalists. The DACSA dataset does not include explicit topical labels.

For this study, we focused on the Spanish subset of DACSA. To ensure diverse coverage of topics and writing styles, we randomly sampled 1000 articles from the test set for evaluation. We used the training split exclusively to provide in-context examples for one-shot prompting. Importantly, we did not overlap the training and test articles to ensure a fair evaluation. An example of the dataset is shown in Table 1.

The test subset consists of 1000 article–summary pairs. The descriptive statistics of these pairs reveal several important characteristics relevant to the summarization task. On average, source articles contain 489 words (SD = 359.9, median = 409), with an interquartile range of 259–622 words, as shown in Figure 1a. This distribution indicates moderate dispersion and confirms that the selected test dataset includes both short news briefs and long analytical reports. Table 2 summarizes the topic proportions and lists representative keywords and their corresponding domains.

Reference summaries are considerably shorter, averaging 35 words (SD = 17.9, median = 31), with most falling between 20 and 47 words, as shown in Figure 1b. This conciseness aligns with journalistic conventions, where summaries are designed to provide a clear and compact overview of the main event or argument.

The mean compression ratio between summaries and their corresponding source articles is 0.108, meaning that summaries are approximately one-tenth the length of the articles. The median ratio of 0.075 indicates that most summaries are more condensed. The observed range of 0.006 to 0.786 reveals occasional outliers, such as very short articles or particularly detailed abstracts.

Overall, these statistics confirm that the selected DACSA test corpus offers substantial diversity in document length and information density. This variation is useful for evaluating the robustness and generalization ability of large language models used for summarization.

In addition to a quantitative analysis of length, a semantic topic exploration was conducted using BERTopic, a transformer-based model for topic discovery that combines document embeddings with clustering in reduced semantic spaces. This analysis aimed to evaluate the thematic diversity of the DACSA test subset and confirm that the chosen evaluation data accurately represent real-world Spanish news coverage.

BERTopic was applied to 1000 test articles and produced 14 coherent thematic clusters, capturing the full breadth of Spanish journalistic domains. The two largest clusters corresponded to general and mixed news (27.2%) and crime, security, and law enforcement (25.4%), reflecting the dominant sections in daily news feeds. Other high-frequency clusters included the economy and finance (9.7%), national politics (8.9%), sports (7.5%), and public health (5.9%). Each cluster is characterized by distinct lexical and stylistic patterns.

Smaller but well-defined clusters represented more specialized domains, including meteorology and environmental updates (2.9%), infrastructure and agricultural investment (2.3%), international relations (2.0%), education and youth (1.6%), gender-based violence and equality (1.4%), culture and social affairs (1.3%), tourism and regional development (1.0%), and traffic and road safety (1.0%).

3.2. Models

We selected a diverse set of open-source, instruction-tuned LLMs covering various architectures, parameter scales, and training regimes.

LLaMA-3.1 (8B and 70B parameters) [9]: In 2024, Meta released the LLaMA-3.1 models. These models represent the latest generation of open large language models and are designed to compete with the most advanced closed models. They offer significant improvements in reasoning, multilingual capabilities, and long-context comprehension. Base and instruction-tuned versions are available, making them suitable for research, dialogue systems, and text generation tasks. Specifically, the model used is LLaMA-3.1-8b-Instruct (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct (accessed on 3 November 2025)) and LLaMA-3.1-70b-Instruct (https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct (accessed on 3 November 2025)).
Gemma-2 (2B and 9B) [11]: Developed by Google DeepMind, the Gemma-2 models are lightweight yet powerful open-source models that prioritize efficiency and accessibility. With 2 billion and 9 billion parameters, respectively, these models are optimized for environments with limited resources while maintaining strong performance in reasoning, safety, and multilingual tasks. Their smaller size makes them ideal for edge and enterprise deployment. Therefore, the model used for the experiments are: gemma-2-2b-it (https://huggingface.co/google/gemma-2-2b-it (accessed on 3 November 2025)), and gemma-2-9b-it (https://huggingface.co/google/gemma-2-9b-it (accessed on 3 November 2025)).
Mistral 7B [16]: Mistral 7B, developed by Mistral AI, is a dense model that is notable for its efficiency and competitive performance, despite its relatively small number of parameters. It is widely recognized for its fast inference, cost-effective deployment, and strong showing on reasoning benchmarks, particularly in English and other European languages. Specifically, the model used is Mistral-7B-Instruct-v0.3 (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 (accessed on 3 November 2025)).
Mixtral (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 (accessed on 3 November 2025)) [17]: Mistral AI released Mixtral, an MoE (mixture-of-experts) model that activates only a subset of its parameters during inference. This approach improves efficiency without sacrificing performance. Mixtral has emerged as a robust, open-source alternative for tasks requiring the capabilities of large models. It delivers competitive results in reasoning, coding, and multilingual application benchmarks.
Qwen 2.5 (7B) [10]: The Qwen 2.5 models, developed by Alibaba Cloud, are open LLMs designed with a strong focus on multilingual comprehension and the ability to follow instructions. The 1.5B and 7B variants strike a balance between efficiency and robust downstream performance. These models are notable for their training on diverse datasets and robust performance in Chinese and English. For the study, we used the variant Qwen2.5-7B-Instruct (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct (accessed on 3 November 2025)) for the experiments.
Phi-3.5 [18]: The Phi-3.5 family of models from Microsoft Research is optimized for reasoning and code-related tasks, all while maintaining a relatively small size. Despite its small size, Phi-3.5 achieves high performance by leveraging carefully curated and synthetic training data. The Phi-3.5 models are designed to be efficient, open, and safe alternatives for practical applications in reasoning-intensive domains. For the experiments we used the mini instruct version of the Phi-3.5 (Phi-3.5-mini-instruct (https://huggingface.co/microsoft/Phi-3.5-mini-instruct (accessed on 3 November 2025))).

3.3. Baseline Approaches

We compared the proposed bottleneck prompting strategy to three widely adopted automatic summarization approaches: zero-shot prompting, one-shot prompting, and fine-tuning, to evaluate its effectiveness. These baseline paradigms represent how summarization has traditionally been approached with neural models and, more recently, with LLMs.

Zero-shot prompting: The model is tasked with performing a specific action based solely on natural language instructions, with no task-specific examples provided. For instance, the model receives the instruction “Summarize the following article in three sentences” and must generalize from its pretraining knowledge. This approach leverages the emergent generalization capabilities of instruction-tuned models. Recent research has demonstrated the versatility of zero-shot learning across domains, including tasks that require deep semantic understanding, such as hate speech detection [21].
One-shot prompting: Before solving the target task, the model is guided by a single input-output example. This provides an explicit demonstration of the desired format or style. For example, one could show the model an article and its summary, and then ask the model to summarize a new article. In the one-shot prompting configuration, one article-summary pair was randomly selected from the DACSA training set and used as the in-context example. This random sampling ensures unbiased model guidance and avoids potential overlap with the evaluation data.
Fine-tuning: Rather than relying solely on prompting, the model undergoes additional training with task-specific data, such as document–summary pairs, to adjust its parameters. This can be achieved through full fine-tuning, which updates all weights, or through parameter-efficient fine-tuning methods, such as LoRA or adapters. Fine-tuning has traditionally been widely applied to seq2seq models such as mBART and mT5, which have produced strong results in multilingual summarization and other generation tasks. For Spanish news summarization, we also consider models that have been fine-tuned on the DACSA dataset and released by their respective authors. These models are ELiRF/mbart-large-cc25-dacsa-es (https://huggingface.co/ELiRF/mbart-large-cc25-dacsa-es (accessed on 3 November 2025)) and ELiRF/mt5-base-dacsa-es (https://huggingface.co/ELiRF/mt5-base-dacsa-es (accessed on 3 November 2025)). They provide strong supervised baselines that are optimized specifically for this corpus.

3.4. Bottleneck Prompting with Attention and NER Highlights

This pipeline introduces a structured information bottleneck, restricting the model’s input to salient, entity-rich content. It effectively transforms the summarization process into a two-stage procedure involving content selection and surface realization. This shifts the process from an unconstrained, end-to-end task. Grounding generation in attention-based salience and entity annotations mitigates hallucinations, improves factual reliability, and ensures faithful coverage of key participants and events in Spanish news.

Figure 2 illustrates the workflow of our proposed bottleneck prompting framework for Spanish news summarization using the DACSA dataset. The dataset is divided into training and testing splits. We only use the training split for random one-shot exemplar selection. In this step, we select a single article-summary pair to serve as an in-context demonstration for all models. The test split is processed through a highlight extractor, which generates candidate segments from each articles. These highlights are analyzed by two complementary modules: (i) a NER module, based on XLM-RoBERTa-large, which detects entities of different types; and (ii) a salience scoring module, which computes attention-based relevance scores using the BERT model.

We extract candidate highlights from each test article using a lightweight summarization approach over short sentence windows. In this framework, highlights are short, semantically rich text segments that encapsulate relevant information, such as events, named entities, and causal relations. These segments are typically single sentences or small groups of sentences. Specifically, we use the multilingual seq2seq model csebuetnlp/mT5_multilingual_XLSum (mT5 fine-tuned on the XL-Sum corpus) as an off-the-shelf compressor. First, we segment the articles into sentences using a conservative regular expression and then group them into fixed windows of approximately five sentences. Each window is summarized independently to produce one short, abstractive span. The concatenation of all the window-level summaries constitutes the set of candidate highlights for the article.

These highlights are then processed through two complementary modules. The first is the NER module. We use an open-source MMG model (https://huggingface.co/MMG/xlm-roberta-large-ner-spanish (accessed on 3 November 2025)) based on XLM-RoBERTa-large that has been fine-tuned for Spanish NER. This system detects four types of entities: person (PER), organization (ORG), location (LOC), and miscellaneous (MISC). These entities are essential for preserving factuality in news summarization. Each highlight is annotated with its entity content, and entity counts are normalized to contribute to the weighting function. The second module is Salience Scoring. We use BETO [27], a Spanish BERT model trained with whole-word masking, to compute attention-based salience. For each candidate sentence, we calculate the average attention directed toward the [CLS] token across all layers and heads. This provides a proxy for sentence informativeness. The raw scores are normalized using a sigmoid function centered around 0.5, yielding values between 0 and 1.

The outputs of the NER and salience modules are integrated into a probabilistic weighting function. This function combines attention salience, entity density, sentence length, and closure features, employing fixed weights calibrated during development. Closure features are syntactic and semantic indicators of completeness. They determine if a sentence expresses a complete proposition or a dependent clause. Thus, they favor self-contained content that is contextually coherent.

W_{attn} = 2.0, W_{ent} = 0.35, W_{len} = 0.2, and W_{sent} = 0.3,

The entity-type weights prioritize organizations (0.35) and people (0.25) over locations (0.20) and miscellaneous categories (0.15). These values were chosen based on empirical and heuristic criteria to balance semantic relevance and entity diversity. Organizations and persons usually represent important sources of information in journalistic discourse and therefore receive higher weights, which emphasize factual, event-centered content. Locations contribute contextual information, but they are less central to the event structure. Miscellaneous entities cover heterogeneous elements, such as products, events, and temporal expressions, whose contribution to salience depends on the domain. Thus, this distribution of weights favors entities with a central narrative function while maintaining representational diversity. Note that entity-type weights can be adjusted based on the domain or dataset.

While the integration of a NER module strengthens factual grounding, it also introduces potential sources of error propagation. Since no NER system achieves perfect accuracy, misclassification (e.g., incorrect labeling of entity types) or omissions of entities may lead to suboptimal weighting of highlights and, consequently affect the final results. This dependency on automatic entity detection can occasionally amplify biases or inconsistencies present in the underlying model, especially for domain-specific or ambiguous entities. As discussed in prior research [28], such errors may propagate through multi-stage NLP pipelines, reducing factual reliability. In our framework, we partially mitigated this limitation by using a robust, multilingual NER model fine-tuned for Spanish and by averaging salience signals across multiple features.

The weighted sum is then subjected to a logistic transformation to produce a final probability score for each highlight. The thresholds for the discrete importance tags were empirically determined from the observed probability distribution, with the goal of achieving an interpretable stratification of relevance. The budgeted selection step retains the top six highlights, which are then annotated with discrete importance tags: <<w=H>> (High importance) for probabilities

\geq 0.65

, <<w=M>> (Medium importance) for

0.45 \leq p (h) < 0.65

, and <<w=L>> (Low importance) otherwise. Additionally, each highlight includes an entity bracket explicitly listing detected entities (e.g., [ENTS ORG:2, PER:1]).

Finally, these enriched highlights are incorporated into the prompt design, as shown in Figure 3. The prompt contains the following: (i) the task instruction in Spanish, (ii) a randomly sampled one-shot exemplar from the DACSA training set, (iii) the test article, and (iv) the Highlights Plus Entities block. During generation, the model is instructed to prioritize important highlights (<<w=H>>), include medium ones if needed for coherence, and ensure that all listed entities appear consistently in the final summary.

3.5. Evaluation Metrics

In line with standard practices in summarization research, we evaluate the quality of the generated summaries using a blend of lexical overlap and semantic similarity metrics. Specifically, we report ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore (F1).

ROUGE measures the degree of n-gram overlap between the system summary and the human-written reference. ROUGE-1 computes unigram overlap, or the overlap of single words, to evaluate the system’s ability to identify keywords in the reference text. ROUGE-2 builds on this by analyzing bigrams, reflecting both lexical choice and local fluency. ROUGE-L measures the longest common subsequence between the prediction and the reference. This metric rewards shared words and their order. Thus, it captures more global coherence.

To complement lexical overlap, we calculate the BERTScore, which measures similarity in a contextual embedding space. In our setup, we use a pretrained BERT model that has been adapted for Spanish. The model matches each prediction token to the most similar reference token based on the cosine similarity of their contextual embeddings. Then, we aggregate precision, recall, and F1 scores across tokens. In this study, we introduce the BERTScore F1 metric, which balances coverage and precision to provide a more reliable measure of semantic similarity between system and reference summaries.

In addition, we propose an alternative evaluation metric, BETOScore, computed analogously to BERTScore but based on the BETO model [27], a Spanish-specific BERT pretrained exclusively on large-scale Spanish corpora. The inclusion of BETOScore enables a language-specialized evaluation, better reflecting the semantic relationships and idiomatic expressions of Spanish texts. Both metrics are reported in their F1 form, which balances precision and recall to provide a stable indicator of overall semantic alignment between the generated and reference summaries.

Using both ROUGE and BERTScore enables us to evaluate different aspects of summarization quality. ROUGE emphasizes content recall and lexical fidelity, while BERTScore emphasizes semantic alignment beyond exact wording. Together, these metrics provide a comprehensive assessment of both surface-level and meaning-level similarities, which is essential for evaluating abstractive summarization with LLMs.

4. Results

In this section, we present the results of our experiments with various LLMs on the DACSA dataset. First, we describe the evaluation metrics used to assess summarization quality. Next, we provide a detailed analysis of the quantitative results across models and approaches. Finally, we compare the impact of our proposed bottleneck strategy to standard zero-shot and one-shot baselines.

4.1. Quantitative Results

Table 3 summarizes the results of the fine-tuned baselines (mBART and mT5) compared to a simple highlight-join heuristic. As expected, mT5 outperforms the other baselines. It achieves ROUGE-2 and ROUGE-L scores of 13.40 and 24.73, respectively, demonstrating its strong ability to generate abstractive summaries in Spanish. In contrast, the highlight-join heuristic underperforms, confirming that simply concatenating salient sentences cannot match the fluency and abstraction of pretrained seq2seq models.

Table 4 shows the results of the zero-shot setting. Large LLMs, such as LLaMA-3.1-70B and Qwen2.5-7B, achieved competitive results with ROUGE-1 scores above 30. These results demonstrate the ability of instruction-tuned LLMs to generalize to summarization tasks without fine-tuning. Notably, smaller models, such as Gemma-2-2B and Mistral-7B, demonstrate comparable performance, highlighting the efficiency of recent medium-sized LLMs.

Table 5 shows the results of the one-shot experiment with one in-context example. In this setting, we observe consistent improvements across all models, except Gemma-2-2b and Phi-3.5. For example, Gemma-2-9B and LLaMA-3.1-70B achieved ROUGE-1 scores of 33.68 and 33.41, respectively, and their BERTScore F1 values approached 69. These results demonstrate the effectiveness of in-context learning in aligning summaries more closely with human-written references, even with minimal supervision.

Table 6 shows how well our proposed bottleneck prompting method works. Gemma-2-9B achieves the highest ROUGE-1 (35.29) and ROUGE-L (24.02) scores among the evaluated systems. This demonstrates that generating content based on entity-rich highlights yields measurable improvements. LLaMA-3.1-70B and Gemma-2-9B also benefits from bottleneck prompting, improving its ROUGE-2 score to 14.28, the highest score across all experiments. These results confirm that the bottleneck strategy improves factual consistency without sacrificing fluency.

Finally, Table 7 compares the best-performing model from each approach. The results demonstrate clear progression. Fine-tuned mT5 establishes a robust supervised baseline, zero-shot Qwen2.5-7B exhibits strong out-of-the-box performance, and bottleneck-prompted Gemma-2-9B achieves the best overall ROUGE-1, ROUGE-2, highest BERTScore and BEToScore scores. These results suggest that bottleneck prompting can complement standard prompting methods and provide an effective, low-cost alternative to full fine-tuning.

Table 8 presents the results of the paired t-test comparing the performance of the mT5 baseline and the proposed bottleneck prompting approach across lexical and semantic evaluation metrics (ROUGE-1, ROUGE-2, ROUGE-L, BERTScore F1, BETOScore F1). The analysis was conducted to determine whether the observed performance differences between the two models statistically significant. The results show that the bottleneck prompting method achieves statistically significant improvements in ROUGE-1, BERTScore, and BETOScore with p-value < 0.05, while the difference in ROUGE-2 and ROUGE-L are not statistically significant (p > 0.1).

Figure 4 summarizes the results using bar plots, allowing a more intuitive comparison of the methods. The plots highlight that bottleneck prompting consistently improves ROUGE-1 and ROUGE-L scores across all models, particularly for LLaMA-3.1-70B and Gemma-2-9B.

Overall, the results highlight three key findings. First, one-shot prompting significantly improves upon zero-shot performance for all models. Second, bottleneck prompting improves content selection and entity preservation, yielding additional gains. Third, fine-tuned baselines remain competitive, but modern open LLMs achieve comparable or better performance without requiring supervised training.

4.2. Analysis of Results

In this section, we analyze the performance of different models and approaches using the DACSA dataset. First, we review the supervised, fine-tuned baselines. Next, we discuss the performance of zero-shot and one-shot prompting. Lastly, we evaluate the impact of our proposed bottleneck prompting strategy. The section concludes with a comparative analysis of the best model from each approach.

As expected, the supervised baselines confirm that mT5 outperforms mBART by a small margin, achieving ROUGE-2 scores of 13.40 and ROUGE-L scores of 24.73. In contrast, the heuristic highlight-join approach performed considerably worse, showing that simply concatenating salient sentences cannot match the abstraction and fluency of neural models.

In the zero-shot setting, large, instruction-tuned LLMs demonstrate robust summarization abilities. For instance, LLaMA-3.1-70B and Qwen2.5-7B achieve ROUGE-1 scores above 30 and BERTScore F1 values around 68, respectively. This shows that modern open-weight models can generalize to summarization tasks without specific training. Though medium-sized models like Gemma-2-2B and Mistral-7B perform slightly worse, they still deliver competitive results, showcasing the effectiveness of recent 7B-class models.

In the one-shot setting, adding a single in-context example consistently improves performance across families. On average, ROUGE-1 increases by nearly one point, and ROUGE-L increases by more than one point, compared to zero-shot. This confirms that minimal supervision helps models better align with the target summarization style. Gemma-2-9B and LLaMA-3.1-70B are the strongest one-shot models, achieving ROUGE-1 scores above 33 and BERTScore F1 values close to 69. These results demonstrate the effectiveness of in-context learning for Spanish summarization.

Compared to one-shot prompting, the bottleneck prompting approach further improves ROUGE-1 and ROUGE-L scores, particularly for larger models such as LLaMA-3.1-70B and Gemma-2-9B. However, BERTScore F1 decreases slightly, suggesting that, although bottleneck prompting enhances factual grounding and lexical alignment, it reduces paraphrastic flexibility.

The improvement in lexical overlap indicates that generating content based on entity-rich highlights strengthens factual alignment and sequence-level coherence. However, this approach may limit the flexibility in paraphrasing captured by embedding-based metrics. At the model level, the gains are most evident for larger LLMs. For instance, LLaMA-3.1-70B increases its ROUGE-2 and ROUGE-L scores to 14.28 and 24.00, respectively, surpassing the fine-tuned mT5 model in bigram and subsequence overlap. Gemma-2-9B achieves the highest ROUGE-1 score, 35.29, across all settings. Qwen-2.5-7B also benefits, improving its ROUGE scores across the board with only a negligible drop in BERT F1. Conversely, smaller models, such as Gemma-2-2B, perform worse under the bottleneck. This indicates that their limited capacity hinders their ability to effectively integrate structured guidance.

When comparing the best-performing model in each approach, a clear progression emerges. While the fine-tuned mT5 baseline remains a strong supervised reference, the zero-shot Qwen-2.5-7B model performs competitively without additional training. Ultimately, bottleneck prompting delivers the best ROUGE-1, ROUGE-2, BERTScore and BETOScore scores, with Gemma-2-9B and LLaMA-3.1-70B leading the way. These results confirm the effectiveness of the proposed bottleneck prompting approach.

To better understand the numerical results, we evaluated model outputs using representative DACSA articles. Two illustrative cases, one institutional and one biographical, highlight the main behavioral patterns and qualitative advantages of our approach.

Figure 5 clearly illustrates differences in information coverage and factual control. The fine-tuned mT5 misinterprets the source, introducing a new action that is not present in the gold summary (“reporting non-compliant establishments”). Qwen2.5-7B produces a fluent yet overly detailed summary, adding fabricated elements such as “sanctions” and “mask use”. The one-shot Gemma-2-9B aligns better with the reference, preserving the intended meaning (“prudence and responsibility”) and avoiding hallucinated content. The bottleneck version yields the most structured and contextually rich summary and adds entities like “Ministerio de Salud”. Thus, this example demonstrates that the bottleneck prompting effectively enhances coherence and factual control.

Figure 6 shows mT5’s complete failure, producing unrelated political content and indicating poor domain generalization. Qwen2.5-7B generates a rich, albeit largely fabricated, biography, adding plausible yet nonexistent facts. The one-shot Gemma-2-9B model produces a balanced and contextually coherent summary. The bottleneck version provides the most factually faithful and semantically grounded text. It correctly retains entities (“Vw8vc“, “Roque Company”, “Ciencia de la Información”) and structures the narrative around real information from the gold summary.

Figure 7 presents a comparative analysis between the baseline (mT5) summarization model and the proposed bottleneck prompting approach, focusing on linguistic, morphological, and lexical characteristics of the generated summaries. To conduct this deeper linguistic analysis, we employed the UMUTextStats [29] tool. After obtaining the linguistic features, we calculated the information gain for each set of summaries. Then, we ordered the features by their information gain coefficients and calculated the mean for each feature. We normalized their values to 100% to observe the differences between the two models. As can be seen, the bottleneck model generates longer summaries than the baseline model, but the baseline model has a higher type-token ratio (TTR), meaning it uses a greater variety of words. Other morphological features, such as the use of first-person verbs, proper nouns, and adjectives, did not show significant variation among the produced summaries.

Overall, three findings stand out from the results. First, one-shot prompting consistently outperforms zero-shot summarization across all models, highlighting the benefits of minimal supervision. Second, bottleneck prompting enhances factual grounding and content selection, particularly for higher-capacity LLMs. This yields higher ROUGE-1, ROUGE-2, and ROUGE-L scores. Third, fine-tuned baselines, such as mT5, remain competitive. However, modern open LLMs with structured prompting can achieve comparable or superior results without requiring domain-specific training. These results suggest a promising path for multilingual summarization in settings with limited resources, such as Spanish news.

4.3. Limitations

Despite promising results, several limitations should be acknowledged. First, the evaluation relied solely on the DACSA dataset, which primarily represents Spanish from Spain and may not generalize to Latin American varieties or other domains. Second, all evaluations used automatic metrics (ROUGE and BERTScore) such as the results presented by [15], which, while reproducible, do not fully capture grammatical correctness or cultural adaptation. Third, the proposed weighting coefficients and thresholds were empirically determined rather than optimized through large-scale tuning, potentially limiting cross-domain robustness. Fourth, due to computational constraints, the study was restricted to a finite set of open LLMs; future work should include additional architectures (LlaMA-3-450B, GPT-4, DeepSeek) for broader validation.

5. Conclusions and Further Lines

This work explores automatic summarization using LLMs for Spanish news, a domain that has been explored less than English. Using the DACSA dataset, we thoroughly evaluated various approaches, including fine-tuned baselines, zero-shot and one-shot prompting, and our proposed bottleneck prompting strategy based on attention and NER. Our findings clearly demonstrate the effectiveness of modern LLMs for abstractive summarization and the potential of structured prompting to enhance performance.

Our experiments yielded several important insights. First, we confirmed that instruction-tuned LLMs possess strong summarization capabilities in zero-shot settings, delivering fluent and coherent summaries without task-specific training. Second, one-shot prompting improves performance consistently across model families, demonstrating that minimal supervision, such as a single example, substantially improves alignment with human-written references. Third, the bottleneck prompting approach introduces an additional layer of content selection, guiding the model toward entity-rich and salient information. This approach leads to measurable improvements in factual grounding and sequence-level coherence, particularly for high-capacity models such as LLaMA-3.1-70B and Gemma-2-9B. Finally, our comparative analysis shows that modern open LLMs equipped with structured prompting can match or surpass traditional fine-tuned baselines, highlighting their versatility and efficiency in multilingual summarization.

We position our approach within the current LLM landscape as a hybrid paradigm that combines symbolic structure (via NER) and attention-based salience with large-scale instruction-tuned generation. Unlike traditional fine-tuning, our method provides a low-cost alternative that improves factual grounding without retraining. It can also serve as a bridge between purely prompt-based and supervised approaches, supporting future integration with retrieval-augmented generation and self-verification pipelines.

As future work, we plan to expand this study in several directions. First, we will explore larger, more diverse multilingual corpora to evaluate cross-lingual generalization, especially for languages with limited resources beyond Spanish, such as NoticIA [26] and BOE-XSUM [25]. Second, we will enhance the bottleneck representation by incorporating event tuples, discourse relations, and coreference chains to provide LLMs with more structured guidance. Third, we will conduct a broader human evaluation focused on factual accuracy and perceived usefulness to better align system outputs with journalistic standards. Fourth, integrating retrieval-augmented generation and self-verification mechanisms appears promising for reducing hallucinations and strengthening the reliability of LLM-based summarization systems [30]. In addition, we plan to incorporate human-centered evaluation protocols to assess linguistic and cultural quality in Spanish, including grammatical correctness, fluency, and sociocultural adaptability.

Another relevant area of research involves integrating abstractive summarization into multimodal tasks. During spontaneous speech, speakers often diverge from the main topic, use indirect expressions, and rely on prosodic cues to convey intent and emotion. Automatic summarization can serve as an intermediate representation that condenses the audio transcript into concise, coherent key points, thereby improving understanding in subsequent stages.

Furthermore, abstractive summarization can be integrated into hate speech detection pipelines as an explanation layer. Summarization can generate human-readable justifications, highlighting the elements that influence a hate speech classification, such as targets and intensity [31]. This would increase the transparency and interpretability of multi-task hate detection systems, providing an alternative to purely discriminative approaches.

Author Contributions

Conceptualization, R.V.-G. and M.d.P.S.-Z.; methodology, M.A.P.-V.; software, R.P.; validation, T.B.-B., M.A.P.-V. and J.A.G.-D.; formal analysis, M.d.P.S.-Z.; investigation, T.B.-B.; resources, M.d.P.S.-Z.; data curation, R.P.; writing—original draft preparation, R.P.; writing—review and editing, M.A.P.-V.; visualization, J.A.G.-D.; supervision, R.V.-G.; project administration, R.V.-G.; funding acquisition, R.V.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Secretary of Science, Humanities, Technology, and Innovation (SECIHTI) under project CIORGANISMOS-2025-119, “Collaborative epidemiological surveillance and prevention of infections diseases based on emerging models and methods of intelligent data analysis”. Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.

Data Availability Statement

Source code for training the fine-tuning, zero- and one-shot models is available at https://github.com/NLP-UMUTeam/mdpi-2025-llm-summary (accessed on 3 November 2025). No new data were created in this research. Therefore, it is necessary to request the datasets from the original authors of each paper evaluated in this work.

Acknowledgments

Authors María del Pilar Salas-Zárate and Mario Andrés Paredes-Valverde, gratefully acknowledge the support provided by the Secretary of Science, Humanities, Technology, and Innovation (SECIHTI) and the Tecnológico Nacional de México (TecNM). Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ibrahim Altmami, N.; El Bachir Menai, M. Automatic summarization of scientific articles: A survey. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 1011–1028. [Google Scholar] [CrossRef]
Gupta, S.; Gupta, S.K. Abstractive summarization: An overview of the state of the art. Expert Syst. Appl. 2019, 121, 49–65. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; ACL Anthology: San Diego, CA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; ACL Anthology: San Diego, CA, USA, 2021; pp. 483–498. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, Virtual, 13–18 July 2020; JMLR.org: New York, NY, USA, 2020. [Google Scholar]
Chen, D.; Bolton, J.; Manning, C.D. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 2358–2367. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; ACL Anthology: San Diego, CA, USA, 2018; pp. 1797–1807. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; ACL Anthology: San Diego, CA, USA, 2020; pp. 1906–1919. [Google Scholar] [CrossRef]
Pagnoni, A.; Balachandran, V.; Tsvetkov, Y. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; ACL Anthology: San Diego, CA, USA, 2021; pp. 4812–4829. [Google Scholar] [CrossRef]
Berezin, S.; Batura, T. Named Entity Inclusion in Abstractive Text Summarization. In Proceedings of the Third Workshop on Scholarly Document Processing, Gyeongju, Republic of Korea, 12–17 October 2022; Cohan, A., Feigenblat, G., Freitag, D., Ghosal, T., Herrmannova, D., Knoth, P., Lo, K., Mayr, P., Shmueli-Scheuer, M., de Waard, A., et al., Eds.; Fraunhofer-Publica: Munich, Germany, 2022; pp. 158–162. [Google Scholar]
Segarra Soriano, E.; Ahuir, V.; Hurtado, L.F.; González, J. DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V., Eds.; ACL Anthology: San Diego, CA, USA, 2022; pp. 5931–5943. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripi, M.; Kauffmann, P.; et al. Phi-4 technical report. arXiv 2024, arXiv:2412.08905. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Z.; Wang, R. Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 8640–8665. [Google Scholar]
Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking Large Language Models for News Summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
García-Díaz, J.A.; Pan, R.; Valencia-García, R. Leveraging zero and few-shot learning for enhanced model generality in hate speech detection in Spanish and English. Mathematics 2023, 11, 5004. [Google Scholar] [CrossRef]
Bernal-Beltrán, T.; Paredes-Valverde, M.A.; Salas-Zárate, M.d.P.; García-Díaz, J.A.; Valencia-García, R. Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models. Future Internet 2025, 17, 445. [Google Scholar] [CrossRef]
Askari, H.; Chhabra, A.; Chen, M.; Mohapatra, P. Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 2187–2201. [Google Scholar]
Wan, D.; Vig, J.; Bansal, M.; Joty, S. On Positional Bias of Faithfulness for Long-form Summarization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; ACL Anthology: San Diego, CA, USA, 2025; pp. 8791–8810. [Google Scholar] [CrossRef]
García, A.F.; de la Rosa, J.; Gonzalo, J.; Morante, R.; Amigó, E.; Benito-Santos, A.; Carrillo-de Albornoz, J.; Fresno, V.; Ghajari, A.; Marco, G.; et al. BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications. Proces. Leng. Nat. 2025, 75, 263–282. [Google Scholar]
García-Ferrero, I.; Altuna, B. NoticIA: A Clickbait Article Summarization Dataset in Spanish. Proces. Leng. Nat. 2024, 73, 191–207. [Google Scholar]
Cañete, J.; Chaperon, G.; Fuentes, R.; Ho, J.H.; Kang, H.; Pérez, J. Spanish Pre-Trained BERT Model and Evaluation Data. In Practical ML for Developing Countries Workshop@ICLR 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Pawlik, L. Performance comparison: Cloud-based vs. open-source NER for English and Polish. Intell. Data Anal. 2025, 29, 1187–1198. [Google Scholar] [CrossRef]
García-Díaz, J.A.; Vivancos-Vicente, P.J.; Almela, Á.; Valencia-García, R. UMUTextStats: A linguistic feature extraction tool for Spanish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 6035–6044. [Google Scholar]
B’echard, P.; Ayala, O.M. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
Pan, R.; García-Díaz, J.A.; Valencia-García, R. Spanish MTLHateCorpus 2023: Multi-task learning for hate speech detection to identify speech type, target, target group and intensity. Comput. Stand. Interfaces 2025, 94, 103990. [Google Scholar] [CrossRef]

Figure 1. Distribution of length (words).

Figure 2. Architecture of the proposed bottleneck prompting framework for Spanish news summarization.

Figure 3. Example of instructions used for the Spanish summarization prompt based on bottleneck prompting with attention and NER highlights.

Figure 4. Comparative performance of summarization approches on DACSA dataset.

Figure 5. Example of institutional communication summary (“Avacu” case).

Figure 6. Example of biographical summary (“Joana Maria de Roque Company” case).

Figure 7. Linguistic comparison between the baseline and bottleneck prompting approach.

Table 1. Three examples of the DACSA dataset.

Article	Summary
Más de 860.000 jóvenes menores de 25 años abandonaron el mercado laboral desde que arrancó la crisis, en el tercer trimestre de 2007, hasta finales de 2012, según un informe de la patronal de grandes empresas de trabajo temporal, Asempleo…	El perfil se corresponde con el de un varón, con un nivel de formación de hasta secundaria, y que lleva más de un año buscando empleo.
En Berlín hay 24,879 sofás disponibles. Sí, para dormir gratis. ¿Se anima? Un billete de avión con Easyjet (4998 euros), un lugar donde dormir (0 euros), unos bocadillos…Ya no hay excusa para quedarse en casa. Internet y el auge de las redes sociales han fomentado esta nueva forma de turismo, un éxito entre jóvenes mochileros…	Tu sofá es mi hotel. La web de couchsurfing.org ya tiene dos millones de usuarios. ¡A dormir gratis!.
El Rey Felipe VI participó este viernes en la Cumbre de Desarrollo de Naciones Unidas y llamó a “corregir” un modelo de crecimiento cuyos agujeros han sido desvelados por la crisis económica mundial…	Felipe VI pide ante la cumbre de Naciones Unidas un desarrollo con equidad tanto para los países ricos como para los pobres.

Table 2. Topic distribution.

Topic	%	Top Keywords
General and Mixed News	27.2	[gobierno, personas, años, día]
Crime, Security, and Law Enforcement	25.4	[policía, guardía civil, personas, años]
Economy, Finance, and Banking	9.7	[euros, millones, banco, sector]
Spanish Politics and Governance	8.9	[PP, PSOE, Sánchez, Podemos]
Sports	7.5	[partido, Real Madrid, Barça, final]
Public Health	5.9	[pacientes, salud, coronavirus, COVID]
Weather and Environment	2.9	[viento, temperaturas, nieve, litoral]
Infrastructure, Agriculture, and Investment	2.3	[obras, inversión, agricultura, zona]
International Relations and Conflicts	2.0	[Irán, Israel, ONU, acuerdo]
Urban Planning and Mobility	1.9	[ayuntamiento, plan, carretera, zona, movilidad]
Education and Youth	1.6	[alumnos, educación, centros, nños]
Gender Violence and Social Issues	1.4	[violencia, mujeres, víctimas, acoso]
Culture and Society	1.3	[LGTBI, colectivo, mesa, propuesta]
Tourism and Regional Development	1.0	[turismo, Mallorca, Castilla León, producto]
Transport, Traffic, and Mobility Safety	1.0	[tráfico, vehículos, accidentes, carretera]

Table 3. Evaluation of summarization models on the DACSA dataset (fine-tuning approach). Bold values indicate the best performance for each metric. ROUGE-1, ROUGE-2, and ROUGE-L measure lexical overlap, respectively. bert_f1 corresponds to the multilingual BERTScore F1, and beto_f1 refers to the Spanish BETO-based BERTScore F1.

Model	rouge1	rouge2	rougeL	bert_f1	beto_f1
mBART	30.68	12.59	23.51	67.58	72.51
mT5	31.31	13.40	24.73	67.91	72.73
Highlights (join)	27.95	10.13	19.88	65.27	71.61

Table 4. Evaluation of summarization models on DACSA dataset (zero_shot approach). Bold values indicate the best performance for each metric. ROUGE-1, ROUGE-2, and ROUGE-L measure lexical overlap, respectively. bert_f1 corresponds to the multilingual BERTScore F1, and beto_f1 refers to the Spanish BETO-based BERTScore F1.

Model	rouge1	rouge2	rougeL	bert_f1	beto_f1
Gemma-2-2b	29.34	10.76	19.31	67.87	73.69
Gemma-2-9b	29.80	11.25	19.68	68.35	73.90
LLaMa-3.1-70b	30.35	12.74	20.80	68.83	74.38
LLaMa-3.1-8b	28.52	11.10	19.02	68.16	73.89
Mistral-7b	27.19	10.60	18.17	67.53	73.29
Mixtral	26.97	9.64	17.80	66.85	72.81
Phi-3.5	27.53	8.76	17.82	65.88	72.23
Qwen2.5-7b	30.80	10.30	20.07	67.81	73.49

Table 5. Evaluation of summarization models on DACSA dataset (few_shot approach). Bold values indicate the best performance for each metric. ROUGE-1, ROUGE-2, and ROUGE-L measure lexical overlap, respectively. bert_f1 corresponds to the multilingual BERTScore F1, and beto_f1 refers to the Spanish BETO-based BERTScore F1.

Model	rouge1	rouge2	rougeL	bert_f1	beto_f1
Gemma-2-2b	27.15	8.04	18.08	66.14	71.33
Gemma-2-9b	33.68	12.94	23.02	68.95	74.51
LLaMa-3.1-70b	33.41	13.98	23.12	69.40	74.72
LLaMa-3.1-8b	31.58	12.40	21.42	68.73	74.20
Mistral-7b	31.33	11.73	20.94	68.39	73.90
Mixtral	29.31	10.91	19.83	67.87	73.42
Phi-3.5	17.73	5.73	12.15	60.25	68.15
Qwen2.5-7b	33.06	12.06	22.41	68.59	74.15

Table 6. Evaluation of summarization models on DACSA dataset (bottleneck prompting approach). Bold values indicate the best performance for each metric. ROUGE-1, ROUGE-2, and ROUGE-L measure lexical overlap, respectively. bert_f1 corresponds to the multilingual BERTScore F1, and beto_f1 refers to the Spanish BETO-based BERTScore F1.

Model	rouge1	rouge2	rougeL	bert_f1	beto_f1
Gemma-2-2b	25.39	4.45	16.27	64.27	69.79
Gemma-2-9b	35.29	14.28	24.02	69.10	74.52
LLaMa-3.1-70b	33.70	14.28	24.00	69.11	74.54
LLaMa-3.1-8b	30.76	11.91	21.10	68.14	73.82
Mistral-7b	31.22	10.85	21.29	67.10	73.18
Mixtral	29.86	10.34	19.85	67.31	73.10
Phi-3.5	19.16	5.87	13.27	59.62	67.73
Qwen2.5-7b	33.67	12.42	23.38	68.34	74.15

Table 7. Comparison of the best model from each approach. Bold values indicate the best performance for each metric. ROUGE-1, ROUGE-2, and ROUGE-L measure lexical overlap, respectively. bert_f1 corresponds to the multilingual BERTScore F1, and beto_f1 refers to the Spanish BETO-based BERTScore F1.

Model	Approach	rouge1	rouge2	rougeL	bert_f1	beto_f1
mT5	Baseline	31.31	13.40	24.73	67.91	72.73
Qwen2.5-7b	Zero-shot	30.80	10.30	20.07	67.81	73.49
Gemma-2-9b	One-shot	33.68	12.94	23.02	68.95	74.51
Gemma-2-9b	Bottleneck prompting	35.29	14.28	24.02	69.10	74.54

Table 8. Comparison of the baseline and bottleneck prompting in paired t-test.

Value	rouge1	rouge2	rougeL	bert_f1	beto_f1
Average of the differences	− 3.98	0.88	−0.72	−3.60	1.80
T-statistic	−7.232	1.411	−1.23	−14.93	7.80
p-value	0.00	0.1587	0.22	0.00	0.00
Significant	Yes	No	No	Yes	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, R.; Bernal-Beltrán, T.; Salas-Zárate, M.d.P.; Paredes-Valverde, M.A.; García-Díaz, J.A.; Valencia-García, R. Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles. Appl. Sci. 2025, 15, 11834. https://doi.org/10.3390/app152111834

AMA Style

Pan R, Bernal-Beltrán T, Salas-Zárate MdP, Paredes-Valverde MA, García-Díaz JA, Valencia-García R. Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles. Applied Sciences. 2025; 15(21):11834. https://doi.org/10.3390/app152111834

Chicago/Turabian Style

Pan, Ronghao, Tomás Bernal-Beltrán, María del Pilar Salas-Zárate, Mario Andrés Paredes-Valverde, José Antonio García-Díaz, and Rafael Valencia-García. 2025. "Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles" Applied Sciences 15, no. 21: 11834. https://doi.org/10.3390/app152111834

APA Style

Pan, R., Bernal-Beltrán, T., Salas-Zárate, M. d. P., Paredes-Valverde, M. A., García-Díaz, J. A., & Valencia-García, R. (2025). Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles. Applied Sciences, 15(21), 11834. https://doi.org/10.3390/app152111834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Can LLMs Generate Coherent Summaries? Leveraging LLM Summarization for Spanish-Language News Articles

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Dataset

3.2. Models

3.3. Baseline Approaches

3.4. Bottleneck Prompting with Attention and NER Highlights

3.5. Evaluation Metrics

4. Results

4.1. Quantitative Results

4.2. Analysis of Results

4.3. Limitations

5. Conclusions and Further Lines

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI