A Modular Approach to Automated News Generation Using Large Language Models

Gambino, Omar Juárez; García Mendoza, Consuelo Varinia; Hernandez Minutti, Braulio; Zapata-Manilla, Carol-Michelle; Bernal-Trani, Marco-Antonio; Calvo, Hiram

doi:10.3390/info17040319

Open AccessArticle

A Modular Approach to Automated News Generation Using Large Language Models

by

Omar Juárez Gambino

^1,*

,

Consuelo Varinia García Mendoza

¹

,

Braulio Hernandez Minutti

¹,

Carol-Michelle Zapata-Manilla

¹,

Marco-Antonio Bernal-Trani

¹ and

Hiram Calvo

²

¹

Escuela Superior de Cómputo (ESCOM), Instituto Politécnico Nacional (IPN), Mexico City 07738, Mexico

²

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Mexico City 07738, Mexico

^*

Author to whom correspondence should be addressed.

Information 2026, 17(4), 319; https://doi.org/10.3390/info17040319

Submission received: 6 February 2026 / Revised: 19 March 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Artificial Intelligence, Generative AI and Large Language Models: Transforming Technology and Society)

Download

Browse Figures

Versions Notes

Abstract

Advances in Generative Artificial Intelligence have enabled the development of models capable of generating text, images, and audio that are similar to what humans can create. These models often have valuable general knowledge thanks to their training on large datasets. Through fine-tuning or prompt-based adaptation, this knowledge can be applied to specific tasks. In this work, we propose a modular approach to automated news generation using Large Language Models, composed of an information retrieval module and a text generation module. The proposed system leverages both publicly available (open-weight) and proprietary Large Language Models, enabling a comparative evaluation of their behavior within the proposed news generation pipeline. We describe the experiments carried out with a total of five representative Large Language Models spanning both categories, detailing their configurations and performance. The results demonstrate the feasibility of using Large Language Models to automate this task and identify systematic differences in behavior across model categories, as well as the problems that remain to be solved to enable fully autonomous news generation.

Keywords:

automated journalism; large language models; news generation; retrieval-augmented generation; natural language processing

Graphical Abstract

1. Introduction

From its inception, journalism has continuously evolved in its practices and technologies. The invention of the printing press in the 15th century laid the foundation for the emergence of the first newspapers, while the development of radio, television, and, more recently, the Internet has diversified the media and accelerated news dissemination [1]. The proliferation of social media and online news platforms has changed how we consume information. Users now demand timely updates on events of interest, forcing news organizations to publish information more quickly.

To meet this demand, media outlets are increasingly turning to tools that use Artificial Intelligence to assist them in creating news articles. Prestigious media outlets have successfully used automated news production from structured data, such as financial reports or sports results [2]. Text generation tools allow media organizations to create and publish news quickly and efficiently, freeing journalists to focus on more analytical and investigative tasks [3].

Generative Artificial Intelligence (GAI) enables the creation of models that can produce highly realistic synthetic data. These models have been used to create images, text, and even music [4]. With the advent of Large Language Models (LLMs) like GPT, the use of GAI has become widespread, enabling faster and more flexible application development. While LLMs have demonstrated their potential to generate text in general domains (e.g., chatbots), their use in specific domains still faces many challenges.

Automated news generation from unstructured data emerges as a potential application for such models. In this paper, we explore the use of LLMs for generating news articles in Spanish. In our proposal, we use basic information about the event of interest and subsequently augment it with data gathered from news sources. This information is provided as context into an LLM to generate a news article. We evaluate this approach using both publicly available (open-weight) and proprietary LLMs, analyzing their configurations and performance to assess differences in behavior between these model categories.

The experimental results demonstrate the feasibility of using LLMs to automate news generation while highlighting current limitations that prevent fully autonomous deployment.

The rest of the paper is organized as follows. First, Section 2 presents related work. Next, Section 3 explains the proposed model for news article generation. Section 4 describes the process of retrieving similar news to augment the initial data. Then, Section 5 explains the use of LLMs for news article generation. Experiments and results are presented in Section 6. A discussion of the generative model’s performance is conducted in Section 7. Finally, Section 8 presents the conclusions and future work.

2. Related Work

The task of news generation automation has been referred to as robotic journalism, automated journalism, algorithmic journalism and computational journalism. All of these terms refer to the process of using software or algorithms to generate news stories without direct human intervention [5]. The need to automate this process is not only to free journalists from this task but also because the speed at which information is shared in specific sectors, such as finance, is crucial for decision-making.

The surge in demand for real-time online news has driven the development of numerous automated news generation methods. These methods span from basic templates to advanced language models. In [6], the authors describe some of the earliest efforts to automate news generation using structured data to fill templates, such as the work done by Thomson Reuters in 2006 [7].

GAI has driven significant progress in fields like Computer Vision and Natural Language Processing. GAI models can create data that is highly similar to human-generated content. In the realm of Natural Language, they have proven capable of producing summaries and essays [8].

Currently, Artificial Intelligence has had an impact on the automation of journalism tasks such as automated content production, data extraction, news dissemination and content optimization [5]. In the particular case of automated content production, the resurgence of neural networks through neural generation models, the development of the transformer architecture and access to large amounts of information on the Internet allow automated content generation. For example, LLMs have been shown to generate high-quality and contextually appropriate headlines and summaries, especially when combined with human-guided prompts [9,10]. The integration of AI into journalism showcases its potential to address various challenges and optimize workflows across different domains. While organizations like Bloomberg utilize AI-driven tools to enhance financial journalism with customized models for precision and speed, general-purpose systems like ChatGPT demonstrate the adaptability of language models in generating high-quality content across broader contexts.

Bloomberg has implemented AI technologies for tasks such as dynamic headline generation, structured content planning, and chart and table analysis. The use of artificial intelligence has enabled partial and complete automation of financial news, improving both the speed of reaction to breaking news and the transparency of complex data analyses. The company developed BloombergGPT, a model tailored for financial applications, and integrated ethical evaluation to mitigate the risks of misinformation and inaccuracies [11].

In addition, ChatGPT-4 has been analyzed for its role as a journalist, particularly in generating content on sensitive topics such as migration. The findings indicate that ChatGPT prioritizes factual accuracy over sensationalism and demonstrates racial awareness and objectivity compared to traditional media outlets. However, it reproduces biases inherent in its training data. This highlights the need for frameworks to critically evaluate LLMs in shaping societal narratives and their potential impact on public opinion [12].

The use of LLMs in journalism has also highlighted significant challenges. As noted in [13], these models are not only capable of generating accurate and high-quality news articles, but can also be exploited to produce and disseminate disinformation. The sophistication of LLMs in understanding context and crafting compelling narratives raises ethical concerns, as they can be misused to create misleading or biased content [14]. To address these concerns, different methods are required to guide LLMs in generating ethical and accurate content. For example, combining LLMs with fine-tuned datasets has shown promise in reducing the generation of undesired outputs while maintaining the high-quality text generation capabilities of the models. Furthermore, ongoing advancements in evaluation metrics and alignment techniques aim to mitigate risks and enhance the reliability of LLM-based systems in journalism.

Automated news generation in Spanish from unstructured data has received limited attention in the literature and has mainly been explored through related tasks such as automatic summarization and headline generation. In this context, the NoticIA dataset introduces a collection of Spanish news articles paired with human-written summaries and evaluations of large language models, highlighting challenges related to factual consistency and hallucinations in automatically generated news content [15]. These limitations are closely linked to the scarcity of resources and task-specific studies for Spanish, as discussed in recent surveys on automatic text summarization for the language [16]. In addition, previous work has explored aspects of headline processing and generation. For instance, some studies focus on detecting misleading headlines and modeling the semantic relationship between headlines and article bodies in Spanish news data, facilitating the automatic recognition of contradictions using pre-trained language models [17]. Overall, while existing research demonstrates progress in partial tasks related to news generation, the end-to-end generation of full news articles in Spanish from unstructured sources—particularly through systematic comparisons between open-weight and proprietary large language models remains relatively underexplored.

3. Proposed Automated News Generation Model

Traditionally, writing a news article involves several steps that require significant human effort and time. A reporter first gathers information about an event, selects the relevant details, and writes the article. This draft is then reviewed and approved by an editor before it is finally published. While efficient, this process can be time-consuming, particularly when rapid response is required for current events, such as natural disasters.

The proposed generative model aims to automate information gathering and article drafting. This automation will significantly reduce the time required to produce a news article, enabling faster dissemination of information.

One scenario where the generative model can be beneficial is when a reporter has limited information about a recent event and needs to publish an article quickly. In this case, the model can automatically collect data from various sources, filter it, and, after the writer curates the relevant information, write a coherent article. This enables reporters to produce timely news without sacrificing quality or accuracy.

The generative model comprises two modules: Information gathering and selection, and News article generation. Figure 1 shows the layout of the proposed modules. The diagram illustrates the overall workflow of the model for generating news articles using a Large Language Model (LLM) applying the modules described below.

Information gathering and selection
1.
Initial data: The process begins when the writer provides the initial data, which includes the place, date, and keywords (e.g., event, people, organizations) related to the journalistic article they want to generate.
2.
RSS feeds: The model retrieves news articles from various RSS feeds. These sources provide structured data, including titles, content, publication dates, and links.
3.
Sentence Embedding: The initial data (place, date, keywords) and the content of the news articles from RSS feeds are converted into embeddings.
4.
Evaluation of similarity: The model compares the embeddings of the writer’s initial data with those of the news articles to find the most similar ones, based on a set threshold, and thus filters the information presented to the writer.
5.
Selection of relevant information: The writer selects the most relevant articles from those returned by the similarity evaluation process.
News article generation
1.
Augmented data: The relevant information selected in the previous stage is used to augment the initial data provided by the writer. This enriched information provides a better context for the news article.
2.
Prompt: Serves as a query or instruction to specify the task to be performed by the model. This prompt incorporates the augmented data and helps the model understand the task, ensuring the generated output is coherent and relevant to the news article.
3.
Model behavior: Set top-level directives to shape how the model responds, particularly focusing on adapting the writing style to meet the specific requirements of journalistic writing.
4.
Large Language Model: The created prompt is fed to an LLM to generate news articles aligned with the information provided.
5.
Generated news article: The output of the model is a document that follows the structure of a news article, aligned with the information provided in the augmented data.

The following sections provide a more detailed description of the two modules that comprise the generative model.

4. Information Gathering and Selection

Newspapers typically organize their information into sections such as sports, business, and politics. Despite the differences in the content reported in each section, the structure of that content is often similar. In [18], the authors point out that, for a piece of information to be considered news, it must include details such as a place, a date, and an event. On this basis, the first step for the writer is to provide the model with these minimum data.

The initial data provided are used to search for additional information about the event from various news sources. The search process has two purposes: firstly, it verifies whether the data supplied by the writer can be found in news published by reputable media outlets. The fact that other media outlets report on the event allows for a certain degree of certainty that the reported event actually occurred. Secondly, it complements the initial information provided by the writer. The additional information provides the model with a better context of the event.

In the search process, we use RSS (Really Simple Syndication) feeds from three Mexican newspapers: La Jornada https://www.jornada.com.mx/v7.0/cgi/rss.php (accessed on 1 October 2023), Expansión https://expansion.mx/canales-rss (accessed on 1 October 2023), and Reforma https://www.reforma.com/libre/estatico/rss/ (accessed on 1 October 2023). These feeds are a standard way to efficiently distribute up-to-date information. The selected media outlets use them to share their latest content for free.

The selected newspapers cover a wide range of topics and offer comprehensive news coverage. La Jornada focuses on national and international news, social issues, and cultural topics. Expansión specializes in business, economics, and finance, while Reforma offers a broad spectrum of news, including politics, society, and technology. This diversity ensures that the information obtained is comprehensive and covers multiple sections. Another consideration for this selection is the political ideology of these newspaper editorial boards. We wanted to have a variety of editorial focuses to present different points of view to the writer before selecting relevant information. La Jornada is known for its left-wing ideology, while Reforma is right-wing conservative [19,20]. Expansión lacks a clear ideological definition, so there is no risk of political bias. Although these sources provide diversity in topics and editorial perspectives, the retrieval process is limited to Mexican news outlets. Therefore, the results of this study should be interpreted within the context of Spanish-language news from these sources, and may not generalize to other regions or media.

RSS feeds provide information such as the news title, description, publication date, and a link to the full article. Figure 2 shows an example of an article extracted from the La Jornada RSS feed (translated from Spanish to English) in XML format. We use Feedparser 6.0.10 https://pypi.org/project/feedparser/ (accessed on 1 October 2023), a Python 3 library that parses RSS feeds, for retrieving the news articles.

For each retrieved RSS entry, the title and description are extracted. Then, they are concatenated as: news article = title + “ ” + description.

Minimal preprocessing was applied to news articles. HTML tags and non-textual elements were removed, while punctuation, casing, and stopwords were preserved to retain contextual information.

To compute semantic similarity between the initial data and news articles, we used the OpenAI text-embedding-3-small embedding model, which produces dense vector representations suitable for semantic retrieval tasks. This model supports multilingual input and can represent Spanish texts effectively. Embeddings for the initial data and news articles (the single text resulting from the concatenation described earlier) were created using this model.

The model returns a 1536-dimensional embedding vector, which represents the semantic content of the input text. The maximum input length supported by the model is 8192 tokens, which was sufficient to encode the full article content in our dataset. When articles exceeded this limit, they were truncated to the maximum allowed length. The embedding model internally produces a fixed-length vector representation for each input text, without requiring explicit pooling operations at the user level.

Cosine similarity was used to compare the embeddings of the initial data and the retrieved news articles. Cosine similarity measures the cosine of the angle between two vectors in a multidimensional space, providing a value between 0 and 1 [21]. Cosine similarity is defined in Equation (1):

Cosine similarity = \frac{\vec{A} \cdot \vec{B}}{∥ \vec{A} ∥ ∥ \vec{B} ∥}

(1)

where:

$\vec{A}$ and $\vec{B}$ are the vectors representing the texts. Each vector is constructed from the text embeddings, which are high-dimensional numerical representations that capture the semantic meaning of words and phrases.
$\vec{A} \cdot \vec{B}$ is the dot product of vectors $\vec{A}$ and $\vec{B}$ . The dot product is calculated as the sum of the products of the corresponding entries of the two vectors:

$\vec{A} \cdot \vec{B} = \sum_{i = 1}^{n} A_{i} B_{i}$

(2)
$∥ \vec{A} ∥$ and $∥ \vec{B} ∥$ are the magnitudes (norms) of vectors $\vec{A}$ and $\vec{B}$ , respectively. The magnitudes are calculated as the square root of the sum of the squares of their components:

$∥ \vec{A} ∥ = \sqrt{\sum_{i = 1}^{n} A_{i}^{2}}$

(3)

$∥ \vec{B} ∥ = \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}$

(4)

Cosine similarity values range from 1 to 0, where 1 indicates that two documents are identical, and 0 indicates that they share no features in common. Different similarity values were tested to determine which news articles would be shown to the writers, so they can select those they deem relevant and use their content to enhance the information about the event. Details of these experiments can be found in Section 6.1.

Additionally, this module was also used to retrieve news articles that were later employed for fine-tuning the LLMs. This stage is essential for improving the accuracy and coherence of the articles generated by our model. Details of the collection of these articles are given in the following Section.

5. News Article Generation

Large language models can follow instructions, or prompts, to complete specific tasks [8]. In the news article generation process, the primary input for the model is created through a combination of an instruction and the augmented data. This is called the prompt construction. By providing a clear and specific prompt, combined with relevant contextual data, it is possible to guide the model to produce a consistent and factually sound output.

Additionally, the behavior of the model is regulated by a set of top-level instructions, which ensure that the generated articles adhere to journalistic standards. For instance, the model is instructed to always include the essential elements of date, place, and event, while ensuring that no fabricated information is introduced, and that formal language is maintained.

The specific prompt and model behavior used to instruct the model may have the following structure:

Prompt: “Create a news article with this information: <Statement with the augmented data>.”
Model behavior: “Your task is to write news articles that always include a date, a place, and an event. You cannot hallucinate information that is not given, use formal language.”

This combination of prompt and model behavior with carefully curated augmented data ensures that the model generates high-quality, accurate, and context-rich news articles.

The expected output of the model is a news article that meets the following characteristics:

It should follow the traditional structure of a news article, including location, date, and event, as described in [18].
The model should not hallucinate information. All generated information must be based on the provided context.
The augmented data should be used effectively to generate an informative article.

In addition, a fine-tuning process was conducted to improve the performance of LLMs in the specific task of generating news. Models were trained with a dataset of Spanish news articles, ensuring that it captured the particularities of the language and journalistic style.

A total of 15,952 news articles were collected over an 8-month period (October 2023 to May 2024) from the three RSS feeds mentioned above. These sources provide content that is freely accessible for non-commercial use. Duplicate and near-duplicate articles were removed based on URL and title matching. This collection of news articles was divided into a training set comprising 90% (14,356) and a validation set comprising 10% (1596). The GPT-2 model was fine-tuned using the full training set and evaluated on the validation set. For the other models that were fine-tuned, training was performed on a subset of 100 news articles from the training set, due to their ability to generalize from a small number of examples. The dataset of 100 news articles, was divided proportionally by the number of tokens in the length of the articles to avoid bias. The goal of fine-tuning was not to maximize model performance, but to adapt the models to journalistic structure and Spanish stylistic conventions under constrained data and computational resources. The distribution was as follows:

33 articles with more than 400 tokens provide substantial content and context, allowing the model to learn from longer and more detailed stories;
33 articles with 200 to 400 tokens help the model adapt to generating concise yet informative news articles;
34 articles with 50 to 199 tokens ensure the model can handle brief news without losing essential information.

The dataset composition allows to address a bias where the language model tended to generate articles averaging around 150 tokens. News articles with this number of tokens could be considered too short and do not provide the essential information. By diversifying the length of the training articles, we aimed to create a more balanced and versatile model.

Once the models were trained, 20 evaluation instances were defined and used based on the initial data and their corresponding augmented data from news articles not included in the full training set, and, therefore, not in the smaller training subset. This separation was enforced to prevent data leakage.

We use both publicly available and proprietary models to evaluate its performance on the proposed task. The publicly available models considered were GPT-2 and LLaMA 3, while the proprietary models included GPT-3.5, Gemini 1.0 Pro and GPT-4o-mini. Details about the parameters used, the dataset processing, and the characteristics of the fine-tuning process applied to these models are described below.

5.1. Fine-Tuning Process for GPT-2

A preprocessing process of the dataset was made for fine-tuning the GPT-2 model. The key points made during preprocessing were as follows:

1.: The spaCy https://spacy.io/ (accessed on 1 Octuber 2023) library 3.7.0 was used to extract named entities (NER) from the text, precisely locations (LOC), and to enrich the dataset by combining these entities with other fields, such as the title, date, and content of each news article. This structured information is used to provide a more complete context during fine-tuning the GPT-2 model.
2.: Unlike other models, GPT-2 does not allow explicitly defining a prompt or its behavior. Although the augmented data is preprocessed to enrich the dataset, GPT-2 uses the input text autonomously, without direct control over how each part of the context will influence the generation.
3.: In this text, a special character, “->”, was used to separate the augmented information from the content that is intended to be generated as a result by the model. This separation helps to clearly distinguish between the metadata added during preprocessing and the main content expected to be generated by GPT-2 during the fine-tuning process.

Figure 3 shows an example of a preprocessed news article, illustrating how the text was reorganized and enriched.

The Table 1 include the key parameters used for fine-tuning the model.

5.2. Fine-Tuning Process for GPT 3.5

For the fine-tuning of the GPT-3.5 linguistic model, the data was preprocessed using the following steps:

1.: The news articles were formatted into a structured JSON format suitable for the model. The formatted data included each article’s title, date, and content.
2.: Each example was transformed into a sequence of messages, including roles such as “system,” “user,” and “assistant.” The system message set the model behavior, the user message provided the prompt for news generation, and the assistant message contained the expected output format.
3.: The content of each news article was adjusted to ensure clarity and coherence. In this case, dates were reformatted to a natural expression, and the location was extracted and included if available.

Figure 4 illustrates an example of the structured format used for the fine-tuning process, including the system instructions, user prompt, and the expected assistant output:

The fine-tuning process involved several key parameters, which are summarized in Table 2.

5.3. Fine-Tuning Process for Gemini

The fine-tuning process for the Gemini language model involved several steps that were similar to the process used for GPT-3.5.

These similarities include the data formatting into a structured JSON format, with both processes involved in transforming each example into a sequence of messages, including roles such as “system”, “user” and “model”. The system message set the model behavior, the user message provided the prompt for news generation, and the assistant message contained the expected output. Also, for both models, the content of each news article was adjusted to ensure clarity and coherence. Dates were reformatted to natural expressions, and locations were extracted and included if available.

Below, we outline the key difference:

1.: The code for Gemini included slight variations in how the conversation data was structured and formatted, reflecting differences in model requirements.

Figure 5 illustrates an example of the structured format used for the fine-tuning process, including the system instructions, user prompt, and the expected assistant output:

The fine-tuning process involved key parameters, summarized in Table 3.

5.4. Fine-Tuning Process for LLaMA 3

The fine-tuning process for the LLaMA 3 language model involved several differences compared to GPT-3.5 and Gemini. Here, we outline the unique aspects of the process:

1.: Unlike GPT-3.5 and Gemini, the LLaMA 3 fine-tuning process did not use a chat format with roles like “user”, “assistant” or “system”.
2.: LLaMA 3 used specific instructional tags like “Instruction”, “Input” and “Response” to mark the beginning and end of the instructions and content. This approach helped the model distinguish between the instructional context and the actual news content more clearly.

Figure 6 shows an example of the structured format utilized during the fine-tuning process, encompassing the system instructions and the formatted news content.

The fine-tuning process for LLaMA 3 included some important parameters, which are detailed in Table 4.

5.5. Fine-Tuning Process for GPT-4o-Mini

The decision was made to evaluate the proprietary language model GPT-4o-mini without adjustments. This design choice is based on both methodological and practical reasons. From a methodological perspective, evaluating proprietary models without fine-tuning allows for a controlled comparison with open-weight models, whose fine-tuning procedures, accessibility, and resource requirements differ substantially. From a practical standpoint, fine-tuning closed models often involves high computational and economic costs and may be limited by platform-specific access policies, making immediate use a common implementation scenario. Finally, given the powerful zero-shot and few-shot capabilities of recent proprietary models, this evaluation focuses on assessing their basic ability to generate coherent news articles using only contextual cues, rather than maximizing task-specific performance through additional training. Despite the absence of fine-tuning, text generation parameters were adjusted to control output style and coherence. In particular, a temperature of 1 was selected to balance fidelity to the input data and the ability to generate coherent, non-redundant news articles. Lower temperatures produced more deterministic outputs with high lexical overlap, leading to extractive behavior. In contrast, higher temperatures increase variability and improve the integration of multiple sources, although they may introduce hallucinations [22]. In our approach, this risk is mitigated by grounding the model in curated augmented data and constraining generation through prompt instructions. Under these conditions, a temperature of 1 provides a suitable trade-off between diversity and factual consistency.

The final configuration of these parameters is reported in Table 5.

The model behavior and prompt used in the experiments were the same as those described for the GPT-3.5 model.

6. Experiments and Results

In this section, we describe the experiments conducted to gather information from news articles similar to the initial data provided by the writer. We will emphasize the tests that led to the threshold selection that allows the news retrieval that may be relevant to complement the initial data. We will also describe the experiments related to news generation, showing the different configurations of the models and their performance.

6.1. Experiments of the Information Gathering and Selection Module

The objective of this module is to identify and retrieve news articles similar to the initial data provided by the author, including location, date, and keywords. To determine the most appropriate similarity threshold for retrieving relevant news articles, we conducted an empirical evaluation using human annotations. A set of 20 queries representing the initial event data was defined. For each query, the retrieval module generated candidate articles using cosine similarity between the query embedding and the embeddings of news articles obtained from RSS feeds.

To analyze the retrieval module’s performance at different similarity levels, the retrieved articles were grouped into similarity ranges: r1 (0.10–0.30), r2 (0.31–0.59), r3 (0.60–0.90) and r4 (0.91–1.00). For each query, the system returned the articles that met the similarity ranges.

The relevance of the retrieved articles was evaluated by two independent human evaluators using a rubric consisting of three criteria:

1.: The article contains the information specified in the initial description of the event;
2.: The article provides additional relevant information about the event;
3.: The article reflects objective information rather than a personal opinion.

Each criterion was assigned a value of 0 if it was not met and 1 if it was met, and an overall relevance score was calculated as the average of the three criteria. For each query, the similarity range yielding the highest average relevance score was identified, and this value was used to rank the similarity ranges that retrieved the most relevant articles. Table 6 shows the ranking of similarity ranges according to the relevance scores assigned by the two annotators. The agreement between annotators was measured using Kendall’s

τ

rank correlation coefficient [23]. The average

τ

across the evaluated queries was 0.967, indicating very strong agreement between annotators. As shown in Table 6, both annotators consistently identified cosine similarity range r3 (0.60–0.90) as yielding the most relevant retrieved articles for the majority of queries. Only one case (Query 15) was assigned a different optimal similarity range (r2) by both annotators. Finally, the inter-annotator agreement was measured using Cohen’s kappa [24], resulting in k = 1.00, indicating perfect agreement [25] between annotators in identifying the cosine similarity range r3 as the one that recovered the most relevant news articles. Based on these results, the retrieval threshold used in the system was defined as a similarity range between 0.6 and 0.9.

To further analyze the behavior of the similarity ranges, we conducted a systematic examination of false positives and false negatives observed across the retrieval results for the full set of queries. Table 7 presents representative examples (translated from Spanish) illustrating these situations. Although false positives and false negatives may occur in any similarity range, their frequency tends to vary with the degree of similarity. In particular, the lower similarity ranges (r1 and r2) are more prone to retrieving noisy or weakly related articles, which increases the likelihood of false positives. This behavior is illustrated using the same query across different ranges: the articles retrieved in r1 and r2 are judged as irrelevant by the annotators, whereas the article retrieved in r3 corresponds to the correct event and is considered relevant. In contrast, the table also includes examples of false negatives, in which relevant articles receive similarity scores below r3. These cases occur when relevant news articles describe the same event using different wording or narrative focus, resulting in lower cosine similarity despite their semantic relevance. Overall, the analysis performed over the complete set of queries, together with the inter-annotator agreement results, suggests that the selected threshold provides a reasonable balance between retrieving relevant contextual information and limiting irrelevant content.

After identifying r3 as the optimal similarity threshold, an additional analysis was conducted to examine how the retrieved news articles contribute to the generation process and to assess the role of the selection step in controlling the quality of the generated content. This analysis was performed using the same set of 20 queries as before.

For each query, the news articles retrieved within the selected threshold were analyzed to identify how different sources describe the same event. Given the length and scope of this analysis, Table 8 presents a representative example. As shown, each retrieved article provides complementary information about the same event, emphasizing different aspects without introducing factual inconsistencies.

To further evaluate the effect of the selection step, three retrieval schemes were considered:

1.: Using only the initial query data without incorporating retrieved news articles (initial data only).
2.: Using a subset of retrieved articles selected by the writer based on relevance and focus (human selection).
3.: Using all retrieved articles without filtering (top k).

This setup can be interpreted as a qualitative ablation of the retrieval selection mechanism. Table 9, Table 10 and Table 11 present the generated news articles following the three schemes, respectively.

Table 9, Table 10 and Table 11 show the generated news articles following the three schemes, respectively. The results show consistent patterns across the evaluated queries. When only the initial data is used (Scheme 1), the generated articles tend to include hallucinated or inaccurate information due to the lack of sufficient contextual grounding. When all retrieved articles are included without filtering (Scheme 3), the generated content often exhibits inconsistencies and occasional hallucinations, likely caused by the presence of multiple perspectives, loosely related information, or conflicting details across sources. In contrast, when the writer selects a subset of relevant articles (Scheme 2), the generated news articles maintain a clear focus, preserve coherence, and avoid factual inaccuracies. This selection process allows the model to benefit from enriched contextual information while preventing the introduction of noise or unrelated content.

These qualitative observations were consistent across all evaluation instances. In particular, the use of unfiltered retrieved content frequently led to conflicting narratives, while the absence of retrieved context resulted in under-specified and error-prone outputs. Manual selection, on the other hand, provided a balance between contextual richness and content control.

Based on these findings, we conclude that the selection of retrieved articles is a critical component of the proposed pipeline. The writer’s intervention ensures that the augmented data remains relevant, coherent, and aligned with the intended narrative of the news article. For this reason, the selection step is retained as a manual process in the current implementation.

Although a quantitative ablation study was not conducted, the consistent qualitative patterns observed across all queries indicate that manual curation significantly improves the reliability and coherence of the generated content. The automatic selection of retrieved articles and its quantitative evaluation remain important directions for future work, particularly given the sensitivity of the generation process to noisy or conflicting contextual inputs.

6.2. Experiments to Evaluate the News Generation Module

Experiments to evaluate the effectiveness of the news article generation module were conducted from two perspectives, quantitative and qualitative. Combining these two perspectives will facilitate a more detailed comparative analysis of the performance of the three fine-tuned large language models.

The models were used to generate 20 distinct news articles that were not included in the training dataset used for fine-tuning. They were provided with the same input information (augmented data) previously selected from the newspapers mentioned in Section 4. These articles covered various topics and journalistic styles, such as sports, economics, politics, and entertainment. This variety allows for a thorough evaluation of the adaptability of the models.

6.2.1. Quantitative Evaluation

The evaluation of text generation tasks, such as machine translation and automatic summarization, presents complexities given the particularities of the language. Metrics such as BLEU and ROUGE allow comparing the similarity between the generated text and the reference text by looking for overlap between n-grams. However, it has been shown that such metrics are insufficient for capturing qualities evaluated by human judges [26].

In our task, we do not intend to directly compare model-generated text and human-written text, as the model aims to create a news article with appropriate wording based on the input data. Therefore, the model’s ability to create a news article with basic information that is not identical to that generated by a human but that does adhere to quality aspects should be evaluated. Given these expectations, traditional text generation metrics such as BLEU or ROUGE will not be used.

Some authors have proposed to evaluate the text quality based on several dimensions [27,28]. The most commonly used are coherence, consistency, fluency and relevance. Each of these dimensions establishes desirable aspects in the generated text. These dimensions are described below.

Coherence: Determines whether the sentences in a text are logically and organically related to each other.
Consistency: Refers to the uniformity in style, tone, and structure of the text.
Fluency: Allows a text to be easily readable, conveying ideas clearly.
Relevance: Verifies whether the generated text contains essential information considering the input text.

In [29], the authors propose a unified multidimensional evaluator for automatically generated text. The evaluator was implemented in a tool called UniEval, which we use to evaluate the quality of the generated text. UniEval assigns numerical scores between 0 and 1 to each dimension, allowing us to quantify the presence or absence of specific features. The tool is only capable of analyzing text in English, and due to the limited availability of standardized evaluation frameworks with comparable coverage for Spanish text, the evaluation pipeline involves translating the generated news articles from Spanish to English prior to scoring with UniEval. The text translation was performed using GPT-4o-mini.

While automatic translation introduces a potential dependency on the translation quality, several factors mitigate its impact. First, the translation process was applied uniformly across all models, ensuring a consistent evaluation setting. Second, the evaluation focuses on high-level dimensions such as coherence, consistency, and relevance, which are largely preserved under accurate translation. Finally, manual inspection of translated outputs confirmed that the semantic content and structural properties of the generated articles were maintained. More importantly, the qualitative evaluation described below was conducted by humans on the Spanish version of the generated news articles. When these articles were translated to include the English version, the human evaluators identified the same problems in terms of structure, accuracy, and coherence that were in the Spanish versions. This suggests that the translation process preserves the semantic content and structural properties of the generated articles. The full qualitative evaluation can be found in the complementary material repository (See Section 7). Therefore, although translation is used as part of the automatic evaluation pipeline, it does not affect the comparative conclusions of this study.

Additional experiments were conducted to explore the effect of using base (non-fine-tuned) versions of the open-weight models. In particular, GPT-2 and LLaMA3 were evaluated under the same experimental conditions using prompt-based generation on the same set of 20 test instances. Their performance was also measured using the UniEval framework. The mean scores obtained for each of the four dimensions in the 20 news articles generated by the models are shown in Table 12.

The first thing we can see in the table is the different effect that fine-tuning has on open-weight models. In the case of GPT-2, the difference in overall performance between the base and fine-tuned models is not very significant, and in fact, the base model achieved a better overall result. This is even though it was trained on a significant number of samples (15,952 news articles). In the case of Llama3, the improvement from fine-tuning is evident. Using only 100 training samples, it improved across all evaluated aspects and achieved a 20% improvement in overall performance. The results show that the impact of fine-tuning depends on the model capacity, not on the amount of training data.

When comparing all the models, we can see that the performance of GPT-2 in almost all metrics was poor for both the base and the fine-tuning versions. The text generated is not coherent, consistent, or relevant except for fluency. In contrast, GPT-3.5 and the fine-tuning version of LLaMA3 show similar results in metrics with good overall performance, except for fluency. Gemini was more balanced, showing good and consistent results in all metrics. Finally, GPT-4o-mini obtained the best results in all dimensions except consistency. The performance of GPT-4o-mini shows the significant evolution of the model since version 2.

6.2.2. Qualitative Evaluation

In addition to the quantitative evaluation, it is important to assess whether the journalistic articles generated conform to the journalistic writing style and accurately reflect the augmented data. Due to the time-intensive nature of qualitative analysis, this evaluation was applied only to the fine-tuned models, which represent the primary configuration of interest in this study. The qualitative evaluation focuses on aspects such as structure, factual consistency, and adherence to journalistic conventions, which are more relevant in models adapted to the task. Although the base version of GPT-2 shows slightly higher quantitative scores, the differences are relatively small, and automatic metrics may not fully capture qualitative aspects of the generated text. Therefore, the qualitative evaluation is centered on the fine-tuned models, which are more representative of the intended use of the proposed system.

Since the qualitative evaluation was conducted by a native Spanish speakers, the generated text was evaluated directly in Spanish without running it through the translation pipeline. An English translation was produced afterward so that the results could be shown to non-native Spanish speakers, and the evaluators noted the same inconsistencies as in the original versions.

A rubric was developed to assess these qualitative aspects. News articles generated by the LLMs must meet the following criteria:

1.: News article structure:
The generated content must conform to a news article format, presenting key information in the following order: date, place and event of interest.
2.: Accuracy of the text:
The generated text must no contain inaccurate or hallucinated information.
3.: Generation of additional information:
The generated text should include additional information relevant to the news context, which helps to emphasize the reported event. This information should not be a mere repetition or paraphrase of the provided data.
4.: Coherence in writing: The writing must make sense and flows logically. Ideas should be expressed through logical connections and transitions and focus on the main topic.
5.: Full text: The generated text must be complete. The models should not produce truncated or incomplete text.

The following is a specific example to illustrate how the models handled content generation based on the augmented data:

Augmented data provided to LLMs:
Paris. The heirs of Diego Maradona will file a lawsuit in an effort to stop the auction of the trophy that the Argentine football legend received after the 1986 World Cup, won by Argentina, the family’s lawyer said yesterday to The Associated Press. Date: 15 May 2024, Place: Paris.
News article generated by GPT-2:
On 15 May 2024. The heirs of Diego Maradona and his wife, the Argentine star², filed an appeal before the International Court of Justice (ICJ)² to revoke the rights over the trophy for selling it as an official trophy with no commercial value²; but it did not⁴. Date: 15 May 2024.¹. The court rejected this request² because it cannot be denied that the footballer is considered one of its most important and influential idols worldwide. As it occurs to him, the court confirmed that the trophy was sold² with public or private money while the club paid 5 million dollars each⁴. Date: 15 May 2024.¹ The appellate judge, the family’s lawyer, pointed out that the ruling was “a mistake” as the trophy is registered under international legal protection norms². Date: 15 May 2024.¹ The court rejected this appeal because it cannot be denied that the trophy⁵
News article generated by Gemini:
Date: 15 May 2024 Location: Paris.¹
News article generated by GPT-3.5:
15 May 2024, in Paris. Paris. The heirs of Diego Maradona will file a lawsuit to stop the auction of the trophy that the myth of Argentine football from the 1986 World Cup, won by Argentina, revealed yesterday the lawyer of the family.
News article generated by LLaMA3:
On 15 May 2024, in Paris. Paris. The heirs of Diego Armando Maradona will file a lawsuit to stop the auction of the trophy that the Argentine football idol received in 1986.
News article generated by GPT-4o-mini:
Paris, 15 May 2024—The heirs of the Argentine football legend, Diego Maradona, have announced their intention to file a lawsuit aimed at stopping the auction of the trophy obtained by the Argentine star following the victory of the Argentine national team in the 1986 FIFA World Cup. This information was communicated by the the lawyer for the Maradona family yesterday to the news agency The Associated Press. The auction of the trophy, which symbolizes one of the greatest achievements in the history of Argentine football, has generated significant interest in the market. However, the heirs of the immortal footballer argue that the object should remain in the hands of the Maradona family and not be auctioned to the highest bidder. This announcement comes in a context where debates about the proper management of Maradona’s legacy, who passed away in November 2020, still persist. The family seeks to protect the honor and memory of a figure that transcends the sports field and has become an icon of Argentine culture. The lawsuit is expected to be filed in the coming days, which could open a legal dispute over the ownership of the trophy and its future. As the situation unfolds, both football fans and followers of Maradona’s career will be attentive to developments and the possible resolution of this crucial matter for the history of the sport.

In the described example, text generated that does not meet specific rubric criteria has been highlighted in red and marked with a superscript number indicating the corresponding metric. Additionally, text generated that complies with criteria 2 and 3—which require the model to produce additional information and ensure accuracy—has been highlighted in blue.

As can be seen, the GPT-2 model does not meet most of the established criteria. This model does not respect the structure of a newspaper article, the generated text is not accurate, as it often hallucinates information, the wording is not coherent, and the generated text is truncated. The Gemini model could not generate the news article and was only limited to indicating the date and location. On the other hand, the LLaMA3 and GPT-3.5 models meet all the criteria but cannot generate additional information. Finally, the GPT-4o-mini model, in addition to meeting the established criteria, generated additional information about the reported event. The additional information is not a mere paraphrase but emphasizes the relevant data, and even using the prior knowledge of the model, it was able to include the date of death of the person mentioned in the news item (bold text).

Figure 7 shows the overall performance of the models on the 20 generated news articles. As can be observed, the GPT-4o-mini model achieved the best results in meeting the criteria of structure, coherence, additional information, and complete text in the 20 generated news articles. However, it generated inaccurate information in four news articles. The LLaMA3 and GPT-3.5 models respected the news’s structure, generating accurate and coherent text, but on very few occasions, they could generate additional information. Furthermore, the LLaMA3 model in half of the news generated incomplete or truncated text. Concerning the Gemini model, it was able to generate complete and coherent texts in all news, although it had difficulties being accurate in the text and respecting the structure. Like LLaMA3 and GPT-3.5, the news in which Gemini could generate additional information was minimal. Finally, the GPT-2 model, as was observed in the previous example, was the one with the worst performance. On very few occasions, it adheres to the quality criteria and, in all cases, hallucinated information.

Figure 8, Figure 9, Figure 10 and Figure 11 show the performance obtained by the models in each of the four sections considered in the experiments.

7. Discussion

The first module developed in our proposal enables the retrieval of news similar to the initial data provided by the writer. The threshold set to a range of 60–90% cosine similarity achieved the best results for retrieving relevant articles. Another relevant result in the first module was the determination of the importance of human-guided selection of complementary news articles to avoid noisy and contradictory information. Thanks to the increase in data, the prompt provided to the language models allowed them to improve the quality of the generated news, reinforcing the importance of combining a retrieval mechanism with human-guided selection strategies. We also observed that the selected embedding model was very efficient in representing the context of the news, so the information retrieval performed well. These results demonstrate that both the selection of similarity thresholds and the controlled use of retrieved information are critical factors in balancing contextual relevance and generation reliability within the proposed pipeline. In the second module of our proposal, we observed that despite LLMs’ success in generating text, these models continue to face several challenges in the specific case of generating news articles. Adhering to a specific structure and writing the text accurately and coherently are important criteria for the specified task, and most models still show deficiencies in meeting them. The GPT-2 model, which was the simplest of those used, could not adhere to the expected structure of a news item, and more importantly, it showed a high tendency to hallucinate information and not generate coherent text. An interesting finding with this model was that the base version without fine-tuning performed slightly better on the quantitative evaluation than the fine-tuning version. On the other hand, more complex models such as GPT-3.5, LLaMA3, and Gemini show better qualitative and quantitative results. However, even these more advanced models were limited to generating text almost identical to that provided as input, so they were not able to generate information that would help to emphasize or complement the reported event. In the case of the GPT-4o-mini model, its results stand out from the rest of the models in all the measurement criteria. According to comparative studies conducted on previous versions of the model, the evolution of this model has been such that it can pass knowledge tests in areas such as medicine and physics, even with results similar to those obtained by humans [30,31,32]. In the particular case of text generation in journalistic articles, it has been reported that version 4 adheres more to the structure of the news and to human writing, having less tendency to fall into informational biases [33]. The foregoing is consistent with what was obtained in our experiments, where GPT-4o-mini obtained very high scores in both the qualitative and quantitative evaluations. However, it is important to point out that all models, including GPT-4o-mini, present cases where they generate inaccurate information, even though in the prompt provided, it was specified that they should avoid this situation. Specifically, GPT-4o-mini hallucinated information about locations, dates and people involved in the news. This problem is especially relevant for the case of news since information that is not true cannot be reported, and this indicates that there is still much work to be done so that the models find a balance between adding information that improves the writing without falling into hallucinated information.

When comparing public access models to proprietary models, we see that GPT-2 had the lowest performance, while LLaMA3 performed comparably and even outperformed Gemini in some tests. These results show that it is feasible to use publicly available models and fine-tune them to generate text, particularly in complex domains such as news generation, while always accounting for the limitations described above to achieve complete automation of the task.

The implementation of the modules described in this work and the full reported experimental results are available in the following repository: https://tinyurl.com/bdcanw8m, accessed on 5 February 2026.

8. Conclusions

Digital media users demand constant and immediate access to information, which has led to the consideration of automated news writing. News writing must adhere to strict quality criteria due to the ethical commitment to report events accurately and without bias. While LLMs have achieved outstanding results in text generation, meeting the requirements of news writing remains a significant challenge for automation. This paper explored the use of LLMs for the automated generation of news articles. A modular approach was proposed, combining information retrieval from news sources with text generation using different Large Language Models. Experiments were conducted using both open-weight and proprietary models, and their performance was evaluated using automatic metrics and human-based qualitative analysis. According to the results, the selected language models benefited from the increase in data provided by the retrieval module, as richer contextual information enabled improvements in news writing. Additionally, the empirical evaluation of similarity thresholds using human annotation, together with the analysis of retrieval strategies, provides insight into how contextual information affects the quality and reliability of generated news articles.

With the exception of GPT-2, the tested models were able to adhere to most of the established evaluation criteria, with the performance of GPT-4o-mini standing out, particularly in the generation of additional information. Beyond demonstrating feasibility, the experimental results highlight clear differences in behavior between open-weight and proprietary models. While proprietary models exhibit strong baseline performance without task-specific fine-tuning, open-weight models provide greater flexibility and adaptability at the cost of additional engineering effort and computational resources. This distinction is particularly relevant for real-world deployments, where access constraints, cost, and customization requirements play a significant role. However, all models exhibited cases of hallucinated information. The findings suggest that current LLM-based systems are best suited for semi-automated news generation workflows, where they can assist journalists by accelerating content production rather than entirely replacing editorial oversight, as was the case in selecting relevant news articles for data augmentation. Future work will focus on improving the reliability and factual grounding of generated content, reducing hallucinations, and enhancing adherence to journalistic conventions. In particular, reinforcement learning strategies could be explored to explicitly reward factual consistency and penalize the generation of unsupported or fabricated information. Additionally, strategies for fully automating the selection of relevant news articles based on writers’ narrative intentions can be explored.

Author Contributions

Conceptualization, O.J.G. and C.V.G.M.; methodology, O.J.G. and C.V.G.M.; software, B.H.M., C.-M.Z.-M. and M.-A.B.-T.; validation, O.J.G., C.V.G.M., B.H.M. and H.C.; formal analysis, B.H.M.; investigation, O.J.G. and B.H.M.; resources, B.H.M., C.-M.Z.-M. and M.-A.B.-T.; data curation, B.H.M., C.-M.Z.-M. and M.-A.B.-T.; writing—original draft preparation, O.J.G., C.V.G.M.; writing—review and editing, O.J.G., C.V.G.M., B.H.M. and H.C.; visualization, B.H.M.; supervision, O.J.G., C.V.G.M. and H.C.; project administration, O.J.G.; funding acquisition, O.J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Mexican Government through Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI), SNII, and the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico, under Grant 20254348, EDI and SIBE-COFAA.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it involved the annotation of publicly available news articles and did not include sensitive personal data or human subject intervention.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Participants were informed about the purpose of the research and participated voluntarily.

Data Availability Statement

Data and trained models used in this work can be found in the following repository: https://tinyurl.com/bdcanw8m, accessed on 5 February 2026.

Acknowledgments

The authors would like to thank Instituto Politécnico Nacional, Escuela Superior de Cómputo, and Centro de Investigación en Computación.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pavlik, J.V. Journalism and New Media; Columbia University Press: New York, NY, USA, 2001. [Google Scholar]
Neil Thurman, K.D.; Kunert, J. When Reporters Get Hands-on with Robo-Writing. Digit. Journal. 2017, 5, 1240–1259. [Google Scholar] [CrossRef]
Graefe, A. Guide to Automated Journalism; Technical Report; Columbia Journalism School: New York, NY, USA, 2016. [Google Scholar]
Bandi, A.; Adapa, P.V.S.R.; Kuchi, Y.E.V.P.K. The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet 2023, 15, 260. [Google Scholar] [CrossRef]
Kotenidis, E.; Veglis, A. Algorithmic journalism—Current applications and future perspectives. Journal. Media 2021, 2, 244–257. [Google Scholar] [CrossRef]
Van Dalen, A. The algorithms behind the headlines: How machine-written news redefines the core skills of human journalists. Journal. Pract. 2012, 6, 648–658. [Google Scholar] [CrossRef]
Robots Wrote This. 2006. Available online: https://www.wired.com/2006/08/robots-wrote-this/ (accessed on 9 September 2024).
Feuerriegel, S.; Hartmann, J.; Janiesch, C.; Zschech, P. Generative AI. Bus. Inf. Syst. Eng. 2024, 66, 111–126. [Google Scholar] [CrossRef]
Sasaki, T.; Kuge, T.; Shoji, Y.; Yamamoto, T.; Ohshima, H. Generating News Headline Containing Specific Person Name. In Proceedings of the Database Systems for Advanced Applications. DASFAA 2024 International Workshops; Morishima, A., Li, G., Ishikawa, Y., Amer-Yahia, S., Jagadish, H.V., Lu, K., Eds.; Springer: Singapore, 2025; pp. 220–227. [Google Scholar]
Francis, M.; Rinaldi, M.; Gili, J.; De Cosmo, L.; Iannaccone, S.; Nissim, M.; Patti, V. GATTINA-GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-It 2024); CEUR Workshop Proceedings: Pisa, Italy, 2024. [Google Scholar]
Quinonez, C.; Meij, E. A New Era of AI-Assisted Journalism at Bloomberg. AI Mag. 2024, 45, 187–199. [Google Scholar] [CrossRef]
Breazu, P.; Katsos, N. ChatGPT-4 as a Journalist: Whose Perspectives is it Reproducing? Discourse Soc. 2024, 35, 687–707. [Google Scholar] [CrossRef]
Dipto Barman, Z.G.; Conlan, O. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Mach. Learn. Appl. 2024, 16, 100545. [Google Scholar] [CrossRef]
Fang, X.; Che, S.; Mao, M.; Zhang, H.; Zhao, M.; Zhao, X. Bias of AI-generated content: An examination of news produced by large language models. Sci. Rep. 2024, 14, 5224. [Google Scholar] [CrossRef]
García-Ferrero, I.; Altuna, B. NoticIA: A Clickbait Article Summarization Dataset in Spanish. Proces. Leng. Nat. 2024, 73, 191–207. [Google Scholar]
Matías-Mendoza, G.A.; Ledeneva, Y.; García-Hernández, R.A. Spanish Automatic Text Summarization: A Survey. Comput. Sist. 2024, 28, 1361–1376. [Google Scholar] [CrossRef]
Sepúlveda-Torres, R.; Bonet-Jover, A.; Saquete, E. Detecting Misleading Headlines Through the Automatic Recognition of Contradiction in Spanish. IEEE Access 2023, 11, 72007–72026. [Google Scholar] [CrossRef]
Jimenez, D.; Gambino, O.J.; Calvo, H. Pseudo-labeling improves news identification and categorization with few annotated data. Comput. Sist. 2022, 26, 183–193. [Google Scholar] [CrossRef]
Sánchez, J.C.A.; Lorenzo, Y.S.; Pereyra-Zamora, P. COVID-19 vaccine strategy, news and political parallelism in Mexico. A comparative analysis of La Jornada and Reforma. Journalism 2025. [Google Scholar] [CrossRef]
Marañón, F.; Tiscareño-García, E. Media Coverage Analysis of Migrant Caravans in Mexican Digital Newspapers. In Media, Migrants, and U.S. Border(s); Rocha de Luna, R., Bañuelos Capistrán, J., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 183–200. [Google Scholar] [CrossRef]
Vajjala, S.; Majumder, B.; Gupta, A.; Surana, H. Practical Natural Language Processing; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2020. [Google Scholar]
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Sai, A.B.; Mohankumar, A.K.; Khapra, M.M. A Survey of Evaluation Metrics Used for NLG Systems. ACM Comput. Surv. 2022, 55, 26. [Google Scholar] [CrossRef]
Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 13806–13834. [Google Scholar]
Berber Sardinha, T. AI-generated vs human-authored texts: A multidimensional comparison. Appl. Corpus Linguist. 2024, 4, 100083. [Google Scholar] [CrossRef]
Zhong, M.; Tang, T.; Zhang, Z.; Liu, P.; Liu, T.; Wang, Y.; Wan, X. Towards a Unified Multi-Dimensional Evaluator for Text Generation. arXiv 2022, arXiv:2210.07197. [Google Scholar] [CrossRef]
Rosoł, M.; Gąsior, J.S.; Łaba, J.; Korzeniewski, K.; Młyńczak, M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci. Rep. 2023, 13, 20512. [Google Scholar] [CrossRef]
Lin, J.C.; Younessi, D.N.; Kurapati, S.S.; Tang, O.Y.; Scott, I.U. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye 2023, 37, 3694–3695. [Google Scholar] [CrossRef] [PubMed]
Yeadon, W.; Peach, A.; Testrow, C. A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course. Sci. Rep. 2024, 14, 23285. [Google Scholar] [CrossRef] [PubMed]
Castillo-Campos, M.; Varona-Aramburu, D.; Becerra-Alonso, D. Artificial Intelligence Tools and Bias in Journalism-related Content Generation: Comparison Between Chat GPT-3.5, GPT-4 and Bing. Tripodos Fac. Comun. I Relac. Int. Blanquerna-URL 2024, 55, 99–115. [Google Scholar] [CrossRef]

Figure 1. Diagram of the proposed generative model.

Figure 2. Example of a news article from La Jornada’s RSS feed.

Figure 3. Example of the data format used to fine-tuning GPT-2.

Figure 4. Example of the data format used to fine-tuning GPT 3.5.

Figure 5. Example of the data format used to fine-tuning Gemini.

Figure 6. Example of the data format used to fine-tuning LLaMA 3.

Figure 7. Overall performance of the models in the qualitative evaluation.

Figure 8. Qualitative evaluation of the generated news from the economy section.

Figure 9. Qualitative evaluation of the generated news from the entertainment section.

Figure 10. Qualitative evaluation of the generated news from the politics section.

Figure 11. Qualitative evaluation of the generated news from the sports section.

Table 1. Fine-Tuning Parameters for GPT-2.

Fine-Tuning Parameters for GPT-2
Base model	dquisi/storyspanishgpt2v2
Epochs	1
Steps	10,000
Loss after fine-tuning	2.96
Trained tokens	109,384
Batch size	4
LR Multiplier	0.0002

Table 2. Fine-Tuning Parameters for GPT-3.5.

Fine-Tuning Parameters for GPT-3.5
Base model	gpt-3.5-turbo
Epochs	4
Steps	400
Loss after fine-tuning	0.39
Trained tokens	109,384
Batch size	1
LR Multiplier	2

Table 3. Fine-Tuning Parameters for Gemini.

Fine-Tuning Parameters for Gemini
Base model	gemini-1.0-pro-002
Epochs	4
Steps	100
Loss after fine-tuning	1.08
Trained tokens	23,667
Batch size	1
LR Multiplier	1

Table 4. Fine-Tuning Parameters for LLaMA 3.

Fine-Tuning Parameters for LLaMA 3
Base model	Meta-Llama-3.1-8B
Epochs	5
Steps	60
Loss after fine-tuning	0.54
Trained tokens	81,076
Batch size	1
LR Multiplier	2

Table 5. Configured parameters for text generation using GPT-4o-mini.

Configured Parameters for GPT-4o-Mini
Model	gpt-4o-mini
Temperature	1
Max Tokens	500
Top-P	1
Frequency Penalty	0.0
Presence Penalty	0.0

Table 6. Threshold rankings assigned by the two annotators and Kendall’s

τ

agreement for each news query.

Table 6. Threshold rankings assigned by the two annotators and Kendall’s

τ

agreement for each news query.

Query	RankingAnnotator1	RankingAnnotator2	Kendall $τ$
1	[r3, r2, r1, r4]	[r3, r2, r1, r4]	1.00
2	[r3, r2, r1, r4]	[r3, r2, r1, r4]	1.00
3	[r3, r2, r1, r4]	[r3, r1, r2, r4]	0.67
4	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
5	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
6	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
7	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
8	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
9	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
10	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
11	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
12	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
13	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
14	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
15	[r2, r1, r3, r4]	[r2, r1, r3, r4]	1.00
16	[r3, r2, r1, r4]	[r3, r1, r2, r4]	0.67
17	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
18	[r3, r1, r2, r4]	[r3, r1, r2, r4]	1.00
19	[r3, r2, r1, r4]	[r3, r2, r1, r4]	1.00
20	[r3, r2, r1, r4]	[r3, r2, r1, r4]	1.00

Table 7. Examples illustrating false positives and false negatives across the evaluated thresholds.

Evaluated Threshold	Query	Retrieved Article and Description	Classification
r1	The Portuguese writer Lobo Antunes has died	A woman dies in a fire in Coyoacán. The article describes the dead of a person that is not related to the character of the query.	False positive
r2	The Portuguese writer Lobo Antunes has died	Elena Poniatowska: Farewell to Pedro Friedeberg. The article discusses the farewell given to someone other than the person mentioned in the query, but who is also related to the cultural field.	False positive
r3	The Portuguese writer Lobo Antunes has died	Farewell to António Lobo Antunes, the author of the oppressed. The article is related to the query and provides additional information, such as the person’s full name, and mentions one of his most renowned works.	Relevant
r3	Mexico at the World Baseball Classic	Mexico makes a triumphant debut in the World Baseball Classic, defeating Great Britain 8-2. This article obtained a cosine similarity of 0.57 that is little lower than the established range r3 so was not recovered by the system, but it is relevant to the annotators.	False negative
r3	USMCA Review Mexico, United States and Canada	Mexico and the U.S. will begin official USMCA talks on March 16. This article is related to the topic but it omits the keyword Canada, so the similarity value was lower that r3, but the annotators considered relevant.	False negative

Table 8. Retrieved news related to the query “Iran Mexico World Cup matches”.

Newspaper	Retrieved News	Focus
Expansión	Iran wants to play its World Cup matches in Mexico for fear of reprisals in the US. Given that Trump has stated that he cannot guarantee the safety of the Iranian team, we will definitely not go to the US,” says the president of the Iranian Football Federation.	The Iranian national team will not play in the US for fear of reprisals.
Reforma	Mexico awaits FIFA decision on Iran matches. Sheinbaum said that Mexico is awaiting FIFA’s decision on hosting Iran’s matches in the World Cup, which are currently scheduled to be played in the US.	Mexico hopes to host Iran’s matches in the World Cup
La Jornada	Iran is negotiating with FIFA to play the 2026 World Cup in Mexico. Kuala Lumpur—Despite recent warnings from US President Donald Trump about the risks of war in the Middle East, Iran has not withdrawn from participating in the upcoming World Cup. Through the Iranian embassy in Mexico, the president of the Iranian Football Federation (FFIRI), Mehdi Taj, announced that negotiations are underway with FIFA to have the national team’s matches played in Mexico.	Iran is negotiating with FIFA to play the 2026 World Cup in Mexico.

Table 9. Analysis of generated news using scheme 1.

Generated News	Inaccurate Information	Hallucinations
Anticipation for the 2026 World Cup: Mexico and Iran Prepare for a High-Voltage Encounter. Iran and Mexico Prepare for an Imminent World Cup Encounter 17 March 2026 Mexico City, Mexico— In the lead-up to the 2026 FIFA World Cup, to be held in the United States, Mexico and Iran are preparing for a crucial group stage encounter. While the exact date of the match has not yet been specified by FIFA, great interest from fans of both countries is anticipated. This match promises to be one of the highlights of the competition, given that Mexico and Iran have a rich football history and a passionate following. Expectations surrounding the event are rising, and the teams are intensifying their training to ensure peak performance. Both national teams have begun making tactical and strategic adjustments in their training sessions to arrive in top form on match day. The clash is expected to have an electrifying atmosphere, with thousands of local and visiting fans in attendance. This match reaffirms both countries’ commitment to excelling in the tournament and advancing as far as possible in the competition. The excitement surrounding this encounter is palpable, and fans of both Mexico and Iran are eagerly awaiting the final date and what promises to be a spectacular sporting event.	In the context of the 2026 FIFA World Cup, which will be held in the United States (the 2026 World Cup will be held in three venues: Mexico, Canada, and the US)	1. Iran and Mexico are preparing for a meeting in the World Cup group stage of the tournament. (Iran will not play against Mexico in the group stage of the 2026 World Cup; Iran is in Group G and Mexico is in Group A) 2. Mexico and Iran have a rich football history (Mexico and Iran do not share a significant history with each other; there are no historic matches, finals, or relevant rivalries)

Table 10. Analysis of generated news using scheme 2.

Generated News	Inaccurate Information	Hallucinations
Iran is negotiating to play the World Cup in Mexico despite international tensions. Kuala Lumpur, 17 March 2026—Despite recent warnings issued by US President Donald Trump regarding the risks associated with the war in the Middle East, Iran has confirmed its intention to participate in the upcoming World Cup. Mehdi Taj, president of the Football Federation of Iran (FFIRI), announced through the Iranian Embassy in Mexico that negotiations are underway with FIFA to host the national team’s matches in Mexico. This announcement comes amid heightened international tensions, with security concerns escalating due to the situation in the Middle East. However, the FFIRI’s decision reflects a strong commitment to sport and participation in one of the world’s most important sporting events. The Federation is actively working to ensure its team can compete under all necessary conditions, prioritizing the safety and well-being of players and fans. The decision to play in Mexico, a country with a strong football tradition and the experience of having hosted two World Cups, could open significant opportunities for both Iranians and football fans in the region. FIFA has not yet issued an official statement regarding this request, but Iran’s interest in participating in the tournament remains strong, demonstrating the resilience of sport in times of political turmoil.	None	None

Table 11. Analysis of generated news using scheme 3.

Generated News	Inaccurate Information	Hallucinations
Iran seeks alternatives to play in the 2026 World Cup amid warnings from the US. Kuala Lumpur, 17 March 2026—Despite recent statements by US President Donald Trump indicating that he cannot guarantee the safety of the Iranian national team during their stay in the country, the Football Federation of Iran maintains its intention to participate in the upcoming FIFA World Cup. In response to Trump’s expressed concerns, the president of the Football Federation of Iran (FFIRI), Mehdi Taj, announced that his country will not travel to the United States for the scheduled matches in the tournament. Through the Iranian embassy in Mexico, Taj reported that negotiations are underway between the FFIRI and FIFA with the aim of hosting the Iranian national team’s matches in Mexico. Mexico City Mayor Claudia Sheinbaum stated that Mexico is awaiting a decision from FIFA on the matter. This development highlights the complex situation Iran faces in international football, in a context marked by political tensions and concerns about the safety of its athletes. The situation continues to evolve as the start of the tournament approaches, generating anticipation about the future of the Iranian team in the World Cup.	Iran seeks alternative venues for the 2026 World Cup (The president of the Football Federation of Iran (FFIRI), Mehdi Taj, is in negotiations with FIFA to have the national team’s matches played specifically in Mexico)	1. In response to concerns expressed by Trump (Trump is not worried about the Iranian team playing in the 2026 World Cup in his country. Trump warned that he cannot guarantee the safety of the Iranian national team during their stay in the US) 2. Mexico City Mayor Claudia Sheinbaum (Claudia Sheinbaum was mayor of Mexico City from 2018 to 2023. Claudia Sheinbaum is currently the president of Mexico)

Table 12. Evaluation metrics of LLMs used.

Model	Coherence	Consistency	Fluency	Relevance	Overall
GPT-2 Base	0.485	0.485	0.504	0.484	0.492
GPT-2	0.261	0.500	0.738	0.291	0.448
Gemini	0.703	0.789	0.777	0.716	0.746
GPT-3.5	0.790	0.863	0.496	0.811	0.740
LLaMA3 Base	0.511	0.510	0.517	0.511	0.512
LLaMA3	0.771	0.876	0.436	0.780	0.716
GPT-4o-mini	0.972	0.799	0.956	0.973	0.925

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gambino, O.J.; García Mendoza, C.V.; Hernandez Minutti, B.; Zapata-Manilla, C.-M.; Bernal-Trani, M.-A.; Calvo, H. A Modular Approach to Automated News Generation Using Large Language Models. Information 2026, 17, 319. https://doi.org/10.3390/info17040319

AMA Style

Gambino OJ, García Mendoza CV, Hernandez Minutti B, Zapata-Manilla C-M, Bernal-Trani M-A, Calvo H. A Modular Approach to Automated News Generation Using Large Language Models. Information. 2026; 17(4):319. https://doi.org/10.3390/info17040319

Chicago/Turabian Style

Gambino, Omar Juárez, Consuelo Varinia García Mendoza, Braulio Hernandez Minutti, Carol-Michelle Zapata-Manilla, Marco-Antonio Bernal-Trani, and Hiram Calvo. 2026. "A Modular Approach to Automated News Generation Using Large Language Models" Information 17, no. 4: 319. https://doi.org/10.3390/info17040319

APA Style

Gambino, O. J., García Mendoza, C. V., Hernandez Minutti, B., Zapata-Manilla, C.-M., Bernal-Trani, M.-A., & Calvo, H. (2026). A Modular Approach to Automated News Generation Using Large Language Models. Information, 17(4), 319. https://doi.org/10.3390/info17040319

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Modular Approach to Automated News Generation Using Large Language Models

Abstract

1. Introduction

2. Related Work

3. Proposed Automated News Generation Model

4. Information Gathering and Selection

5. News Article Generation

5.1. Fine-Tuning Process for GPT-2

5.2. Fine-Tuning Process for GPT 3.5

5.3. Fine-Tuning Process for Gemini

5.4. Fine-Tuning Process for LLaMA 3

5.5. Fine-Tuning Process for GPT-4o-Mini

6. Experiments and Results

6.1. Experiments of the Information Gathering and Selection Module

6.2. Experiments to Evaluate the News Generation Module

6.2.1. Quantitative Evaluation

6.2.2. Qualitative Evaluation

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI