A Methodological Proposal to Evaluate Journalism Texts Created for Depopulated Areas Using AI

: The public service media Radio Televisi ó n Española (RTVE) conducted a proof-of-concept study to automatically generate reports on the results of the local elections of 28 May 2023 in Spanish communities with fewer than 1000 inhabitants. This study describes the creation, testing and application of the methodological tool used to evaluate the quality of the reports generated using artificial intelligence in order to optimize the algorithm. The application of the proposed datasheet provided a systematic analysis, and the iterative use of the tool made it possible to gradually improve the results produced by the system until a suitable threshold was reached for publication. The study also showed that, despite the ability of AI systems to automatically generate a large volume of information, both human labour and the reliability of the data that feed the system are essential to ensure journalistic quality.


Introduction
The arrival of artificial intelligence (AI) systems in newsrooms has made it possible, amongst other things, to generate automated news content (Tejedor 2023).Although the algorithms were initially used to produce simple pieces based on structured data related to sports, finance or the weather, over time, initiatives have emerged that have enabled the creation of more elaborate texts that, in addition to relieving journalists of less creative work, have provided the means to fill some information gaps that would not otherwise be possible to fill, due to the limited resources of media companies, as noted by Aramburu Moncada et al. (2023).One such case concerns election information in sparsely populated towns.
With this aim in mind, in late 2023, the Editorial Board for Technology, Innovation and Systems of Radio Televisión Española (RTVE) conducted a proof-of-concept study to produce reports on the results of the local elections held in Spain on 28 May 2023.A multidisciplinary team made up of engineers, journalists and information technology experts from public institutions and private enterprises worked to create a system based on AI that could automatically generate reports related to the election results in 4941 Spanish communities with fewer than 1000 inhabitants.
From the time the polling stations closed at 10 p.m., the system generated 59,052 pieces with text, images and graphics as the results came in.The information, based on official data from the Ministry of the Interior, could be consulted on the website www.rtveia.es.Additionally, 332 h of audio using a synthetic voice were generated and also accessible on the same page through the virtual assistant Alexa.Because of the good results, the system was used again in the general elections of 23 June 2023.On that occasion, 23,006 reports were generated on the evolution of the turnout and 53,034 on the results of the vote.The significance of the project was confirmed when it was recognized with a prestigious international IBC2023 award in the category of Social Impact, which acknowledges groundbreaking initiatives that are redefining the landscape of the media industry.To train the system, it was necessary to create a methodological tool that could calibrate the quality of the texts being generated by the machine and propose any required improvements.This study examines the undertaking based on the experience of a team of researchers from the Universidad de Castilla-La Mancha, who worked alongside RTVE journalists and technicians from Narrativa, the first international automated news agency.

Automated Texts and Their Impact in Rural Areas
In the media ecosystem today, the use of new technologies based on artificial intelligence for election coverage has been accompanied by changes in working methods and dynamics, reducing costs and automating essential work that had, until recently, been considered to belong exclusively to the realm of humans in the journalism profession (Brennen et al. 2022).Although a number of works have approached this topic, few extensive field studies have been conducted.The impact of this tool on the generation of election news has been analysed by Fanta (2017), Sánchez Gonzales and Sánchez González (2017) and Digiday (2017), amongst others. For Diaz-Garcia et al. (2020), its function ranges from providing and comparing information on the candidates and their proposals to compiling data on the preferences of the electorate in order to formulate party strategies, campaign messages and other political communication material, to predicting non-official election outcome scenarios, amongst other applications.In short, AI helps journalists to generate new news content by searching for and analysing data that are personalized and adapted to the needs of different audiences who did not previously receive this information due to a lack of resources (Canavilhas 2015).
However, Aramburu Moncada et al. (2023) argue that a key potential aspect of this technological tool lies in interpreting the election results from small towns in depopulated areas of Spain and transforming this information into automated reports without human intervention.They highlight the initiatives undertaken in this country in 2019 by the digital newspaper El Confidencial and the start-up Narrativa, the agency that went on to collaborate with the public broadcast entity RTVE in 2023 to automatically generate articles and audios about the election outcomes and developments in Spanish communities with fewer than 1000 inhabitants.The resulting accurate and intelligible synthetic words and texts once again demonstrated the scope and importance of artificial intelligence with regard to extending into areas that traditional coverage cannot reach, as proposed by LeCompte (2015).

Generative AI Tools and Models
The appearance of artificial intelligence models and tools has radically transformed the way the profession of journalism is understood and performed, changing the education process, the production of content and the knowledge and skills required by professionals.The arrival of AI has led academics to raise various questions about the efficacy and limitations of these applications, especially ChatGPT, a natural language processing model (NLP) (Guida and Mauri 1986) developed by the company OpenAI (OpenAI 2022).
A number of works have analysed the positive and negative sides of AI, including Aydın and Karaarslan (2022), Lopezosa (2023) and Wang et al. (2023), amongst others, who highlight the efficiency of generative artificial intelligence in the production of content but focus on the need to be responsible and ethical, since this tool does not disclose the sources of information or body texts it uses to search for information and create responses.Other studies, such as the work by Lopezosa et al. (2023), have expanded their scope to new tools like Midjourney, Dall-e and Stable Diffusion, recognizing that, although they are capable of reaching the general public, they make mistakes in their application and collection of data.Consequently, the authors recommend that humans provide feedback when they are used, in other to reinforce all the verification procedures.
Turning to image and video creation, Guerrero-Solé and Ballester (2023) argue that the applications Stable Diffusion, Midjourney and Dall-e have changed the paradigm, although they call for additional studies and research to better understand the opportunities, limitations and related challenges.In a study of the audiovisual creations generated by GenAI, López Delacruz (2023) observed that one of its primary limitations is that it imitates earlier visual styles without contributing any narrative innovation.
Faced with this situation, Van-Dis et al. (2023) recommend that both researchers and journalists apply a critical approach to the use of artificial intelligence tools, carefully verifying the results, the data and the references, since these tools contain biases resulting from the models used or their training data.

Previous Studies
Academic studies on the quality of news articles created by IA have increased and improved in recent years (Jung et al. 2017;Waddell 2019;Tandoc et al. 2020;Jia and Johnson 2021;Wölker and Powell 2018;Lermann Henestrosa et al. 2023).The first international research studies, which appeared in the mid-2000s, primarily focused on how these tools could write texts by themselves (Carlson 2015) and on how articles written by robots are perceived, using indicators to measure their quality (Clerwall 2014;Haim and Graefe 2017;Zheng et al. 2018).
In Spain, Ufarte Ruiz and Murcia Verdú (2018) were amongst the first to analyse the quality of the information in political and financial reports produced by Gabriele, the Narrativa software, using a questionnaire conducted with more than 100 journalists.Calvo-Rubio and Ufarte-Ruiz (2020), in turn, surveyed almost 200 Journalism and Audiovisual Communication students in a number of Spanish universities and concluded that the quality of automated news is deficient due to the lack of contrast, absence of interpretation, non-existence of humanity and sensitivity and poor wording.
Rojas Torrijos and Toural-Bran (2019) shifted the focus to the sports coverage produced by AnaFut, the bot developed by the digital paper El Confidencial, to identify the statistics that facilitate orderly data handling and the programming of routine news productions, given the cyclical and repetitive nature of matches and tournaments.In a similar vein, Murcia Verdú et al. (2022) analysed the content of 28 news stories to discover whether these types of texts have the same quality standards as pieces written by journalists.On the other hand, Rojas Torrijos (2021) and Calvo-Rubio and Rojas-Torrijos (2024) have addressed the importance of reinforcing the supervision of journalistic ethics in semiautomated journalism.
In general, the quality of automated news is perceived as excellent, although with some limitations, like the impossibility of adding context, different points of view and interpretation (Sandoval-Martín and Barrolleta 2023).
In the field of public service media (PSM), the demands of quality and compliance with the ethical principles of journalism reach the highest level due to their relevance, need for trust and social function.In this field, Fieiras-Ceide et al. (2023) have studied the use of AI in the recommender systems of 14 European public broadcasters.In a more general scope, Zaragoza and García Avilés (2022) and Direito-Rebollal and Donders ( 2023) have focused on analysing how PSMs are adapting to the new communicative context where technological innovation plays an essential role, with special emphasis on quality and content adaptation.

The RTVE Project
As Fieiras-Ceide et al. (2023, p. 354) explain, "the renewed digital context motivates public broadcasting corporations to internalize innovative processes that allow them to be relevant in people's lives.These processes are not only limited to the development and integration of sophisticated technological prototypes, but are closely linked to a philosophy of constant change and renewal of ideas and ways of thinking".In the case of Radio Televisión Española (RTVE), the importance of Artificial Intelligence as a fundamental tool for the creation, production and distribution of content has been emphasized.
Since 2021, the Editorial Board for Technology, Innovation and Systems of RTVE has been working to create a tool that would make it possible to generate news on election results in communities with fewer than 1000 inhabitants, in order to 'offer a service that is not possible to provide using traditional media' to citizens and 'assist RTVE professionals in their jobs by creating an initial version on which they can work' (RTVE 2023).The aim of these initiatives on the part of the public broadcasting service is to emphasize its mission of public service.Earlier, in 2019, the BBC News Lab had already experimented with the automated generation of news related to general elections using the engine called SALCO (Semi-Automated Local Content).Subsequently, in May 2023, they developed a project to generate hyper-local news based on official statistical data (Hatcher-Moore 2023).
For the necessary technological developments, RTVE used the services provided by the company Narrativa, specifically its natural language generation system known as Gabriel.The company trained the software to write election reports using texts written by journalists as the corpus.The first tests were conducted during the elections for the Madrid Assembly in 2021.At that time, 2600 pieces were generated in four hours that were not posted but were used to adjust the system (Aramburu Moncada et al. 2023).After that trial, attention turned to the local elections of 2023.The challenge was to adjust the system to offer useful, real-time information on the vote count in Spanish communities with fewer than 1000 inhabitants.RTVE journalists designed the structure of the reports that the system needed to generate using the official structured data supplied by the Ministry of the Interior throughout election day as the source. 1  To train the system, data from earlier local elections were loaded and the phase of testing and adjusting the algorithm began.This work continued throughout the first four months of 2023.The project also participated with the studio Monoceros Labs and the Universidad de Granada to create the synthetic voice, with the technological support of Amazon Web Services (AWS).The Spanish National Organization of the Blind, ONCE, provided advice about information accessibility.
In order to evaluate the quality of the automatically generated reports and measure the advances, it was necessary to create a methodological tool.This task was assigned to the Universidad de Castilla-La Mancha, thus expanding the university's collaboration with RTVE's territorial hub in the region of Castille-La Mancha, a model of technological innovation for the public broadcasting service.

Methods
The proposed methodology uses content analysis; although mainly quantitative, it is complemented with value judgements in the form of observations to facilitate the implementation of improvements, in line with that carried out by Odriozola-Chéné et al. (2020).
The design of the methodology began with the creation of a datasheet containing 11 variables and 58 dimensions that could be used to detect anomalies and provide a numerical evaluation of the final result (Table 1).Six experts in the fields of journalistic writing and cyber journalism were involved in the design and evaluation to ensure the reliability of the instrument.This tool was iteratively applied to a representative sample of the total reports generated by the system with the data loaded for the training.After each analysis, the system was adjusted before it generated new reports.The process was repeated until the result was accepted as valid by the multidisciplinary team created to develop the project led by RTVE.
The datasheet was configured on the basis of three characteristics: the journalistic quality of the text, its suitability for the medium used for transmission and its compliance with the policies of a public information service.
For the first characteristic, the textual elements of the reports were identified and, based on the academic literature (Trillo-Domínguez and Alberich-Pascual 2017; López Hidalgo 2001Hidalgo , 2019;;Armentia Vizuete and Marcet 2003), the characteristics that needed to be accounted for were determined.As a result, the dichotomous variables were obtained that evaluated the presence, or lack thereof, of each of the characteristics (0: not present; 1: present), to which four variables were added to evaluate the overall clarity, concision, coherence and cohesion of the final result (Grijelmo 2014;Martínez Albertos 1974;Dovifat 1959).
For the second characteristic, the team evaluated the suitable appearance of the narrative elements characteristic of a webpage, the main platform for posting the content: links, photos, audios and graphics.Videos were omitted as they were not included in the proposed news format.All the elements were studied for their presence, or lack thereof, and, in the former case, whether they were relevant in terms of providing valuable information.The suitability of the anchor texts for the links, the use of photo captions and the presence of a descriptive headline for the graphics were also reviewed (Salaverría 2005;Cobo 2012).
Finally, a series of variables related to the suitability of the content for a public medium were included: the public relevance of the information, the accuracy, objectivity, impartiality, use of nondiscriminatory language, use of inclusive language and the precise use of the data (Mandato-Marco a la Corporación RTVE 2008; Estatuto de Información de la Corporación RTVE 2008; Manual de Estilo de RTVE 2010; Guía de Igualdad de RTVE 2022).
To test the tool and the reliability of the data obtained from the six encoders-experts involved in its development-a pilot test was conducted based on analysing 12 of the 632 automatically generated reports (1.9% of the total).Krippendorff's (2004) alpha was employed for the study index, yielding an average result of 0.681, divided by category as follows (Table 2): As the main discrepancies were detected in the evaluation of the headlines and the body text, the coding was revised and the analytical criteria unified in order to raise the index above 0.7 in all the categories, a figure considered sufficient to obtain reliable data (Hayes and Krippendorff 2007).

Results
After adjusting the datasheet and the codebook, the tool was applied to 106 reports generated by the system under study to a number of communities.The results (Table 3) identified the elements that required optimization.
The analysis of the body text showed that the first paragraphs reflected the most important elements of the reports (100%) but without summarizing (100%).However, all of them began with an adverb (100%).In 89.62% of the cases, the spelling was correct and 95.28% of the reports followed the rules of grammar.The articles generated by AI included a background (99.05%), provided context data (82.62%) and clearly explained the facts (98.11%).However, anomalies were detected in the interpretation of the data in 47.2% of the pieces, as well as in the spelling (42.45%) and grammar (16.98%).The overall evaluation received 2.13 out of 3 points for clarity, 2.06 for concision, 1.86 for coherence and 1.73 for cohesion.
In the section related to the elements of digital media that contribute to the story, the links were present and relevant and the anchor text was suitable in 100% of the cases.All the pieces also included photos, but it was concluded that they were only relevant in 27.36% of the reports, since many consisted of mere resource graphics that did not contribute any information.In no cases were captions present.Audios (100%) and graphics (99.06%) were also present, with a high degree of relevance (100% and 99.06%, respectively).However, in nearly half of the reports (49.06%), the headline for the graphics was either not present or incorrect.
Finally, the reports were determined to have public relevance (99.06%) and be accurate (90.57%), objective (93.4%) and impartial (94.34%).Additionally, the language was not discriminatory (100%) and the data were precise (87.74%).Possible improvements in the use of inclusive language were identified in every report.
The observations included by the encoders were used to create a list of elements to improve in each of the sections, with references to the text where they had been identified (Table 4):

Second Analysis
After adjustments made by the narrative technical team based on the results obtained, 633 new reports were generated, of which 553 received a valid analysis.Since the billboards, which the team considered accurate, did not vary between the reports, they were eliminated from this evaluation.
The new data were compared with those obtained in the first analysis below in Table 5.
The percentage of reports that complied with the characteristics established for the headline elements improved for almost all the items analysed.The only notable decrease was found in the typographical difference in the billboards.The reason for this was a discrepancy between encoders; while some understood that these texts required larger letters than the body text, others understood that the full stop that separated them from the beginning of the sentence was sufficient.In the end, the presentation was accepted as valid, as it adhered to the style used by RTVE on its website.The variables related to the text also showed improvement.However, there was a notable decline in the number of articles that met the spellcheck criteria (44.85%).A review of the observations included by the encoders on the datasheets determined that the problem lay in the data used to generate the graphics; these texts did not include the accent marks required in the Spanish language.
The principal problems were identified with the precision and the interpretation of the data.Once again, the errors were largely related to the graphics; the graphic representation did not coincide with the text content; the percentages did not add up to 100%; and there were discrepancies between the number of votes and the census, amongst other problems.The result was a decrease in the average score for clarity (1.84).
RTVE made efforts to obtain at least one photograph from each community.The decision was also made to include another image related to election processes from their documentary archives, which increased the relevance of the images (49.37%).Photo captions also began to appear (38.88%).
All the parameters related to the suitability of the texts for a public medium improved, but the use of inclusive language only reached 2.47%.A detailed analysis of this section revealed that the reports complied with RTVE policies, although there was room for improvement in more general terms.
The proposals for improvement centre around three elements: (1) the problems detected in the graphics data; (2) the interpretation of the data in the text (incoherence, lack of clarity, contradictions, declaration of a winning party in some cases where a pact was necessary); and (3) style corrections (inaccurate or missing conjunctions, confusing wording, incorrect expressions).

Third Analysis
During the first week of April, following the introduction of the changes suggested by the technicians, the system generated a further 633 items, of which 565 received a valid analysis.The results are presented in Table 6.
After correcting the deficiencies detected in each of the reports, the working team validated the suitability of the content for final publication using the data provided by the Ministry of the Interior on election day.The results can be consulted at https://www.rtveia.es/elecciones-municipales-2023 (accessed on 20 March 2024).

Discussion
In view of the results, the use of the methodological tool created to evaluate the quality of the reports generated by artificial intelligence tools for the results of the local elections of 28 May 2023 in Spain improved the system.Based on the algorithm created for the elections to the Madrid Assembly in 2021, the RTVE journalists created the information structure that the AI system needed to generate, using the official election data provided by the Ministry of the Interior as its source.At that point, the tests to train the information system were carried out until a result was obtained that was considered valid for publication under the name of this public broadcasting entity.
During the training phase, a datasheet created ad hoc was used to facilitate the systematic study of the critical elements related to quality, the use of narrative mechanisms and compliance with the ethical policies of public service media.This tool made it possible to identify the weakest points, analyse them more thoroughly and make improvements when deemed necessary.
Initially, the principal problems were detected in the sections related to precision and data interpretation.The first analysis detected anomalies in the headlines (7.55%), callouts (6.6%) and body text (47.17%).The adjustments made it possible to conclude the training with a reduction in errors in these aspects, to 0.35% in the headlines, 0.71% in the callouts and 37.7% in the body text.However, the spelling and grammar checks and style were more problematic.In this case, despite the improvements to the text generation system, errors in the database used for the tests limited the extent to which the reports were able to improve in order to reach full accuracy.Nonetheless, there was a clear improvement in the texts.The overall assessment of the reports regarding concision (+0.21), coherence (+0.22) and cohesion (+0.15) increased after the successive revisions.There was almost no variation in the quality, with the final score above 2 out of 3.
The actions taken regarding the headlines, text, graphics and images improved the parameters that established the suitability of the pieces for the RTVE statute.Public relevance rose 0.94 points, reaching 100% in the end.Accuracy increased to 99.12% of the content (+7.48%), while 99.12% of the pieces were considered objective in the final sample (+5.72%), and 97.52% were determined to be impartial (+3.18).No article was found to contain discriminatory language.Although there was room to improve the use of inclusive language in a significant number of the reports generated-95% of the final sample-the results complied with the policies established by the public broadcasting entity.
The other characteristics had acceptable values from the beginning.The accuracy of the headline elements surpassed 99% for all the variables, and these numbers stayed constant or improved throughout the training.The first paragraphs highlighted the most important elements of the information, avoided summarizing and met spelling standards (−1.66%).

Conclusions
The proposed methodological tool enabled the assessment of the different dimensions related to quality and was useful for identifying anomalies, proposing improvements to programmers and fine-tuning algorithms.The application of the datasheet and numerical results facilitated systematic analysis, and its iterative use gradually improved the results obtained with the system to a threshold suitable for publication.With slight modifications, the datasheet can be applied to reports from other domains and means for training AI systems.
Despite the ability of AI systems to generate a large volume of automated information, human work is essential during each phase of the process, in a similar view to previous research ( Van-Dis et al. 2023;Aydın and Karaarslan 2022;Lopezosa 2023;Wang et al. 2023;Lopezosa et al. 2023;Calvo-Rubio and Rojas-Torrijos 2024).The choice of the subject, the design of the information structure, the selection of the database, the system programming, the detection of anomalies during training, decisions about what improvements to make, adjusting the software and conducting the final review are all critical phases that require human participation.Moreover, aside from the steps related to programming and managing technological systems, the journalist plays a fundamental role in this process.Be that as it may, however, the reliability of the databases used is critical when using generative AI systems.
This collaboration between AI systems and journalists in the new media ecosystem reinforces the idea that news professionals should not see technological tools as enemies that come to replace jobs but that they can use them to improve journalistic routines to limits never before reached.This union between journalists and machines (Túñez-López et al. 2018) revives the old debate that newsrooms should have new professional profiles and specialised work teams that connect the possibilities of artificial intelligence with the needs of journalism itself.In short, there are the so-called exo-journalists (Tejedor and Vila (2021), who know the computer language and are endowed with heterogeneous technical and linguistic skills that allow them to document, verify and generate content from a transmedia logic and from different approaches.
This study has some limitations.The analysis of the reports that were ultimately posted could have more precisely quantified the percentage of improvement in each dimension studied, but that does not diminish the validity of the tool.Moreover, this study opens up new avenues for research comparing texts produced by humans and machines.An investigation into how the residents of depopulated areas view this type of information and its relevance to their lives would also be worth pursuing.

Table 3 .
Results of the first sample.

Table 4 .
List of elements to improve after sample 1.

Table 5 .
Results of the second sample.

Table 6 .
Results of the third sample.