ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards

Kim, Hui-Sang; Kang, Ji-Won; Choi, Sun-Yong

doi:10.3390/electronics14112115

Open AccessArticle

ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards

by

Hui-Sang Kim

¹,

Ji-Won Kang

² and

Sun-Yong Choi

^2,*

¹

Department of Next Generation Smart Energy System Convergence, Gachon University, Seongnam 13120, Gyeonggi, Republic of Korea

²

Department of Finance and Big Data, Gachon University, Seongnam 13120, Gyeonggi, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2115; https://doi.org/10.3390/electronics14112115

Submission received: 3 April 2025 / Revised: 20 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Recent advances in natural language processing (NLP) have enabled the development of powerful language models such as Generative Pre-trained Transformers (GPTs). This study evaluates the performance of ChatGPT in generating news summaries by comparing them with summaries written by professional journalists at The New York Times. Using BERTScore as the primary metric, we assessed the semantic similarity between ChatGPT-generated and human-authored summaries. We further employed OpenAI’s moderation API to examine the extent to which each set of summaries contained potentially biased, inflammatory, or violent language. The results indicate that ChatGPT-generated summaries exhibit a high degree of contextual alignment with human-written summaries, achieving a BERTScore F1-score above 0.87. Moreover, ChatGPT outputs consistently omit language flagged as problematic by moderation algorithms, producing summaries that are less likely to include harmful or polarizing content—a feature we define as moderation-friendly summarization. These findings suggest that ChatGPT can serve as a valuable tool for automated news summarization, offering content that is both contextually accurate and aligned with content moderation standards, thereby supporting more objective and responsible news dissemination.

Keywords:

ChatGPT; news summarization; BERTscore; moderation; New York Times

1. Introduction

Generative AI, a category within the realm of artificial intelligence, is characterized by its ability to produce new data by learning from existing datasets. These models can generate content spanning a wide spectrum of domains, including text, imagery, music, and education [1,2,3,4,5]. The foundation of generative AI lies in the utilization of deep learning techniques and neural networks, which enable comprehensive analysis, comprehension, and generation of content that closely mimics the outputs created by humans. A noteworthy example in this realm is the introduction of OpenAI’s ChatGPT (https://openai.com/blog/chatgpt (accessed on 1 January 2025)), a state-of-the-art generative AI model that has significantly impacted diverse domains, including the academic landscape. Notably, ChatGPT’s proficiency in processing text-based open-ended queries and producing responses that closely resemble human-generated inputs has elicited admiration and apprehension among researchers and scholars.

The rapid advancement of natural language processing (NLP) has been largely driven by the development of transformer-based architectures, particularly the Generative Pre-trained Transformer (GPT) models, such as ChatGPT. These models are designed to generate coherent, human-like text across a wide range of formats, including sentences, paragraphs, and documents.

Given that GPT models are trained on human-generated content, comparing ChatGPT’s output with that of humans presents both a critical and inevitable challenge. Recent studies have rigorously evaluated ChatGPT across various domains—including education [1,6,7], healthcare [8,9,10], research [11,12,13,14], programming [15,16,17], translation [18], and text generation [19,20,21,22,23]. These evaluations suggest that while ChatGPT achieves near-human performance in some areas, its capabilities remain limited in others, as discussed in Section 2.

Our study narrows this focus to the task of summarization—an essential and widely applicable NLP function. Summarization enhances communication efficiency, facilitates information management, and plays a vital role across fields such as education, research, and decision-making.

The crucial task of summarization is also a core function that ChatGPT can perform. Accordingly, several studies have evaluated the performance of ChatGPTs in this specific aspect of functionality. For instance, Goyal et al. [24] investigated the impact of GPT-3 on text summarization, with a specific focus on the well-established news summarization domain. The primary objective was to compare GPT-3 with fine-tuned models trained on extensive summarization datasets. Their results demonstrated that GPT-3 summaries, generated solely based on task descriptions, not only garnered strong preference from human evaluators, but also exhibited a notable absence of common dataset-specific issues, such as factual inaccuracies. Gao et al. [25] explored ChatGPT’s ability to perform summarization evaluations resembling human judgments. They employed four distinct human evaluation methods on five datasets. Their findings indicated that ChatGPT could proficiently complete annotations through Likert-scale scoring, pairwise comparisons, pyramid evaluations, and binary factuality assessments. Soni and Wade [26] conducted an evaluation of ChatGPT’s performance in the realm of abstractive summarization. They employed both automated metrics and blinded human reviewers to assess summarization capabilities. Notably, their research revealed that, while text classification algorithms were effective in distinguishing between genuine and ChatGPT-generated summaries, human evaluators struggled to differentiate between the two. Additionally, Yang et al. [27] evaluated ChatGPT performance using four widely recognized benchmark datasets that encompassed diverse types of summaries from sources such as Reddit posts, news articles, dialogue transcripts, and stories. Their experiments demonstrated that ChatGPT’s performance, as measured by Rouge scores, was comparable to that of traditional fine-tuning methods. The collective findings of these studies underscore the considerable potential of ChatGPT as a robust and effective instrument for text summarization.

Our study presents a novel approach that compares not only the quantitative differences but also the qualitative distinctions between summaries generated by human authors and those generated by a ChatGPT-based language model. In particular, the qualitative aspect of this comparison focuses on the degree of bias present in each summary.

The motivation for considering bias as a qualitative indicator lies in its potential to adversely influence public perception through skewed media coverage. Exposure to distorted or misleading information can lead to inaccurate understandings of events or individuals, thereby hindering fair and objective judgment. Moreover, biased reporting contributes to ideological polarization and intensifies social conflict, potentially fostering political division and antagonism among different groups.

Given these consequences, various studies have explored the nature and classification of media bias. For instance, Hamborg et al. [28] highlighted that news reports often lack objectivity, with bias emerging from the selection of events, labeling, word choices, and accompanying images. They also advocated for automated approaches to identify such biases. Similarly, Meyer et al. [29] argued that the rise of AI/ML technologies has amplified bias within language models, largely due to the nature of the training data. As such, they emphasized the importance of using large language models (LLMs) with caution and awareness of their embedded biases.

Recent research has examined the potential of LLMs in mitigating such biases. For example, Schneider et al. [30] investigated the application of language models to promote diverse perspectives while avoiding the explicit transmission of biased information. While the final outputs are still dependent on the underlying training data, their findings indicate that few-shot learning can serve to partially mitigate such biases. Additionally, Törnberg [31] evaluated the accuracy, reliability, and bias of language models by analyzing their classification of Twitter users’ political affiliations and comparing them to human experts. Their findings showed that the language model achieved higher accuracy and exhibited less bias than the human evaluators, thereby demonstrating its potential to perform large-scale interpretative tasks.

Consequently, this study proposes a novel approach that simultaneously compares the summaries generated by ChatGPT and those written by humans in both quantitative and qualitative dimensions. A key distinction of this research lies in its focus on examining differences in bias between human-authored and AI-generated summaries. While previous studies have primarily relied on human evaluations or text-based metrics to assess summary quality, our study uniquely emphasizes the dual analysis of both semantic similarity and bias by comparing each summary with the original text.

To conduct our investigation, we utilized a sample dataset from The New York Times, a reputable source of news content. Each article typically includes a title, body text, and an editorial summary. We compare the body of each article with the editorial summary provided by The New York Times as well as the corresponding summary generated by ChatGPT.

In terms of methodology, we applied BERTScore as a quantitative metric to assess the semantic similarity between the human-generated summary (reference sentence) and the ChatGPT-generated summary (candidate sentence). BERTScore enabled us to evaluate the extent of textual alignment between the two summaries. Additionally, to assess differences in bias from a qualitative standpoint, we conducted a moderation violation test to examine potential disparities in the expression and content of each summary. A detailed explanation of our research design is presented in Section 3.1.

The primary contribution of this study lies in its empirical comparison of summarization performance between ChatGPT and human writers. While analyzing the full text of news articles in sentiment analysis can result in inaccuracies due to the overwhelming volume of data, using summaries introduces variation depending on their quality. Our findings demonstrate that summaries produced by ChatGPT are of comparable quality to those written by professional journalists, indicating their applicability in subsequent tasks such as sentiment analysis.

Moreover, the presence of immoderate or emotionally charged language can skew sentiment analysis results. By using ChatGPT-generated summaries, which tend to reduce linguistic excess, more neutral and unbiased sentiment assessments can be achieved.

Finally, from a data management perspective, maintaining and analyzing large-scale news archives entails significant costs. In cases where journalists do not provide summaries, automatically generated summaries can offer a cost-effective alternative by reducing storage requirements while preserving essential information.

The remainder of this paper is organized as follows. In the following section, several recent studies on ChatGPT are briefly reviewed. Section 3 describes the data and methodologies (ChatGPT, BERTScore, moderation) used in this study. In Section 4, we present comparison results and a case study. Finally, Section 5 provides concluding remarks.

2. Literature Review

Recent academic studies have focused on conducting comprehensive assessments of ChatGPT’s performance, potentialities, and limitations.

2.1. ChatGPT Performance Across Diverse Fields

Research on the performance of ChatGPT is being conducted across diverse fields of study, including but not limited to education and medicine.

Several scholars have investigated multifaceted applications of ChatGPT in education. For instance, Rudolph et al. [1] investigated its potential to foster creativity within educational environments, whereas Mhlanga [6] investigated its role in facilitating personalized learning experiences. Furthermore, Wang et al. [32] investigated the effectiveness of providing real-time feedback to students during online learning activities. Together, these studies enrich our understanding of the impact of ChatGPT on education and highlight various aspects of its utility and effectiveness in educational settings.

In a medical field study, ChatGPT’s performance and concerns were analyzed. De Angelis et al. [8] highlight ChatGPT’s significant impact on the general public and research community but raise concerns about ethical challenges, especially in medicine due to potential misinformation risks. Rao et al. [9] demonstrate ChatGPT’s potential for radiologic decision making, suggesting it could enhance clinical workflow and the responsible use of radiology services. Hulman et al. [10] assess ChatGPT’s ability to answer diabetes-related questions, revealing the need for careful integration into clinical practice, as participants only marginally outperformed random guessing, contrary to expectations.

Several studies have examined the performance of ChatGPT in computer programming tasks, highlighting its strengths and limitations. Koubaa et al. [15] highlight human superiority in programming tasks, particularly evident in the IEEExtreme Challenge. Surameery and Shakor [17] advocate for ChatGPT’s integration into a comprehensive debugging toolkit to augment bug identification and resolution. Similarly, Yilmaz and Yilmaz [16] stressed the importance of cautious integration, and offered recommendations for effectively incorporating ChatGPT into programming education.

Finally, according to Minaee et al. [33], who conducted a comprehensive review of major large language models—including GPT, LLaMA, and PaLM—covering their architectures, training methodologies, applications, and performance comparisons, as well as discussing the contributions and limitations of each model, GPT-4 is regarded as a leading LLM that demonstrates state-of-the-art (SOTA) performance across most tasks, including commonsense reasoning and world knowledge.

2.2. ChatGPT Performance in Summarization

Generating human-like summaries is an important part of natural language processing (NLP). Summarizing a text into succinct sentences containing important information differs from the traditional method of extracting key sentences from the original text. The quality of the generated summaries has also improved rapidly with the development of LLMs. Recently, various studies have been conducted to generate summaries using GPT-3.

Zhang et al. [34] conducted a systematic human evaluation of news summarization using ten different large language models, including GPT, to assess how effectively LLMs perform in this task. The results showed that zero-shot instruction-tuned GPT-3 achieved the best performance, and the summaries it generated were rated as comparable to those written by freelance writers. Furthermore, the study suggests that instruction tuning, rather than model size, is the key factor contributing to improved summarization performance in LLMs.

According to Gao et al. [25], ChatGPT was used to evaluate summaries, and the results showed that ChatGPT has human-like evaluation abilities, and its performance is highly dependent on prompt design.There has also been research interest in controllable text summarization [19]. This methodology generates summaries by adding specific requirements or constraints to a prompt. This allows the user to control the length, style, or tone of the summarized sentences. The study showed that ChatGPT outperformed the previous language model (text-style transfer).

Furthermore, Alomari et al. [35] provide a comprehensive review of recent approaches aimed at improving the performance of abstractive text summarization (ATS), focusing on deep reinforcement learning (RL) and transfer learning (TL) as key techniques to overcome the limitations of traditional sequence-to-sequence (seq2seq) models. According to the study, Transformer-based transfer learning models have significantly addressed the shortcomings of earlier seq2seq approaches, which often produced low-quality, low-novelty summaries. These models have optimized summary quality, novelty, and fluency, thereby substantially advancing SOTA performance in the field.

2.3. Evaluation Methodologies

According to Yang et al. [27], ChatGPT’s summarization performance can be measured using aspect- or query-based summarization methods. Aspect-based summarization is a method that promptly presents certain aspects and asks for a summary of them, whereas query-based summarization is a method that performs summarization based on questions. The results showed that ChatGPT performed well in news summarization.

Moreover, several studies have assessed ChatGPT’s text-generation capabilities compared with human-authored content. Pu and Demberg [19] find instances where humans excel in various styles, while Herbold et al. [20] observe ChatGPT’s superiority in generating argumentative essays. Additionally, studies such as those by [21,23] examined the disparities between humans and ChatGPT-generated text. However, Pegoraro et al. [22] note the difficulty of effectively distinguishing between the two. These findings highlight the nuances of ChatGPT’s performance and underscore the ongoing challenges in discerning between human- and AI-generated texts.

2.4. Bias and Moderation in AI-Generated Text

Research has been conducted on whether ChatGPT biases summaries. According to Barker and Kazakov [36], an experiment was conducted to determine whether ChatGPT-generated summaries reduced some textual indicators compared to the original. The results show that some textual indicators were removed during the transformation from the original to the summary, which could reduce discrimination based on the region of origin. However, there are studies that show a bias in sentences generated by ChatGPT [37]. The study found that when politicians were used as prompts to generate sentences, the frequency of positive and negative words differed between liberal and conservative politicians, indicating a bias. Lucy and Bamman [38] also showed that the sentences generated using GPT-3 have a gender preference depending on the specific context. When generating sentences about storytelling, the study showed that women were more likely to be described in terms of their physical appearance and men were more likely to be described in terms of their ability to be powerful. This suggests that GPT-3 has a gender bias.

While ChatGPT’s efficacy is acknowledged in the field of research, it also triggers discussions about its ethical implications and inaccuracy [11,12,13,14], signaling a multifaceted evaluation of its role in the field.

In addition, although our study addresses a similar topic to the aforementioned works, it introduces several important advancements that distinguish it from prior research. Compared to previous studies that primarily relied on human preference evaluations [25], we quantitatively assessed semantic similarity by employing BERTScore. Furthermore, unlike previous research that focused solely on summarization performance [24,26,27], our study uniquely incorporates an ethical evaluation, analyzing to what extent the generated summaries reduced violent, biased, or harmful expressions through the use of the Moderation API. This approach offers a novel perspective by simultaneously evaluating both the factual consistency and the social acceptability of the generated summaries.

3. Research Design and Models

3.1. Research Design

The research workflow in this study is outlined as follows. Initially, we gathered human-generated text summaries and created text summaries using ChatGPT. For this purpose, we used New York Times articles that provide both the original text and a summary for each article. We utilized ChatGPT to generate individual article summaries and subsequently conducted a comparative analysis with the corresponding New York Times summaries. In our analysis, we computed the BERTscore to assess the disparity between the two summaries. Furthermore, a moderation validation test was conducted on the summaries to identify instances in which they deviated from moderation principles. The research process is illustrated in Figure 1. The specific prompt design used for ChatGPT during this phase is as follows:

Figure 2 illustrates the process of generating a summary using ChatGPT. We used a GPT3.5 turbo-16K model to account for the number of tokens in the news body. We selected GPT-3.5-turbo-16k rather than GPT-4 primarily due to the significant cost differences associated with text generation tasks. Specifically, GPT-4 costs approximately USD 30 per 1 million input tokens and USD 60 per 1 million output tokens, whereas GPT-3.5-turbo-16k costs only around USD 0.5 per 1 million input tokens and USD 1.5 per 1 million output tokens. This price difference—GPT-4 being approximately 60 times more expensive—makes GPT-3.5-turbo-16k substantially more practical and economically viable, especially given the large-scale summarization tasks involving thousands of news articles analyzed in this study. In total, we analyzed 1737 news bodies from The New York Times, each paired with its corresponding editorial summary; ChatGPT was given the role of generating summaries of The New York Times articles when given their bodies. The prompt used is “We’ll take a News text as input and produce a news summary as output”. This standardized prompt was applied consistently across all 1737 pairs with the same parameters. In this study, we used the parameters used in the ChatGPT API example of a text summary. Here, each parameter indicates that the temperature was set to 0, producing deterministic outputs by removing randomness. Higher value of temperature indicated a higher probability of generating unpredictable words. A frequency penalty and presence penalty were set to 0, imposing no restrictions or penalties on token repetition or presence. Top_P was used to determine whether to use the top few percentages of words among the candidates to generate words. The maximum length determines the length of the generated sentence. We then matched the maximum length of the tokens to the number of tokens in the New York Times summary. These parameters are shown in Table 1. After setting the parameters, we provided ChatGPT to news bodies. Consequently, ChatGPT generated a summary of the news based on internal probability calculations.

3.2. Data

We harvested articles from The New York Times “Today’s Paper” section, which provides a daily list of the newspaper’s print articles. For each date from 26 January 2018 to 31 December 2022, we retrieved only the first textual news article listed on the “Today’s Paper” page. If the first entry was an advertisement or a multimedia-only item (photo essay, video, etc.), that date was excluded, maintaining a uniform one article per day sampling rule. Each selected item follows the standard New York Times article format—headline, editor-written abstract (summary), and full body text. Using Selenium for navigation and BeautifulSoup for parsing, we extracted these three components separately and stored them in distinct fields—title, summary, and body—so that the body–summary pair is explicitly available for downstream analysis. The resulting corpus comprises 1737 article–summary pairs. On average, the editor-written abstracts contain 30.13 tokens, whereas the corresponding full-text bodies contain 1420.24 tokens. New York Times articles have also been used as training data for machine-learning models in several studies [39,40,41,42].

3.3. Methods for Summary

3.3.1. GPT

The field of language processing has seen rapid development since the release of the transformer architecture, with the generative pretrained transformer (GPT) model receiving particular attention. One example is OpenAI’s ChatGPT. GPT is a large language model trained on massive amounts of language data. The early GPT was designed differently from current learning methods. GPT-1 uses a semi-supervised approach. It was pretrained on a large corpus, and then transfer learning was performed by fine-tuning each downstream task using labeled data [43]. GPT-2, on the other hand, used an approach that did not include fine-tuning in the traditional semi-supervised approach because changes in the data distribution or minor modifications would reduce the accuracy of discriminative downstream operations. The ultimate goal of the GPT model is to avoid manual generation and use of labeled data for cost-effective learning. Although it still uses a transformer architecture, layer normalization has been moved to be an input to each sub-block and has been added after the self-attention blocks. It increased the vocabulary size, context size, and batch size and used 1.5 billion parameters. However, GPT-2 does not perform well in summarization and question-answering tasks [44]. Subsequently, while there were no algorithmic changes with GPT-3, it increased the number of parameters by up to 175 billion without fine-tuning the training. Performance improvements were observed across classification, extraction, generation rewriting, and chatting close QA open QA summarization compared to previous models, such as GPT-2; however, limitations exist, such as hallucinations and misaligned sentence creation [45,46,47]. To address these issues, fine-tuning was conducted by following broad instructions without violating conventions via human feedback, which resulted in the creation of an instruct-GPT model that generated results aligned with human preferences, even for tasks that were not pretrained, unlike traditional GPT-3 models. However, expanding parameters inevitably increases costs, including those associated with fine-tuning, making discriminative downstream tuning difficult; however, providing examples of questions within prompts confirmed human-level performance even for non-pre-trained tasks through few-shot learning [48,49,50].

3.3.2. BERTScore

In this study, we used the BERTScore [51] similarity measure to evaluate the similarity between the reference and candidate sentences. BERTScore is a metric that measures how similar the meanings of two sentences are. Unlike traditional similarity evaluation methods that rely on exact match of words and perform only surface-level comparisons, BERTScore analyzes the meanings of words even when the sentence structures differ. By leveraging the BERT language model to understand the meanings of individual words, BERTScore evaluates the semantic similarity between sentences based on their conveyed meanings.

Specifically, BLEU [52] uses n-grams to compare similarities. However, this does not account for the similarity of sentences that have the same meaning but use different words. For example, BLEU and METEOR [53], which use n-grams, assign higher scores to sentences that use the same words as the reference but do not have meanings that are similar to those that use different words, and penalize sentences based on the order of word arrangement. BERTScore uses BERT’s [54] contextual embedding method to tokenize and measure their similarities to calculate the BERTScore. Contextual embedding is a method that changes the embedding of a word depending on the context. For example, in these sentences, “I need a new mouse for my computer” and “There’s a mouse in the kitchen”, mouse is the same word, but it is counted as a different embed in the contextual embedding method. To compute the BERTScore, we used BERT to embed the reference and candidate sentences. The reference sentence is represented by

x = < x_{1}, \dots, x_{i} >

and the candidate sentence is represented by

\hat{x} = < {\hat{x}}_{1}, \dots, {\hat{x}}_{j} >

We perform pairwise cosine similarity between the embedded vectors of each sentence to calculate the cosine similarity between multiple vectors. Because we normalize them beforehand, we simplify the calculation to

x_{i}^{⊤} {\hat{x}}_{j}

. The Recall, Precision, and F1-score formulas for each sentence were calculated as follows:

\begin{matrix} R_{B E R T} & = & \frac{1}{| x |} \sum_{x_{i} \in x} \underset{{\hat{x}}_{j} \in \hat{x}}{m a x} x_{i}^{⊤} {\hat{x}}_{j} \end{matrix}

(1)

\begin{matrix} P_{B E R T} & = & \frac{1}{| \hat{x} |} \sum_{{\hat{x}}_{j} \in \hat{x}} \underset{x_{i} \in x}{m a x} x_{i}^{⊤} {\hat{x}}_{j} \end{matrix}

(2)

\begin{matrix} F_{B E R T} & = & 2 \frac{P_{B E R T} \cdot R_{B E R T}}{P_{B E R T} + R_{B E R T}} \end{matrix}

(3)

The cosine similarity has the same range as

[- 1, 1]

as shown in the formula. However, it has a more restricted interval owing to the geometry of contextual embeddings. In general, it is recommended to rescale BERTScore by setting an empirical lower bound, b for the capability of the calculated score.

\begin{matrix} {\hat{R}}_{B E R T} & = & \frac{R_{B E R T} - b}{1 - b} \end{matrix}

(4)

Figure 3 employed two illustrative summary sentences, one provided by a human editor from The New York Times (“New Yorkers are standing in hours-long lines to get tested for coronavirus, highlighting the ongoing challenges the United States faces in public”) and the other generated by ChatGPT (“As the outbreak surges around the country, the testing delays show the basic public health challenges that the country still faces”). Each sentence was first tokenized into WordPiece tokens. Subsequently, embeddings for each token were extracted using BERT contextual embedding methods, resulting in embedding vectors of dimension 768 per token. Using these token-level embeddings, we computed pairwise cosine similarities to produce a token-to-token similarity matrix in Figure 4. Finally, following the BERTScore methodology, we calculated Recall, Precision, and F-score metrics from this similarity matrix, quantifying the semantic similarity between the two sentences.

3.3.3. Moderation

When performing sentiment analysis on sentences, violent, stigmatized, and biased sentences often have a negative impact on sentiment analysis. In other words, what words are a sentence made up of, or what nuances are important to both humans and machines? News, in particular, conveys the author’s intentions, which can be biased. Therefore, if a sentence is extremely violent, simplistic, or sexual in content, it can lead to misunderstandings or preconceived notions [55,56].

Therefore, this study conducted a moderation validation test using the Moderation API to verify whether our summaries contained any bias.

Moderation API is a tool that automatically checks whether text is safe and appropriate. Specifically, it is an endpoint API developed by OpenAI to detect policy violations in the outputs generated by large language models. It allows for the categorization of violent, emotionally charged, and sexual elements in sentences produced by either LLMs or humans. The system calculates a confidence score for each category, ranging from 0 to 1. Technically, OpenAI’s moderation API (https://platform.openai.com/docs/guides/moderation (accessed on 1 May 2025)) is a tool used to identify and prevent inappropriate or dangerous content in the generated sentences. It includes the following categories: hate, hate/threatening, harassment, harassment/threatening, self-harm, self-harm/intent, self-harm/instruction, sexual, sexual/minor, violence, and violence/graphic. OpenAI’s moderation calculates whether the input sentence violates a category and calculates the extent to which the sentence falls into a category. In this study, we compared the number of moderation violations in summaries provided by the New York Times and summaries provided by ChatGPT, as well as the word composition of each summary, to show that summarizing using ChatGPT is less biased than human summarization.

4. Results and Discussion

Our approach involves two primary steps. Initially, we calculated and compared the BERTScore between the summary derived from the original New York Times article and that generated by ChatGPT. This analysis was supplemented by case studies that underscored the comparison results obtained. Subsequently, we turned our attention to the assessment and comparison of bias within the summaries of the original New York Times articles and those produced by ChatGPT. Case studies were presented to highlight and elucidate the outcomes of this comparative analysis.

The quantitative results demonstrate that ChatGPT outperforms T5 in semantic similarity to human-written summaries, as confirmed by statistically significant differences across all BERTScore metrics (Precision, Recall, and F1-score). This performance advantage may be attributed to ChatGPT’s underlying moderation-driven generation process, which inherently filters out violent or emotionally charged language. As a result, ChatGPT produces summaries that are more objective, fact-centered, and aligned with content moderation standards.

Furthermore, ChatGPT-generated summaries tend to avoid subjective or biased expressions often present in human-written summaries, particularly in politically or emotionally sensitive contexts. This contributes to higher moderation compliance and a more neutral tone overall.

However, several limitations must be acknowledged. First, ChatGPT may struggle with understanding cultural context, figurative language, or rhetorical nuance, which can lead to summaries that are factually correct but semantically shallow. Second, as a black-box model, ChatGPT does not provide transparency regarding its internal decision-making processes, making it difficult to verify or interpret how certain summarization choices are made. Third, since ChatGPT relies on OpenAI’s Moderation API to filter outputs, the moderation standards themselves—being proprietary and subject to change—introduce potential variability in outputs over time. This poses a reproducibility concern for studies or applications dependent on such moderation signals.

These findings highlight both the promise and the constraints of using large language models for automated summarization, especially in high-stakes domains like journalism.

4.1. Comparison Between Summary Sentences by NYT and ChatGPT

In this study, we employed the T5 model developed by [57] to evaluate the performance of summary generation in comparison with ChatGPT. This involves contrasting the summaries produced by The NYT with those generated by both ChatGPT and T5.

The T5 model, proposed by Google, is a general-purpose natural language processing framework that unifies various language tasks by formulating them as a text-to-text problem. In contrast to ChatGPT, which adopts a decoder-only architecture, T5 is based on a Transformer encoder-decoder structure, enabling it to handle a wide range of tasks—such as translation, summarization, and question answering—within a single, consistent framework.

Studies have demonstrated that the T5 model excels at text transformation and outperforms various other models in generating summaries. It has shown superiority over models such as BART and ProphetNet in abstract summarization tasks, as well as greater efficiency compared to RNNs and LSTMs for news text summarization [58,59]. Moreover, T5 surpasses other pre-trained models such as BERT and GPT-2 in general natural language processing tasks [60].

Studies have demonstrated its superiority over BART and ProphetNet for abstract summaries as well as its efficiency in surpassing RNNs and LSTMs for news text summaries [58,59]. Moreover, T5 surpasses pre-trained models such as BERT and GPT-2 in natural language processing [60].

Through this comparative approach, we aimed to gauge the summary generation performance of the GPT model relative to the alternative models. In addition, we employed BERTScore to calculate the similarity between the NYT summary and both the ChatGPT and T5 summaries. This comparative analysis allowed us to assess the degree of similarity between the summaries produced by the ChatGPT (GPT 3.5) and T5 models and those provided by The New York Times.

Table 2 lists the basic statistics of the BERTScore Precision, Recall, and F1-score results for each summary. Based on the maximum, average, and minimum values of Precision, Recall, and F1-score, ChatGPT demonstrated greater similarity to the NYT summary than T5 across all three indicators. The high similarity between ChatGPT and NYT summaries is evident from their values approaching 1 for all three indicators, indicating a highly congruent summary generation. Furthermore, considering the standard deviation of these indicators, we inferred that ChatGPT consistently provided summaries that were more closely aligned with NYT summaries than those generated by the T5 model.

As shown in Table 3, the independent two-sample t-tests conducted to compare BERTScore metrics between ChatGPT and T5 yielded statistically significant results in all three metrics: Precision (

t = 33.17, p = 1.93 \times 10^{- 209}

), Recall (

t = 62.17, p < 10^{- 300}

), and F1-score (

t = 51.12, p < 10^{- 300}

). Furthermore, the 95% confidence interval for the mean difference in Precision between ChatGPT and T5 ranged from 0.0202 to 0.0227, indicating that ChatGPT achieved a significantly higher performance of approximately 2.02% to 2.27% relative to T5. These findings robustly demonstrate the quality of the ChatGPT summary compared to T5, emphasizing the statistical reliability and significance of the observed differences.

Moreover, we conducted an in-depth analysis of the case studies that exhibited the highest and lowest F1-score values. Table 4 and Table 5 present the best and worst cases and the original article information for both cases. In the best-case scenario, it is apparent that the NYT summaries and those generated by ChatGPT are either aligned closely or are identical. Conversely, in the worst-case scenario, NYT summaries often echo the biases present in the news articles. Notably, the worst-case NYT summary adopted a critical tone, employing terms such as “brazen”. By contrast, summaries generated by ChatGPT tend to extract and synthesize information neutrally from the text. Consequently, when the original news content contains biases, particularly in political contexts, the similarity between human-generated and model-generated summaries may diminish because of inherent human biases and the impartial nature of ChatGPT. Conversely, if the original news content lacks biased elements, the similarity between the human- and model-generated summaries may increase significantly.

The problem of bias in the LLM has existed for a long time [24,61,62]. The task of summarizing news is to extract and summarize a given text. If the news text is already biased, the summary is likely to be biased as well. However, as shown in this Table 4, the New York Times used the word “militants” and ChatGPT used “Taliban”, indicating that ChatGPT summarizes more factually than the human summarizer. In other words, this study showed that, even with biased news, summaries generated by ChatGPT are less biased than those generated by humans. Consequently, in assessing the summarization process undertaken by ChatGPT, we investigated whether ChatGPT tended to utilize more objective expressions than human summaries. This examination was conducted using the moderation validation test provided by OpenAI.

4.2. Moderation Test for the Summaries by NYT and ChatGPT

Table 6 presents the summaries in cases where both summaries violated moderation and when only the New York Times summary violated moderation. In instances where both summaries infringed moderation, the category of violation was violence, attributed to the simultaneous use of words like “beheaded” and “severed”. Despite this, both the ChatGPT and NYT summaries remained similar, as they encapsulated factual information that was challenging to paraphrase. Conversely, in cases where only the NYT’s summary violated moderation, terms like “con man” and “troublemaker” were employed to criticize a specific individual. However, the ChatGPT summary filtered out direct criticism to deliver a succinct account of the events described in the news text. For example, in the case of “Surgeons Labored to Save the Wounded in El Paso Mass Shooting”, the NYT’s summary conveyed the tense atmosphere without providing specific details, whereas ChatGPT’s summary precisely outlined the events while filtering out violent language.

Consequently, the likelihood of the NYT’s summary violating OpenAI’s moderation is higher than that of the ChatGPT’s summary. This suggests that ChatGPT is more adept at filtering out violent or biased expressions during sentence generation, focusing instead on summarizing objective facts.

We also examined whether the original text of the summary violating moderation also exhibited moderation violations. Table 7 displays the titles of the articles in which the New York Times summary violated moderation. In one instance, the main body of a news article dated 18 October 2018, received a moderation violence score of 0.3, indicating the absence of violence within the article. However, both the NYT and ChatGPT summaries scored significantly higher at 0.8 and 0.9, respectively, suggesting the inclusion of more violent sentences in the summaries than in the body of the article.

Conversely, in a news article published on 30 July2019, the disgust score was 0.02, signifying an absence of disgust-inducing content. The New York Times summary obtained a disgust score of 0.498, whereas the ChatGPT summary scored 0.001. This discrepancy highlights the higher aversive score in the human-summarized sentence compared to the model-generated sentence. Additionally, in an article from 10 August 2019, with a violence score of 0.5, indicative of moderate violence, the New York Times summary generated a sentence with a violence score of 0.99, whereas ChatGPT produced a nonviolent summary with a violence score of 0.14.

These examples illustrate the two distinct scenarios. Firstly, when words from the news article are verbatim in the summary, as seen in the first case, the violent nature of the sentence may persist if words like “severed” and “beheaded” are directly summarized. Second, when violence in news text is filtered and summarized into new sentences, as demonstrated in the second and third cases, the summaries tend to describe events rather than accusatory language. Consequently, filtering out negative-impact words in the summary reduces the likelihood of misunderstanding or the formation of preconceived notions.

However, the moderation API of OpenAI utilized in this study has a black-box nature in that the parameters and moderation criteria used internally are determined by OpenAI’s internal policies, and this process is not disclosed to the outside world. It is also worth mentioning as a limitation that OpenAI can arbitrarily change the criteria and parameters of the Moderation API at any time, which may cause unexpected bias in future ChatGPT-based summarization systems. Therefore, when applying ChatGPT-based automated summarization systems in practice, it is recommended to recognize the possibility of changing the criteria of the Moderation API and perform periodic checks and additional supplementary evaluations.

5. Discussion and Concluding Remarks

In this study, we utilized ChatGPT to generate news summaries from journalist-provided articles and compared them with original summaries using BERTScore. The analysis yielded an F1-score of 0.87, indicating substantial similarity. Notably, ChatGPT incorporated sentences from periods that were not pretrained, suggesting adaptability for future summaries. In addition, employing OpenAI’s moderation API revealed that journalist-written summaries exhibited more moderation violations, potentially because of their use of violent or hazardous language. By contrast, ChatGPT-generated summaries maintain a focus on events without resorting to language that voilates moderation criteria. This finding suggests that ChatGPT is adept at producing moderation-friendly news summaries.

The main results of this study are twofold.

First, by employing ChatGPT, we conducted a summarization task on news texts utilizing New York Times data up to 2022. Our analysis indicated a performance akin to that of human summarization, as evidenced by the close-to-unity BERTScore between the reference and ChatGPT-generated summaries. Notably, prior studies have often utilized datasets predicting 2021, such as CNN/DM, XSum, and Newsroom [63,64,65], comprising news texts and reference summaries.

Second, in our exploration of ChatGPT-generated news summaries, we observed a tendency for ChatGPT to employ words from the text verbatim, while avoiding the direct usage of violent or emotionally charged language. It is noteworthy that ChatGPT tends to mitigate the biases inherent in news events rather than merely altering wording. For instance, when a politician employs an overly simplistic term to denounce another individual in the news text, the ChatGPT-generated summary omits the derogatory term, focusing solely on the act of criticism by the politician. This finding underscores the potential of ChatGPT-generated summaries to exhibit reduced bias and extremism compared with human-generated summaries, which can vary depending on the author’s perspective and the information being conveyed. Consequently, our study suggests that ChatGPTs may possess the capability to moderate and diminish extreme expressions more effectively in summary generation than their human counterparts.

Finally, when using automated summarization tools such as ChatGPT in news contexts, ethical concerns related to trust, misinformation, and bias can arise. ChatGPT-generated summaries may occasionally produce hallucinations or unintentionally reflect biases present in their training data, potentially affecting the accuracy and objectivity of news reporting. Additionally, unclear accountability for errors or biased outputs may negatively influence audience trust. Therefore, clear accountability guidelines and careful oversight are necessary when integrating automated summarization into journalism.

The implications of these results are as follows:

Our study underscores that news summaries generated by ChatGPT exhibit a high degree of contextual similarity to summaries derived directly from the original news text. This corroborates the findings of prior studies [24,27,66], indicating ChatGPT’s proficiency of ChatGPT in news summarization. Notably, when evaluating the contextual similarity between summaries produced by ChatGPT and those of the New York Times, a prominent international daily, using BERTScore, we observed an F1-score exceeding 0.87, signifying a high level of resemblance. This suggests a substantial advancement in the summarization capabilities of large language models approaching human performance. Furthermore, our study indicates that ChatGPT’s commendable performance in news summarization may be attributed to its pretraining on extensive news data. In addition, the existing literature suggests that the quality of output can vary based on prompt utilization, hinting at the potential for further refining summary generation to align more closely with reference summaries [19,25].

Second, leveraging ChatGPT for summarization presents the advantage of mitigating moderation issues. Previous studies demonstrated ChatGPT’s efficacy in simplifying sentences to reduce bias and harmful language [36]. In the context of news summarization, where objectivity is paramount and summaries shape public perception, the avoidance of biased or inflammatory language is crucial. Words employed in summaries can influence reader perceptions and sentiment analysis outcomes. Our study revealed that the ChatGPT-generated summaries tended to filter out violent or harmful language present in the original text, albeit without consistently using completely neutral language. Nevertheless, existing research highlights persistent biases in language generation based on gender and political orientation, likely stemming from the underlying training data [37,38]. Despite these challenges, appropriately formulated prompts offer a potential avenue for reducing bias and extremism in generated summaries [25].

Our study proposes various potential applications based on these findings, particularly within the journalism industry. Given that ChatGPT-based summaries demonstrate high semantic similarity and reduced bias compared to human-written summaries, they can serve as editorial aids by minimizing unnecessary manual effort in the news summarization process. Moreover, they can function as moderation filters to detect and mitigate subjective emotions or biases that may be unconsciously introduced by human editors. This dual functionality offers substantial benefits in enhancing both the efficiency and objectivity of news production.

Nonetheless, despite these advantages, human oversight remains essential. Especially in media organizations that hold the power to shape public perception and values, it is critical to guard against potential misuse of AI systems. LLMs like GPT are vulnerable to biases embedded in their training data, and since such models are developed and operated by external entities (e.g., OpenAI), the risk of unintended bias persists. Therefore, media organizations should establish and adhere to rigorous ethical standards and implement bias auditing processes when integrating AI-based summarization systems, ensuring responsible and transparent use.

Under these conditions, the automatic summarization system proposed in this study can serve as an effective tool for improving summarization efficiency and enhancing content quality in the journalism industry.

The limitations of this study and avenues for future research are outlined below.

First, as discussed earlier, our study found that ChatGPT-generated summaries tended to reduce the bias present in the original text, as evaluated through the Moderation API. However, bias can still persist depending on the design of the prompts and the characteristics of the training data used for LLMs such as GPT. Such biases, whether implicit or explicit, have the potential to influence the outcomes of summarization tasks. Therefore, future research should not be limited to simple numerical comparisons using moderation APIs, but should develop more sophisticated methods to detect, measure, and mitigate these biases. Moreover, investigations should be expanded to encompass the ethical implications of AI-generated journalism, including how AI-generated content may influence public opinion, reinforce stereotypes, or introduce unintended biases.

Furthermore, our study utilized ChatGPT for news summarization through zero-shot generation. Given the known impact of prompt design on summary quality, future research should explore the influence of prompt design parameters on summarization outcomes.

Second, our analysis focused solely on articles from The New York Times for summary generation and comparisons. Restrictions on news articles necessitate future research exploring summary generation and comparisons across diverse text corpora, including books, research papers, and professional descriptions. Moreover, beyond English-language texts, future studies should also investigate cross-linguistic comparisons to evaluate the generalizability and robustness of ChatGPT-generated summaries across different languages and cultural contexts.

Third, as the application of ChatGPT continues to specialize across various fields, our summarization and evaluation system could similarly be extended by utilizing fine-tuned models tailored to specific domains such as healthcare, education, and others. Furthermore, conducting a comparative analysis of various LLMs such as Claude and Gemini can help elucidate performance differences across models. Integrating explainability tools like LIME and SHAP would also be a significant direction for future research, as they can provide deeper insights into the causes of moderation flags or inconsistencies in similarity scores.

Finally, a potential future research direction involves conducting sentiment analysis on summaries generated by ChatGPT. Comparing the sentiment analysis results between media-provided and ChatGPT-generated summaries can elucidate the summarization capabilities and limitations of ChatGPT. Furthermore, the reliability of ChatGPT can be assessed by scrutinizing the consistency of the sentiment-analysis outcomes between the two sets of summaries. In addition, we propose a study on the hallucinogenic symptoms that occur when ChatGPT generates text. In other words, it is worth considering further study to investigate and analyze the hallucinogenic symptoms that may occur when ChatGPT generates a summary.

Author Contributions

Conceptualization, S.-Y.C.; data curation, H.-S.K.; formal analysis, H.-S.K., J.-W.K. and S.-Y.C.; investigation, H.-S.K., J.-W.K. and S.-Y.C.; methodology, H.-S.K.; resources, H.-S.K.; supervision, S.-Y.C.; writing—original draft, H.-S.K. and S.-Y.C.; writing—review and editing, S.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S.-Y. Choi was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2024-00454493) and by the Seoul R&BD Program (QR240016) through the Seoul Business Agency(SBA) funded by the Seoul Metropolitan Government.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rudolph, J.; Tan, S.; Tan, S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? J. Appl. Learn. Teach. 2023, 6, 342–363. [Google Scholar]
Naumova, E.N. A mistake-find exercise: A teacher’s tool to engage with information innovations, ChatGPT, and their analogs. J. Public Health Policy 2023, 44, 173–178. [Google Scholar] [CrossRef] [PubMed]
Katib, I.; Assiri, F.Y.; Abdushkour, H.A.; Hamed, D.; Ragab, M. Differentiating chat generative pretrained transformer from humans: Detecting ChatGPT-generated text and human text using machine learning. Mathematics 2023, 11, 3400. [Google Scholar] [CrossRef]
Mujahid, M.; Rustam, F.; Shafique, R.; Chunduri, V.; Villar, M.G.; Ballester, J.B.; Diez, I.D.L.T.; Ashraf, I. Analyzing sentiments regarding ChatGPT using novel BERT: A machine learning approach. Information 2023, 14, 474. [Google Scholar] [CrossRef]
Gau, L.S.; Chu, H.T.; Pham, D.T.; Huang, C.H. Innovative Teaching of AI-Based Text Mining and ChatGPT Applications for Trend Recognition in Tourism and Hospitality. Tour. Hosp. 2024, 5, 1274–1291. [Google Scholar] [CrossRef]
Mhlanga, D. Open AI in education, the responsible and ethical use of ChatGPT towards lifelong learning. In FinTech and Artificial Intelligence for Sustainable Development: The Role of Smart Technologies in Achieving Development Goals; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
Frieder, S.; Pinchetti, L.; Griffiths, R.R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.C.; Chevalier, A.; Berner, J. Mathematical capabilities of chatgpt. arXiv 2023, arXiv:2301.13867. [Google Scholar]
De Angelis, L.; Baglivo, F.; Arzilli, G.; Privitera, G.P.; Ferragina, P.; Tozzi, A.E.; Rizzo, C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Front. Public Health 2023, 11, 1166120. [Google Scholar] [CrossRef]
Rao, A.; Kim, J.; Kamineni, M.; Pang, M.; Lie, W.; Succi, M.D. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv 2023. [Google Scholar] [CrossRef]
Hulman, A.; Dollerup, O.L.; Mortensen, J.F.; Fenech, M.E.; Norman, K.; Støvring, H.; Hansen, T.K. ChatGPT-versus human-generated answers to frequently asked questions about diabetes: A Turing test-inspired survey among employees of a Danish diabetes center. PLoS ONE 2023, 18, e0290773. [Google Scholar] [CrossRef]
Ariyaratne, S.; Iyengar, K.P.; Nischal, N.; Chitti Babu, N.; Botchu, R. A comparison of ChatGPT-generated articles with human-written articles. Skelet. Radiol. 2023, 52, 1755–1758. [Google Scholar] [CrossRef]
Thorp, H.H. ChatGPT is fun, but not an author. Science 2023, 379, 313. [Google Scholar] [CrossRef]
Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, V.; Ahuja, M.; et al. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
Arif, T.B.; Munaf, U.; Ul-Haque, I. The future of medical education and research: Is ChatGPT a blessing or blight in disguise? Med. Educ. Online 2023, 28, 2181052. [Google Scholar] [CrossRef]
Koubaa, A.; Qureshi, B.; Ammar, A.; Khan, Z.; Boulila, W.; Ghouti, L. Humans are still better than chatgpt: Case of the ieeextreme competition. arXiv 2023, arXiv:2305.06934. [Google Scholar] [CrossRef]
Yilmaz, R.; Yilmaz, F.G.K. Augmented intelligence in programming learning: Examining student views on the use of ChatGPT for programming learning. Comput. Hum. Behav. Artif. Hum. 2023, 1, 100005. [Google Scholar] [CrossRef]
Surameery, N.M.S.; Shakor, M.Y. Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 2023, 3, 17–22. [Google Scholar] [CrossRef]
Jiao, W.; Wang, W.; Huang, J.; Wang, X.; Tu, Z. Is ChatGPT a good translator? Yes with GPT-4 as the engine. arXiv 2023, arXiv:2301.08745. [Google Scholar]
Pu, D.; Demberg, V. ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer. arXiv 2023, arXiv:2306.07799. [Google Scholar]
Herbold, S.; Hautli-Janisz, A.; Heuer, U.; Kikteva, Z.; Trautsch, A. AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays. arXiv 2023, arXiv:2304.14276. [Google Scholar]
Mitrović, S.; Andreoletti, D.; Ayoub, O. Chatgpt or human? Detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv 2023, arXiv:2301.13852. [Google Scholar]
Pegoraro, A.; Kumari, K.; Fereidooni, H.; Sadeghi, A.R. To ChatGPT, or not to ChatGPT: That is the question! arXiv 2023, arXiv:2304.01487. [Google Scholar]
Liao, W.; Liu, Z.; Dai, H.; Xu, S.; Wu, Z.; Zhang, Y.; Huang, X.; Zhu, D.; Cai, H.; Li, Q.; et al. Differentiating ChatGPT-generated and human-written medical texts: Quantitative study. JMIR Med. Educ. 2023, 9, e48904. [Google Scholar] [CrossRef]
Goyal, T.; Li, J.J.; Durrett, G. News summarization and evaluation in the era of gpt-3. arXiv 2022, arXiv:2209.12356. [Google Scholar]
Gao, M.; Ruan, J.; Sun, R.; Yin, X.; Yang, S.; Wan, X. Human-like summarization evaluation with chatgpt. arXiv 2023, arXiv:2304.02554. [Google Scholar]
Soni, M.; Wade, V. Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms. arXiv 2023, arXiv:2303.17650. [Google Scholar]
Yang, X.; Li, Y.; Zhang, X.; Chen, H.; Cheng, W. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv 2023, arXiv:2302.08081. [Google Scholar]
Hamborg, F.; Donnay, K.; Gipp, B. Automated identification of media bias in news articles: An interdisciplinary literature review. Int. J. Digit. Libr. 2019, 20, 391–415. [Google Scholar] [CrossRef]
Meyer, J.G.; Urbanowicz, R.J.; Martin, P.C.; O’Connor, K.; Li, R.; Peng, P.C.; Bright, T.J.; Tatonetti, N.; Won, K.J.; Gonzalez-Hernandez, G.; et al. ChatGPT and large language models in academia: Opportunities and challenges. BioData Min. 2023, 16, 20. [Google Scholar] [CrossRef]
Schneider, J.; Schenk, B.; Niklaus, C.; Vlachos, M. Towards LLM-based Autograding for Short Textual Answers. arXiv 2023, arXiv:2309.11508. [Google Scholar]
Törnberg, P. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv 2023, arXiv:2304.06588. [Google Scholar]
Wang, Z.; Xie, Q.; Ding, Z.; Feng, Y.; Xia, R. Is ChatGPT a good sentiment analyzer? A preliminary study. arXiv 2023, arXiv:2304.04339. [Google Scholar]
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2025, arXiv:2402.06196v3. [Google Scholar]
Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
Alomari, A.; Idris, N.; Sabri, A.Q.M.; Alsmadi, I. Deep reinforcement and transfer learning for abstractive text summarization: A review. Comput. Speech Lang. 2022, 71, 101276. [Google Scholar] [CrossRef]
Barker, C.; Kazakov, D. ChatGPT as a Text Simplification Tool to Remove Bias. arXiv 2023, arXiv:2305.06166. [Google Scholar]
McGee, R.W. Is chat gpt biased against conservatives? An empirical study. SSRN Electron. J. 2023. [Google Scholar] [CrossRef]
Lucy, L.; Bamman, D. Gender and representation bias in GPT-3 generated stories. In Proceedings of the Third Workshop on Narrative Understanding, Virtual, 11 June 2021; pp. 48–55. [Google Scholar]
Garvey, C.; Maskal, C. Sentiment analysis of the news media on artificial intelligence does not support claims of negative bias against artificial intelligence. Omics J. Integr. Biol. 2020, 24, 286–299. [Google Scholar] [CrossRef]
Matalon, Y.; Magdaci, O.; Almozlino, A.; Yamin, D. Using sentiment analysis to predict opinion inversion in Tweets of political communication. Sci. Rep. 2021, 11, 7250. [Google Scholar] [CrossRef]
Kim, J.; Kim, H.S.; Choi, S.Y. Forecasting the S&P 500 Index Using Mathematical-Based Sentiment Analysis and Deep Learning Models: A FinBERT Transformer Model and LSTM. Axioms 2023, 12, 835. [Google Scholar]
Costola, M.; Hinz, O.; Nofer, M.; Pelizzon, L. Machine learning sentiment analysis, COVID-19 news and stock market reactions. Res. Int. Bus. Financ. 2023, 64, 101881. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. OpenAI. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 2 April 2025).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Zhang, M.; Li, J. A commentary of GPT-3 in MIT Technology Review 2021. Fundam. Res. 2021, 1, 831–833. [Google Scholar] [CrossRef]
Deng, J.; Lin, Y. The benefits and challenges of ChatGPT: An overview. Front. Comput. Intell. Syst. 2022, 2, 81–83. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Liu, J.; Shen, D.; Zhang, Y.; Dolan, B.; Carin, L.; Chen, W. What Makes Good In-Context Examples for GPT-3? arXiv 2021, arXiv:2101.06804. [Google Scholar]
Wang, S.; Liu, Y.; Xu, Y.; Zhu, C.; Zeng, M. Want to reduce labeling cost? GPT-3 can help. arXiv 2021, arXiv:2108.13487. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Venkit, P.N.; Wilson, S. Identification of bias against people with disabilities in sentiment analysis and toxicity detection models. arXiv 2021, arXiv:2111.13259. [Google Scholar]
Mei, K.; Fereidooni, S.; Caliskan, A. Bias Against 93 Stigmatized Groups in Masked Language Models and Downstream Sentiment Classification Tasks. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, Chicago, IL, USA, 12–15 June 2023; pp. 1699–1710. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Etemad, A.G.; Abidi, A.I.; Chhabra, M. Fine-Tuned T5 for Abstractive Summarization. Int. J. Perform. Eng. 2021, 17, 900. [Google Scholar]
Balaji, N.; Kumari, D.; Bhavatarini, N.; Megha, N.; Kumar, S. Text Summarization using NLP Technique. In Proceedings of the 2022 International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Karnataka, India, 14–15 October 2022; pp. 30–35. [Google Scholar]
Kale, M.; Rastogi, A. Text-to-text pre-training for data-to-text tasks. arXiv 2020, arXiv:2005.10433. [Google Scholar]
Motoki, F.; Neto, V.P.; Rodrigues, V. More human than human: Measuring ChatGPT political bias. Public Choice 2023, 198, 3–23. [Google Scholar] [CrossRef]
Schramowski, P.; Turan, C.; Andersen, N.; Rothkopf, C.A.; Kersting, K. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nat. Mach. Intell. 2022, 4, 258–268. [Google Scholar] [CrossRef]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv 2018, arXiv:1808.08745. [Google Scholar]
Grusky, M.; Naaman, M.; Artzi, Y. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv 2018, arXiv:1804.11283. [Google Scholar]
Pu, X.; Gao, M.; Wan, X. Summarization is (almost) dead. arXiv 2023, arXiv:2309.09558. [Google Scholar]

Figure 1. Flow chart.

Figure 2. Prompt design.

Figure 3. Flow chart to calculate BERTScore.

Figure 4. Illustrative example of cosine similarity heatmap for BERTScore.

Table 1. Hyperparameters used for ChatGPT (GPT-3.5-turbo-16k) summarization.

Parameter	Value	Description
temperature	0	Controls randomness. 0 means deterministic outputs.
top_p (nucleus sampling)	1.0	Includes all possible tokens without probability cutoff.
frequency_penalty	0	No penalty for repeated usage of tokens.
presence_penalty	0	No penalty applied based on tokens already present in the generated text.
max_tokens	summary_token	Matches the exact token length of the reference summaries from NYT.

Table 2. BERTscore between New York Times summaries and summaries generated by ChatGPT, T5 models.

	(NYT, ChatGPT)			(NYT, T5)
	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Max	0.9804	0.9715	0.9759	0.9338	0.8972	0.9083
Min	00.8102	0.7883	0.8129	0.8012	0.7532	0.7904
Average	0.8766	0.8733	0.8749	0.8552	0.8312	0.843
Std.	0.0202	0.0229	0.0208	0.0178	0.0164	0.0156

Table 3. Independent t-test results comparing BERTScore metrics between ChatGPT and T5-generated summaries.

Metric	t-Statistic	p-Value	Interpretation
Precision (P)	33.17	$1.93 \times 10^{- 209}$	Significant difference
Recall (R)	62.17	< $10^{- 300}$	Highly significant difference
F1-score	51.12	< $10^{- 300}$	Highly significant difference

Table 4. The best and worst cases in the BERTscore comparison.

Best case
NYT	President Trump addressed the nation from the Oval Office on Wednesday night.
ChatGPT	President Trump addressed the nation from the Oval Office on Wednesday night.
NYT	Three white men were found guilty of murder and other charges on Wednesday for the pursuit and fatal shooting of Ahmaud Arbery, a 25-year-old Black man.
ChatGPT	Three white men have been found guilty of murder charges in the case of the fatal shooting of Ahmaud Arbery, a 25-year-old Black man.
NYT	The House Judiciary Committee voted over Republican objections to advance two articles of impeachment accusing President Trump of abuse of power and obstruction of Congress.
ChatGPT	The House Judiciary Committee has approved two articles of impeachment against President Trump, accusing him of abusing his power and obstructing Congress.
Worst case
NYT	The vote capped a brazen feat for Republicans, who pushed through a Supreme Court nominee in little more than five weeks with the election looming.
ChatGPT	Judge Amy Coney Barrett, a conservative appeals court judge and protégée of former Justice Antonin Scalia, was confirmed on Monday as the.
NYT	The militants have engaged the former officials Hamid Karzai and Abdullah Abdullah, as well as Moscow, to seek help in building an “inclusive” government while cementing their rule.
ChatGPT	Former Afghan president Hamid Karzai has reached out to the Taliban in an attempt to form an “inclusive” government and avoid continuous conflict and isolation.
NYT	Nearly three million new unemployment claims brought the two-month total to more than 36 million, even with some still frustrated in seeking benefits.
ChatGPT	The reopening of retail stores, nail salons, and restaurants across the country has led to a halt in layoffs, but the economic landscape

Table 5. New York Times main bodies for the best and worst cases for comparison of BERTscore.

Best case
2020/03/12	U.S. to Suspend Most Travel From Europe as World Scrambles to Fight Pandemic
2021/11/25	Three men are found guilty of murder in Arbery shooting.
2019/12/14	Panel Approves Impeachment Articles and Sends Charges for a House Vote
Worst case
2020/10/27	Senate Confirms Barrett, Delivering for Trump and Reshaping the Court
2021/08/23	Chaos Persists at Kabul Airport as Taliban Discuss New Government
2020/05/15	‘Rolling Shock’ as Job Losses Mount Even With Reopenings

Table 6. Summary filtered by OpenAI’s content moderation.

Moderation violation in both summaries
NYT	The Saudi journalist’s fingers were severed and he was later beheaded, according to details from audio recordings published in the Turkish media.
ChatGPT	Saudi journalist Jamal Khashoggi was beheaded, dismembered, and had his fingers severed within minutes of entering the Saudi Embassy.
Moderation violation in NYT
NYT	President Trump extended his attacks on critics of color with an early-morning Twitter blast calling Al Sharpton “a con man” and “a troublemaker”.
ChatGPT	President Trump continued his attacks on Rev. Al Sharpton and Representative Elijah E. Cummings, drawing criticism from both Republicans and Democrats.
NYT	The bullets ripped through one woman, shredding her intestines and leaving holes the size of a man’s fist in her side. But surgeons had to work fast, clearing the operating room to make way for other victims.
ChatGPT	A woman who was grievously wounded in the mass shooting at a Walmart in El Paso was operated on by Dr. Alan Tyroch, the chief of surgery at University Medical Center El Paso.

Table 7. New York Times main bodies for moderation.

Moderation Violation Case
2018/10/18	Audio Offers Gruesome Details of Jamal Khashoggi Killing, Turkish Official Says
2019/07/30	Trump Widens War on Black Critics While Embracing ‘Inner City Pastors’
2019/08/10	Surgeons Labored to Save the Wounded in El Paso Mass Shooting

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.-S.; Kang, J.-W.; Choi, S.-Y. ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards. Electronics 2025, 14, 2115. https://doi.org/10.3390/electronics14112115

AMA Style

Kim H-S, Kang J-W, Choi S-Y. ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards. Electronics. 2025; 14(11):2115. https://doi.org/10.3390/electronics14112115

Chicago/Turabian Style

Kim, Hui-Sang, Ji-Won Kang, and Sun-Yong Choi. 2025. "ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards" Electronics 14, no. 11: 2115. https://doi.org/10.3390/electronics14112115

APA Style

Kim, H.-S., Kang, J.-W., & Choi, S.-Y. (2025). ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards. Electronics, 14(11), 2115. https://doi.org/10.3390/electronics14112115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ChatGPT vs. Human Journalists: Analyzing News Summaries Through BERTScore and Moderation Standards

Abstract

1. Introduction

2. Literature Review

2.1. ChatGPT Performance Across Diverse Fields

2.2. ChatGPT Performance in Summarization

2.3. Evaluation Methodologies

2.4. Bias and Moderation in AI-Generated Text

3. Research Design and Models

3.1. Research Design

3.2. Data

3.3. Methods for Summary

3.3.1. GPT

3.3.2. BERTScore

3.3.3. Moderation

4. Results and Discussion

4.1. Comparison Between Summary Sentences by NYT and ChatGPT

4.2. Moderation Test for the Summaries by NYT and ChatGPT

5. Discussion and Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI